Web-scraping the raw data behind the Scottish Government Health Survey.
The Scottish Health Survey is a comprehensive resource covering various health-related aspects, making it invaluable for research and policy making. Its raw data, although rich in potential insights, isn’t easily accessible for in-depth analysis. This blog aims to provide a guide on web scraping in Python to facilitate easier access to this data, enabling professionals and data enthusiasts to explore and utilize it effectively for evidence-based decision-making and novel research endeavors.
Web scraping allows us to collect data from websites automatically, making it a powerful tool for acquiring and analyzing valuable information from online sources. It’s important to note that this guide is intended for educational purposes and adheres to the website’s terms and conditions. We’ll be scraping only a small amount of data as not to overwhelm servers but saving a lot of manual time (& faff)!
Before we begin, make sure you have the following prerequisites:
requests
, BeautifulSoup
, pathlib
, and re
.All the code needed is broken down step-by-step below. I’ve also rationalised this code into functions, which can be found in a Jupyter notebook on my GitHub page.
This was the first step in a personal project to investigate adults sporting activities, SIMD, and health forecasting. Be sure to check back soon if you’re interested in machine learning with public data!
import re
import requests
from bs4 import BeautifulSoup
from pathlib import Path
from urllib.parse import urljoin
These are the libraries we’ll use to speak to the web pages, parse HTML content, clean file names, and file management.
Special characters can be a pain when it comes to filenames, so one of the first things we are going to do is define a function to remove these, using the re
package. This will be incorporated into the code later on.
def sanitize_filename(filename):
"""Remove characters that are not allowed in file names."""
return re.sub(r'[^\w\-_. ]', '', filename)
To collect data spanning multiple years, we need to define a list of specific years we want to scrape. After a manual check, I’ve confirmed that supplementary data is available for the years 2011 to 2021.
years = list(range(2011, 2023))
Now, let’s put it all together to scrape the data. We’ll iterate through the list of years
for year in years:
base_url = f"https://www.gov.scot/publications/scottish-health-survey-{year}-supplementary-tables/"
and perform the following steps for each year:
We modify the base URL with the current year to access the page for that year storing the data.
response = requests.get(base_url)
We send an HTTP GET request to the modified URL to retrieve the web page’s content.
if response.status_code == 404:
print(f"No data available for {year}. Skipping...")
continue
We check if the page exists. If the website returns a 404 status code, we skip that year’s data and print a statement to highlight that year’s data isn’t available.
response.raise_for_status() # Check for any HTTP errors
soup = BeautifulSoup(response.content, "html.parser")
excel_links = []
We parse the HTML content of the page and initialize a list to store links.
for link in soup.find_all("a", href=True):
href = link["href"]
absolute_url = urljoin(base_url, href) # Convert relative URLs to absolute URLs
if absolute_url.endswith(".xls") or absolute_url.endswith(".xlsx"):
excel_links.append((absolute_url, link.text.strip()))
We find all the links on the page (“a”), convert relative URLs to absolute URLs, and filter for excel extensions. We store the links and their corresponding text labels in excel_links
.
folder_name = Path(str(year))
folder_name.mkdir(parents=True, exist_ok=True)
We create a folder with the year as the name to store the downloaded files.
for idx, (link_url, link_name) in enumerate(excel_links, start=1):
if not link_name:
print(f"Skipping download for {year} - File {idx} - Empty link name.")
continue
file_name = folder_name / (sanitize_filename(link_name) + ".xls")
if not file_name:
print(f"Skipping download for {year} - File {idx} - Invalid link name: {link_name}")
continue
response = requests.get(link_url)
with open(file_name, "wb") as file:
file.write(response.content)
print(f"Downloaded {file_name} for {year}.")
We download the excel files, using sanitized filenames, and store them in the respective year’s folder.
Now all the files should be stored under subfolders corresponding to that years data!
When conducting web scraping it’s essential to adhere to ethical guidelines:
Check if the website has a robots.txt file that specifies rules for web crawlers. Follow these rules to avoid overloading the server or accessing restricted content.
Review the website’s terms of use and privacy policy to ensure compliance. Some websites may explicitly prohibit or limit web scraping activities.
Implement rate limiting to avoid sending too many requests to the server in a short time. Respect the website’s server capacity.
Consider setting a user-agent header in your requests to identify your scraper. This allows website administrators to contact you if any issues arise.
Be cautious about scraping sensitive or private data. Always anonymize and aggregate data to protect individuals' privacy.
Ethical web scraping is a valuable tool to access publicly available data for research and analysis. By following best practices, and respecting the terms of use of websites, you can gather data responsibly and ethically.
Happy scraping and data analysis!
*Summary image by Marvin Meyer on Unsplash