Python

Web scraping public data

Web scraping public data

Web-scraping the raw data behind the Scottish Government Health Survey.

Introduction

The Scottish Health Survey is a comprehensive resource covering various health-related aspects, making it invaluable for research and policy making. Its raw data, although rich in potential insights, isn’t easily accessible for in-depth analysis. This blog aims to provide a guide on web scraping in Python to facilitate easier access to this data, enabling professionals and data enthusiasts to explore and utilize it effectively for evidence-based decision-making and novel research endeavors.

Web scraping allows us to collect data from websites automatically, making it a powerful tool for acquiring and analyzing valuable information from online sources. It’s important to note that this guide is intended for educational purposes and adheres to the website’s terms and conditions. We’ll be scraping only a small amount of data as not to overwhelm servers but saving a lot of manual time (& faff)!

Prerequisites

Before we begin, make sure you have the following prerequisites:

  1. A Python installation.
  2. Required Python libraries: requests, BeautifulSoup, pathlib, and re.

All the code needed is broken down step-by-step below. I’ve also rationalised this code into functions, which can be found in a Jupyter notebook on my GitHub page.

This was the first step in a personal project to investigate adults sporting activities, SIMD, and health forecasting. Be sure to check back soon if you’re interested in machine learning with public data!

Step 1: Import the necessary libraries

import re 
import requests 

from bs4 import BeautifulSoup 
from pathlib import Path  
from urllib.parse import urljoin 

These are the libraries we’ll use to speak to the web pages, parse HTML content, clean file names, and file management.

Step 2: Define a function to sanitize filenames

Special characters can be a pain when it comes to filenames, so one of the first things we are going to do is define a function to remove these, using the re package. This will be incorporated into the code later on.

def sanitize_filename(filename):
    """Remove characters that are not allowed in file names."""
    return re.sub(r'[^\w\-_. ]', '', filename)

Step 3: Specify the years of interest

To collect data spanning multiple years, we need to define a list of specific years we want to scrape. After a manual check, I’ve confirmed that supplementary data is available for the years 2011 to 2021.

years = list(range(2011, 2023))

Step 4: Iterate through the years and scrape data

Now, let’s put it all together to scrape the data. We’ll iterate through the list of years

4.1. Modify the base URL

for year in years:
    base_url = f"https://www.gov.scot/publications/scottish-health-survey-{year}-supplementary-tables/"

and perform the following steps for each year:

We modify the base URL with the current year to access the page for that year storing the data.

4.2. Send a HTTP Request

    response = requests.get(base_url)

We send an HTTP GET request to the modified URL to retrieve the web page’s content.

4.3. Check page exists

    if response.status_code == 404:
        print(f"No data available for {year}. Skipping...")
        continue

We check if the page exists. If the website returns a 404 status code, we skip that year’s data and print a statement to highlight that year’s data isn’t available.

4.4. Parse HTML content

    response.raise_for_status()  # Check for any HTTP errors
    soup = BeautifulSoup(response.content, "html.parser")
    excel_links = []

We parse the HTML content of the page and initialize a list to store links.

    for link in soup.find_all("a", href=True): 
        href = link["href"]
        absolute_url = urljoin(base_url, href)  # Convert relative URLs to absolute URLs
        if absolute_url.endswith(".xls") or absolute_url.endswith(".xlsx"): 
            excel_links.append((absolute_url, link.text.strip()))

We find all the links on the page (“a”), convert relative URLs to absolute URLs, and filter for excel extensions. We store the links and their corresponding text labels in excel_links.

4.6. Create folders

    folder_name = Path(str(year)) 
    folder_name.mkdir(parents=True, exist_ok=True)

We create a folder with the year as the name to store the downloaded files.

4.7. Download excel files

    for idx, (link_url, link_name) in enumerate(excel_links, start=1):
        if not link_name:
            print(f"Skipping download for {year} - File {idx} - Empty link name.")
            continue
        
        file_name = folder_name / (sanitize_filename(link_name) + ".xls")  
        if not file_name:
            print(f"Skipping download for {year} - File {idx} - Invalid link name: {link_name}")
            continue

        response = requests.get(link_url)
        with open(file_name, "wb") as file:
            file.write(response.content)
            print(f"Downloaded {file_name} for {year}.")

We download the excel files, using sanitized filenames, and store them in the respective year’s folder.

Now all the files should be stored under subfolders corresponding to that years data!

Ethical Considerations

When conducting web scraping it’s essential to adhere to ethical guidelines:

1. Respect robots.txt

Check if the website has a robots.txt file that specifies rules for web crawlers. Follow these rules to avoid overloading the server or accessing restricted content.

2. Check website’s Terms of Use

Review the website’s terms of use and privacy policy to ensure compliance. Some websites may explicitly prohibit or limit web scraping activities.

3. Rate limiting

Implement rate limiting to avoid sending too many requests to the server in a short time. Respect the website’s server capacity.

4. User-agent header

Consider setting a user-agent header in your requests to identify your scraper. This allows website administrators to contact you if any issues arise.

5. Data privacy

Be cautious about scraping sensitive or private data. Always anonymize and aggregate data to protect individuals' privacy.

Conclusion

Ethical web scraping is a valuable tool to access publicly available data for research and analysis. By following best practices, and respecting the terms of use of websites, you can gather data responsibly and ethically.

Happy scraping and data analysis!

*Summary image by Marvin Meyer on Unsplash