Unlocking geospatial data for street (greenspace) smarts

Introduction

I feel like I live different lives. There is the professional side, which has now moved away from using, to advocating for the use of data. The scientist who likes to find stuff out (and see pretty plots). Then the mother, whose parenting style can be described as “outside”. All that to say, this is an exciting blog post to write as it intercepts the different me’s!

I’m going to go through some basic operations when working with geospatial datasets to see what insights can be discovered for access to outdoor space in Scotland. I’ve split this into two blogs - this one is a tutorial showcasing the sf package and working with multiple geospatial datasets. The next will be the pretty plots and my personal insights/musings.

I’m also putting an accompanying quarto document on my GitHub so anyone can run this themselves (sidenote, I should really probably transfer this blog to quarto from blogdown but time).

The Data

I want to look at the total area and access to greenspace in Scotland. I want to know if greenspace is equally distributed across population, deprivation quintiles, and local authorities.

Greenspace is defined as “public parks, playing fields, sports facilities, play areas, allotments and cemetries.”

For this I’ll need the greenspace data, alongside the Scottish Index of Multiple Deprivation (SIMD).

Green Space

I’m going to use the freely available greenspace product from Ordanance Survey. I really like the OS Data Hub for it’s incredibly user friendly interface.

I downloaded all the tiles covering Scotland, into a greenspace directory within a data directory in my project folder. The reason for this should become clear when it comes to loading in all these shapefiles.

# Directory containing shapefiles
greenspace_data_path <- "data/greenspace/"

You could follow most of this tutorial just using one tile or tiles covering one local authority if you didn’t want to do this for the whole of Scotland.

To check the shapefiles in the directory:

# List all files with the .shp extension in the directory
shapefile_list <- list.files(path = greenspace_data_path, pattern = "\\.shp$", full.names = TRUE)

We’re then going to name each file according to the filename without the extension for easier manipulation.

# Assign names to the list based on the file names
names(sf_list) <- tools::file_path_sans_ext(basename(shapefile_list))

We now want to check that all the column names are consistent because if I’ve learnt anything it’s that people producing and publishing data have no sympathy for data users when it comes to these things ;)

# Checking out the column names to check they match
column_names_list <- lapply(sf_list, function(sf_object) {
  colnames(sf_object)
})

You’ll see the NT file has additional columns and the function_ column is different to all the other function. columns. As it’s just the one file we can rename that column to match the others.

# Rename 'function_' to 'function.' only in the 'NT' shapefile
names(sf_list$NT_GreenspaceSite)[names(sf_list$NT) == "function_"] <- "function."

Then run the lines above to check column names again to ensure it worked.

We’re going to merge all the shapefiles into one larger shapefile, for this we will ditch any data that’s not needed to make it as small a file as possible, by selecting only the needed columns.

# Select only 'id', function' and 'geometry' columns from each sf object in sf_list
selected_cols_sf <- lapply(sf_list, function(sf_object) {
  sf_object %>%
    select(id, function., geometry)
})

Then we merge:

# Merge all sf objects in selected_cols_sf into a single sf object
merged_greenspace_sf <- do.call(rbind, selected_cols_sf)

Optional, you can make a quick plot using the tmap package to check our geospatial file is actually geospatial data.

# Fix factor levels for cleaner plotting
merged_greenspace_sf$function. <- as.factor(merged_greenspace_sf$function.)

# Quick plot 
qtm(merged_greenspace_sf["function."])

Let’s do some data wrangling

I said at the top we’ll be looking at the different greenspace areas, so we need to convert the geometry into an area. I love this feature of sf that just does it for you!

# Add a new column with the area of each object
merged_greenspace_sf$area <- st_area(merged_greenspace_sf)

For fun, let’s look at the whole dataset and see what we can tell about greenspace in Scotland. We won’t make any more maps so we can drop the geometry and make a standard, much much smaller, df that will return quicker calculations.

# Dropping the geometry for quicker calculating
merged_sf_nogeom <- st_set_geometry(merged_greenspace_sf, NULL)

Let’s see how much area there is for each type of greenspace (or function) in Scotland.

# Calculate the amount of area of each greenspace type
function_area <- merged_sf_nogeom %>%
  group_by(`function.`) %>%
  summarise(function_area = sum(area, na.rm = TRUE)) %>%
  arrange(desc(function_area))

And we can quickly visualise this with a treemap using plotly:

treemap <- plot_ly(data = function_area, ids = ~function., labels = ~function., parents = "",
                        values = ~function_area, type = "treemap") %>%
  layout(plot_bgcolor = "black",  
         paper_bgcolor = "black")

Local Authority data

If we just wanted to look at all of Scotland, we could just use the greenspace data and a map of Scotland (as one greenspace file crosses the boundary into England). But I want to see how greenspace compares across Local Authorities, especially since I already have the files from a previous blog posts.

The SIMD dataset contains all the output areas across Scotland, with their SIMD value and associated Local Authority. This means we don’t need an additional dataset to look at Local Authority breakdowns.

The Scottish Index of Multiple Deprivation (SIMD) is an index of relative deprivation. It sections Scotland into around 7k datazones with roughly equivalent population sizes. These areas are then ranked against 7 risk domains (access to services, crime, education, employment, health, housing, income) and divided into quintiles. Areas that rank highest against multiple domains are described as the most deprived areas of Scotland (quintile 1).

I downloaded this data from the Scottish Spatial Data website and saved to my projects data folder.

# SIMD shapefiles
SIMD_shapefile_path <- "data/SG_SIMD_2020.shp"

# Read in SIMD shapefiles
Layer_SIMD <- st_read(SIMD_shapefile_path, layer = "SG_SIMD_2020")

# Filtering SIMD layer to only wanted columns
filtered_simd <- Layer_SIMD %>% select(DataZone, LAName, Quintilev2, Decilev2, SAPE2017)

# Cleaning column names
filtered_simd <- clean_names(filtered_simd)

Merging datasets

First off, we’re going to ensure that both datasets are in the same projected CRS.

For more information on CRS and other geospatial need to knows check out this blog post

We can test this:

# Testing CRS projection matches
st_crs(filtered_simd) == st_crs(merged_greenspace_sf)

And if they match it will return TRUE.

We can also set CRS for both datasets to EPSG 27700. If we wanted to convert to standard latitude/longitude we’d switch the CRS value to 4326, for WGS84 projection.

# Set same projected CRS for datasets
target_crs <- 27700  # British National Grid (BNG)
filtered_simd <- st_transform(filtered_simd, crs = target_crs)
merged_greenspace_sf <- st_transform(merged_greenspace_sf, crs = target_crs)

We’re now going to join the greenspace and SIMD datasets together based on overlapping geometry.By putting the greenspace dataframe first, I am adding a SIMD quintile to the greenspace types.

# Joining greenspace and simd shapefiles
la_merged_shapefile <- st_join(merged_greenspace_sf, filtered_simd)

Quintiles

# Summarise total greenspace area by SIMD quintile
quintile_greenspace <- la_merged_shapefile %>%
  st_drop_geometry() %>%
  group_by(quintilev2) %>%
  summarise(total_greenspace_area = sum(as.numeric(area), na.rm = TRUE)) %>%
  arrange(quintilev2)

# Calculate population per quintile
pop_by_quintile <- filtered_simd %>%
  st_drop_geometry() %>%
  group_by(quintilev2) %>%
  summarise(total_population = sum(sape2017, na.rm = TRUE))

# Join and calculate greenspace per person
quintile_summary <- left_join(quintile_greenspace, pop_by_quintile, by = "quintilev2") %>%
  mutate(gs_per_person = total_greenspace_area / total_population)

Green deserts

Next we want to see where there are SIMD Quintile 1 areas with poor access to greenspace or are green deserts. As the two spatial datasets contain a lot of complex geometries we want to try and simplify or reduce data to speed up calculations.

Let’s compare three approaches for joining the datasets and see how the outputs differ.

We are only interested in SIMD quintile 1 areas so that means we can filter down the SIMD dataframe.

# Filtering to Q1
simd_q1 <- simd %>% filter(quintilev2 == 1)

We’ll use three methods to join or overlap data and compare outputs. First, we’ll use st_join() again but this time with SIMD dataset first, as we’re focused on all quintile 1 areas.

# Join datasets and create a column 'access_a' where there is no greenspace
method_a <- st_join(simd_q1, merged_greenspace_sf, join = st_intersects, left = TRUE) %>%
  mutate(access_a = !is.na(function.))

For the second method we’ll create centroids and look at greenspace within a 1000m buffer of those centroids.

simd_q1_pts <- st_point_on_surface(simd_q1)
has_nearby_greenspace <- lengths(
  st_is_within_distance(simd_q1_pts, merged_greenspace_sf, dist = 1000)
) > 0

method_b <- simd_q1 %>%
  mutate(access_b = has_nearby_greenspace)

And for the final method

# Create 500 buffer around greenspace
greenspace_500m <- st_buffer(merged_greenspace_sf, dist = 500)

# Intersection of greenspace buffer and SIMD Q1 areas
in_buffer <- lengths(st_intersects(simd_q1, greenspace_500m)) > 0

method_c <- simd_q1 %>%
  mutate(access_c = in_buffer)

We’ll then join all the outputs together into one dataframe

# Combine all access flags into one object
simd_compare <- simd_q1 %>%
  select(data_zone, geometry) %>%
  left_join(method_a %>% st_drop_geometry() %>% select(data_zone, access_a), by = "data_zone") %>%
  left_join(method_b %>% st_drop_geometry() %>% select(data_zone, access_b), by = "data_zone") %>%
  left_join(method_c %>% st_drop_geometry() %>% select(data_zone, access_c), by = "data_zone")

And then compare outputs across the 3 methods.

simd_compare %>%
  st_drop_geometry() %>%
  summarise(
    overlap = sum(access_a, na.rm = TRUE),
    within_1km = sum(access_b, na.rm = TRUE),
    in_500m_buffer = sum(access_c, na.rm = TRUE)
  )

Buffer zones

Finally, we’ll calculate what percent of SIMD quintile 1 zones are covered in greenspaces.

# Recalculate area total
filtered_simd$area_total <- st_area(filtered_simd)

# Intersect SIMD zones with 500m greenspace buffer
intersection <- st_intersection(filtered_simd, greenspace_500m)

# Combine overlaps per SIMD zone (avoids duplicate area from multiple overlaps)
intersect_union <- intersection %>%
  group_by(data_zone) %>%
  summarise(geometry = st_union(geometry)) %>%
  mutate(area_overlap = st_area(geometry)) %>%
  st_drop_geometry()

# Join back to original SIMD data
simd_allq <- filtered_simd %>%
  left_join(intersect_union, by = "data_zone") %>%
  mutate(
    area_overlap = ifelse(is.na(area_overlap), 0, area_overlap),
    coverage_pct = (as.numeric(area_overlap) / as.numeric(area_total)) * 100
  )

# Summarise by quintile
area_overlap_percent <- simd_allq %>%
  st_drop_geometry() %>%
  group_by(quintilev2) %>%
  summarise(
    total_overlap_area = sum(as.numeric(area_overlap), na.rm = TRUE),
    total_area = sum(as.numeric(area_total), na.rm = TRUE),
    coverage_pct = (total_overlap_area / total_area) * 100
  )

Summary

Since I started this blog ChatGPT has really evolved into a code go-to, so I’m left feeling these tutorials are no longer very useful (if they ever were!). However, I think having some actual datasets to play with a problems to try and address is a good way to explore new tools - something ChatGPT can’t give without the right prompts. I was also putting this code together to analyse the data so why not share anyway!

Summary image by Leonardo Fuzão Reis

Working with multiple geospatial datasets