1 Reproducible Spatial Analysis

This week’s lecture offered a comprehensive introduction to the Geocomputation module, highlighting how and why it differs from a traditional GIScience course. In this week’s tutorial, we will introduce you to using R and RStudio for working with spatial data, focusing specifically on how R can be used to make maps.

1.1 Lecture slides

You can download the slides of this week’s lecture here: [Link].

1.2 Reading list

Essential readings

  • Brunsdon, C. and Comber, A. 2021. Opening practice: Supporting reproducibility and critical spatial data science. Journal of Geographical Systems 23: 477–496. [Link]
  • Franklin, R. 2023. Quantitative methods III: Strength in numbers? Progress in Human Geography. Online First. [Link].
  • Longley, P. et al. 2015. Geographic Information Science & Systems, Chapter 1: Geographic Information: Science, Systems, and Society, pp. 1-32. [Link]

Suggested readings

  • Goodchild, M. 2009. Geographic information systems and science: Today and tomorrow. Annals of GIS 15(1): 3-9. [Link]
  • Franklin, S., Houlden, V., Robinson, C. et al. 2021. Who counts? Gender, Gatekeeping, and Quantitative Human Geography. The Professional Geographer 73(1): 48-61. [Link]
  • Schurr, C., Müller, M. and Imhof, N. 2020. Who makes geographical knowledge? The gender of Geography’s gatekeepers. The Professional Geographer 72(3): 317-331. [Link]
  • Yuan, M. 2001. Representing complex geographic phenomena in GIS. Cartography and Geographic Information Science 28(2): 83-96. [Link]

1.3 Europeans in London

In RStudio, scripts allow us to build and save code that can be run repeatedly. We can organise these scripts into RStudio projects, which consolidate all files related to an analysis such as input data, R scripts, results, figures, and more. This organisation helps keep track of all data, input, and output, while enabling us to create standalone scripts for each part of our analysis. Additionally, it simplifies managing directories and filepaths and allows us to keep track of our installed packages through renv.

Package management in R involves handling the installation, updating, and tracking of external libraries needed for your code. This ensures that your R scripts can run smoothly without issues related to missing or incompatible packages. Within an RStudio project, you can use renv to create a reproducible environment by capturing the specific package versions used in the project. This means that anyone working on or revisiting the project will have access to the same package setup, preventing problems caused by package updates or changes.

Navigate to File -> New Project -> New Directory. Choose a directory name, such as GEOG0030, and select the location on your computer where you want to save this project by clicking on Browse….

Ensure you select an appropriate folder to store your GEOG0030 project. For example, you might use your Geocomputation folder, if you have one, or another location within your Documents directory on your computer.

Please ensure that folder names and file names do not contain spaces or special characters such as * . " / \ [ ] : ; | = , < ? > & $ # ! ' { } ( ). Different operating systems and programming languages deal differently with spaces and special characters and as such including these in your folder names and file names can cause many problems and unexpected errors. As an alternative to using white space you can use an underscore (_) or hyphen (-) if you like.

Tick the checkbox for Use renv with this project and click on Create Project. You should now see your main RStudio window switch to this new project and when you check your files pane, you should see a new R Project called GEOG0030.

With our GEOG0030 project ready to go, in this first tutorial we will look at the distribution of the share of European immigrants across London. The data covers the number of people residing in London that are born in a European country, as recorded in the 2021 Census for England and Wales, aggregated at the Middle Layer Super Output Area (MSOA) level.

An MSOA is a geographic unit used in the UK for statistical analysis. It typically represents small areas with populations of around 5,000 to 15,000 people and is designed to ensure consistent data reporting. MSOAs are commonly used to report on census data, deprivation indices, and other socio-economic statistics.

The dataset has been extracted using the Custom Dataset Tool, and you can download the file via the link provided below. Save the file in your project folder under data/attributes. Along with this dataset, we also have access to a GeoPackage that contains the MSOA boundaries. Save this file under data/spatial, respectively.

You will to have create a folder named data within your RStudio Project directory, inside which you will have to have a folder named attributes and a folder named spatial.

File Type Link
London MSOA Census 2021 European Population csv Download
London MSOA 2021 Spatial Boundaries GeoPackage Download

To download a csv file that is hosted on GitHub, click on the Download raw file button on the top right of your screen and it should download directly to your computer.

You may have used spatial data before and noticed that we did not download a collection of files known as a shapefile but a GeoPackage instead. Whilst shapefiles are still being used, GeoPackage is a more modern and portable file format. Have a look at this article on towardsdatascience.com for an excellent explanation on why one should use GeoPackage files over shapefiles where possible: [Link]

To get started, let us create our first script. File -> New File -> R Script. Save your script as w01-european-population-london.r.

We will start by loading the libraries that we will need:

R code
# load libraries
library(tidyverse)
library(sf)
library(tmap)

You may have to install some of these libraries if you have not used these before.

For Linux and macOS users who are new to working with spatial data in R, the installation of the sf library may fail because additional (non-R) libraries are required which are automatically installed for Windows users. If you encounter installation issues,, please refer to the information pages of the sf library for instructions on how to install these additional libraries.

Once downloaded, we can load both files into memory:

R code
# read spatial dataset
msoa21 <- st_read("data/spatial/London-MSOA-2021.gpkg")
Reading layer `London-MSOA-2021' from data source 
  `/Users/justinvandijk/Library/CloudStorage/Dropbox/UCL/Web/jtvandijk.github.io/GEOG0030/data/spatial/London-MSOA-2021.gpkg' 
  using driver `GPKG'
Simple feature collection with 1002 features and 8 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 503574.2 ymin: 155850.8 xmax: 561956.7 ymax: 200933.6
Projected CRS: OSGB36 / British National Grid
# load attribute dataset
msoa_eur <- read_csv("data/attributes/London-MSOA-European.csv")
Rows: 1002 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): msoa21cd
dbl (2): eur21, pop21

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# inspect
head(msoa21)
Simple feature collection with 6 features and 8 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 530966.7 ymin: 180512.6 xmax: 551943.8 ymax: 191139
Projected CRS: OSGB36 / British National Grid
   msoa21cd                 msoa21nm msoa21nmw  bng_e  bng_n      lat      long
1 E02000001       City of London 001           532384 181355 51.51562 -0.093490
2 E02000002 Barking and Dagenham 001           548267 189685 51.58652  0.138756
3 E02000003 Barking and Dagenham 002           548259 188520 51.57606  0.138149
4 E02000004 Barking and Dagenham 003           551004 186412 51.55639  0.176828
5 E02000005 Barking and Dagenham 004           548733 186824 51.56069  0.144267
6 E02000007 Barking and Dagenham 006           549698 186609 51.55851  0.158087
                                globalid                           geom
1 {71249043-B176-4306-BA6C-D1A993B1B741} MULTIPOLYGON (((532135.1 18...
2 {997A80A8-0EBE-461C-91EB-3E4122571A6E} MULTIPOLYGON (((548881.6 19...
3 {62DED9D9-F53A-454D-AF35-04404D9DBE9B} MULTIPOLYGON (((549102.4 18...
4 {511181CD-E71F-4C63-81EE-E8E76744A627} MULTIPOLYGON (((551550.1 18...
5 {B0C823EB-69E0-4AE7-9E1C-37715CF3FE87} MULTIPOLYGON (((549099.6 18...
6 {A33C6ADD-D70A-4737-ADE5-3460D7016CA1} MULTIPOLYGON (((549819.9 18...
# inspect
head(msoa_eur)
# A tibble: 6 × 3
  msoa21cd  eur21 pop21
  <chr>     <dbl> <dbl>
1 E02000001  1926  8582
2 E02000002  1102  8280
3 E02000003  1930 11542
4 E02000004   808  6640
5 E02000005  1541 11082
6 E02000007  1365 10159

You can further inspect both objects using the View() function.

1.3.1 Exploring spatial data

The first thing we want to do when we load spatial data is to plot the data to check whether everything is in order. To do this, we can simply use the base R plot() function

R code
# plot data
plot(msoa21, max.plot = 1, main = "")

Figure 1: Quick plot to inspect the MSOA spatial data.

You should see your msoa21 plot appear in your Plots window.

The plot() function should not to be used to make publishable maps but can be used as a quick way of inspecting your spatial data.

Just as with a tabular dataframe, we can inspect the attributes of the spatial data frame:

R code
# inspect columns
ncol(msoa21)
[1] 9
# inspect rows
nrow(msoa21)
[1] 1002
# inspect data
head(msoa21)
Simple feature collection with 6 features and 8 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 530966.7 ymin: 180512.6 xmax: 551943.8 ymax: 191139
Projected CRS: OSGB36 / British National Grid
   msoa21cd                 msoa21nm msoa21nmw  bng_e  bng_n      lat      long
1 E02000001       City of London 001           532384 181355 51.51562 -0.093490
2 E02000002 Barking and Dagenham 001           548267 189685 51.58652  0.138756
3 E02000003 Barking and Dagenham 002           548259 188520 51.57606  0.138149
4 E02000004 Barking and Dagenham 003           551004 186412 51.55639  0.176828
5 E02000005 Barking and Dagenham 004           548733 186824 51.56069  0.144267
6 E02000007 Barking and Dagenham 006           549698 186609 51.55851  0.158087
                                globalid                           geom
1 {71249043-B176-4306-BA6C-D1A993B1B741} MULTIPOLYGON (((532135.1 18...
2 {997A80A8-0EBE-461C-91EB-3E4122571A6E} MULTIPOLYGON (((548881.6 19...
3 {62DED9D9-F53A-454D-AF35-04404D9DBE9B} MULTIPOLYGON (((549102.4 18...
4 {511181CD-E71F-4C63-81EE-E8E76744A627} MULTIPOLYGON (((551550.1 18...
5 {B0C823EB-69E0-4AE7-9E1C-37715CF3FE87} MULTIPOLYGON (((549099.6 18...
6 {A33C6ADD-D70A-4737-ADE5-3460D7016CA1} MULTIPOLYGON (((549819.9 18...
# inspect column names
names(msoa21)
[1] "msoa21cd"  "msoa21nm"  "msoa21nmw" "bng_e"     "bng_n"     "lat"      
[7] "long"      "globalid"  "geom"     

We can further establish the class of our data:

R code
# inspect
class(msoa21)
[1] "sf"         "data.frame"

We should see our data is an sf dataframe, which is what we want.

1.3.2 Joining attribute data

Now we have our dataset containing London’s European born population and the MSOA spatial boundaries loaded, we can join these together using an Attribute Join. Before proceeding with the join, we need to verify that a matching unique identifier exists in both datasets. Let’s look at the column names in our datasets again:

R code
# inspect column names
names(msoa21)
[1] "msoa21cd"  "msoa21nm"  "msoa21nmw" "bng_e"     "bng_n"     "lat"      
[7] "long"      "globalid"  "geom"     
# inspect column names
names(msoa_eur)
[1] "msoa21cd" "eur21"    "pop21"   

The msoa21cd columns looks promising as it features in both datasets. We can quickly sort both columns and have a peek at the data:

R code
# inspect spatial dataset
head(sort(msoa21$msoa21cd))
[1] "E02000001" "E02000002" "E02000003" "E02000004" "E02000005" "E02000007"
# inspect attribute dataset
head(sort(msoa_eur$msoa21cd))
[1] "E02000001" "E02000002" "E02000003" "E02000004" "E02000005" "E02000007"

They seem to contain similar values, so that is promising. Let us try to join the attribute data onto the spatial data:

R code
# join attribute data onto spatial data
msoa21 <- msoa21 |> 
  left_join(msoa_eur, by = c('msoa21cd' = 'msoa21cd'))

The code above uses a pipe function: |>. The pipe operator allows you to pass the output of one function directly into the next, streamlining your code. While it might be a bit confusing at first, you will find that it makes your code faster to write and easier to read. More importantly, it reduces the need to create multiple intermediate variables to store outputs.

We can explore the joined data in usual fashion:

R code
# inspect columns
ncol(msoa21)
[1] 11
# inspect rows
nrow(msoa21)
[1] 1002
# inspect data
head(msoa21)
Simple feature collection with 6 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 530966.7 ymin: 180512.6 xmax: 551943.8 ymax: 191139
Projected CRS: OSGB36 / British National Grid
   msoa21cd                 msoa21nm msoa21nmw  bng_e  bng_n      lat      long
1 E02000001       City of London 001           532384 181355 51.51562 -0.093490
2 E02000002 Barking and Dagenham 001           548267 189685 51.58652  0.138756
3 E02000003 Barking and Dagenham 002           548259 188520 51.57606  0.138149
4 E02000004 Barking and Dagenham 003           551004 186412 51.55639  0.176828
5 E02000005 Barking and Dagenham 004           548733 186824 51.56069  0.144267
6 E02000007 Barking and Dagenham 006           549698 186609 51.55851  0.158087
                                globalid eur21 pop21
1 {71249043-B176-4306-BA6C-D1A993B1B741}  1926  8582
2 {997A80A8-0EBE-461C-91EB-3E4122571A6E}  1102  8280
3 {62DED9D9-F53A-454D-AF35-04404D9DBE9B}  1930 11542
4 {511181CD-E71F-4C63-81EE-E8E76744A627}   808  6640
5 {B0C823EB-69E0-4AE7-9E1C-37715CF3FE87}  1541 11082
6 {A33C6ADD-D70A-4737-ADE5-3460D7016CA1}  1365 10159
                            geom
1 MULTIPOLYGON (((532135.1 18...
2 MULTIPOLYGON (((548881.6 19...
3 MULTIPOLYGON (((549102.4 18...
4 MULTIPOLYGON (((551550.1 18...
5 MULTIPOLYGON (((549099.6 18...
6 MULTIPOLYGON (((549819.9 18...
# inspect column names
names(msoa21)
 [1] "msoa21cd"  "msoa21nm"  "msoa21nmw" "bng_e"     "bng_n"     "lat"      
 [7] "long"      "globalid"  "eur21"     "pop21"     "geom"     

Always inspect your join to ensure everything looks as expected. A good way to do this is by using the View() function to check for any unexpected missing values, which are marked as NA.

We can also compare the total number of rows in the spatial dataset with the total number of non-NA values in the joined columns:

R code
# inspect
nrow(msoa21)
[1] 1002
# check for missing values
sum(!is.na(msoa21$eur21))
[1] 1002
# check for missing values
sum(!is.na(msoa21$pop21))
[1] 1002

No missing values. In this case we did not expect any missing values, so this confirms that all our full attribute dataset has been linked to the spatial dataset.

We are almost ready to map the data. Only thing that is left is for us to calculate the share of European-born immigrants within each MSOA:

R code
# calculate proportion
msoa21 <- msoa21 |>
    mutate(prop_eur21 = eur21/pop21)

1.3.3 Mapping spatial data

For our map-making, we will use one of the two primary visualisation libraries for spatial data: tmap. tmap offers a flexible, layer-based approach that makes it easy to create various types of thematic maps, such as choropleths and proportional symbol maps. One of the standout features of tmap is its quick plotting function, qtm(), which allows you to generate basic maps with minimal effort.

R code
# quick thematic map
qtm(msoa21, fill = "prop_eur21")

Figure 2: Quick thematic map.

In this case, the fill() argument in tmap is how we instruct the library to create a choropleth map based on the values in the specified column. If we set fill() to NULL, only the borders of our polygons will be drawn, without any colour fill. The qtm() function in tmap is versatile, allowing us to pass various parameters to customise the aesthetics of our map. By checking the function’s documentation, you can explore the full list of available parameters. For instance, to set the MSOA borders to white, we can use the borders parameter:

R code
# quick thematic map
qtm(msoa21, fill = "prop_eur21", borders = "white")

Figure 3: Quick thematic map with white borders.

The map does not look quite right yet. While we can continue tweaking parameters in the qtm() function to improve it, qtm() is somewhat limited in its functionality and is primarily intended for quickly inspecting your data and creating basic maps. For more complex and refined map-making with the tmap library, it is better to use the main plotting method that starts with the tm_shape() function.

The primary approach to creating maps in tmap involves using a layered grammar of graphics to build up your map, starting with the tm_shape() function. This function, when provided with a spatial dataframe, captures the spatial information of your data, including its projection and geometry, and creates a spatial object. While you can override certain aspects of the spatial data (such as its projection) using the function’s parameters, the essential role of tm_shape() is to instruct R to “use this object as the basis for drawing the shapes.”

To actually render the shapes, you need to add a layer that specifies the type of shape you want R to draw from this spatial information—such as polygons for our data. This layer function tells R to “draw my spatial object as X”, where X represents the type of shape. Within this layer, you can also provide additional details to control how R draws your shapes. Further, you can add more layers to include other spatial objects and their corresponding shapes on your map. Finally, layout options can be specified through a layout layer, allowing you to customise the overall appearance and arrangement of your map.

Let us build a map using tmap:

R code
# shape, polygons
tm_shape(msoa21) + tm_polygons()

Figure 4: Building up a map layer by layer.

As you can now see, we have mapped the spatial polygons of our msoa21 spatial dataframe. However, this is not quite the map we want; we need a choropleth map where the polygons are coloured based on the proportion of European immigrants. To achieve this, we use the col parameter within the tm_polygons() function.

The col parameter within tm_polygons() allows you to fill polygons with colours based on:

  • A single colour value (e.g. red or #fc9272).
  • The name of a data variable within the spatial data file. This variable can either contain specific colour values or numeric/categorical values that will be mapped to a colour palette.

Let us go ahead and pass our prop_eur21 variable within the col() parameter and see what we get:

R code
# shape, polygons
tm_shape(msoa21) +
  # specify column
  tm_polygons(
    col = "prop_eur21"
  )

Figure 5: Building up a map layer by layer.

We are making progress, but there are two immediate issues with our map. First, the classification breaks do not adequately reflect the variation in our dataset. By default, tmap uses pretty breaks, which may not be the most effective for our data. An alternative, such as natural breaks (or jenks), might better reveal the data’s variation.

To customise the classification breaks, refer to the tm_polygons() documentation. The following parameters are relevant:

Parameter Description
n Specifies the number of classification breaks.
style Defines the method for classification breaks, such as fixed, standard deviation, equal, or quantile.
breaks Allows you to set specific numeric breaks when using the fixed style.

For example, if we want to adjust our choropleth map to use five classes determined by the natural breaks method, we need to add the n and style parameters to our tm_polygons() layer:

R code
# shape, polygons
tm_shape(msoa21) +
  # specify column, classes
  tm_polygons(
    col = "prop_eur21",
    n = 5,
    style = "jenks"
  )

Figure 6: Building up a map layer by layer.

1.4 Styling spatial data

Styling a map in tmap requires a deeper understanding and familiarity with the library, which is something you will develop best through hands-on practice. Here are the key functions to be aware of:

Function Description
tm_layout() Customise titles, fonts, legends, and other layout elements.
tm_compass() Add and style a North arrow or compass.
tm_scale_bar() Add and style a scale bar.

To begin styling your map, explore each of these functions and their parameters. Through trial and error, you can tweak and refine the map until you achieve the desired look:

R code
# shape, polygons
tm_shape(msoa21) +

  # specify column, classes, labels, title
  tm_polygons(
    col = "prop_eur21", n = 5, style = "jenks",
    border.col = "#ffffff",
    border.alpha = 0.3,
    palette = c("#feebe2", "#fbb4b9", "#f768a1", "#c51b8a", "#7a0177"),
    labels = c("Smallest share", "2nd smallest", "3rd smallest", "4th smallest", "Largest share"),
    title = "Share of population",
    textNA = "No population"
  ) +

  # set layout
  tm_layout(
    main.title = "Share of population born in Europe",
    main.title.size = 0.9,
    main.title.position = c("right", "top"),
    legend.outside = FALSE,
    legend.position = c("right", "top"),
    legend.title.size = 0.7,
    legend.title.fontface = "bold",
    legend.text.size = 0.5,
    frame = FALSE,
    inner.margins = c(0.05, 0.05, 0.05, 0.05),
    fontfamily = "Helvetica"
  ) +

  # add North arrow
  tm_compass(
    type = "arrow",
    position = c("left", "top"),
    size = 1,
    text.size = 0.7
  ) +

  # add scale bar
  tm_scale_bar(
    breaks = c(0, 5, 10, 15, 20),
    position = c("right", "bottom"),
    text.size = 0.4
  )

Figure 7: Building up a map layer by layer.

We can also have some map labels, if we want, by extracting centroids from selected polygons and adding these as separate map layer:

R code
# map labels
lab <- msoa21 |>
  filter(msoa21cd == "E02000642" | msoa21cd == "E02000180") |>
  st_centroid()
Warning: st_centroid assumes attributes are constant over geometries
# map object
lon_eurpop <-
  # shape, polygons
  tm_shape(msoa21) +

  # specify column, classes, labels, title
  tm_polygons(
    col = "prop_eur21", n = 5, style = "jenks",
    border.col = "#ffffff",
    border.alpha = 0.3,
    palette = c("#feebe2", "#fbb4b9", "#f768a1", "#c51b8a", "#7a0177"),
    labels = c("Smallest share", "2nd smallest", "3rd smallest", "4th smallest", "Largest share"),
    title = "Share of population",
    textNA = "No population"
  ) +

  # label centroids
  tm_shape(lab) +

  # add points
  tm_dots(size = 0.4, col = "#000000") +

  # add labels
  tm_text(text = "msoa21nm", xmod = 0, ymod = -0.6, col = "#000000", size = 0.8) +

  # set layout
  tm_layout(
    main.title = "Share of population born in Europe",
    main.title.size = 0.9,
    main.title.position = c("right", "top"),
    legend.outside = FALSE,
    legend.position = c("right", "top"),
    legend.title.size = 0.7,
    legend.title.fontface = "bold",
    legend.text.size = 0.5,
    frame = FALSE,
    inner.margins = c(0.05, 0.05, 0.05, 0.05),
    fontfamily = "Helvetica"
  ) +

  # add North arrow
  tm_compass(
    type = "arrow",
    position = c("left", "top"),
    size = 1,
    text.size = 0.7
  ) +

  # add scale bar
  tm_scale_bar(
    breaks = c(0, 5, 10, 15, 20),
    position = c("right", "bottom"),
    text.size = 0.4
  ) +

  # add credits
  tm_credits("Data source: Census 2021, Office for National Statistics",
    fontface = "italic",
    position = c("left", "bottom"),
    size = 0.4
  )

# plot
lon_eurpop

Figure 8: Building up a map layer by layer.

In the code above, we stored the full map definition as an object. This makes it easy to export the map and save it as a .jpg, .png or .pdf file:

R code
# write map
tmap_save(tm = lon_eurpop, filename = "london-european-population.jpg", width = 15,
    height = 15, units = c("cm"))

1.5 Assignment

Now that we have prepared our dataset and created our initial maps in R, we can also try and map the distribution of the proportion of European immigrants across Wales and experiment with different mapping parameters. Follow these steps:

  1. Download the two datasets provided below and save them in the appropriate subfolder within your data directory. The datasets include:
    • A csv file containing the number of people residing in Wales that are born in a European country, as recorded in the 2021 Census for England and Wales, aggregated at the MSOA level.
    • A GeoPackage file containing the 2021 MSOA spatial boundaries for England and Wales.
  2. Load both datasets and merge them together. Make sure to only retain those MSOAs that belong to Wales.
  3. Create a map that shows the proportion of the population residing in Wales that is born in Europe.
  4. Experiment by adjusting various map parameters, such as the colour scheme, map labels, and data classification method.
File Type Link
Wales MSOA Census 2021 European Population csv Download
England and Wales MSOA 2021 Spatial Boundaries GeoPackage Download

1.6 Before you leave

And that is how you use R as a GIS in its most basic form. More RGIS in the coming weeks, but this concludes the tutorial for this week.