4 Spatial Autocorrelation

This week, we will explore the concept of spatial dependence: the idea that the value of a variable at one location is influenced by the value of the same variable at nearby locations. This dependence can be statistically measured by assessing spatial autocorrelation, which refers to the degree of similarity between values of a variable at different locations or between multiple variables at the same location.

4.1 Lecture slides

You can download the slides of this week’s lecture here: [Link].

4.2 Reading list

Essential readings

  • Griffith, D. 2017. Spatial Autocorrelation. The Geographic Information Science & Technology Body of Knowledge. [Link]
  • Gimond, M. 2023. Intro to GIS and spatial analysis. Chapter 13: Spatial autocorrelation. [Link]
  • Livings, M. and Wu, A-M. 2020. Local Measures of Spatial Association. The Geographic Information Science & Technology Body of Knowledge. [Link]
  • Su, R. Newsham, N., and Dodge, S. 2024. Spatiotemporal dynamics of ethnoracial diversity and segregation in Los Angeles County: Insights from mobile phone data. Computers, Environment and Urban Systems 114: 102203. [Link]

Suggested readings

  • Lee, S. 2019. Uncertainty in the effects of the modifiable areal unit problem under different levels of spatial autocorrelation: a simulation study. International Journal of Geographical Information Science 33: 1135-1154. [Link]
  • Harris, R. 2020. Exploring the neighbourhood-level correlates of Covid-19 deaths in London using a difference across spatial boundaries method. Health & Place 66: 102446. [Link]

4.3 Population groups in London

This week, we will investigate to what extent people in London who self-identified as Asian-Bangladeshi in the 2021 Census are clustered in London at the LSOA-level. The data covers all usual residents, as recorded in the 2021 Census for England and Wales, aggregated at the Lower Super Output Area (LSOA) level.

An LSOA is a geographic unit used in the UK for statistical analysis. It typically represents small areas with populations of around 1,000 to 3,000 people and is designed to ensure consistent data reporting. LSOAs are commonly used to report on census data, deprivation indices, and other socio-economic statistics.

The data has been extracted using the Custom Dataset Tool and subsequently processed to include only the proportion of individuals who self-identify as belonging to one of the Asian groups defined in the Census. Along with this dataset, we also have access to a GeoPackage that contains the LSOA boundaries.

You can download both files below and save them in your project folder under data/attributes and data/spatial, respectively.

File Type Link
London LSOA Census 2021 Asian Population csv Download
London LSOA 2021 Spatial Boundaries GeoPackage Download

Open a new script within your GEOG0030 project and save this as w04-spatial-autocorrelation.r.

Begin by loading the necessary libraries:

R code
# load libraries
library(tidyverse)
library(sf)
library(tmap)
library(spdep)

You may have to install some of these libraries if you have not used these before.

Once downloaded, we can load both files into memory:

R code
# read spatial dataset
lsoa21 <- st_read("data/spatial/London-LSOA-2021.gpkg")
Reading layer `London-LSOA-2021' from data source 
  `/Users/justinvandijk/Library/CloudStorage/Dropbox/UCL/Web/jtvandijk.github.io/GEOG0030/data/spatial/London-LSOA-2021.gpkg' 
  using driver `GPKG'
Simple feature collection with 4994 features and 8 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 503574.2 ymin: 155850.8 xmax: 561956.7 ymax: 200933.6
Projected CRS: OSGB36 / British National Grid
# load ethnicity data
lsoa_eth <- read_csv("data/attributes/London-LSOA-Asian.csv")
Rows: 4994 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): lower_layer_super_output_areas_code
dbl (5): asian_asian_british_or_asian_welsh_bangladeshi, asian_asian_british...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# inspect
head(lsoa21)
Simple feature collection with 6 features and 8 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 531948.3 ymin: 180733.9 xmax: 545296.2 ymax: 184700.6
Projected CRS: OSGB36 / British National Grid
   lsoa21cd                  lsoa21nm  bng_e  bng_n      long      lat
1 E01000001       City of London 001A 532123 181632 -0.097140 51.51816
2 E01000002       City of London 001B 532480 181715 -0.091970 51.51882
3 E01000003       City of London 001C 532239 182033 -0.095320 51.52174
4 E01000005       City of London 001E 533581 181283 -0.076270 51.51468
5 E01000006 Barking and Dagenham 016A 544994 184274  0.089317 51.53875
6 E01000007 Barking and Dagenham 015A 544187 184455  0.077763 51.54058
                                globalid pop2021                           geom
1 {1A259A13-A525-4858-9CB0-E4952BA01AF6}    1473 MULTIPOLYGON (((532105.3 18...
2 {1233E433-0B0D-4807-8117-17A83C23960D}    1384 MULTIPOLYGON (((532634.5 18...
3 {5163B7CB-4FFE-4F41-95B9-AA6CFC0508A3}    1613 MULTIPOLYGON (((532135.1 18...
4 {2AF8015E-386E-456D-A45A-D0A223C340DF}    1101 MULTIPOLYGON (((533808 1807...
5 {B492B45E-175E-4E77-B0B5-5B2FD6993EF4}    1842 MULTIPOLYGON (((545122 1843...
6 {4A374975-B1D0-40CE-BF6E-6305623E5F7E}    2904 MULTIPOLYGON (((544180.3 18...
# inspect
head(lsoa_eth)
# A tibble: 6 × 6
  lower_layer_super_output_areas…¹ asian_asian_british_…² asian_asian_british_…³
  <chr>                                             <dbl>                  <dbl>
1 E01000001                                       0.00271                0.0448 
2 E01000002                                       0.00505                0.0736 
3 E01000003                                       0.00682                0.0323 
4 E01000005                                       0.239                  0.0318 
5 E01000006                                       0.116                  0.00596
6 E01000007                                       0.113                  0.0148 
# ℹ abbreviated names: ¹​lower_layer_super_output_areas_code,
#   ²​asian_asian_british_or_asian_welsh_bangladeshi,
#   ³​asian_asian_british_or_asian_welsh_chinese
# ℹ 3 more variables: asian_asian_british_or_asian_welsh_indian <dbl>,
#   asian_asian_british_or_asian_welsh_pakistani <dbl>,
#   asian_asian_british_or_asian_welsh_other_asian <dbl>

You can inspect both objects using the View() function.

You will notice is that the column names are rather long, so let us rename the columns for easier reference.

R code
# rename columns
names(lsoa_eth) <- c("lsoa21cd", "asian_bangladeshi", "asian_chinese", "asian_indian",
    "asian_pakistani", "asian_other")

4.3.1 Spatial dependency

As you should know by now, the first step when working with spatial data is to create a map:

R code
# join attribute data onto spatial data
lsoa21 <- lsoa21 |>
  left_join(lsoa_eth, by = c("lsoa21cd" = "lsoa21cd"))

# shape, polygons
tm_shape(lsoa21) +

  # specify column, colours
  tm_polygons(
    col = "asian_bangladeshi",
    palette = c("#f0f9e8", "#bae4bc", "#7bccc4", "#43a2ca", "#0868ac"),
    border.col = "#ffffff",
    border.alpha = 0.1,
    title = "Proportion"
  ) +

  # set layout
  tm_layout(
    legend.outside = FALSE,
    legend.position = c("left", "bottom"),
    frame = FALSE
  )

Figure 1: Proportions of people that self-identify as Asian-Bangladeshi.

Looking at the map, the geographical patterning of the percentage of the population that self-identify as Asian-Bangladeshi appears to be neither random nor uniform, with a tendency for similar values to be found in some neighbourhoods in East London. Let us compare our map to a map with the same values which have been randomly permutated:

R code
# seed for reproducibility of random permutation
set.seed(99)

# random permutation
lsoa21 <- lsoa21 |>
  mutate(asian_bangladeshi_random = sample(lsoa21$asian_bangladeshi, replace = FALSE))

# shape, polygons
tm_shape(lsoa21) +

  # specify column, colours
  tm_polygons(
    col = "asian_bangladeshi_random",
    palette = c("#f0f9e8", "#bae4bc", "#7bccc4", "#43a2ca", "#0868ac"),
    border.col = "#ffffff",
    border.alpha = 0.1,
    title = "Proportion"
  ) +

  # set layout
  tm_layout(
    legend.outside = FALSE,
    legend.position = c("left", "bottom"),
    frame = FALSE
  )

Figure 2: Proportions of people that self-identify as Asian-Bangladeshi with randomly permutated values.

Looking at Figure 2, even with the values being randomly permuted, certain patterns seem to emerge. This observation raises an important question: to what extent are the patterns that we see in the actual data actually present? A widely used method to quantify the similarity between neighbouring locations is by calculating Moran’s I statistic. This measure assesses spatial autocorrelation, indicating the degree to which values of a variable cluster spatially — either through similar (positive spatial autocorrelation) or contrasting values (negative spatial autocorrelation).

Underlying our Moran’s I test is the concept of a spatial lag. A spatial lag refers to a concept in spatial analysis where the value of a variable at a given location is influenced by the values of the same variable at neighboring locations. Essentially, it captures the idea that observations in close proximity are likely to be correlated, meaning that what happens in one area can ‘lag’ into or affect nearby areas. The Moran’s I statistic tries to capture the relationship between a value and its spatial lag. An Ordinary Least Squares (OLS) regression is applied, after both variables have been transformed to z-scores, to fit the data and produce a slope, which determines the Moran’s I statistic.

Figure 3: Scatter plot of spatially lagged income (neighboring income) versus each areas income. Source: Manuel Gimond.

Moran’s I values typically range from \(-1\) to \(1\):

  • +1: Indicates perfect positive spatial autocorrelation. High values cluster near other high values, and low values near other low values.
  • 0: Suggests no spatial autocorrelation, meaning the spatial distribution of the variable is random.
  • -1: Indicates perfect negative spatial autocorrelation. High values cluster near low values, and vice versa (a checkerboard pattern).

There are two approaches to estimating the significance of the Moran’s I statistic: an analytical method and a computational method. The analytical method relies on assumptions about the data, such as normality, which can sometimes limit its reliability. In contrast, the computational method, which is preferred here, does not make such assumptions and offers a more flexible and robust evaluation of significance.

The computational approach is based on a repeated random permutation of the observed values. The Moran’s I statistic is then calculated for each of these randomly reshuffled data sets, generating a reference distribution. By comparing the observed Moran’s I value to this reference distribution, we can assess whether our observed statistic is typical or an outlier and calculate a psuedo \(p\)-value (see Figure 4). If the observed Moran’s I value is an outlier, meaning it falls outside the range expected from random data distribution, it suggests a significant degree of clustering in the data.

Figure 4: Determining significance using a Monte Carlo simulation. Source: Manuel Gimond.

We can derive a pseudo-\(p\) value from these simulation results as follows:

\[ \frac{N_{extreme} + 1}{N + 1} \]

where \({N_{extreme}}\) is the number of simulated Moran’s I values that were more extreme than our observed statistic and \({N}\) is the total number of simulations. In the example shown in Figure 4, only 1 out the 199 simulations was more extreme than the observed local Moran’s I statistic. Therefore \({N_{extreme}}\) = 1 , so \(p\) is equal to \((1+1) / (199 + 1) = 0.01\). This means that there is a one percent probability that we would be wrong in rejecting the null hypothesis of spatial randomness.

4.3.2 Defining neighbours

If the purpose of a Moran’s I test is to quantify how similar places are to their neighbours, the first step is to define what constitutes a neighbour. This definition is not necessarily straightforward, because ‘neighbouring’ observations can be determined in various ways, based on either geometry or proximity. The most common methods include:

Type Description
Contiguity Spatial units are considered neighbours if their polygon boundaries touch.
Fixed Distance Spatial units are considered neighbours if they fall within a specified distance.
Nearest Neighbours Spatial units are considered neighbours if they are among the closest neighbours.

To capture this information, we need to formalise the spatial relationships within our data by constructing a spatial weights matrix (\(W_{ij}\)). This matrix defines which units are neighbours based on our chosen criteria.

In the following example, neighbours are defined as places that share a border (i.e., they are contiguous). Currently, it is sufficient for them to meet at a single point — so if two places are triangular, touching corners would count them as neighbours. If, however, you require them to share an edge, rather than just a corner, you can modify the default argument by setting queen = FALSE.

R code
# create neighbour list
lsoa21_nb <- poly2nb(lsoa21, queen = TRUE)

# inspect
summary(lsoa21_nb)
Neighbour list object:
Number of regions: 4994 
Number of nonzero links: 29472 
Percentage nonzero weights: 0.1181714 
Average number of links: 5.901482 
Link number distribution:

   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
   3   19  185  670 1244 1331  841  404  185   70   21   13    3    1    2    1 
  20 
   1 
3 least connected regions:
2329 3915 4440 with 1 link
1 most connected region:
4990 with 20 links

The neighbour list object is a sparse matrix that lists the neighboring polygons for each LSOA. This matrix represents the spatial relationships between LSOAs, where each entry indicates which polygons share boundaries. These neighborhood relationships can be visualised as a graph by extracting the coordinate points of the centroids of the polygons representing each LSOA:

Regardless of the neighborhood definition you choose, it is important to verify the results, particularly when using contiguity-based approaches. If your spatial file has issues such as polygons that appear adjacent but do not actually share a border, your results may be inaccurate. You could increase the default value of the snap distance parameter in the poly2nb() function to include these polygons only separated by small gaps.

R code
# extract centroids from polygons
lsoa21_cent <- st_centroid(lsoa21, of_largest_polygon = TRUE)
Warning: st_centroid assumes attributes are constant over geometries
# plot graph
par(mai = c(0, 0, 0, 0))
plot(st_geometry(lsoa21), border = "#cccccc")
plot(lsoa21_nb, st_geometry(lsoa21_cent), add = TRUE)

Figure 5: Neighbourhood graph using queen contiguity.

With nearly 5,000 LSOAs, the neighbourhood graph appears quite crowded. However, it seems acceptable, with no noticeable gaps and a dense network of neighbours in Central London, where many smaller LSOAs are located.

4.3.3 Defining weights

The neighbourhood list simply identifies which areas (polygons) are neighbours, but spatial weights take this a step further by assigning a weight to each neighbourhood connection. This is important because not all polygons have the same number of neighbours. To ensure that our spatially lagged values are comparable across neighbourhoods of different sizes, standardisation is required. The code below uses style = 'W' to row-standardise the values: if an LSOA has five neighbours, the value of the spatially lagged variable will be the average of that variable across those five neighbours, with each neighbour receiving equal weight.

R code
# create spatial weights matrix
lsoa21_nb_weights <- lsoa21_nb |>
    nb2listw(style = "W")

# inspect - neigbhours of polygon '10'
lsoa21_nb_weights$neighbours[[10]]
[1]    6    7    9 3557 3559 3560 3561
# inspect - weights of neighbours of polygon '10'
lsoa21_nb_weights$weights[[10]]
[1] 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571

Not all places have neighbours. Islands, by definition, will not be considered as neighbours using a contiguity approach. If you attempt to create spatial weights using the nb2listw() function with a neighbourhood list that includes places without neighbours, you will encounter an error message. Potential solutions include using a different neighbourhood definition (e.g. \(k\)-nearest neighbours) or manually editing the neighbourhood file if you wish to include these polygons. Alternatively, you can leave it as is but then you must specify the argument zero.policy = TRUE in nb2listw() to allow for empty sets.

4.3.4 Global Moran’s I

Now that everything is in place, we can begin by plotting the proportion of people without schooling against the spatially lagged values:

R code
# moran's plot
moran.plot(lsoa21$asian_bangladeshi, listw = lsoa21_nb_weights, xlab = "Variable: Asian-Bangladeshi",
    ylab = "Spatially Lagged Variable: Asian Bangladeshi")

Figure 6: Plot of lagged values versus polygon values.

We observe a positive relationship between our asian_bangladeshi variable and the spatially lagged values, suggesting that our global Moran’s I test will likely yield a statistic reflective of the slope visible in the scatter plot.

R code
# moran's test
moran <- moran.mc(lsoa21$asian_bangladeshi, listw = lsoa21_nb_weights, nsim = 999)

# results
moran

    Monte-Carlo simulation of Moran I

data:  lsoa21$asian_bangladeshi 
weights: lsoa21_nb_weights  
number of simulations + 1: 1000 

statistic = 0.84992, observed rank = 1000, p-value = 0.001
alternative hypothesis: greater

The results of the Monte Carlo simulation, visualised in Figure 7, suggest that there is statistically significant positive autocorrelation in our variable. This indicates that LSOAs with higher percentages of people that self-identify as Asian-Bangladeshi tend to be surrounded by other LSOAS with similarly high percentages. Likewise, LSOAs with lower percentages of people that self-identify as Asian Bangladeshi are generally surrounded by LSOAs with similarly low values.

R code
# permutation distribution
plot(moran, main = "", xlab = "Variable: Asian-Bangladeshi")

Figure 7: Density plot of permutation outcomes.

4.3.5 Local Moran’s I

Although we have established that there is positive spatial autocorrelation in our data, we still need to identify the specific spatial patterns. Looking back at Figure 3, you will notice that the plot is divided into four quadrants.

Quadrant Description
Top-right quadrant This area represents LSOAs that have a higher-than-average share of the population without schooling and are surrounded by other LSOAs with similarly high shares of the population without schooling. These are known as high-high clusters.
Bottom-left quadrant This area represents LSOAs with a lower-than-average share of the population without schooling, surrounded by other LSOAs with similarly low shares. These are low-low clusters.
Top-left quadrant LSOAs with a higher-than-average share of the population without schooling surrounded by LSOAs with a lower-than-average share. These are high-low clusters.
Bottom-right quadrant LSOAs with a lower-than-average share of the population without schooling surrounded by LSOAs with a higher-than-average share. These are low-high clusters.

We can show these area on a map by deconstructing the Moran’s I into a series of local Moran values, each measuring how similar each place is (individually) to its neighbours.

R code
# local moran's test
lmoran <- localmoran_perm(lsoa21$asian_bangladeshi, listw = lsoa21_nb_weights, nsim = 999)

# results
head(lmoran)
          Ii          E.Ii     Var.Ii      Z.Ii Pr(z != E(Ii))
1 0.15069391 -0.0032726893 0.03484206 0.8248493   4.094572e-01
2 0.13225460 -0.0016613502 0.03507478 0.7150473   4.745798e-01
3 0.09935974 -0.0009791708 0.02573120 0.6255173   5.316316e-01
4 4.20123420 -0.0098905991 1.02079945 4.1680018   3.072815e-05
5 2.21814457  0.0064109676 0.29187729 4.0938570   4.242561e-05
6 1.07427412  0.0125602786 0.17292266 2.5531805   1.067442e-02
  Pr(z != E(Ii)) Sim Pr(folded) Sim  Skewness Kurtosis
1              0.130          0.065 -1.903886 3.970373
2              0.346          0.173 -1.949852 4.028935
3              0.650          0.325 -1.874507 4.017294
4              0.010          0.005  1.618687 2.813289
5              0.010          0.005  2.054221 4.456165
6              0.060          0.030  1.563633 2.379143

We are not given a single statistic as we did with our global Moran’s I, but rather we get a table of different statistics that are all related back to each of the LSOAs in our dataset. If we refer to the help page for the localmoran() function, we can find detailed explanations of these statistics. The most relevant ones include:

Name Description
Ii Local Moran’s I statistic.
E.Ii Expectation (mean) of the local Moran’s I statistic.
Var.Ii Variance of local Moran’s I statistic
Z.Ii Standard deviation (z-score) of the local Moran’s I statistic.
Pr() Pseudo \(p\)-value of local Moran’s I statistic based on standard deviations and means from the permutation sample.
Pr() Sim Pseudo \(p\)-value of local Moran’s I statistic based on the rank within the permutation sample, assuming a uniform distribution.
Pr(Folded) Sim Pseudo \(p\)-value of local Moran’s I statistic based on the rank within the permutation sample using a one-sided test, assuming a uniform distribution.

We can further extract the quadrants to which of all these polygons have been assigned:

R code
# extract quadrants
lmoran_quadrants <- attr(lmoran, "quadr")

# inspect
head(lmoran_quadrants)
       mean    median     pysal
1   Low-Low   Low-Low   Low-Low
2   Low-Low   Low-Low   Low-Low
3   Low-Low  Low-High   Low-Low
4 High-High High-High High-High
5 High-High High-High High-High
6 High-High High-High High-High

We can now link these values back to our spatial dataframe and make a map using the tmap library:

R code
# replace names
names(lmoran_quadrants) <- c("lmoran_mean", "lmoran_median", "lmoran_pysal")

# bind results
lsoa21 <- lsoa21 |>
  cbind(lmoran_quadrants)

# shape, polygons
tm_shape(lsoa21) +

  # specify column, colours
  tm_polygons(
    col = "lmoran_mean",
    border.col = "#ffffff",
    border.alpha = 0.3,
    palette = c(
      "Low-Low" = "#0571b0",
      "Low-High" = "#92c5de",
      "High-Low" = "#f4a582",
      "High-High" = "#ca0020"
    ),
    title = "Cluster type",
  ) +

  # set layout
  tm_layout(
    legend.outside = FALSE,
    legend.position = c("left", "bottom"),
    frame = FALSE
  )

Figure 8: Mapping the Local Moran’s I clusters.

This type of map is called a LISA map and is a great way of showing how a variable is actually clustering over space. However, we can improve on this further by only mapping the statistically significant clusters:

R code
# replace values if not significant
lmoran_quadrants[lmoran[, 6] > 0.05, ] <- NA

# replace names
names(lmoran_quadrants) <- c("lmoran_mean_sig", "lmoran_median_sig", "lmoran_pysal_sig")

# bind results
lsoa21 <- lsoa21 |>
  cbind(lmoran_quadrants)

# shape, polygons
tm_shape(lsoa21) +

  # specify column, colours
  tm_polygons(
    col = "lmoran_mean_sig",
    border.col = "#ffffff",
    border.alpha = 0.3,
    palette = c(
      "Low-Low" = "#0571b0",
      "Low-High" = "#92c5de",
      "High-Low" = "#f4a582",
      "High-High" = "#ca0020"
    ),
    title = "Cluster type"
  ) +

  # set layout
  tm_layout(
    legend.outside = FALSE,
    legend.position = c("left", "bottom"),
    frame = FALSE
  )

Figure 9: Mapping the significant Local Moran’s I clusters.

This new map may still not fully address the issue of statistical significance due to repeated testing, and some values may appear significant purely by chance. To correct for this, you can adjust the \(p\)-values using R’s p.adjust() function. For further details, refer to Manual Gimond’s explanation of the multiple comparison problem in the context of the pseudo-\(p\) values.

4.4 Assignment

Any statistic that includes spatial weights is dependent upon how those weights are defined. We have so far used first order contiguity, i.e. polygons that share a boundary, but there is no particular reason why we should not include second order contiguity polygons (i.e. neighbours of neighbours), use a fixed distance neighbours definitions, or adopt a \(k\) nearest neighbours definition. Try to do the following:

  1. Extract the centroids from the lsoa21 file.
  2. Identify the 5 nearest neighbours for each LSOA, using the knearneigh() function.
  3. Create a neigbhour list of these nearest neighbours, using the knn2nb() function.
  4. Compute the Global Moran’s I of the asian_indian variable using this new neighbourhood definition.
  5. Map the statistically significant clusters of Local Moran’s I based on this new neighbourhood definition.
  6. Compare these results to the output of our asian_bangladeshi values. Do both variables exhibit clustering? Are the clusters located in similar areas?

4.5 Before you leave

And that is how you can measure spatial dependence in your dataset through different spatial autocorrelation measures. Next week we will focus on the last topic within our set of core spatial analysis methods and techniques, but this week we have covered enough! Probably time to get back to that pesky reading list.