5 Spatial Models

Last week, we explored spatial dependency and methods for measuring it. This week, we will examine how spatial autocorrelation can impact our analyses, especially in models where independence assumptions are crucial. To address these dependencies, we will use the GWmodel library to calculate local summary statistics for geographical areas and to fit basic spatially explicit models to our data.

5.1 Lecture slides

You can download the slides of this week’s lecture here: [Link].

5.2 Reading list

Essential readings

  • Harris, R. 2019. Chapter 8: Not just nuisance: Spatialising social statistics. In: Whitworth, A. (ed). Towards a Spatial Social Policy: Bridging the Gap Between Geography and Social Policy. Bristol: Policy Press. [Link]
  • Franklin, R. 2022. Quantitative methods I: Reckoning with uncertainty. Progress in Human Geography 46(2): 689-697. [Link]
  • Franklin, R. 2023. Quantitative methods II: Big theory. Progress in Human Geography 47(1): 178-186. [Link]

Suggested readings

  • Brunsdon, C., Fotheringham, M., and Charlton, M. 2002. Geographically Weighted Regression. Journal of the Royal Statistical Society: Series D (The Statistician) 47(3): 431-443. [Link]
  • Comber, A., Brunsdon, C., Charlton, M. et al.. 2022. A route map for successful applications of Geographically Weighted Regression. Geographical Analysis 55(1): 155-178. [Link]
  • Hochstenbach, C., Howard, A., and Arundel, R. 2024. Tijdschrift voor Economische en Sociale Geografie. [Link]

5.3 Elections results in England and Wales

This week we will investigate the political geography of England and Wales, focusing on the results of the July 2024 General Election, which was won by the Labour Party led by Keir Starmer. You will work with data extracted from two data sources: the constituency results from the election and socio-demographic information relating to age groups, economic status, and ethnic background from the 2021 Census, extracted using the Custom Dataset Tool. These datasets have been prepared and merged. Along with this dataset, we also have access to a GeoPackage that contains the boundaries of the parliamentary constituencies.

You can download both files below and save them in your project folder under data/attributes and data/spatial, respectively.

File Type Link
England and Wales Parliamentary Constituencies GE2024 csv Download
England and Wales Parliamentary Constituencies Boundaries GeoPackage Download

Open a new script within your GEOG0030 project and save this as w05-election-analysis.r.

Begin by loading the necessary libraries:

R code
# load libraries
library(tidyverse)
library(sf)
library(tmap)
library(GWmodel)
library(easystats)
library(spdep)

You may have to install some of these libraries if you have not used these before.

Once downloaded, we can load both files into memory:

R code
# load election dataset
elec_24 <- read_csv("data/attributes/England-Wales-GE2024-Constituency-Vars.csv")
Rows: 575 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): constituency_code, constituency_name, region_name, winning_party, ...
dbl (22): eligible_voters, valid_votes, conservative_votes, labour_votes, li...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# load constituency boundaries
cons_24 <- st_read("data/spatial/England-Wales-GE2024-Boundaries.gpkg")
Reading layer `England-Wales-GE2024-Boundaries' from data source 
  `/Users/justinvandijk/Library/CloudStorage/Dropbox/UCL/Web/jtvandijk.github.io/GEOG0030/data/spatial/England-Wales-GE2024-Boundaries.gpkg' 
  using driver `GPKG'
Simple feature collection with 575 features and 8 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 85713.71 ymin: 7054.1 xmax: 655644.8 ymax: 657536.5
Projected CRS: OSGB36 / British National Grid
# inspect
head(elec_24)
# A tibble: 6 × 28
  constituency_code constituency_name  region_name winning_party eligible_voters
  <chr>             <chr>              <chr>       <chr>                   <dbl>
1 W07000081         Aberafan Maesteg   Wales       Lab                     72580
2 E14001063         Aldershot          South East  Lab                     78553
3 E14001064         Aldridge-Brownhil… West Midla… Con                     70268
4 E14001065         Altrincham and Sa… North West  Lab                     74025
5 W07000082         Alyn and Deeside   Wales       Lab                     75790
6 E14001066         Amber Valley       East Midla… Lab                     71546
# ℹ 23 more variables: valid_votes <dbl>, conservative_votes <dbl>,
#   labour_votes <dbl>, libdem_votes <dbl>, conservative_vote_share <dbl>,
#   labour_vote_share <dbl>, libdem_vote_share <dbl>,
#   aged_15_years_and_under <dbl>, aged_16_to_24_years <dbl>,
#   aged_25_to_34_years <dbl>, aged_35_to_49_years <dbl>,
#   aged_50_years_and_over <dbl>, eco_not_applicable <dbl>,
#   eco_active_employed <dbl>, eco_active_unemployed <dbl>, …
# inspect
head(cons_24)
Simple feature collection with 6 features and 8 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 368283 ymin: 100922.9 xmax: 524969.7 ymax: 393552.1
Projected CRS: OSGB36 / British National Grid
   pcon24cd                 pcon24nm pcon24nmw  bng_e  bng_n     long      lat
1 E14001063                Aldershot           484716 155270 -0.78648 51.29027
2 E14001064      Aldridge-Brownhills           404720 301030 -1.93172 52.60704
3 E14001065 Altrincham and Sale West           374132 389051 -2.39049 53.39766
4 E14001066             Amber Valley           440478 349674 -1.39771 53.04282
5 E14001067  Arundel and South Downs           497309 118530 -0.61584 50.95798
6 E14001068                 Ashfield           450035 356564 -1.25410 53.10395
                                globalid                           geom
1 {3F742659-208E-4859-8905-1CE5E2B87AAC} MULTIPOLYGON (((485406.9 15...
2 {4668E55F-5248-45E4-9927-30562A6812D5} MULTIPOLYGON (((406519.1 30...
3 {56FDBCDF-25DA-4EE5-AB66-6FC9C71D2927} MULTIPOLYGON (((379104.1 39...
4 {9F9742E9-AD8C-4878-9841-1DF40BBA2619} MULTIPOLYGON (((444868.4 35...
5 {4AFD2380-5BB7-4A01-8712-8637117E96D4} MULTIPOLYGON (((523813.2 11...
6 {E0D1319E-FD1E-43AF-B809-79127FB22B22} MULTIPOLYGON (((455809 3579...

You can inspect both objects using the View() function.

5.3.1 Geographically weighted correlation

In GEOG0018 Research Methods in Human Geography, we already worked with this electoral dataset and we hypothesised that predominantly older voters tend to support the Conservative Party. To explore this, we examined the relationship between the proportion of individuals over 50 years old in each parliamentary constituency (aged_50_years_and_over) and the proportion of votes cast for the Conservative Party (conservative_vote_share). A scatterplot and Pearson’s correlation revealed a moderate association between the two variables, suggesting a possible link between age demographics and Conservative voting patterns:

R code
# correlation
cor.test(elec_24$aged_50_years_and_over, elec_24$conservative_vote_share, method = "pearson")

    Pearson's product-moment correlation

data:  elec_24$aged_50_years_and_over and elec_24$conservative_vote_share
t = 16.472, df = 573, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5086971 0.6199173
sample estimates:
      cor 
0.5668849 
# scatterplot
plot(elec_24$aged_50_years_and_over, elec_24$conservative_vote_share, xlab = "Proportion of population over 50 years old",
    ylab = "Proportion of votes for the Conservative party")

Figure 1: Quick scatterplot

However, as you should know by now, the first step when working with spatial data is to map your variables:

R code
# join attribute data onto spatial data
cons_24 <- cons_24 |>
  left_join(elec_24, by = c("pcon24cd" = "constituency_code"))

# shape, polygons
tm_shape(cons_24) +

  # specify column, colours
  tm_fill(
    col = "conservative_vote_share",
    style = "jenks",
    n = "5",
    palette = c("#f1eef6", "#bdc9e1", "#74a9cf", "#2b8cbe", "#045a8d"),
    title = "Conservative vote share"
  ) +

  # set layout
  tm_layout(
    legend.outside = FALSE,
    legend.position = c("left", "top"),
    frame = FALSE
  )

Figure 2: Proportions of votes cast for the Conservative party in the 2024 elections.
R code
# shape, polygons
tm_shape(cons_24) +

  # specify column, colours
  tm_fill(
    col = "aged_50_years_and_over",
    style = "jenks",
    n = "5",
    palette = c("#f0f0f0", "#d9d9d9", "#bdbdbd", "#969696", "#737373"),
    title = "50+ population share"
  ) +

  # set layout
  tm_layout(
    legend.outside = FALSE,
    legend.position = c("left", "top"),
    frame = FALSE
  )

Figure 3: Proportions of population over 50 years old.

Looking at the maps, the spatial distribution of Conservative vote share and the proportion of the population over 50 appears neither random nor uniform. This raises the question to what extent the observed correlation in Figure 1 holds consistently across different constituencies? To answer this we can run a Geographically Weighted Correlation using the GWmodel library.

Geographically Weighted Correlation (GWC) allows us to investigate whether the strength and direction of the association between variables vary across space. By applying a localised correlation technique, GWC calculates the correlation coefficient within specified spatial windows or kernels across the study area. This can reveal geographic areas where the relationship is stronger or weaker, providing more insight than a single global correlation and helping us to understand spatial heterogeneity in the data.

The GWmodel library allows us to calculate local statistics, such as means, standard deviations, variances, and correlations. But as with spatial autocorrelation, we must define what local means. One approach is to use a kernel function to select which values contribute to each local estimate. Kernels operate on point locations (e.g. polygon centroids) and apply a window with a specified shape and bandwidth. The bandwidth, which defines the kernel’s size, can be set in absolute terms (e.g. within 10 km) or in relative terms (e.g. the 10 nearest centroids), the latter known as an adaptive kernel.

The GWmodel library uses the older sp data format for handling spatial data. We therefore need to convert our current sf object to sp before continuing:

R code
# to sp
cons_24_sp <- as_Spatial(cons_24)

The sf package is now preferred over sp in R for its modern, efficient handling of spatial data. By using simple features, a standardised format for spatial geometries, sf is more compatible with other geospatial tools and integrates smoothly with the tidyverse, simplifying data manipulation and analysis. However, some libraries still rely on sp and have not yet transitioned to sf.

We can now calculate a geographically weighted correlation. We will use an adaptive bandwidth that estimates the local correlation using the 25 nearest constituencies:

R code
# geographically weighted correlation
cons_24_cor <- gwss(cons_24_sp, vars = c("conservative_vote_share", "aged_50_years_and_over"),
    bw = 25, adaptive = TRUE)

We can now extract the values and bind these back to our original sf object:

R code
# extract correlation
cons_24 <- cons_24 |>
    mutate(cons_age_cor = cons_24_cor$SDF$Corr_conservative_vote_share.aged_50_years_and_over)

# inspect
summary(cons_24$cons_age_cor)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.02121  0.57869  0.72341  0.68576  0.83984  0.95214 

The results of the outcomes of gwss function can be accessed through the $SDF data frame.

The summary shows that there indeed seems to be variation of this relationship across the country, with the local Pearson correlation ranging from very weak to very strong. Of course, we can map this in the usual way:

R code
# shape, polygons
tm_shape(cons_24) +

  # specify column, colours
  tm_fill(
    col = "cons_age_cor",
    style = "jenks",
    n = "5",
    palette = c("#fee5d9", "#fcae91", "#fb6a4a", "#de2d26", "#a50f15"),
    title = "Local correlation"
  ) +

  # set layout
  tm_layout(
    legend.outside = FALSE,
    legend.position = c("left", "top"),
    frame = FALSE
  )

Figure 4: Local correlation values conservative_vote_share and aged_50_years_and_over variables.

While the map shows correlation patterns, it doesn not indicate statistical significance. To evaluate this, we can use a Monte Carlo simulation with the gwss.montecarlo() function. However, for a large dataset like ours, this process is computationally intensive and time-consuming, especially with more than the default 99 simulations.

Although we will not run a Monte Carlo simulation here, the code below demonstrates how to run this and link the results back to the cons_24 spatial dataframe

R code
# geographically weighted correlation, monte carlo simulation
cons_24_cor_sig <- gwss.montecarlo(cons_24_sp, vars = c("conservative_vote_share",
    "aged_50_years_and_over"), bw = 25, adaptive = TRUE, nsim = 99) |>
    as_tibble() |>
    select(Corr_conservative_vote_share.aged_50_years_and_over)

# replace names
names(cons_24_cor_sig) <- "cons_age_cor_p"

# bind results
cons_24 <- cons_24 |>
    cbind(cons_24_cor_sig) |>
    mutate(cons_24_cor = if_else(cons_age_cor_p < 0.025, cons_age_cor_p, if_else(cons_age_cor_p >
        0.975, cons_age_cor_p, NA)))

It took more than 2 minutes to run the Monte Carlo simulation on an Apple MacBook Pro M1 (16GB RAM) with the default of 99 simulations.

5.3.2 Geographically weighted regression

A correlation describes the strength and direction of a linear relationship, but if we want to quantify the change in a dependent variable (y) for a one-unit change in the independent variable(s) (x) we need to run a regression. Geographically weighted regression (GWR) is used when the relationship between a dependent and set of independent variables is not constant across space, meaning the model coefficients vary by location. This is useful when you suspect that the relationship between variables may change depending on the geographic context. GWR provides a localised understanding of the relationships by allowing each observation to have its own set of regression coefficients, which can provide insights into how relationships differ across the study area.

GWR fits a separate regression equation at each location in the study area, weighting nearby observations more heavily than those farther away. Again, the weighting is typically based on a kernel function. The basic GWR equation is:

\[ y_{i} = \beta_{0}(\upsilon_{i}, v_{i}) + \sum_{k=1}^{p}\beta_{k}(\upsilon_{i}, v_{i})x_{ik} + \epsilon_{i} \]

where \((\upsilon_{i}, v_{i})\) are the coordinates of location \(i\) and \(\beta_{k}(\upsilon_{i}, v_{i})\) are the location-specific coefficients.

5.3.2.1 Multivariate regression

To predict the Conservative voter share, we hypothesise that certain socio-demographic variables may play a significant role. These independent variables include:

Variable Description
eco_active_employed Proportion of economically active employed individuals.
eco_active_unemployed Proportion of economically active unemployed individuals.
eth_white Proportion of the population identifying as white.
eth_black Proportion of the population identifying as black.
aged_25_to_34_years Proportion of the population between ages 25 and 34.
aged_50_years_and_over Proportion of the population over age 50.

We can run the multivariate regression as follows:

R code
# regression
lm_voteshare <- lm(conservative_vote_share ~ eco_active_employed + eco_active_unemployed +
    eth_white + eth_black + aged_25_to_34_years + aged_50_years_and_over, data = cons_24)

# summary
summary(lm_voteshare)

Call:
lm(formula = conservative_vote_share ~ eco_active_employed + 
    eco_active_unemployed + eth_white + eth_black + aged_25_to_34_years + 
    aged_50_years_and_over, data = cons_24)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.286730 -0.040843  0.001182  0.044496  0.220419 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -0.16210    0.08779  -1.846  0.06535 .  
eco_active_employed     1.36140    0.13208  10.308  < 2e-16 ***
eco_active_unemployed  -0.32339    0.91429  -0.354  0.72369    
eth_white              -0.28028    0.03567  -7.857 1.97e-14 ***
eth_black              -0.29751    0.10190  -2.920  0.00364 ** 
aged_25_to_34_years    -1.52695    0.20954  -7.287 1.07e-12 ***
aged_50_years_and_over  0.59711    0.09827   6.076 2.26e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06797 on 568 degrees of freedom
Multiple R-squared:  0.5537,    Adjusted R-squared:  0.5489 
F-statistic: 117.4 on 6 and 568 DF,  p-value: < 2.2e-16

The model summary indicates that most variables are statistically significant predictors of Conservative vote share. The results suggest that older and employed populations are linked to higher Conservative support, while younger age groups and certain ethnic proportions show a negative relationship.

We can now run the diagnostics to see whether the regression assumptions are met:

R code
# run diagnostics
performance::check_model(lm_voteshare, check = c("qq", "normality"))

Figure 5: Check for normality of residuals.
R code
# run diagnostics
performance::check_model(lm_voteshare, check = c("linearity", "homogeneity"))

Figure 6: Check for linearity and homogeneity
R code
# run diagnostics
performance::check_model(lm_voteshare, check = c("vif", "outliers"))

Figure 7: Check for influential observations and multicollinearity.

5.3.2.2 Geographically weighted regression

The model diagnostics seem generally fine, perhaps except for the linearity check. This could indicate a curved relationship or, alternatively, suggest spatial dependence in the relationship. If spatial dependence is present, it may imply that the observations are not independent. We can examine this by assessing whether the model residuals display spatial dependence. Let us begin with a map:

R code
# join model residuals onto spatial data
cons_24 <- cons_24 |>
  mutate(lm_residuals = lm_voteshare$residuals)

# shape, polygons
tm_shape(cons_24) +

  # specify column, colours
  tm_fill(
    col = "lm_residuals",
    style = "jenks",
    n = "5",
    palette = c("#ffffb2", "#fecc5c", "#fd8d3c", "#f03b20", "#f03b20"),
    title = "Regression residuals"
  ) +

  # set layout
  tm_layout(
    legend.outside = FALSE,
    legend.position = c("left", "top"),
    frame = FALSE
  )

Figure 8: Regression model residuals.

Looking at Figure 8 there appears to be some spatial structure in the model residuals. We can test this more formally using Moran’s I test to assess spatial autocorrelation:

R code
# create neighbour list
cons_24_nb <- poly2nb(cons_24, queen = TRUE)
Warning in poly2nb(cons_24, queen = TRUE): some observations have no neighbours;
if this seems unexpected, try increasing the snap argument.
Warning in poly2nb(cons_24, queen = TRUE): neighbour object has 3 sub-graphs;
if this sub-graph count seems unexpected, try increasing the snap argument.
# create spatial weights matrix
cons_24_nb_weights <- cons_24_nb |>
    nb2listw(style = "W", zero.policy = TRUE)

# moran's test
moran <- moran.mc(cons_24$lm_residuals, listw = cons_24_nb_weights, nsim = 999)

# results
moran

    Monte-Carlo simulation of Moran I

data:  cons_24$lm_residuals 
weights: cons_24_nb_weights  
number of simulations + 1: 1000 

statistic = 0.4155, observed rank = 1000, p-value = 0.001
alternative hypothesis: greater

The Moran’s I test indicates significant spatial autocorrelation in our model residuals, suggesting that the observations are not independent. As a result, we can proceed with a Geographically Weighted Regression to account for spatial variation in the relationship between the variables.

We will first estimate the ‘optimal’ bandwidth using an automated bandwidth selection procedure:

R code
# bandwidth selection
cons_24_bw <- bw.gwr(conservative_vote_share ~ eco_active_employed + eco_active_unemployed +
    eth_white + eth_black + aged_25_to_34_years + aged_50_years_and_over, data = cons_24,
    adaptive = TRUE)
Adaptive bandwidth: 363 CV score: 2.208986 
Adaptive bandwidth: 232 CV score: 2.105893 
Adaptive bandwidth: 151 CV score: 1.973309 
Adaptive bandwidth: 100 CV score: 1.940199 
Adaptive bandwidth: 70 CV score: 2.008461 
Adaptive bandwidth: 120 CV score: 1.945603 
Adaptive bandwidth: 89 CV score: 1.942555 
Adaptive bandwidth: 108 CV score: 1.938444 
Adaptive bandwidth: 112 CV score: 1.942082 
Adaptive bandwidth: 104 CV score: 1.938601 
Adaptive bandwidth: 109 CV score: 1.940144 
Adaptive bandwidth: 106 CV score: 1.939139 
Adaptive bandwidth: 108 CV score: 1.938444 

The model can then be fitted.

R code
# run gwr
cons_24_gwr <- gwr.basic(conservative_vote_share ~ eco_active_employed + eco_active_unemployed +
    eth_white + eth_black + aged_25_to_34_years + aged_50_years_and_over, data = cons_24,
    adaptive = TRUE, bw = cons_24_bw)

# inspect
cons_24_gwr
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2025-02-11 09:27:19.19713 
   Call:
   gwr.basic(formula = conservative_vote_share ~ eco_active_employed + 
    eco_active_unemployed + eth_white + eth_black + aged_25_to_34_years + 
    aged_50_years_and_over, data = cons_24, bw = cons_24_bw, 
    adaptive = TRUE)

   Dependent (y) variable:  conservative_vote_share
   Independent variables:  eco_active_employed eco_active_unemployed eth_white eth_black aged_25_to_34_years aged_50_years_and_over
   Number of data points: 575
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
      Min        1Q    Median        3Q       Max 
-0.286730 -0.040843  0.001182  0.044496  0.220419 

   Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
   (Intercept)            -0.16210    0.08779  -1.846  0.06535 .  
   eco_active_employed     1.36140    0.13208  10.308  < 2e-16 ***
   eco_active_unemployed  -0.32339    0.91429  -0.354  0.72369    
   eth_white              -0.28028    0.03567  -7.857 1.97e-14 ***
   eth_black              -0.29751    0.10190  -2.920  0.00364 ** 
   aged_25_to_34_years    -1.52695    0.20954  -7.287 1.07e-12 ***
   aged_50_years_and_over  0.59711    0.09827   6.076 2.26e-09 ***

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 0.06797 on 568 degrees of freedom
   Multiple R-squared: 0.5537
   Adjusted R-squared: 0.5489 
   F-statistic: 117.4 on 6 and 568 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 2.624389
   Sigma(hat): 0.06767633
   AIC:  -1451.196
   AICc:  -1450.941
   BIC:  -1940.526
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: bisquare 
   Adaptive bandwidth: 108 (number of nearest neighbours)
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                                Min.    1st Qu.     Median    3rd Qu.   Max.
   Intercept               -1.089291  -0.441131  -0.160803   0.209568 1.1375
   eco_active_employed     -1.045077   0.151632   0.930912   1.498222 2.8325
   eco_active_unemployed  -10.747683  -4.837597   0.830042   3.163451 9.8531
   eth_white               -1.411777  -0.365168  -0.230988  -0.109584 0.1738
   eth_black               -3.276551  -0.308415  -0.121533   0.071703 1.1200
   aged_25_to_34_years     -3.659487  -1.575036  -1.288467  -0.342851 1.9350
   aged_50_years_and_over  -0.287576   0.503695   1.075497   1.383006 2.1875
   ************************Diagnostic information*************************
   Number of data points: 575 
   Effective number of parameters (2trace(S) - trace(S'S)): 103.6449 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 471.3551 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): -1622.807 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): -1730.89 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): -1880.236 
   Residual sum of squares: 1.444834 
   R-square value:  0.7542715 
   Adjusted R-square value:  0.7001241 

   ***********************************************************************
   Program stops at: 2025-02-11 09:27:19.294264 

The global regression results are derived from the standard OLS model, representing a single regression for the entire study area without accounting for spatial variation. In contrast, the geographically weighted regression (GWR) results stem from 575 separate but spatially overlapping regression models, each fitted with geographic weighting to different spatial subsets of the data, allowing for local variation in the regression estimates. We can also see that the model fit (adjusted \(R^2\)) has increased to 70%.

All the local estimates are contained in the spatial data frame cons_24_gwr$SDF. You can inspect the results using the View() function.

We can now also map all the local \(R^2\) values to get an idea of how well the model fit varies spatially:

R code
# join local r squared values onto spatial data
cons_24 <- cons_24 |>
  mutate(gwr_r2 = cons_24_gwr$SDF$Local_R2)

# shape, polygons
tm_shape(cons_24) +

  # specify column, colours
  tm_fill(
    col = "gwr_r2",
    style = "jenks",
    n = "5",
    palette = c("#fee5d9", "#fcae91", "#fb6a4a", "#de2d26", "#a50f15"),
    title = "Local R-squared"
  ) +

  # set layout
  tm_layout(
    legend.outside = FALSE,
    legend.position = c("left", "top"),
    frame = FALSE
  )

Figure 9: Model performance of the Geographically Weighted Regression.

In similar fashion, and with some data wrangling, we can map the individual local coefficients of our independent variables:

R code
# join local coefficient values onto spatial data, select, pivot
cons_24_coef <- cons_24 |>
  mutate(
    eco_active_employed_coef = cons_24_gwr$SDF$eco_active_employed,
    eco_active_unemployed_coef = cons_24_gwr$SDF$eco_active_unemployed,
    eth_white_coef = cons_24_gwr$SDF$eth_white,
    eth_black_coef = cons_24_gwr$SDF$eth_black,
    aged_25_to_34_years_coef = cons_24_gwr$SDF$aged_25_to_34_years,
    aged_50_years_and_over_coef = cons_24_gwr$SDF$aged_50_years_and_over
  ) |>
  select(40:45) |>
  pivot_longer(cols = !geom, names_to = "variable", values_to = "coefficient")

# shape, polygons
tm_shape(cons_24_coef) +

  # specify column, colours
  tm_fill(
    col = "coefficient",
    style = "jenks",
    n = 9,
    palette = c("#e66101", "#fdb863", "#cccccc", "#b2abd2", "#5e3c99"),
    title = "Local GWR coefficient"
  ) +

  # facet
  tm_facets(
    by = "variable",
    ncol = 2
  ) +

  # set layout
  tm_layout(
    legend.show = TRUE,
    legend.outside = FALSE,
    frame = FALSE,
    panel.labels = c("Age: 25-34", "Age: >50", "Employed", "Unemployed", "Ethnicity: Black", "Ethnicity: White"),
  )
Use "fisher" instead of "jenks" for larger data sets
Variable(s) "coefficient" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Figure 10: Local coefficients of the Geographically Weighted Regression.

Mapping these local estimates reveals an interesting geography, where, for example, the proportion of the population over 50 years old is a better predictor for the Conservative vote share in regions like the South West, Greater London, and parts of the North West and Yorkshire and the Humber.

Not all of the local regression estimates may be statistically significant. We can use use the estimated \(t\)-values, as indicated by the _TV suffix in the SDF data frame of the GWR results, to filter out those are not. For example, to filter at 95% confidence, remove \(t\)-values outside the range of -1.96 to +1.96.

For a more comprehensive dive into Geographically Weighted Regression, the Spatial Modelling for Data Scientists course by Liverpool-based Professors Francisco Rowe and Dani Arribas-Bel provides an excellent introduction.

5.3.3 More spatial models

In this tutorial, we have looked at geographically weighted correlation and regression. However, whilst out of the scope of this module, there are many other approaches to account for space in statistical models. Examples of such models are the spatial error model and the spatial lag model.

5.3.3.1 Spatial error model

The spatial error model is used when the error terms in a regression model exhibit spatial autocorrelation, meaning the error terms are not independent across space. This can happen due to omitted variables that have a spatial pattern or unmeasured factors that affect the dependent variable similarly across nearby locations.

The model adjusts for spatial autocorrelation by adding a spatially lagged error term (a weighted sum of the errors from neighbouring locations) to the regression equation:

\[ y = X\beta + \upsilon, \upsilon = \lambda W \upsilon + \epsilon \]

where \(X\beta\) represents the standard regression components, \(\lambda\) is a spatial autoregressive parameter, \(W\) is a spatial weights matrix, and \(\upsilon\) is a vector of spatially autocorrelated errors.

Spatial error models can be fitted using R’s spatialreg package.

5.3.3.2 Spatial lag model

The spatial lag model is appropriate when the dependent variable itself exhibits spatial dependence. This means the value of the dependent variable in one location depends on the values in neighbouring locations. This model is used to capture the spillover effects or diffusion processes, where the outcome in one area is influenced by outcomes in nearby areas (e.g. house prices, crime rates).

The model incorporates a spatially lagged dependent variable, which is the weighted sum of the dependent variable values in neighbouring locations, into the regression equation:

\[ y = \rho Wy + X\beta + \epsilon \]

where \(\rho\) is the spatial autoregressive coefficient, \(Wy\) represents the spatially lagged dependent variable, and \(X\beta\) represents the standard regression components.

Spatial lag models can be fitted using R’s spatialreg package.

5.3.3.3 To note

The spatial error model adjusts for spatial autocorrelation in the error terms, whereas the spatial lag model adjusts for spatial dependence in the dependent variable itself. The spatial error model does not alter the interpretation of the coefficients of the independent variables, while the spatial lag model introduces a feedback loop where changes in one area can influence neighbouring areas.

Both the spatial error and spatial lag models assume that the relationships between variables are the same across the study area, with adjustments made only for spatial dependencies. GWR, on the other hand, allows the relationships themselves to vary across space. GWR is more flexible but also more complex and computationally intensive, providing local instead of global estimates of coefficients.

5.4 Assignment

The Geographically Weighted Regression (GWR) reveals some varying patterns in the sign and directiion of the Conservative voter share, but it is likely that other factors are at play. One could hypothesise that another potential predictor of the Conservative vote share is the proportion of the population employed in a particular industry. Try to do the following:

  1. Download the dataset provided below and save it to the appropriate subfolder within your data directory. The csv, extracted using the Custom Dataset Tool, contains the number of employed individuals as recorded in the 2021 Census by their Standard Industrial Classification (SIC) code.
  2. Formulate a hypothesis regarding which of the nine main industries presented in the dataset might help predict the Conservative vote share.
  3. For the chosen industry variable:
  4. What do the results suggest? Is this variable associated with any changes in Conservative vote share?
  5. Create a map of the local coefficients for this variable.
File Type Link
England and Wales Parliamentary Constituencies Employed by Industry csv Download

5.5 Before you leave

This week, we explored decomposing global measures to account for non-stationarity in spatial variables. Geographically Weighted Regression (GWR) is particularly useful when the relationship between variables varies across space, as it allows for localised regression coefficients and helps identify spatial heterogeneity that might be overlooked in global models. This was a rather heavy one, time for some light reading?