Geodemographic Classification

This week we will turn to geodemographic classification. Geodemographic classification is a method used to categorise geographic areas and the people living in them based on demographic, socioeconomic, and sometimes lifestyle characteristics. This approach combines geographic information with demographic data to create profiles of different neighborhoods.

Lecture slides

You can download the slides of this week’s lecture here: [Link].

Reading list

Essential readings

  • Longley, P. A. 2012. Geodemographics and the practices of geographic information science. International Journal of Geographical Information Science 26(12): 2227-2237. [Link]
  • Singleton, A. and Longley, P. A. 2024. Classifying and mapping residential structure through the London Output Area Classification. Environment and Planning B: Urban Analytics and City Science 51(5): 1153-1164. [Link]
  • Wyszomierski, J., Longley, P. A., and Singleton, A. et al. 2024. A neighbourhood Output Area Classification from the 2021 and 2022 UK censuses. The Geographical Journal. 190(2): e12550. [Link]

Suggested readings

  • Dalton, C. M. and Thatcher. J. 2015. Inflated granularity: Spatial “Big Data” and geodemographics. Big Data & Society 2(2): 1-15. [Link]
  • Fränti, P. and Sieronoja, S. 2019. How much can k-means be improved by using better initialization and repeats? Pattern Recognition 93: 95-112. [Link]
  • Singleton, A. and Spielman, S. 2014. The past, present, and future of geodemographic research in the United States and United Kingdom. The Professional Geographer 66(4): 558-567. [Link]

Classifying London

Today, we will create our own geodemographic classification to examine demographic clusters across London, drawing inspiration from London Output Area Classification. Specifically, we will try to identify clusters based on age group, self-identified ethnicity, country of birth, and first or preferred language.

The data covers all usual residents, as recorded in the 2021 Census for England and Wales, aggregated at the Lower Super Output Area (LSOA) level. These datasets have been extracted using the Custom Dataset Tool, and you can download each file via the links provided below. A copy of the 2021 London LSOAs spatial boundaries is also available. Save these files in your project folder under data.

File Type Link
London LSOA Census 2021 Age Groups csv Download
London LSOA Census 2021 Country of Birth csv Download
London LSOA Census 2021 Ethnicity csv Download
London LSOA Census 2021 Main Language csv Download
London LSOA 2021 Spatial Boundaries GeoPackage Download

To download a csv file that is hosted on GitHub, click on the Download raw file button on the top right of your screen and it should download directly to your computer.

For the spatial boundaries of the London LSOAs, you may have noticed that, instead of providing a collection of files known as a shapefile, we have supplied a GeoPackage. While shapefiles remain in use, GeoPackage is a more modern and portable file format. Have a look at this article on towardsdatascience.com for an excellent explanation on why one should use GeoPackage files over shapefiles, where possible: [Link]

Open a new script and save this as w07-geodemographic-analysis.r.

Begin by loading the necessary libraries:

R code
# load libraries
library(tidyverse)
library(janitor)
library(ggcorrplot)
library(cluster)
library(factoextra)
library(sf)
library(tmap)

You may have to install some of these libraries if you have not used these before.

Next, we can load the individual csv files that we downloaded into R.

R code
# load age data
lsoa_age <- read_csv("data/London-LSOA-AgeGroup.csv")
Rows: 24970 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Lower layer Super Output Areas Code, Lower layer Super Output Areas...
dbl (2): Age (5 categories) Code, Observation

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# load country of birth data
lsoa_cob <- read_csv("data/London-LSOA-Country-of-Birth.csv")
Rows: 39952 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Lower layer Super Output Areas Code, Lower layer Super Output Areas...
dbl (2): Country of birth (8 categories) Code, Observation

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# load ethnicity data
lsoa_eth <- read_csv("data/London-LSOA-Ethnicity.csv")
Rows: 99880 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Lower layer Super Output Areas Code, Lower layer Super Output Areas...
dbl (2): Ethnic group (20 categories) Code, Observation

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# load language data
lsoa_lan <- read_csv("data/London-LSOA-MainLanguage.csv")
Rows: 54934 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Lower layer Super Output Areas Code, Lower layer Super Output Areas...
dbl (2): Main language (11 categories) Code, Observation

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

If using a Windows machine, you may need to substitute your forward-slashes (/) with two backslashes (\\) whenever you are dealing with file paths.

Now, carefully examine each individual dataframe to understand how the data is structured and what information it contains.

R code
# inspect age data
head(lsoa_age)
# A tibble: 6 × 5
  Lower layer Super Output Areas…¹ Lower layer Super Ou…² Age (5 categories) C…³
  <chr>                            <chr>                                   <dbl>
1 E01000001                        City of London 001A                         1
2 E01000001                        City of London 001A                         2
3 E01000001                        City of London 001A                         3
4 E01000001                        City of London 001A                         4
5 E01000001                        City of London 001A                         5
6 E01000002                        City of London 001B                         1
# ℹ abbreviated names: ¹​`Lower layer Super Output Areas Code`,
#   ²​`Lower layer Super Output Areas`, ³​`Age (5 categories) Code`
# ℹ 2 more variables: `Age (5 categories)` <chr>, Observation <dbl>
# inspect country of birth data
head(lsoa_cob)
# A tibble: 6 × 5
  Lower layer Super Output Areas…¹ Lower layer Super Ou…² Country of birth (8 …³
  <chr>                            <chr>                                   <dbl>
1 E01000001                        City of London 001A                        -8
2 E01000001                        City of London 001A                         1
3 E01000001                        City of London 001A                         2
4 E01000001                        City of London 001A                         3
5 E01000001                        City of London 001A                         4
6 E01000001                        City of London 001A                         5
# ℹ abbreviated names: ¹​`Lower layer Super Output Areas Code`,
#   ²​`Lower layer Super Output Areas`, ³​`Country of birth (8 categories) Code`
# ℹ 2 more variables: `Country of birth (8 categories)` <chr>,
#   Observation <dbl>
# inspect ethnicity data
head(lsoa_eth)
# A tibble: 6 × 5
  Lower layer Super Output Areas…¹ Lower layer Super Ou…² Ethnic group (20 cat…³
  <chr>                            <chr>                                   <dbl>
1 E01000001                        City of London 001A                        -8
2 E01000001                        City of London 001A                         1
3 E01000001                        City of London 001A                         2
4 E01000001                        City of London 001A                         3
5 E01000001                        City of London 001A                         4
6 E01000001                        City of London 001A                         5
# ℹ abbreviated names: ¹​`Lower layer Super Output Areas Code`,
#   ²​`Lower layer Super Output Areas`, ³​`Ethnic group (20 categories) Code`
# ℹ 2 more variables: `Ethnic group (20 categories)` <chr>, Observation <dbl>
# inspect language data
head(lsoa_lan)
# A tibble: 6 × 5
  Lower layer Super Output Areas…¹ Lower layer Super Ou…² Main language (11 ca…³
  <chr>                            <chr>                                   <dbl>
1 E01000001                        City of London 001A                        -8
2 E01000001                        City of London 001A                         1
3 E01000001                        City of London 001A                         2
4 E01000001                        City of London 001A                         3
5 E01000001                        City of London 001A                         4
6 E01000001                        City of London 001A                         5
# ℹ abbreviated names: ¹​`Lower layer Super Output Areas Code`,
#   ²​`Lower layer Super Output Areas`, ³​`Main language (11 categories) Code`
# ℹ 2 more variables: `Main language (11 categories)` <chr>, Observation <dbl>

You can further inspect the results using the View() function.

Variable preparation

To identify geodemographic clusters in our dataset, we will use a technique called \(k\)-means. \(k\)-means aims to partition a set of standardised observations into a specified number of clusters (\(k\)). To do this we first need to prepare the individual datasets, as well as transform and standardise the input variables.

\(k\)-means clustering is an unsupervised machine learning algorithm used to group data into a predefined number of clusters, based on similarities between data points. It works by initially assigning \(k\) random centroids, then iteratively updating them by assigning each data point to the nearest centroid and recalculating the centroid’s position based on the mean of the points in each cluster. The process continues until the centroids stabilise, meaning they no longer change significantly. \(k\)-means is often used for tasks such as data segmentation, image compression, or anomaly detection. It is simple but may not work well with non-spherical or overlapping clusters.

Because all the data are stored in long format, with each London LSOA appearing on multiple rows for each category — such as separate rows for different age groups, ethnicities, countries of birth, and first or preferred languages - we need to transform it into a wide format. For example, instead of having multiple rows for an LSOA showing counts for different age groups all the information for each LSOA will be consolidated into a single row. Additionally, we will clean up the column names to follow standard R naming conventions and make the data easier to work with. We can automate this process using the janitor package.

We will begin with the age dataframe:

R code
# clean names
lsoa_age <- lsoa_age |>
    clean_names()

# pivot
lsoa_age <- lsoa_age |>
    pivot_wider(id_cols = "lower_layer_super_output_areas_code", names_from = "age_5_categories",
        values_from = "observation")

# clean names
lsoa_age <- lsoa_age |>
    clean_names()

The code above uses a pipe function: |>. The pipe operator allows you to pass the output of one function directly into the next, streamlining your code. While it might be a bit confusing at first, you will find that it makes your code faster to write and easier to read. More importantly, it reduces the need to create multiple intermediate variables to store outputs.

To account for the non-uniformity of the areal units, we further need to convert the observations to proportions and only retain those columns that are likely to be meaningful in the context of the classification:

R code
# total observations
lsoa_age <- lsoa_age |>
    rowwise() |>
    mutate(age_pop = sum(across(2:6)))

# total proportions, select columns
lsoa_age <- lsoa_age |>
    mutate(across(2:6, ~./age_pop)) |>
    select(1:6)

# inspect
head(lsoa_age)
# A tibble: 6 × 6
# Rowwise: 
  lower_layer_super_output_areas_code aged_15_years_and_un…¹ aged_16_to_24_years
  <chr>                                                <dbl>               <dbl>
1 E01000001                                           0.0846              0.0744
2 E01000002                                           0.0621              0.0889
3 E01000003                                           0.0682              0.0706
4 E01000005                                           0.127               0.178 
5 E01000006                                           0.224               0.120 
6 E01000007                                           0.257               0.103 
# ℹ abbreviated name: ¹​aged_15_years_and_under
# ℹ 3 more variables: aged_25_to_34_years <dbl>, aged_35_to_49_years <dbl>,
#   aged_50_years_and_over <dbl>

This looks much better. We can do the same for the country of birth data:

R code
# prepare country of birth data
lsoa_cob <- lsoa_cob |>
    clean_names() |>
    pivot_wider(id_cols = "lower_layer_super_output_areas_code", names_from = "country_of_birth_8_categories",
        values_from = "observation") |>
    clean_names()

# proportions, select columns
lsoa_cob <- lsoa_cob |>
    rowwise() |>
    mutate(cob_pop = sum(across(2:9))) |>
    mutate(across(2:9, ~./cob_pop)) |>
    select(-2, -10)

And we can do the same for the ethnicity and language datasets:

R code
# prepare ethnicity data
lsoa_eth <- lsoa_eth |>
    clean_names() |>
    pivot_wider(id_cols = "lower_layer_super_output_areas_code", names_from = "ethnic_group_20_categories",
        values_from = "observation") |>
    clean_names()

# proportions, select columns
lsoa_eth <- lsoa_eth |>
    rowwise() |>
    mutate(eth_pop = sum(across(2:21))) |>
    mutate(across(2:21, ~./eth_pop)) |>
    select(-2, -22)

# prepare language data
lsoa_lan <- lsoa_lan |>
    clean_names() |>
    pivot_wider(id_cols = "lower_layer_super_output_areas_code", names_from = "main_language_11_categories",
        values_from = "observation") |>
    clean_names()

# proportions, select columns
lsoa_lan <- lsoa_lan |>
    rowwise() |>
    mutate(lan_pop = sum(across(2:12))) |>
    mutate(across(2:12, ~./lan_pop)) |>
    select(-2, -11, -13)

We now have four separate datasets, each containing the proportions of usual residents classified into different groups based on age, country of birth, ethnicity, and language.

Variable selection

Where we initially selected variables from different demographic domains, not all variables may be suitable for inclusion. Firstly, the variables need to exhibit sufficient heterogeneity to ensure they capture meaningful differences between observations. Secondly, variables should not be highly correlated with one another, as this redundancy can skew the clustering results. Ensuring acceptable correlation between variables helps maintain the diversity of information and improves the robustness of the clustering outcome.

Variable selection is often a time-consuming process that requires a combination of domain knowledge and more extensive exploratory analysis than is covered in this practical.

A straightforward yet effective method to examine the distribution of our variables is to create boxplots for each variable. This can be efficiently achieved by using facet_wrap() from the ggplot2 library to generate a matrix of panels, allowing us to visualise all variables in a single view.

ggplot2 is a popular data visualisation package in R, designed for creating complex plots. It uses the Grammar of Graphics to build layered, customisable graphics by mapping data to visual elements like colour, size, and shape. You can refer to the ggplot2 documentation for more details.

R code
# wide to long
lsoa_age_wd <- lsoa_age |>
    pivot_longer(cols = c(2:5), names_to = "agegroup", values_to = "count")

# facet age
ggplot(lsoa_age_wd, aes(y = count)) + geom_boxplot() + facet_wrap(~agegroup, ncol = 2) +
    theme_minimal() + ylab("")

Figure 1: Boxplots of the distribution of the age dataset.

When repeating this process for the birth, ethnicity, and language variables, you will notice that some variables have a very limited distribution. Specifically, some variables may have a value of 0 for the majority of London LSOAs. As a rule of thumb, we will retain only those variables where at least 75% of the LSOAs have values different from 0.

This threshold of 75% is arbitrary, and in practice, more thorough consideration should be given when deciding whether to include or exclude a variable.

R code
# join
lsoa_df <- lsoa_age |>
    left_join(lsoa_cob, by = "lower_layer_super_output_areas_code") |>
    left_join(lsoa_eth, by = "lower_layer_super_output_areas_code") |>
    left_join(lsoa_lan, by = "lower_layer_super_output_areas_code")

# calculate proportion of zeroes
zero_prop <- sapply(lsoa_df[2:41], function(x) {
    mean(x == 0)
})

# extract variables with high proportion zeroes
idx <- which(zero_prop > 0.25)

# inspect
idx
   white_gypsy_or_irish_traveller            any_other_uk_languages 
                               27                                33 
  oceanic_or_australian_languages north_or_south_american_languages 
                               37                                38 
# remove variables with high proportion zeroes
lsoa_df <- lsoa_df |>
    select(-white_gypsy_or_irish_traveller, -any_other_uk_languages, -oceanic_or_australian_languages,
        -north_or_south_american_languages)

The code above makes use of Boolean logic to calculate the proportion of zeroes within each variable. The x == 0 part checks each value in column x to see if it is equal to 0, returning TRUE or FALSE for each element. The mean() function is then used to calculate the average of the TRUE values in the column. Since TRUE is treated as 1 and FALSE as 0, this gives the proportion of values in the column that are equal to zero.

We can subsequently check for multicollinearity of the remaining variables. The easiest way to check the correlations between all variables is probably by visualising a correlation matrix:

R code
# inspect variable names
names(lsoa_df)
 [1] "lower_layer_super_output_areas_code"                                  
 [2] "aged_15_years_and_under"                                              
 [3] "aged_16_to_24_years"                                                  
 [4] "aged_25_to_34_years"                                                  
 [5] "aged_35_to_49_years"                                                  
 [6] "aged_50_years_and_over"                                               
 [7] "europe_united_kingdom"                                                
 [8] "europe_ireland"                                                       
 [9] "europe_other_europe"                                                  
[10] "africa"                                                               
[11] "middle_east_and_asia"                                                 
[12] "the_americas_and_the_caribbean"                                       
[13] "antarctica_and_oceania_including_australasia_and_other"               
[14] "asian_asian_british_or_asian_welsh_bangladeshi"                       
[15] "asian_asian_british_or_asian_welsh_chinese"                           
[16] "asian_asian_british_or_asian_welsh_indian"                            
[17] "asian_asian_british_or_asian_welsh_pakistani"                         
[18] "asian_asian_british_or_asian_welsh_other_asian"                       
[19] "black_black_british_black_welsh_caribbean_or_african_african"         
[20] "black_black_british_black_welsh_caribbean_or_african_caribbean"       
[21] "black_black_british_black_welsh_caribbean_or_african_other_black"     
[22] "mixed_or_multiple_ethnic_groups_white_and_asian"                      
[23] "mixed_or_multiple_ethnic_groups_white_and_black_african"              
[24] "mixed_or_multiple_ethnic_groups_white_and_black_caribbean"            
[25] "mixed_or_multiple_ethnic_groups_other_mixed_or_multiple_ethnic_groups"
[26] "white_english_welsh_scottish_northern_irish_or_british"               
[27] "white_irish"                                                          
[28] "white_roma"                                                           
[29] "white_other_white"                                                    
[30] "other_ethnic_group_arab"                                              
[31] "other_ethnic_group_any_other_ethnic_group"                            
[32] "english_or_welsh"                                                     
[33] "european_languages_eu"                                                
[34] "other_european_languages_non_eu"                                      
[35] "asian_languages"                                                      
[36] "african_languages"                                                    
[37] "any_other_languages"                                                  
# change variable names to index to improve visualisation
lsoa_df_vis <- lsoa_df
names(lsoa_df_vis)[2:37] <- paste0("v", sprintf("%02d", 1:36))

# correlation matrix
cor_mat <- cor(lsoa_df_vis[, -1])

# correlation plot
ggcorrplot(cor_mat, outline.col = "#ffffff", tl.cex = 8, legend.title = "Correlation")

Figure 2: Correlation plot of classification variables.

Following the approach from Wyszomierski et al. (2024), we can define a weak correlation as lying between 0 and 0.40, moderate as between 0.41 and 0.65, strong as between 0.66 and 0.80, and very strong as between 0.81 and 1.

A few strong and very strong correlations can be observed that potentially could be removed; however, to maintain representation, here we decide to retain all variables.

Variable standardisation

If the input data are heavily skewed or contain outliers, \(k\)-means may produce less meaningful clusters. While normality is not required per se, it has been common to do this nonetheless. More important is to standardise the input variables, especially when they are measured on different scales. This ensures that each variable contributes equally to the clustering process.

R code
# inverse hyperbolic sine
lsoa_df_vis[, -1] <- sapply(lsoa_df_vis[-1], asinh)

# range standardise
lsoa_df_vis[, -1] <- sapply(lsoa_df_vis[-1], function(x) {
    (x - min(x))/(max(x) - min(x))
})

Selecting the number of clusters

Now our data are prepared we will start by creating an elbow plot. The elbow method is a visual tool that helps determine the optimal number of clusters in a dataset. This is important because with \(k\)-means clustering you need to specify the numbers of clusters a priori. The elbow method involves running the clustering algorithm with varying numbers of clusters (\(k\)) and plotting the total explained variation (known as the Within Sum of Squares) against the number of clusters. The goal is to identify the ‘elbow’ point on the curve, where the rate of decrease in explained variation starts to slow. This point suggests that adding more clusters yields diminishing returns in terms of explained variation.

R code
# elbow plot
fviz_nbclust(lsoa_df_vis[, -1], kmeans, nstart = 100, iter.max = 100, method = "wss")

Figure 3: Elbow plot with ‘Within Sum of Squares’ against number of clusters.

Based on the elbow plot, we can now choose the number of clusters and it looks like 6 clusters would be a reasonable choice.

The interpretation of an elbow plot can be quite subjective, and multiple options for the optimal number of clusters might be justified; for instance, 4, 5, or even 7 clusters could be reasonable choices. In addition to the elbow method, other techniques can aid in determining the optimal number of clusters, such as silhouette scores and the gap statistic. An alternative and helful approach is to use a clustergram, which is a two-dimensional plot that visualises the flows of observations between clusters as more clusters are added. This method illustrates how your data reshuffles with each additional cluster and provides insights into the quality of the splits. This method can be done in R, but currently easier to implement in Python.

\(k\)-means clustering

Now we have decided on the number of clusters, we can run our \(k\)-means analysis.

R code
# set seed for reproducibility
set.seed(999)

# k-means
lsoa_clus <- kmeans(lsoa_df_vis[, -1], centers = 6, nstart = 100, iter.max = 100)

We can inspect the object to get some information about our clusters:

R code
# inspect
lsoa_clus
K-means clustering with 6 clusters of sizes 796, 1097, 771, 1011, 851, 468

Cluster means:
        v01       v02       v03       v04       v05       v06       v07
1 0.4816225 0.1632210 0.2425566 0.4838983 0.4169123 0.5410477 0.1337158
        v08       v09       v10        v11        v12        v13        v14
1 0.3007540 0.2480613 0.2859754 0.08913663 0.05177222 0.14603013 0.06993627
         v15        v16        v17        v18        v19        v20        v21
1 0.12176548 0.10935022 0.20109979 0.18150482 0.11605092 0.12757934 0.09288236
         v22        v23       v24       v25       v26        v27       v28
1 0.07473711 0.14662903 0.1842217 0.3522638 0.1490577 0.02997040 0.2423784
         v29       v30       v31       v32        v33        v34        v35
1 0.07148504 0.2193009 0.5870244 0.2411272 0.07541661 0.23467242 0.10174187
         v36
1 0.10216507
 [ reached getOption("max.print") -- omitted 5 rows ]

Clustering vector:
 [1] 4 4 4 1 1 5 5 6 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 1 1 2 2 1 1 2 2 2 1 1 1
[39] 1 1 1 1 5 1 5 1 1 1 1 1
 [ reached getOption("max.print") -- omitted 4944 entries ]

Within cluster sum of squares by cluster:
[1] 259.0272 177.7951 288.8625 232.7770 298.9145 160.1702
 (between_SS / total_SS =  48.5 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

Visualising clusters

We now need to perform some post-processing to extract useful summary data for each cluster. To characterise the clusters, we can compare the global mean values of each variable with the mean values specific to each cluster.

R code
# global means
glob_means <- colMeans(lsoa_df_vis[, -1])

# add clusters to input data
lsoa_df_vis <- cbind(lsoa_df_vis, cluster = lsoa_clus$cluster)

# cluster means
cluster_means <- lsoa_df_vis |>
    group_by(cluster) |>
    summarise(across(2:37, mean))

# difference
cluster_diffs <- cluster_means |>
    mutate(across(2:37, ~. - glob_means[cur_column()]))

These comparisons can then be visualised using, for instance, a radial bar plot:

R code
# to long format
cluster_diffs_long <- cluster_diffs |>
    pivot_longer(!cluster, names_to = "vars", values_to = "score")

# facet clusters
ggplot(cluster_diffs_long, aes(x = factor(vars), y = score)) + geom_bar(stat = "identity") +
    coord_radial(expand = FALSE) + facet_wrap(~cluster, ncol = 3) + theme_minimal() +
    theme(axis.text.x = element_text(size = 7)) + xlab("") + ylab("")

Figure 4: Radial barplots of cluster means for each input variable.

These plots can serve as a foundation for creating pen portraits by closely examining which variables drive each cluster.

For easier interpretation, these values can be transformed into index scores, allowing us to assess which variables are under- or overrepresented within each cluster group.

Of course, we can also map the results:

R code
# read spatial dataset
lsoa21 <- st_read("data/London-LSOA-2021.gpkg")
Reading layer `London-LSOA-2021' from data source 
  `/Users/justinvandijk/Library/CloudStorage/Dropbox/UCL/Web/jtvandijk.github.io/GEOG0114/data/London-LSOA-2021.gpkg' 
  using driver `GPKG'
Simple feature collection with 4994 features and 8 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 503574.2 ymin: 155850.8 xmax: 561956.7 ymax: 200933.6
Projected CRS: OSGB36 / British National Grid
# join
lsoa21 <- cbind(lsoa21, cluster = lsoa_clus$cluster)

# shape, polygon
tm_shape(lsoa21) +

  # specify column, colours
  tm_polygons(
    col = "cluster",
    palette = c("#feebe2", "#fbb4b9", "#f768a1", "#c51b8a", "#7a0177"),
    border.col = "#ffffff",
    border.alpha = 0.1,
    title = "Cluster number"
  ) +

  # set layout
  tm_layout(
    legend.outside = FALSE,
    legend.position = c("right", "bottom"),
    frame = FALSE
  )

Figure 5: Classification of London LSOAs based on several demographic variables.

Assignment

The creation of a geodemographic classification is an iterative process. This typically includes adding or removing variables, adjusting the number of clusters, and grouping data in different ways to achieve the most meaningful segmentation. Try to do the following:

  1. Download the two datasets provided below and save them to your data folder. The datasets include:
    • A csv file containing the number of people aged 16 years and older by occupational category, as defined by the Standard Occupational Classification 2020, aggregated by 2021 LSOAs.
    • A csv file containing the number of people aged 16 years and older by their highest level of qualification, also aggregated to the 2021 LSOA level.
  2. Prepare these two datasets and retain only those variables that are potentially meaningful. Filter out any variables with a high proportion of zero values.
  3. Merge the education and occupation dataset with the dataset used to generate the initial geodemographic classification. Check for multicollinearity and consider removing any variables that are highly correlated.
  4. Perform \(k\)-means clustering on your extended dataset. Make sure to select an appropriate number of clusters for your analysis.
  5. Interpret the individual clusters in terms of the variables that are under- and overrepresented.
File Type Link
London LSOA Census 2021 Occupation csv Download
London LSOA Census 2021 Education csv Download

Before you leave

Having finished this tutorial, you should now understand the basics of a geodemographic classification. That is all for this week!