9 Geodemographics
9.1 Introduction
After last week’s focus on using spatial analysis for the creation and analysis of novel datasets within data scarce environments, this week we will look at data rich environments and how we can use socio-demographic and socio-economic data to characterise neighbourhoods using geodemographics.
Geodemographics is the ‘analysis of people by where they live’ (Harris et al. 2005) and as such entails representing the individual and collective identities that are manifest in observable neighbourhood structure (Longley 2012)
We will look at geodemographics by focusing on a geodemographic classification known as the Internet User Classification.
This week is structured around two short lecture videos, two assignments (one on data visualisation and one on automation that you need to do in preparation for Friday’s seminar, and the practical material. As always, this week’s reading list is available on the UCL library reading list page for the course. Good luck!
9.1.1 Video: Overview
[Lecture slides] [Watch on MS stream]9.2 Geodemographics and the Internet User Classification
The CDRC Internet User Classification (IUC) is a bespoke geodemographic classification that describes how people residing in different parts of Great Britain interact with the Internet. For every Lower Super Output Area (LSOA) in England and Wales and Data Zone (DZ) (2011 Census Geographies), the IUC provides aggregate population estimates of Internet use (Singleton et al. 2020) and provides insights into the geography of the digital divide in the United Kingdom.
“Digital inequality is observable where access to online resources and those opportunities that these create are non-egalitarian. As a result of variable rates of access and use of the Internet between social and spatial groups (..), this leads to digital differentiation, which entrenches difference and reciprocates digital inequality over time (Singleton et al. 2020).”
9.2.1 Video: Geodemographics and the Internet User Classification
[Lecture slides] [Watch on MS stream]9.2.2 Example: Geodemographics and the Internet User Classification
For the first part of this week’s practical material, we will be looking at the Internet User Classification (IUC) for Great Britain in more detail by mapping it.
Our first step is to download the IUC dataset:
- Open a web browser and go to the data portal of the CDRC.
- Register if you need to, or if you are already registered, make sure you are logged in.
- Search for Internet User Classification.
- Scroll down and choose the download option for the IUC 2018 (CSV).
- Save the
iuc_gb_2018.csv
file in an appropriate folder.
Start by inspecting the dataset in MS Excel, or any other spreadsheet software such as Apache OpenOffice Calc or Numbers. Also, have a look at the IUC 2018 User Guide that provides the pen portraits of every cluster, including plots of cluster centres and a brief summary of the methodology.
Note
It is always a good idea to inspect your data prior to analysis to find out how your data look like. Of course, depending on the type of data, you can choose any tool you like to do this inspection (ArcGIS, R, QGIS, Microsoft Excel, etc.).
# load libraries
library(tidyverse)
library(tmap)
# load data
read_csv('raw/index/iuc_gb_2018.csv')
iuc <-
# inspect
iuc
## # A tibble: 41,729 x 5
## SHP_ID LSOA11_CD LSOA11_NM GRP_CD GRP_LABEL
## <dbl> <chr> <chr> <dbl> <chr>
## 1 1 E01020179 South Hams 012C 5 e-Rational Utilitarians
## 2 2 E01033289 Cornwall 007E 9 Settled Offline Communities
## 3 3 W01000189 Conwy 015F 5 e-Rational Utilitarians
## 4 4 W01001022 Bridgend 014B 7 Passive and Uncommitted Users
## 5 5 W01000532 Ceredigion 007B 9 Settled Offline Communities
## 6 6 E01018888 Cornwall 071G 9 Settled Offline Communities
## 7 7 E01018766 Cornwall 028D 9 Settled Offline Communities
## 8 8 E01019948 East Devon 010C 9 Settled Offline Communities
## 9 9 W01000539 Ceredigion 005D 5 e-Rational Utilitarians
## 10 10 E01019171 Barrow-in-Furness 005E 6 e-Mainstream
## # … with 41,719 more rows
# inspect data types
str(iuc)
## tibble [41,729 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ SHP_ID : num [1:41729] 1 2 3 4 5 6 7 8 9 10 ...
## $ LSOA11_CD: chr [1:41729] "E01020179" "E01033289" "W01000189" "W01001022" ...
## $ LSOA11_NM: chr [1:41729] "South Hams 012C" "Cornwall 007E" "Conwy 015F" "Bridgend 014B" ...
## $ GRP_CD : num [1:41729] 5 9 5 7 9 9 9 9 5 6 ...
## $ GRP_LABEL: chr [1:41729] "e-Rational Utilitarians" "Settled Offline Communities" "e-Rational Utilitarians" "Passive and Uncommitted Users" ...
## - attr(*, "spec")=
## .. cols(
## .. SHP_ID = col_double(),
## .. LSOA11_CD = col_character(),
## .. LSOA11_NM = col_character(),
## .. GRP_CD = col_double(),
## .. GRP_LABEL = col_character()
## .. )
Questions
- Create a histogram to inspect the distribution of the data. Looking at the histogram: is this what you expected? Why (not)?
Now the data are loaded we can move to acquiring our spatial data. As the IUC is created at the level of the Lower layer Super Output Area Census geography, we need to download its administrative borders. As the dataset for the entire country is quite large, we will focus on Liverpool for a change.
- Go to the UK Data Service Census support portal and select Boundary Data Selector.
- Set Country to England, Geography to Statistical Building Block, dates to 2011 and later, and click Find.
- Select English Lower Layer Super Output Areas, 2011 and click List Areas.
- Select Liverpool from the list and click Extract Boundary Data.
- Wait until loaded and download the
BoundaryData.zip
file. - Unzip and save the file in the usual fashion.
Note
You could also have downloaded the shapefile with the data already joined to the LSOA boundaries directly from the CDRC data portal, but this is the national data set and is quite large (75MB). Also, as we will be looking at Liverpool today we do not need all LSOAs in Great Britain. Of course, if you prefer filtering Liverpool’s LSOAs out of the GB dataset please go ahead!
Now we got the administrative boundary data, we can prepare the IUC map by joining our csv
file with the IUC classification to the shapefile.
# load libraries
library(sf)
library(tmap)
# load spatial data
st_read('raw/boundaries/liverpool_lsoa_2011/england_lsoa_2011.shp') liverpool <-
## Reading layer `england_lsoa_2011' from data source `/Users/Tycho/Dropbox/UCL/Web/jtvandijk.github.io/GEOG0114/raw/boundaries/liverpool_lsoa_2011/england_lsoa_2011.shp' using driver `ESRI Shapefile'
## Simple feature collection with 298 features and 3 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 332390.2 ymin: 379748.5 xmax: 345636 ymax: 397980.1
## CRS: 27700
# inspect
plot(liverpool$geometry)
# join data
merge(liverpool, iuc, by.x='code', by.y='LSOA11_CD')
liv_iuc <-
# inspect
liv_iuc
## Simple feature collection with 298 features and 7 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 332390.2 ymin: 379748.5 xmax: 345636 ymax: 397980.1
## CRS: 27700
## First 10 features:
## code label name SHP_ID LSOA11_NM
## 1 E01006512 E08000012E02001377E01006512 Liverpool 031A 26586 Liverpool 031A
## 2 E01006513 E08000012E02006932E01006513 Liverpool 060A 24660 Liverpool 060A
## 3 E01006514 E08000012E02001383E01006514 Liverpool 037A 27675 Liverpool 037A
## GRP_CD GRP_LABEL geometry
## 1 1 e-Cultural Creators POLYGON ((336203 390010, 33...
## 2 1 e-Cultural Creators POLYGON ((335402.8 390317.5...
## 3 1 e-Cultural Creators POLYGON ((335651.3 389926.8...
## [ reached 'max' / getOption("max.print") -- omitted 7 rows ]
# inspect
tmap_mode('view')
tm_shape(liv_iuc) +
tm_fill(col='GRP_LABEL') +
tm_layout(legend.outside=TRUE)
Let’s use the same colours as used on CDRC mapmaker by specifying the hex colour codes for each of our groups. Note the order of the colours is important: the colour for group 1 is first, group 2 second and so on.
# define palette
c('#dd7cdc','#ea4d78','#d2d1ab','#f36d5a','#a5cfbc','#e4a5d0','#8470ff','#79cdcd','#808fee','#ffd39b')
iuc_colours <-
# plot pretty
tm_shape(liv_iuc) +
tm_fill(col='GRP_LABEL', palette=iuc_colours) +
tm_layout(legend.outside=TRUE)
Questions
- Looking at the interactive map of the 2018 IUC in Liverpool: are their any obvious patterns?
- How does the 2018 IUC pattern in Liverpool relate to a ‘typical’ organisation of a city? Why?
9.2.3 Assignment
Now we have these cluster classifications, how can we link them to people? Try using the Mid-Year Population Estimates 2019 that you can download below to:
- calculate the total number of people associated with each cluster group for Great Britain as a whole (not just Liverpool!); and
- create a pretty data visualisation showing the results (no map!).
No further instructions are provided and you will have to figure out on your own how to do this! We will discuss your choices and results in Friday’s seminar, so make sure to ‘bring’ your results along.
9.3 K-means clustering
In several cases, including the 2011 residential-based area classifications and the Internet User Classification, a technique called k-means clustering is used in the creation of a geodemographic classification. K-means clustering aims to partition a set of observations into a number of clusters (k), in which each observation will be assigned to the cluster with the neareast mean. As such, a cluster refers to a collection of data points aggregated together because of certain similarities (i.e. standardised scores of your input data). In order to run a k-means clustering, you first define a target number k of clusters that you want. The k-means algorithm subsequently assigns every observation to one of the clusters by finding the solution that minimises the total within-cluster variance.
9.3.2 Example: K-means clustering
For the second part of this week’s practical material, we will be replicating part of the Internet User Classification for Great Britain. For this we will be using an MSOA-level input dataset containing various socio-demographic and socio-economic variables that you can download below together with the MSOA administrative boundaries.
The dataset contains the following variables:
Variable | Definition |
---|---|
msoa11cd |
MSOA Code |
age_total , age0to4pc , age5to14pc , age16to24pc , age25to44pc , age45to64pc , age75pluspc |
Percentage of people in various age groups |
nssec_total , 1_higher_managerial , 2_lower_managerial , 3_intermediate_occupations , 4_employers_small_org , 5_lower_supervisory , 6_semi_routine , 7_routine , 8_unemployed |
Percentage of people in selected operational categories and sub-categories classes drawn from the National Statistics Socio-economic Classification (NS-SEC) |
avg_dwn_speed , avb_superfast , no_decent_bband , bband_speed_under2mbs , bband_speed_under10mbs , bband_speed_over30mbs |
Measures of broadband use and internet availability |
File download
File | Type | Link |
---|---|---|
MSOA-level input variables for IUC | csv |
Download |
Middle-layer Super Output Areas Great Britain 2011 | shp |
Download |
# load data
read_csv('raw/index/msoa_iuc_input.csv')
iuc_input <-
# inspect
head(iuc_input)
## # A tibble: 6 x 23
## msoa11cd age_total age0to4pc age5to14pc age16to24pc age25to44pc age45to64pc
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 E020000… 7375 0.032 0.0388 0.0961 0.407 0.273
## 2 E020000… 6775 0.0927 0.122 0.113 0.280 0.186
## 3 E020000… 10045 0.0829 0.102 0.118 0.306 0.225
## 4 E020000… 6182 0.0590 0.102 0.139 0.254 0.250
## 5 E020000… 8562 0.0930 0.119 0.119 0.299 0.214
## 6 E020000… 8791 0.103 0.125 0.129 0.285 0.197
## # … with 16 more variables: age75pluspc <dbl>, nssec_total <dbl>,
## # `1_higher_managerial` <dbl>, `2_lower_managerial` <dbl>,
## # `3_intermediate_occupations` <dbl>, `4_employers_small_org` <dbl>,
## # `5_lower_supervisory` <dbl>, `6_semi_routine` <dbl>, `7_routine` <dbl>,
## # `8_unemployed` <dbl>, avg_dwn_speed <dbl>, avb_superfast <dbl>,
## # no_decent_bband <dbl>, bband_speed_under2mbs <dbl>,
## # bband_speed_under10mbs <dbl>, bband_speed_over30mbs <dbl>
Before running our k-means clustering algorithm, we need to extract the data which we want to use; i.e. we need to remove all the columns with data that we do not want to include in the clustering process.
# column names
names(iuc_input)
## [1] "msoa11cd" "age_total"
## [3] "age0to4pc" "age5to14pc"
## [5] "age16to24pc" "age25to44pc"
## [7] "age45to64pc" "age75pluspc"
## [9] "nssec_total" "1_higher_managerial"
## [11] "2_lower_managerial" "3_intermediate_occupations"
## [13] "4_employers_small_org" "5_lower_supervisory"
## [15] "6_semi_routine" "7_routine"
## [17] "8_unemployed" "avg_dwn_speed"
## [19] "avb_superfast" "no_decent_bband"
## [21] "bband_speed_under2mbs" "bband_speed_under10mbs"
## [23] "bband_speed_over30mbs"
# extract columns by index
iuc_input[,c(3:8,10:17,18:20)]
cluster_data <-
# inspect
head(cluster_data)
## # A tibble: 6 x 17
## age0to4pc age5to14pc age16to24pc age25to44pc age45to64pc age75pluspc
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.032 0.0388 0.0961 0.407 0.273 0.0607
## 2 0.0927 0.122 0.113 0.280 0.186 0.0980
## 3 0.0829 0.102 0.118 0.306 0.225 0.0646
## 4 0.0590 0.102 0.139 0.254 0.250 0.0886
## 5 0.0930 0.119 0.119 0.299 0.214 0.0501
## 6 0.103 0.125 0.129 0.285 0.197 0.0688
## # … with 11 more variables: `1_higher_managerial` <dbl>,
## # `2_lower_managerial` <dbl>, `3_intermediate_occupations` <dbl>,
## # `4_employers_small_org` <dbl>, `5_lower_supervisory` <dbl>,
## # `6_semi_routine` <dbl>, `7_routine` <dbl>, `8_unemployed` <dbl>,
## # avg_dwn_speed <dbl>, avb_superfast <dbl>, no_decent_bband <dbl>
We also need to rescale the data so all input data are presented on a comparable scale: the average download speed data (i.e. avg_dwn_speed
) is very different to the other data that, for instance, represent the percentage of the population by different age groups.
# rescale
scale(cluster_data)
cluster_data <-
# inspect
head(cluster_data)
## age0to4pc age5to14pc age16to24pc age25to44pc age45to64pc age75pluspc
## [1,] -1.7579913 -2.8680309 -0.36823669 2.1768529 0.2561260 -0.6039803
## 1_higher_managerial 2_lower_managerial 3_intermediate_occupations
## [1,] 4.4185012 1.8391416 -2.2291104
## 4_employers_small_org 5_lower_supervisory 6_semi_routine 7_routine
## [1,] -1.02298662 -2.53130431 -2.49888739 -1.81145433
## 8_unemployed avg_dwn_speed avb_superfast no_decent_bband
## [1,] -0.5139854 -1.6561653 -5.2186970 -0.3797659
## [ reached getOption("max.print") -- omitted 5 rows ]
Questions
- What does the function
scale()
do exactly?
Now our data are all on the same scale, we will start by creating an elbow plot. The elbow method is a visual aid that can help in determining the number of clusters in a dataset. Remember: this is important because with a k-means clustering you need to specify the numbers of clusters a priori!
The elbow method can help as it plots the total explained variation (‘Within Sum of Squares’) in your data as a function of the number of cluster. The idea is that you pick the number of clusters at the ‘elbow’ of the curve as this is the point in which the additional variation that would be explained by an additional cluster is decreasing. Effectively this means you are actually running the k-means clustering multiple times before running the actual k-means clustering algorithm.
# create empty list to store the within sum of square values
list()
wss_values <-
# execute a k-means clustering for k=1, k=2, ..., k=15
for (i in 1:15) {
sum(kmeans(cluster_data,centers=i)$withinss)
wss_values[i] <- }
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
# inspect
wss_values
## [[1]]
## [1] 144143
##
## [[2]]
## [1] 110934.1
##
## [[3]]
## [1] 94944.45
##
## [[4]]
## [1] 82701.63
##
## [[5]]
## [1] 76194.75
##
## [[6]]
## [1] 67726.74
##
## [[7]]
## [1] 64459.49
##
## [[8]]
## [1] 58581.9
##
## [[9]]
## [1] 55914.71
##
## [[10]]
## [1] 53310.72
##
## [[11]]
## [1] 51384.7
##
## [[12]]
## [1] 49448.68
##
## [[13]]
## [1] 48258.77
##
## [[14]]
## [1] 46981.07
##
## [[15]]
## [1] 45948.9
# vector to dataframe
as.data.frame(wss_values)
wss_values <-
# transpose
as.data.frame(t(wss_values))
wss_values <-
# add cluster numbers
$cluster <- seq.int(nrow(wss_values))
wss_valuesnames(wss_values) <- c('wss','cluster')
# plot
ggplot(data=wss_values, aes(x=cluster,y=wss)) +
geom_point() +
geom_path() +
scale_x_continuous(breaks=seq(1,15)) +
xlab('number of clusters') +
ylab('within sum of squares')
Based on the elbow plot, we can now choose the number of clusters and it looks like 7 clusters would be a reasonable choice.
Note
The interpretation of an elbow plot can be quite subjective and often mutlltle options would be justified: 6, 8, and perhaps 9 clusters also do not look unreasonable. You would need to try the different options and see what output you get to determine the ‘optimal’ solution. However, at very least, the elbow plot does give you an idea of what would potentially be an adequate number of clusters.
Now we have decided on the number of clusters (i.e. 7 clusters), we can now run our cluster analysis. We will be running this analysis 10 times because there is an element of randomness within the clustering, and we want to make sure we get the best clustering output.
# create empty list to store the results of the clustering
list()
clusters <-
# create empty variable to store fit
NA
fit <-
# run the k-means 10 times
for (i in 1:10){
# keep track of the runs
print(paste0('starting run: ', i))
# run the k-means clustering algorithm to extract 7 clusters
kmeans(x=cluster_data, centers=7, iter.max=1000000, nstart=1)
clust7 <-
# get the total within sum of squares for the run and
clust7$tot.withinss
fit[i] <-
# update the results of the clustering if the total within sum of squares for the run
# is lower than any of the runs that have been executed so far
if (fit[i] < min(fit[1:(i-1)])){
clust7}
clusters <- }
## [1] "starting run: 1"
## [1] "starting run: 2"
## [1] "starting run: 3"
## [1] "starting run: 4"
## [1] "starting run: 5"
## [1] "starting run: 6"
## [1] "starting run: 7"
## [1] "starting run: 8"
## [1] "starting run: 9"
## [1] "starting run: 10"
# inspect
clusters
## K-means clustering with 7 clusters of sizes 1814, 2039, 2053, 186, 1107, 541, 740
##
## Cluster means:
## age0to4pc age5to14pc age16to24pc age25to44pc age45to64pc age75pluspc
## 1_higher_managerial 2_lower_managerial 3_intermediate_occupations
## 4_employers_small_org 5_lower_supervisory 6_semi_routine 7_routine
## 8_unemployed avg_dwn_speed avb_superfast no_decent_bband
## [ reached getOption("max.print") -- omitted 7 rows ]
##
## Clustering vector:
## [1] 7 5 2 2 5 5 5 5 5 2 2 5 5 5 5
## [ reached getOption("max.print") -- omitted 8465 entries ]
##
## Within cluster sum of squares by cluster:
## [1] 12871.008 11471.872 11736.830 3401.821 9429.647 6265.986 7786.692
## (between_SS / total_SS = 56.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# inspect
fit
## [1] 63126.42 64459.54 64459.46 64456.73 62963.86 64456.74 64456.74 62963.86
## [9] 64459.54 64456.73
Questions
- Do you understand what is happening in each step of this code, especially within the loop?
- Do you understand what the different parameters in the
kmeans()
function mean and what they do?
We now have to execute a bit of post-processing to extract some useful summary data for each cluster: the cluster size (size
) and mean values for each cluster.
# assign to new variable for clarity
clusters
kfit <-
# cluster sizes
kfit$size
kfit_size <-
# inspect
kfit_size
## [1] 1814 2039 2053 186 1107 541 740
# mean values for each variable in each cluster
as_tibble(aggregate(cluster_data,by=list(kfit$cluster),FUN=mean))
kfit_mean<-names(kfit_mean)[1] <- 'cluster'
# inspect
kfit_mean
## # A tibble: 7 x 18
## cluster age0to4pc age5to14pc age16to24pc age25to44pc age45to64pc age75pluspc
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 -0.679 -0.0412 -0.465 -0.773 0.765 0.856
## 2 2 0.0162 0.118 -0.172 0.0189 0.202 -0.103
## 3 3 -0.00815 -0.0824 -0.0526 -0.158 0.0919 0.164
## 4 4 -1.52 -2.68 5.42 0.113 -2.62 -1.28
## 5 5 1.56 1.14 0.381 0.626 -1.13 -0.954
## 6 6 -0.880 -0.148 -0.523 -0.998 1.22 0.554
## 7 7 0.326 -0.923 0.209 2.05 -1.23 -0.927
## # … with 11 more variables: `1_higher_managerial` <dbl>,
## # `2_lower_managerial` <dbl>, `3_intermediate_occupations` <dbl>,
## # `4_employers_small_org` <dbl>, `5_lower_supervisory` <dbl>,
## # `6_semi_routine` <dbl>, `7_routine` <dbl>, `8_unemployed` <dbl>,
## # avg_dwn_speed <dbl>, avb_superfast <dbl>, no_decent_bband <dbl>
# transform shape to tidy format
pivot_longer(kfit_mean, cols=(-cluster))
kfit_mean_long <-
# plot data
ggplot(kfit_mean_long, aes(x=cluster, y=value, colour=name)) +
geom_line () +
scale_x_continuous(breaks=seq(1,7,by=1)) +
theme_minimal() +
theme(legend.title = element_blank())
Looking at the table with the mean values for each cluster and the graph, we can see, for instance, that only cluster 2 shows a clear pattern with Internet usage. Your values may be slightly different because there is this element of randomness within the clustering. The graph is a little busy, so you might want to look at the cluster groups or variables individually to get a better picture of each cluster.
Questions
- Which clusters are primarily driven by age?
- Which clusters are primarily driven by the National Statistics Socio-economic Classification?
# read shape
st_read('raw/boundaries/gb_msoa_2011/gb_msoa11_sim.shp') msoa <-
## Reading layer `gb_msoa11_sim' from data source `/Users/Tycho/Dropbox/UCL/Web/jtvandijk.github.io/GEOG0114/raw/boundaries/gb_msoa_2011/gb_msoa11_sim.shp' using driver `ESRI Shapefile'
## Simple feature collection with 8480 features and 3 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 5513 ymin: 5342.7 xmax: 655604.7 ymax: 1220302
## CRS: 4326
# set projection
st_crs(msoa) = 27700
## Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
## that
# simplify for speedier plotting
st_simplify(msoa, dTolerance = 100)
msoa <-
# join
cbind(iuc_input,kfit$cluster)
cluster_data <- left_join(msoa,cluster_data,by=c('geo_code'='msoa11cd'))
msoa <-
# plot
tmap_mode('view')
tm_shape(msoa) +
tm_fill(col='kfit$cluster')