9 Geodemographics

9.1 Introduction

After last week’s focus on using spatial analysis for the creation and analysis of novel datasets within data scarce environments, this week we will look at data rich environments and how we can use socio-demographic and socio-economic data to characterise neighbourhoods using geodemographics.

Geodemographics is the ‘analysis of people by where they live’ (Harris et al. 2005) and as such entails representing the individual and collective identities that are manifest in observable neighbourhood structure (Longley 2012)

We will look at geodemographics by focusing on a geodemographic classification known as the Internet User Classification.

This week is structured around two short lecture videos, two assignments (one on data visualisation and one on automation that you need to do in preparation for Friday’s seminar, and the practical material. As always, this week’s reading list is available on the UCL library reading list page for the course. Good luck!

9.1.1 Video: Overview

[Lecture slides] [Watch on MS stream]

9.2 Geodemographics and the Internet User Classification

The CDRC Internet User Classification (IUC) is a bespoke geodemographic classification that describes how people residing in different parts of Great Britain interact with the Internet. For every Lower Super Output Area (LSOA) in England and Wales and Data Zone (DZ) (2011 Census Geographies), the IUC provides aggregate population estimates of Internet use (Singleton et al. 2020) and provides insights into the geography of the digital divide in the United Kingdom.

“Digital inequality is observable where access to online resources and those opportunities that these create are non-egalitarian. As a result of variable rates of access and use of the Internet between social and spatial groups (..), this leads to digital differentiation, which entrenches difference and reciprocates digital inequality over time (Singleton et al. 2020).”

9.2.1 Video: Geodemographics and the Internet User Classification

[Lecture slides] [Watch on MS stream]

9.2.2 Example: Geodemographics and the Internet User Classification

For the first part of this week’s practical material, we will be looking at the Internet User Classification (IUC) for Great Britain in more detail by mapping it.

Our first step is to download the IUC dataset:

Open a web browser and go to the data portal of the CDRC.
Register if you need to, or if you are already registered, make sure you are logged in.
Search for Internet User Classification.
Scroll down and choose the download option for the IUC 2018 (CSV).
Save the iuc_gb_2018.csv file in an appropriate folder.

Figure 9.1: Download the GB IUC 2018.

Start by inspecting the dataset in MS Excel, or any other spreadsheet software such as Apache OpenOffice Calc or Numbers. Also, have a look at the IUC 2018 User Guide that provides the pen portraits of every cluster, including plots of cluster centres and a brief summary of the methodology.

Note
It is always a good idea to inspect your data prior to analysis to find out how your data look like. Of course, depending on the type of data, you can choose any tool you like to do this inspection (ArcGIS, R, QGIS, Microsoft Excel, etc.).

Figure 9.2: GB IUC 2018 in Excel.

# load libraries
library(tidyverse)
library(tmap)

# load data
iuc <- read_csv('raw/index/iuc_gb_2018.csv')

# inspect
iuc

## # A tibble: 41,729 x 5
##    SHP_ID LSOA11_CD LSOA11_NM              GRP_CD GRP_LABEL                    
##     <dbl> <chr>     <chr>                   <dbl> <chr>                        
##  1      1 E01020179 South Hams 012C             5 e-Rational Utilitarians      
##  2      2 E01033289 Cornwall 007E               9 Settled Offline Communities  
##  3      3 W01000189 Conwy 015F                  5 e-Rational Utilitarians      
##  4      4 W01001022 Bridgend 014B               7 Passive and Uncommitted Users
##  5      5 W01000532 Ceredigion 007B             9 Settled Offline Communities  
##  6      6 E01018888 Cornwall 071G               9 Settled Offline Communities  
##  7      7 E01018766 Cornwall 028D               9 Settled Offline Communities  
##  8      8 E01019948 East Devon 010C             9 Settled Offline Communities  
##  9      9 W01000539 Ceredigion 005D             5 e-Rational Utilitarians      
## 10     10 E01019171 Barrow-in-Furness 005E      6 e-Mainstream                 
## # … with 41,719 more rows

# inspect data types
str(iuc)

## tibble [41,729 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ SHP_ID   : num [1:41729] 1 2 3 4 5 6 7 8 9 10 ...
##  $ LSOA11_CD: chr [1:41729] "E01020179" "E01033289" "W01000189" "W01001022" ...
##  $ LSOA11_NM: chr [1:41729] "South Hams 012C" "Cornwall 007E" "Conwy 015F" "Bridgend 014B" ...
##  $ GRP_CD   : num [1:41729] 5 9 5 7 9 9 9 9 5 6 ...
##  $ GRP_LABEL: chr [1:41729] "e-Rational Utilitarians" "Settled Offline Communities" "e-Rational Utilitarians" "Passive and Uncommitted Users" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   SHP_ID = col_double(),
##   ..   LSOA11_CD = col_character(),
##   ..   LSOA11_NM = col_character(),
##   ..   GRP_CD = col_double(),
##   ..   GRP_LABEL = col_character()
##   .. )

Questions

Create a histogram to inspect the distribution of the data. Looking at the histogram: is this what you expected? Why (not)?

Now the data are loaded we can move to acquiring our spatial data. As the IUC is created at the level of the Lower layer Super Output Area Census geography, we need to download its administrative borders. As the dataset for the entire country is quite large, we will focus on Liverpool for a change.

Go to the UK Data Service Census support portal and select Boundary Data Selector.
Set Country to England, Geography to Statistical Building Block, dates to 2011 and later, and click Find.
Select English Lower Layer Super Output Areas, 2011 and click List Areas.
Select Liverpool from the list and click Extract Boundary Data.
Wait until loaded and download the BoundaryData.zip file.
Unzip and save the file in the usual fashion.

Note
You could also have downloaded the shapefile with the data already joined to the LSOA boundaries directly from the CDRC data portal, but this is the national data set and is quite large (75MB). Also, as we will be looking at Liverpool today we do not need all LSOAs in Great Britain. Of course, if you prefer filtering Liverpool’s LSOAs out of the GB dataset please go ahead!

Now we got the administrative boundary data, we can prepare the IUC map by joining our csv file with the IUC classification to the shapefile.

# load libraries
library(sf)
library(tmap)

# load spatial data
liverpool <- st_read('raw/boundaries/liverpool_lsoa_2011/england_lsoa_2011.shp')

## Reading layer `england_lsoa_2011' from data source `/Users/Tycho/Dropbox/UCL/Web/jtvandijk.github.io/GEOG0114/raw/boundaries/liverpool_lsoa_2011/england_lsoa_2011.shp' using driver `ESRI Shapefile'
## Simple feature collection with 298 features and 3 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 332390.2 ymin: 379748.5 xmax: 345636 ymax: 397980.1
## CRS:            27700

# inspect
plot(liverpool$geometry)

# join data
liv_iuc <- merge(liverpool, iuc, by.x='code', by.y='LSOA11_CD')

# inspect
liv_iuc

## Simple feature collection with 298 features and 7 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 332390.2 ymin: 379748.5 xmax: 345636 ymax: 397980.1
## CRS:            27700
## First 10 features:
##        code                       label           name SHP_ID      LSOA11_NM
## 1 E01006512 E08000012E02001377E01006512 Liverpool 031A  26586 Liverpool 031A
## 2 E01006513 E08000012E02006932E01006513 Liverpool 060A  24660 Liverpool 060A
## 3 E01006514 E08000012E02001383E01006514 Liverpool 037A  27675 Liverpool 037A
##   GRP_CD           GRP_LABEL                       geometry
## 1      1 e-Cultural Creators POLYGON ((336203 390010, 33...
## 2      1 e-Cultural Creators POLYGON ((335402.8 390317.5...
## 3      1 e-Cultural Creators POLYGON ((335651.3 389926.8...
##  [ reached 'max' / getOption("max.print") -- omitted 7 rows ]

# inspect
tmap_mode('view')
tm_shape(liv_iuc) +
  tm_fill(col='GRP_LABEL') +
  tm_layout(legend.outside=TRUE)

Let’s use the same colours as used on CDRC mapmaker by specifying the hex colour codes for each of our groups. Note the order of the colours is important: the colour for group 1 is first, group 2 second and so on.

# define palette
iuc_colours <- c('#dd7cdc','#ea4d78','#d2d1ab','#f36d5a','#a5cfbc','#e4a5d0','#8470ff','#79cdcd','#808fee','#ffd39b')

# plot pretty
tm_shape(liv_iuc) +
  tm_fill(col='GRP_LABEL', palette=iuc_colours) +
  tm_layout(legend.outside=TRUE)

Questions

Looking at the interactive map of the 2018 IUC in Liverpool: are their any obvious patterns?
How does the 2018 IUC pattern in Liverpool relate to a ‘typical’ organisation of a city? Why?

9.2.3 Assignment

Now we have these cluster classifications, how can we link them to people? Try using the Mid-Year Population Estimates 2019 that you can download below to:

calculate the total number of people associated with each cluster group for Great Britain as a whole (not just Liverpool!); and
create a pretty data visualisation showing the results (no map!).

No further instructions are provided and you will have to figure out on your own how to do this! We will discuss your choices and results in Friday’s seminar, so make sure to ‘bring’ your results along.

File download

File	Type	Link
LSOA-level Mid-Year Population Estimates 2019	`csv`	Download
Lower-layer Super Output Areas Great Britain 2011	`shp`	Download

9.3 K-means clustering

In several cases, including the 2011 residential-based area classifications and the Internet User Classification, a technique called k-means clustering is used in the creation of a geodemographic classification. K-means clustering aims to partition a set of observations into a number of clusters (k), in which each observation will be assigned to the cluster with the neareast mean. As such, a cluster refers to a collection of data points aggregated together because of certain similarities (i.e. standardised scores of your input data). In order to run a k-means clustering, you first define a target number k of clusters that you want. The k-means algorithm subsequently assigns every observation to one of the clusters by finding the solution that minimises the total within-cluster variance.

9.3.1 Video: K-means clustering

[Lecture slides] [Watch on MS stream]

9.3.2 Example: K-means clustering

For the second part of this week’s practical material, we will be replicating part of the Internet User Classification for Great Britain. For this we will be using an MSOA-level input dataset containing various socio-demographic and socio-economic variables that you can download below together with the MSOA administrative boundaries.

The dataset contains the following variables:

Variable	Definition
`msoa11cd`	MSOA Code
`age_total`, `age0to4pc`, `age5to14pc`, `age16to24pc`, `age25to44pc`, `age45to64pc`, `age75pluspc`	Percentage of people in various age groups
`nssec_total`, `1_higher_managerial`, `2_lower_managerial`, `3_intermediate_occupations`, `4_employers_small_org`, `5_lower_supervisory`, `6_semi_routine`, `7_routine`, `8_unemployed`	Percentage of people in selected operational categories and sub-categories classes drawn from the National Statistics Socio-economic Classification (NS-SEC)
`avg_dwn_speed`, `avb_superfast`, `no_decent_bband`, `bband_speed_under2mbs`, `bband_speed_under10mbs`, `bband_speed_over30mbs`	Measures of broadband use and internet availability

File download

File	Type	Link
MSOA-level input variables for IUC	`csv`	Download
Middle-layer Super Output Areas Great Britain 2011	`shp`	Download

# load data
iuc_input <- read_csv('raw/index/msoa_iuc_input.csv')

# inspect
head(iuc_input)

## # A tibble: 6 x 23
##   msoa11cd age_total age0to4pc age5to14pc age16to24pc age25to44pc age45to64pc
##   <chr>        <dbl>     <dbl>      <dbl>       <dbl>       <dbl>       <dbl>
## 1 E020000…      7375    0.032      0.0388      0.0961       0.407       0.273
## 2 E020000…      6775    0.0927     0.122       0.113        0.280       0.186
## 3 E020000…     10045    0.0829     0.102       0.118        0.306       0.225
## 4 E020000…      6182    0.0590     0.102       0.139        0.254       0.250
## 5 E020000…      8562    0.0930     0.119       0.119        0.299       0.214
## 6 E020000…      8791    0.103      0.125       0.129        0.285       0.197
## # … with 16 more variables: age75pluspc <dbl>, nssec_total <dbl>,
## #   `1_higher_managerial` <dbl>, `2_lower_managerial` <dbl>,
## #   `3_intermediate_occupations` <dbl>, `4_employers_small_org` <dbl>,
## #   `5_lower_supervisory` <dbl>, `6_semi_routine` <dbl>, `7_routine` <dbl>,
## #   `8_unemployed` <dbl>, avg_dwn_speed <dbl>, avb_superfast <dbl>,
## #   no_decent_bband <dbl>, bband_speed_under2mbs <dbl>,
## #   bband_speed_under10mbs <dbl>, bband_speed_over30mbs <dbl>

Before running our k-means clustering algorithm, we need to extract the data which we want to use; i.e. we need to remove all the columns with data that we do not want to include in the clustering process.

# column names
names(iuc_input)

##  [1] "msoa11cd"                   "age_total"                 
##  [3] "age0to4pc"                  "age5to14pc"                
##  [5] "age16to24pc"                "age25to44pc"               
##  [7] "age45to64pc"                "age75pluspc"               
##  [9] "nssec_total"                "1_higher_managerial"       
## [11] "2_lower_managerial"         "3_intermediate_occupations"
## [13] "4_employers_small_org"      "5_lower_supervisory"       
## [15] "6_semi_routine"             "7_routine"                 
## [17] "8_unemployed"               "avg_dwn_speed"             
## [19] "avb_superfast"              "no_decent_bband"           
## [21] "bband_speed_under2mbs"      "bband_speed_under10mbs"    
## [23] "bband_speed_over30mbs"

# extract columns by index
cluster_data <- iuc_input[,c(3:8,10:17,18:20)]

# inspect
head(cluster_data)

## # A tibble: 6 x 17
##   age0to4pc age5to14pc age16to24pc age25to44pc age45to64pc age75pluspc
##       <dbl>      <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
## 1    0.032      0.0388      0.0961       0.407       0.273      0.0607
## 2    0.0927     0.122       0.113        0.280       0.186      0.0980
## 3    0.0829     0.102       0.118        0.306       0.225      0.0646
## 4    0.0590     0.102       0.139        0.254       0.250      0.0886
## 5    0.0930     0.119       0.119        0.299       0.214      0.0501
## 6    0.103      0.125       0.129        0.285       0.197      0.0688
## # … with 11 more variables: `1_higher_managerial` <dbl>,
## #   `2_lower_managerial` <dbl>, `3_intermediate_occupations` <dbl>,
## #   `4_employers_small_org` <dbl>, `5_lower_supervisory` <dbl>,
## #   `6_semi_routine` <dbl>, `7_routine` <dbl>, `8_unemployed` <dbl>,
## #   avg_dwn_speed <dbl>, avb_superfast <dbl>, no_decent_bband <dbl>

We also need to rescale the data so all input data are presented on a comparable scale: the average download speed data (i.e. avg_dwn_speed) is very different to the other data that, for instance, represent the percentage of the population by different age groups.

# rescale
cluster_data <- scale(cluster_data) 

# inspect
head(cluster_data)

##       age0to4pc age5to14pc age16to24pc age25to44pc age45to64pc age75pluspc
## [1,] -1.7579913 -2.8680309 -0.36823669   2.1768529   0.2561260  -0.6039803
##      1_higher_managerial 2_lower_managerial 3_intermediate_occupations
## [1,]           4.4185012          1.8391416                 -2.2291104
##      4_employers_small_org 5_lower_supervisory 6_semi_routine   7_routine
## [1,]           -1.02298662         -2.53130431    -2.49888739 -1.81145433
##      8_unemployed avg_dwn_speed avb_superfast no_decent_bband
## [1,]   -0.5139854    -1.6561653    -5.2186970      -0.3797659
##  [ reached getOption("max.print") -- omitted 5 rows ]

Questions

What does the function scale() do exactly?

Now our data are all on the same scale, we will start by creating an elbow plot. The elbow method is a visual aid that can help in determining the number of clusters in a dataset. Remember: this is important because with a k-means clustering you need to specify the numbers of clusters a priori!

The elbow method can help as it plots the total explained variation (‘Within Sum of Squares’) in your data as a function of the number of cluster. The idea is that you pick the number of clusters at the ‘elbow’ of the curve as this is the point in which the additional variation that would be explained by an additional cluster is decreasing. Effectively this means you are actually running the k-means clustering multiple times before running the actual k-means clustering algorithm.

# create empty list to store the within sum of square values 
wss_values <- list()

# execute a k-means clustering for k=1, k=2, ..., k=15
for (i in 1:15) {
  wss_values[i] <- sum(kmeans(cluster_data,centers=i)$withinss)
}

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

# inspect
wss_values

## [[1]]
## [1] 144143
## 
## [[2]]
## [1] 110934.1
## 
## [[3]]
## [1] 94944.45
## 
## [[4]]
## [1] 82701.63
## 
## [[5]]
## [1] 76194.75
## 
## [[6]]
## [1] 67726.74
## 
## [[7]]
## [1] 64459.49
## 
## [[8]]
## [1] 58581.9
## 
## [[9]]
## [1] 55914.71
## 
## [[10]]
## [1] 53310.72
## 
## [[11]]
## [1] 51384.7
## 
## [[12]]
## [1] 49448.68
## 
## [[13]]
## [1] 48258.77
## 
## [[14]]
## [1] 46981.07
## 
## [[15]]
## [1] 45948.9

# vector to dataframe
wss_values <- as.data.frame(wss_values)

# transpose
wss_values <- as.data.frame(t(wss_values))

# add cluster numbers
wss_values$cluster <- seq.int(nrow(wss_values))
names(wss_values) <- c('wss','cluster')

# plot
ggplot(data=wss_values, aes(x=cluster,y=wss)) +
  geom_point() +
  geom_path() + 
  scale_x_continuous(breaks=seq(1,15)) +
  xlab('number of clusters') +
  ylab('within sum of squares')

Based on the elbow plot, we can now choose the number of clusters and it looks like 7 clusters would be a reasonable choice.

Note
The interpretation of an elbow plot can be quite subjective and often mutlltle options would be justified: 6, 8, and perhaps 9 clusters also do not look unreasonable. You would need to try the different options and see what output you get to determine the ‘optimal’ solution. However, at very least, the elbow plot does give you an idea of what would potentially be an adequate number of clusters.

Now we have decided on the number of clusters (i.e. 7 clusters), we can now run our cluster analysis. We will be running this analysis 10 times because there is an element of randomness within the clustering, and we want to make sure we get the best clustering output.

# create empty list to store the results of the clustering
clusters <- list()

# create empty variable to store fit
fit <- NA
  
# run the k-means 10 times
for (i in 1:10){
  
  # keep track of the runs
  print(paste0('starting run: ', i))
  
  # run the k-means clustering algorithm to extract 7 clusters
  clust7 <- kmeans(x=cluster_data, centers=7, iter.max=1000000, nstart=1)
  
  # get the total within sum of squares for the run and 
  fit[i] <- clust7$tot.withinss
  
  # update the results of the clustering if the total within sum of squares for the run
  # is lower than any of the runs that have been executed so far 
  if (fit[i] < min(fit[1:(i-1)])){
    clusters <- clust7}
}

## [1] "starting run: 1"
## [1] "starting run: 2"
## [1] "starting run: 3"
## [1] "starting run: 4"
## [1] "starting run: 5"
## [1] "starting run: 6"
## [1] "starting run: 7"
## [1] "starting run: 8"
## [1] "starting run: 9"
## [1] "starting run: 10"

# inspect
clusters

## K-means clustering with 7 clusters of sizes 1814, 2039, 2053, 186, 1107, 541, 740
## 
## Cluster means:
##      age0to4pc  age5to14pc age16to24pc age25to44pc age45to64pc age75pluspc
##   1_higher_managerial 2_lower_managerial 3_intermediate_occupations
##   4_employers_small_org 5_lower_supervisory 6_semi_routine  7_routine
##   8_unemployed avg_dwn_speed avb_superfast no_decent_bband
##  [ reached getOption("max.print") -- omitted 7 rows ]
## 
## Clustering vector:
##  [1] 7 5 2 2 5 5 5 5 5 2 2 5 5 5 5
##  [ reached getOption("max.print") -- omitted 8465 entries ]
## 
## Within cluster sum of squares by cluster:
## [1] 12871.008 11471.872 11736.830  3401.821  9429.647  6265.986  7786.692
##  (between_SS / total_SS =  56.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

# inspect
fit

##  [1] 63126.42 64459.54 64459.46 64456.73 62963.86 64456.74 64456.74 62963.86
##  [9] 64459.54 64456.73

Questions

Do you understand what is happening in each step of this code, especially within the loop?
Do you understand what the different parameters in the kmeans() function mean and what they do?

We now have to execute a bit of post-processing to extract some useful summary data for each cluster: the cluster size (size) and mean values for each cluster.

# assign to new variable for clarity
kfit <- clusters

# cluster sizes
kfit_size <- kfit$size

# inspect
kfit_size

## [1] 1814 2039 2053  186 1107  541  740

# mean values for each variable in each cluster
kfit_mean<- as_tibble(aggregate(cluster_data,by=list(kfit$cluster),FUN=mean))
names(kfit_mean)[1] <- 'cluster'

# inspect
kfit_mean

## # A tibble: 7 x 18
##   cluster age0to4pc age5to14pc age16to24pc age25to44pc age45to64pc age75pluspc
##     <int>     <dbl>      <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
## 1       1  -0.679      -0.0412     -0.465      -0.773       0.765        0.856
## 2       2   0.0162      0.118      -0.172       0.0189      0.202       -0.103
## 3       3  -0.00815    -0.0824     -0.0526     -0.158       0.0919       0.164
## 4       4  -1.52       -2.68        5.42        0.113      -2.62        -1.28 
## 5       5   1.56        1.14        0.381       0.626      -1.13        -0.954
## 6       6  -0.880      -0.148      -0.523      -0.998       1.22         0.554
## 7       7   0.326      -0.923       0.209       2.05       -1.23        -0.927
## # … with 11 more variables: `1_higher_managerial` <dbl>,
## #   `2_lower_managerial` <dbl>, `3_intermediate_occupations` <dbl>,
## #   `4_employers_small_org` <dbl>, `5_lower_supervisory` <dbl>,
## #   `6_semi_routine` <dbl>, `7_routine` <dbl>, `8_unemployed` <dbl>,
## #   avg_dwn_speed <dbl>, avb_superfast <dbl>, no_decent_bband <dbl>

# transform shape to tidy format
kfit_mean_long <- pivot_longer(kfit_mean, cols=(-cluster))

# plot data  
ggplot(kfit_mean_long, aes(x=cluster, y=value, colour=name)) + 
  geom_line () +
  scale_x_continuous(breaks=seq(1,7,by=1)) +
  theme_minimal() +
  theme(legend.title = element_blank())

Looking at the table with the mean values for each cluster and the graph, we can see, for instance, that only cluster 2 shows a clear pattern with Internet usage. Your values may be slightly different because there is this element of randomness within the clustering. The graph is a little busy, so you might want to look at the cluster groups or variables individually to get a better picture of each cluster.

Questions

Which clusters are primarily driven by age?
Which clusters are primarily driven by the National Statistics Socio-economic Classification?

# read shape
msoa <- st_read('raw/boundaries/gb_msoa_2011/gb_msoa11_sim.shp')

## Reading layer `gb_msoa11_sim' from data source `/Users/Tycho/Dropbox/UCL/Web/jtvandijk.github.io/GEOG0114/raw/boundaries/gb_msoa_2011/gb_msoa11_sim.shp' using driver `ESRI Shapefile'
## Simple feature collection with 8480 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 5513 ymin: 5342.7 xmax: 655604.7 ymax: 1220302
## CRS:            4326

# set projection
st_crs(msoa) = 27700

## Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
## that

# simplify for speedier plotting
msoa <- st_simplify(msoa, dTolerance = 100) 

# join
cluster_data <- cbind(iuc_input,kfit$cluster)
msoa <- left_join(msoa,cluster_data,by=c('geo_code'='msoa11cd'))

# plot
tmap_mode('view')
tm_shape(msoa) +
  tm_fill(col='kfit$cluster')

Questions

Why can we just do a cbind() to join the cluster solution to the cluster data?

9.3.3 Assignment

The creation of a geodemographic classification is an iterative process of trial and error that involves the addition and removal of variables as well as experimentation with different numbers of clusters. It also might be, for instances, that some clusters are very focused on one group of data (e.g. age) and it could be an idea to group some ages together. If you want to make changes to the clustering solution, you can simply re-run the analysis with a different set of variables or with a different number of clusters by updating the code. However, it would be even simpler if you could automate some of the processs.

Try to create a R function that:

takes at least the following three arguments: a data frame that contains your input data, the number of clusters that you want, and a list of input variables;
executes a k-means clustering on the input variables and the specified number of clusters; and,
produces a csv file that contains the table of means of the solution.

Tips

The list of input variables does not have to be a list of names, but can also be a list containing the index values of the columns.
Google is still your friend.
Your function should look something like: automate_k_means(dataframe,number_of_clusters,input_variables)
Feel free to add optional variables that you need to specify if that makes it easier.
Have a look at Hadley Wickhams explanation of functions in R.

Coding tutorial vs trying it yourself

Following a coding tutorialpic.twitter.com/dXUfp0WWv7
— Tawanda Nyahuye👨 💻 (@towernter) November 28, 2020

Optional challenges

Include maps or graphs in the code that get automatically saved.
Ensure that the csv outcome does not get overwritten every time you run you function.
Include optional arguments in your function with default values if not specified.

No further instructions are provided and you will have to figure out on your own how to do this! We will discuss your automation solution in Friday’s seminar, so make sure to ‘bring’ your results along.

9.4 Take home message

Just like last week, this week we looked at how you can use spatial analysis for different types of applications within geographic and social data science-oriented research. Geodemographic classification are not only useful for descriptives, but can also be used as part of an analysis, for instance, by encoding geodemographic clusters as a dummy as part of a regression. A great example is the paper by Martin et al. 2017, part of this week’s reading list, that provides new representations of the structure of travel to work between residential and workplace area types. Similarly, the paper by Goodman et al. 2011 examines traffic-related air pollution in London in relation to area- and individual-level socio-economic position. Their analysis includes the Index of Multiple Deprivation as well as the commercially created postcode-level ACORN geodemographic classification. In these examples the geodemographic classifications are not an end product but very much an integral part of the analysis. That is it for this second to last week of term!

9.5 Attributions

This week’s content and practical uses content and inspiration from:

Bearman, Nick. 2020. Internet User Classification (IUC) and K-Means Clustering. Consumer Data Research Centre
Garbade, Michael. 2018. Understanding K-means clustering in machine learning. https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1

9.6 Feedback

Please take a moment to give us some feedback on this week’s content.