2 Statistical Analysis I

Geographers are often interested not only in a specific variable within an area but also in how that variable varies between different areas. Building on last week’s refresher on using R and RStudio for quantitative data analysis, this week we will focus on making inferences about different groups.

2.1 Lecture slides

You can download the slides of this week’s lecture here: [Link].

2.2 Reading list

Essential readings

  • Field, A. Discovering Statistics using R, Chapter 9: Comparing two means, pp. 359-397. [Link]
  • Field, A. Discovering Statistics using R, Chapter 10: Comparing several means: ANOVA (GLM1), pp. 398-461. [Link]

Suggested readings

  • Field, A. Discovering Statistics using R, Chapter 15: Non-parametric tests, pp. 653-695. [Link]

2.3 Unemployment in London

In this week’s tutorial, we will explore unemployment rates in London as reported in the 2021 Census. In addition, we will use of last week’s dataset on age groups in London for the homework task. You can download both files using the links provided below. Make sure to save the files in your project folder in the data directory.

File Type Link
London LSOA Census 2021 Age Groups csv Download
London LSOA Census 2021 Economic Status csv Download

To download a csv file that is hosted on GitHub, click on the Download raw file button on the top right of your screen and it should download directly to your computer.

To get started, open your GEOG0018 R Project and create a new script: File -> New File -> R Script. Save your script as w07-employment-age-analysis.r.

We will start by loading the libraries that we will need:

R code
# load libraries
library(tidyverse)
library(janitor)

2.3.1 Data loading

Next, we can load both the London-LSOA-AgeGroup.csv and London-LSOA-EconomicStatus.csv file into R.

R code
# load age data
lsoa_age <- read_csv('data/London-LSOA-AgeGroup.csv')
Rows: 24970 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Lower layer Super Output Areas Code, Lower layer Super Output Areas...
dbl (2): Age (5 categories) Code, Observation

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# load eployment data
lsoa_emp <- read_csv('data/London-LSOA-EconomicStatus.csv')
Rows: 19976 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Lower layer Super Output Areas Code, Lower layer Super Output Areas...
dbl (2): Economic activity status (4 categories) Code, Observation

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Because we already worked with the lsoa_age dataset last week, we will shift our focus to the lsoa_emp dataset. Now we have loaded the data into R, we will start by inspecting the dataset to understand its structure and the variables it contains:

R code
# inspect
head(lsoa_emp)
# A tibble: 6 × 5
  Lower layer Super Output Areas…¹ Lower layer Super Ou…² Economic activity st…³
  <chr>                            <chr>                                   <dbl>
1 E01000001                        City of London 001A                        -8
2 E01000001                        City of London 001A                         1
3 E01000001                        City of London 001A                         2
4 E01000001                        City of London 001A                         3
5 E01000002                        City of London 001B                        -8
6 E01000002                        City of London 001B                         1
# ℹ abbreviated names: ¹​`Lower layer Super Output Areas Code`,
#   ²​`Lower layer Super Output Areas`,
#   ³​`Economic activity status (4 categories) Code`
# ℹ 2 more variables: `Economic activity status (4 categories)` <chr>,
#   Observation <dbl>
# inspect column names
names(lsoa_emp)
[1] "Lower layer Super Output Areas Code"         
[2] "Lower layer Super Output Areas"              
[3] "Economic activity status (4 categories) Code"
[4] "Economic activity status (4 categories)"     
[5] "Observation"                                 
# inspect economic status categories
unique(lsoa_emp$`Economic activity status (4 categories)`)
[1] "Does not apply"                                                   
[2] "Economically active: In employment (including full-time students)"
[3] "Economically active: Unemployed (including full-time students)"   
[4] "Economically inactive"                                            

The economic status categories in the lsoa_emp dataset represent the number of individuals and their economic activity status as reported in the 2021 Census. The Office for National Statistics (ONS) provides definitions for each of these statuses in their Census 2021 data dictionary.

You can further inspect the dataset using the View() function.

2.3.2 Data manipulation

Because the lsoa_emp dataset was extracted using the Custom Dataset Tool, we need to convert the data from long to wide format. Like last week, we can do this by pivoting the table:

R code
# clean names 
lsoa_emp <- lsoa_emp |>
  clean_names()

# pivot wider
lsoa_emp <- lsoa_emp |>
  pivot_wider(id_cols = c('lower_layer_super_output_areas_code', 'lower_layer_super_output_areas'),
              names_from = 'economic_activity_status_4_categories',
              values_from = 'observation') 

# clean names
lsoa_emp <- lsoa_emp |>
  clean_names()

If your clean_names() function returns an error, it is likely due to a conflict with another library that also includes a clean_names() function. In such cases, R cannot determine which one to use. To resolve this, you can specify the library explicitly by using janitor::clean_names().

The new column names of the lsoa_emp data are quite lengthy, so let us manually assign more concise column names for ease of use. This will help simplify our code and make it easier to reference the columns in subsequent analyses. Here is how we can do that:

R code
# names 
names(lsoa_emp)[3:6] <- c('not_applicable', 'active_employed', 'active_unemployed', 'inactive')

To account for the non-uniformity of the areal units, we further need to convert the observations to proportions again:

R code
# calculate total population
lsoa_emp <- lsoa_emp |>
  rowwise() |>
  mutate(lsoa_pop = sum(across(3:6)))

# calculate proportions
lsoa_emp_prop <- lsoa_emp |>
  mutate(across(3:6, ~./lsoa_pop))

You can further inspect the results using the View() function.

2.3.3 Data comparison

Now that we have our data prepared, we can begin comparing different groups within it. Various statistical methods exist for this purpose, depending on the nature of the data and the number of groups involved.

For comparing two groups with continuous data, we can use the t-test, which assesses whether the means of the two groups are significantly different from each other. If the data does not meet the assumptions of normality, the Mann-Whitney U test serves as a non-parametric alternative, comparing the ranks of the values instead.

Parametric tests assume that the data follows a specific distribution, usually normal, and rely on parameters such as mean and variance. In contrast, non-parametric tests do not assume a particular distribution and are used for ordinal data or when assumptions of parametric tests are violated.

When we want to compare more than two groups, we can use the Analysis of Variance (ANOVA) test. An ANOVA evaluates whether there are any statistically significant differences between the means of multiple groups. If the assumptions of ANOVA are violated, we can use the Kruskal-Wallis test, which is a non-parametric method that compares the ranks across the groups.

2.3.3.1 Comparing two groups

Two compare two groups with continuous data, we can run an independent samples t-test, which allows us to statistically assess whether the means of the two groups differ significantly. For instance, we might be interested in whether the proportion of the population that is unemployed in Lower Super Output Areas (LSOAs) in the London Borough of Camden is different, on average, from the proportion of the population that is unemployed in the London Borough of Sutton.

Let us begin by creating two separate datasets to facilitate our comparison: one containing the LSOAs for Camden and containing the LSOAs for Sutton. We can do this by filtering the lsoa_emp dataset:

R code
# filter out camden lsoas
lsoa_emp_camden <- lsoa_emp_prop |>
  filter(str_detect(lower_layer_super_output_areas, 'Camden'))

# filter out sutton lsoas
lsoa_emp_sutton <- lsoa_emp_prop |>
  filter(str_detect(lower_layer_super_output_areas, 'Sutton'))

We can use the t.test() function in R to conduct a two-tailed test, assessing whether the mean proportion of unemployed people in Camden differs from that in Sutton The null hypothesis for this test states that there is no difference in the mean proportions of unemployed individuals between the two areas. We can run the test as follows:

R code
# run t-test
t.test(lsoa_emp_camden$active_unemployed, lsoa_emp_sutton$active_unemployed)

    Welch Two Sample t-test

data:  lsoa_emp_camden$active_unemployed and lsoa_emp_sutton$active_unemployed
t = 11.279, df = 188.26, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.01303670 0.01856346
sample estimates:
 mean of x  mean of y 
0.04220920 0.02640912 

We can see that the mean in Sutton LSOAs (2.64%) is lower than in Camden LSOAs (4.22%). Since the \(p\)-value is < 0.05 we can conclude reject the null hypothesis that means for Sutton and Camden are the same, and accept the alternative hypothesis.

The non-parametric alternative to the t-test is the Mann-Whitney U test (also known as the Wilcoxon rank-sum test). This test is used to compare differences between two independent groups when the assumptions of the t-test (such as normality of the data) are not met.

The Mann-Whitney U test evaluates whether the distributions of the two groups differ by ranking all the data points and then comparing the sums of these ranks, making it suitable for ordinal data or non-normally distributed continuous data.

If we think our data violates any of the assumption underlying the t-test, we can run the Mann-Whitney U (Wilcoxon rank-sum) test instead:

R code
# run mann-whitney
wilcox.test(lsoa_emp_camden$active_unemployed, lsoa_emp_sutton$active_unemployed)

    Wilcoxon rank sum test with continuity correction

data:  lsoa_emp_camden$active_unemployed and lsoa_emp_sutton$active_unemployed
W = 13295, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0

Also the results of the Mann-Whitney U test suggest that the difference in means in these two London boroughs is statistically significant.

Since the \(p\)-value is < 0.05 we can reject the null hypothesis that means for Sutton and Camden are the same, and accept the alternative hypothesis.

To determine if a variable is normally distributed, visual inspection can be helpful, but the Shapiro-Wilk test (shapiro.test()) offers a more formal statistical assessment. This test compares the distribution of your variable to a normal distribution, with the null hypothesis stating that they are the same. If the test returns a small \(p\)-value (typically < 0.05), you can conclude that the variable is not normally distributed. However, it is important to note that the test is significantly influenced by sample size; in large datasets, even minor deviations from normality can yield a low \(p\)-value. The Shapiro-Wilk test should therefore not be considered definitive and should be used alongside visual assessment of the data.

2.3.3.2 Comparing more than two groups

If we want to compare more than two groups, we can use an Analysis of Variance (ANOVA) test. An ANOVA is a statistical method used to compare the means of three or more groups to determine if at least one group mean significantly differs from the others. For instance, if we extend our analysis from Camden and Sutton to include the London Borough of Hammersmith and Fulham, we can test for differences in means among these three areas.

The null hypothesis of the ANOVA is that all group means are equal, while the alternative hypothesis posits that at least one group mean is different. ANOVA assesses the variance within groups and between groups to calculate the F-statistic, which is the ratio of the variance between the groups to the variance within the groups. A higher F-statistic indicates that group means are not all the same. The \(p\)-value derived from the F-statistic tells us whether we can reject the null hypothesis.

To run an ANOVA in R, we first need to ensure that all the groups we are comparing are contained within the same dataframe. We need a dataframe where one column represents the dependent variable (e.g. unemployment rates) and another column represents the independent grouping variable (e.g. borough names).

We can do this by filtering our LSOA dataframe to include only those LSOAs that fall within Camden, Sutton, and Hammersmith and Fulham. To create the grouping variable, we need to extract the borough name from the lower_layer_super_output_areas column. We can do this by creating a new variable that excludes the last five characters of the lower_layer_super_output_areas variable.

R code
# filter 
lsoa_subset <- lsoa_emp_prop |>
  filter(str_detect(lower_layer_super_output_areas, 'Camden') |
         str_detect(lower_layer_super_output_areas, 'Sutton') |
         str_detect(lower_layer_super_output_areas, 'Hammersmith and Fulham'))

# extract borough
lsoa_subset <- lsoa_subset |>
  mutate(borough_name = substr(lower_layer_super_output_areas, 1, nchar(lower_layer_super_output_areas) - 5))

You can further inspect the results using the View() function.

We can now run the ANOVA:

R code
# run anova
anova_result <- aov(active_unemployed ~ borough_name, data = lsoa_subset)

# summary
summary(anova_result)
              Df  Sum Sq  Mean Sq F value Pr(>F)    
borough_name   2 0.01968 0.009842   75.31 <2e-16 ***
Residuals    365 0.04770 0.000131                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA output shows the F-statistic with a value 75.31, with a \(p\)-value < 0.001. We can conclude that there are significant differences in unemployment rates between the boroughs. This means we reject the null hypothesis that all group means are equal.

What this does not tell us is which group are significantly different from another. We can conduct a post hoc Tukey test to identify which specific groups differ from another:

R code
# run post hoc tukey test
tukey_result <- TukeyHSD(anova_result)

# summary
tukey_result
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = active_unemployed ~ borough_name, data = lsoa_subset)

$borough_name
                                       diff          lwr          upr     p adj
Hammersmith and Fulham-Camden -0.0006520551 -0.004096128  0.002792018 0.8964024
Sutton-Camden                 -0.0158000776 -0.019184199 -0.012415956 0.0000000
Sutton-Hammersmith and Fulham -0.0151480225 -0.018637794 -0.011658251 0.0000000

The output presents pairwise comparisons between the groups, with the null hypothesis stating that the group means are equal. Our analysis indicates significant differences in unemployment rates between Sutton and Camden, as well as between Sutton and Hammersmith and Fulham. However, there is no evidence of a difference in unemployment rates between Camden and Hammersmith and Fulham.

If we think our data violates any of the assumption underlying the ANOVA, we can run the non-parametric Kruskal-Wallis test instead.

The Kruskal-Wallis test is a non-parametric statistical method used to compare the medians of three or more independent groups. It is an alternative to one-way ANOVA when the assumptions of normality and homogeneity of variances are not met. The test ranks all the data points and evaluates whether the rank sums of the groups differ significantly.

R code
# run kruskal-wallis
kruskal.test(active_unemployed ~ borough_name, data = lsoa_subset)

    Kruskal-Wallis rank sum test

data:  active_unemployed by borough_name
Kruskal-Wallis chi-squared = 117.37, df = 2, p-value < 2.2e-16

The results of the Kruskal-Wallis test suggest that the differences in medians among these three London boroughs are statistically significant.

To further investigate which specific groups differ, we can conduct a series of pairwise Mann-Whitney U tests for comparisons between each pair of boroughs.

2.4 Homework task

This concludes this week’s tutorial. Now complete the following homework tasks making use of both the lsoa_emp and lsoa_age datasets:

  1. Use an appropriate statistical test to determine whether the average proportion of the employed population differs between LSOAs in the London Borough of Bexley and the London Borough of Harrow.
  2. Apply the relevant test to compare the proportion of people aged 50 years and over across the London Boroughs of Lambeth, Southwark, and Westminster.

Paste the test outputs in the appendix of your assignment, include a few sentences interpreting the results.

2.5 Before you leave

This week, we explored how to statistically compare different groups to assess whether they are different, using both parametric and non-parametric methods. Next week, we will shift our focus from group comparisons to analysing associations and relationships between variables. Looking forward? Good. For now, take some time to unwind and get ready for what is next!