Question

Is air pollution different in Southern Ontario compared to Northern Ontario?

A more specific question might be …

Are hourly PM-2.5 levels higher in Toronto compared to Thunder Bay in 2016?

EDA

Load the libraries used in the EDA.

The data source is MOE Ontario.

library(tidyverse)
library(stringr)
to_file <- "~/Dropbox/Docs/STA2453H/4002/Sta4002-2017/class2/airpollutantdata-ontario/torontopm25_clean.csv"

toronto_dat <- read_csv(to_file, col_types = cols(Date = col_datetime(format = "%Y-%m-%d"), `Station ID` = col_factor(levels = c("31103"))))

nrow(toronto_dat)
## [1] 366
ncol(toronto_dat)
## [1] 27
tb_file <-"~/Dropbox/Docs/STA2453H/4002/Sta4002-2017/class2/airpollutantdata-ontario/thunderbaypm25_clean.csv"  

thunderbay_dat <- read_csv(tb_file, col_types = cols(Date = col_datetime(format = "%Y-%m-%d"), `Station ID` = col_factor(levels = c("63203"))))

nrow(thunderbay_dat)
## [1] 365
ncol(thunderbay_dat)
## [1] 27

More days recorded in Toronto versus Thuder Bay. Seems strange. How can we investigate?

A plot of the number of rows per date for each city.

thunderbay_dat %>% group_by(Date) %>% summarise(n=n()) %>% ggplot(aes(x=Date,y=n))+geom_count()

toronto_dat %>% group_by(Date) %>% summarise(n=n()) %>% ggplot(aes(x=Date,y=n))+geom_count()

Thunder Bay has six readings for one day in June.

Average over multiple readings for the same date.

tbdat_c <- thunderbay_dat %>% group_by(Date) %>% summarise_at(vars(starts_with("H")),mean,na.rm=T)

nrow(tbdat_c)
## [1] 360
ncol(tbdat_c)
## [1] 25
todat_c <- toronto_dat %>% group_by(Date) %>% summarise_at(vars(starts_with("H")),mean,na.rm=T)
nrow(todat_c)
## [1] 366
ncol(todat_c)
## [1] 25

Plot the data for Toronto and Thunder Bay.

todat_long <- todat_c %>% gather(hour,pm,H01:H24) %>% mutate(hour1=as.numeric(str_extract(hour,"\\d{2}")))

todat_long %>% ggplot(aes(x=Date,y=pm,group=hour,colour=hour))+geom_line()

todat_long %>% group_by(Date) %>% summarise(median_pm=median(pm)) %>% ggplot(aes(x=Date,y=median_pm))+geom_line()+labs(title="Toronto PM2.5")

tbdat_long <- tbdat_c %>% gather(hour,pm,H01:H24) %>% mutate(hour1=as.numeric(str_extract(hour,"\\d{2}")))

ggplot(tbdat_long,aes(x=Date,y=pm,group=hour,colour=hour))+geom_line()

tbdat_long %>% group_by(Date) %>% summarise(median_pm=median(pm)) %>% ggplot(aes(x=Date,y=median_pm))+geom_line()+labs(title="Thunder Bay PM2.5")

Toronto seems to have higher values, but it’s hard to tell from separate plots.

todat <- todat_long %>% mutate(city="Toronto")
tbdat <- tbdat_long %>% mutate(city="Thunder Bay")
alldat <- rbind(todat,tbdat)

alldat %>% group_by(Date,city) %>% summarise(median_pm=median(pm)) %>% ggplot(aes(x=Date,y=median_pm,colour=city))+geom_line()+labs(title="Thunder Bay and Toronto PM2.5")

Easy Solution

The original question was: Are air pollution levels higher in Southern Ontario compared to Northern Ontario?

Let’s try to an analysis that yields a simple answer.

alldat %>% group_by(Date,city) %>% summarise(median_pm=median(pm)) %>% ggplot(aes(city,median_pm))+geom_boxplot()
## Warning: Removed 37 rows containing non-finite values (stat_boxplot).

daily_pm <- alldat %>% group_by(Date,city) %>% summarise(median_pm=median(pm))
t.test(median_pm~city,daily_pm)
## 
##  Welch Two Sample t-test
## 
## data:  median_pm by city
## t = -8.251, df = 576.1, p-value = 1.073e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.369843 -1.458528
## sample estimates:
## mean in group Thunder Bay     mean in group Toronto 
##                  4.530259                  6.444444

Is a difference in PM2.5 of 1.91 meaningful?

Limitations to the Easy Solution

Comparing boxplots and the t-test gives limited information. The current standard for 24-hour PM2.5 is below \(28 \mu \text g / m^3\). The spaghetti plots indicate that all days are below this level for both cities. Thunder Bay did not experience a day in 2016 with a 24-hour PM2.5 above \(15 \mu \text g / m^3\), but Toronto had two days above \(20 \mu \text g / m^3\), but below \(28 \mu \text g / m^3\).

Does the simple solution really indicate that Toronto has worse air pollution compared to Thunder Bay?