Is air pollution different in Southern Ontario compared to Northern Ontario?
A more specific question might be …
Are hourly PM-2.5 levels higher in Toronto compared to Thunder Bay in 2016?
Load the libraries used in the EDA.
The data source is MOE Ontario.
library(tidyverse)
library(stringr)
to_file <- "~/Dropbox/Docs/STA2453H/4002/Sta4002-2017/class2/airpollutantdata-ontario/torontopm25_clean.csv"
toronto_dat <- read_csv(to_file, col_types = cols(Date = col_datetime(format = "%Y-%m-%d"), `Station ID` = col_factor(levels = c("31103"))))
nrow(toronto_dat)
## [1] 366
ncol(toronto_dat)
## [1] 27
tb_file <-"~/Dropbox/Docs/STA2453H/4002/Sta4002-2017/class2/airpollutantdata-ontario/thunderbaypm25_clean.csv"
thunderbay_dat <- read_csv(tb_file, col_types = cols(Date = col_datetime(format = "%Y-%m-%d"), `Station ID` = col_factor(levels = c("63203"))))
nrow(thunderbay_dat)
## [1] 365
ncol(thunderbay_dat)
## [1] 27
More days recorded in Toronto versus Thuder Bay. Seems strange. How can we investigate?
A plot of the number of rows per date for each city.
thunderbay_dat %>% group_by(Date) %>% summarise(n=n()) %>% ggplot(aes(x=Date,y=n))+geom_count()
toronto_dat %>% group_by(Date) %>% summarise(n=n()) %>% ggplot(aes(x=Date,y=n))+geom_count()
Thunder Bay has six readings for one day in June.
Average over multiple readings for the same date.
tbdat_c <- thunderbay_dat %>% group_by(Date) %>% summarise_at(vars(starts_with("H")),mean,na.rm=T)
nrow(tbdat_c)
## [1] 360
ncol(tbdat_c)
## [1] 25
todat_c <- toronto_dat %>% group_by(Date) %>% summarise_at(vars(starts_with("H")),mean,na.rm=T)
nrow(todat_c)
## [1] 366
ncol(todat_c)
## [1] 25
Plot the data for Toronto and Thunder Bay.
todat_long <- todat_c %>% gather(hour,pm,H01:H24) %>% mutate(hour1=as.numeric(str_extract(hour,"\\d{2}")))
todat_long %>% ggplot(aes(x=Date,y=pm,group=hour,colour=hour))+geom_line()
todat_long %>% group_by(Date) %>% summarise(median_pm=median(pm)) %>% ggplot(aes(x=Date,y=median_pm))+geom_line()+labs(title="Toronto PM2.5")
tbdat_long <- tbdat_c %>% gather(hour,pm,H01:H24) %>% mutate(hour1=as.numeric(str_extract(hour,"\\d{2}")))
ggplot(tbdat_long,aes(x=Date,y=pm,group=hour,colour=hour))+geom_line()
tbdat_long %>% group_by(Date) %>% summarise(median_pm=median(pm)) %>% ggplot(aes(x=Date,y=median_pm))+geom_line()+labs(title="Thunder Bay PM2.5")
Toronto seems to have higher values, but it’s hard to tell from separate plots.
todat <- todat_long %>% mutate(city="Toronto")
tbdat <- tbdat_long %>% mutate(city="Thunder Bay")
alldat <- rbind(todat,tbdat)
alldat %>% group_by(Date,city) %>% summarise(median_pm=median(pm)) %>% ggplot(aes(x=Date,y=median_pm,colour=city))+geom_line()+labs(title="Thunder Bay and Toronto PM2.5")
The original question was: Are air pollution levels higher in Southern Ontario compared to Northern Ontario?
Let’s try to an analysis that yields a simple answer.
alldat %>% group_by(Date,city) %>% summarise(median_pm=median(pm)) %>% ggplot(aes(city,median_pm))+geom_boxplot()
## Warning: Removed 37 rows containing non-finite values (stat_boxplot).
daily_pm <- alldat %>% group_by(Date,city) %>% summarise(median_pm=median(pm))
t.test(median_pm~city,daily_pm)
##
## Welch Two Sample t-test
##
## data: median_pm by city
## t = -8.251, df = 576.1, p-value = 1.073e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.369843 -1.458528
## sample estimates:
## mean in group Thunder Bay mean in group Toronto
## 4.530259 6.444444
Is a difference in PM2.5 of 1.91 meaningful?
Comparing boxplots and the t-test gives limited information. The current standard for 24-hour PM2.5 is below \(28 \mu \text g / m^3\). The spaghetti plots indicate that all days are below this level for both cities. Thunder Bay did not experience a day in 2016 with a 24-hour PM2.5 above \(15 \mu \text g / m^3\), but Toronto had two days above \(20 \mu \text g / m^3\), but below \(28 \mu \text g / m^3\).
Does the simple solution really indicate that Toronto has worse air pollution compared to Thunder Bay?