Course Overview

The primary learning objectives are:

  1. to gain experience using data analysis to extract information.
  2. to gain experience communicating information that arise from a data analysis.

What is data analysis?

These quotes are from Tukey’s 1962 paper, The Future of Data Analysis.

… data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.

Data analysis … take on the chracteristics of a science rather than those of mathematics …

… give general advice about the use of techniques as soon as there is reasonable ground to think the advice is sound; be prepared for a reasonable fraction (not too large) of cases of such advice to be generally wrong.

Pure mathematics differs from most human endevor in that assumptions are not criticized because of thier relation to something outside, though they are …often critized as unasthetic or as as unnecessarily strong …

In data analysis we must look to a very heavy emphasis on judgement …

(a1) judgement based upon the experience of the particular field of subject matter from which the data come,

(a2) judgement based upon a broad experience with how particular techniques of data analysis have worked in a variety of fields of application,

(a3) judgement based upon abstract results about the properties of particular techniques, whether obtained by mathematical proofs or empirical sampling.

If one were to write down the steps in a data analysis, you might come up with something along these lines of the following list

  • Defining the question
  • Defining the ideal dataset
  • Determining what data you can access
  • Obtaining the data
  • Cleaning the data
  • Exploratory data analysis
  • Statistical prediction/modeling
  • Interpretation of results
  • Challenging of results
  • Synthesis and write up
  • Creating reproducible code

Computing

Data aquisition

Data cleaning and wrangling

agedat <- tibble(age=c(25, 20, 21, 120,19,31,90,17),sex=c(rep("Male",4),rep("Female",4)))
ggplot(agedat,aes(sex,age))+geom_boxplot()

agedat_clean <- filter(agedat,age<=100) # remove age > 100
ggplot(agedat_clean,aes(sex,age))+geom_boxplot()

agedat_std <- agedat_clean %>% group_by(sex) %>% summarise(mean=mean(age),sd=sd(age)) %>% full_join(agedat_clean) %>% mutate(std_age=(age-mean)/sd)
ggplot(agedat_std,aes(sex,std_age))+geom_boxplot()

Methods

We won’t be learning about any particular methods in this course.

I expect that you are familiar with:

  1. Basic concepts in inference such as confidence interval, p-value, and prediction.

  2. Application of basic statistical methods used for inference (e.g., general linear models), and prediction (e.g., linear and logistic regression) using a programming language such as R or Python.

  3. Open to learning new methods techniques with minimal guidance.

Data analysis - Case study

Introduction

The UofT administration is intrested in learning about what people are saying about UofT on social media platforms. What type of image does the University of Toronto (UofT) on social media?

Data Collection - Twitter

Twitter posts (tweets) were searched for the hashtag “#UofT”. The tweets were restricted to the time period 2017-09-05 thru 2017-09-11 and users were located within 50km of the University of Toronto St. George campus.

# Set Twitter API credentials.
library(twitteR)
library(tidytext)
consumer_key <- consumer_key_nt 
consumer_secret <- consumer_secret_nt
access_token <- access_token_nt
access_secret <- access_secret_nt

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
## [1] "Using direct authentication"
# Search Twitter

fn_twitter <- searchTwitter("#UofT",n=1000,lang="en",since = '2017-09-05',until='2017-09-11',geocode = '43.662977,-79.395739,50km') 

fn_twitter_df <- twListToDF(fn_twitter) # Convert to data frame

Data Cleaning and Wrangling - Twitter

The words within each Tweet were tokenized. The 50 most frequent words were plotted to check for uncommon words that might not be in the stop words database within the tidytext package. The stop words - common words in a language, were removed.

library(tidytext)
tweet_words <- fn_twitter_df %>% select(id, text) %>% unnest_tokens(word,text)

tweet_words %>% count(word,sort=T) %>% slice(1:50) %>% 
  ggplot(aes(x = reorder(word, 
    n, function(n) -n), y = n)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 60, 
    hjust = 1)) + xlab("")