Course Description

Students will gain experience with the data science process including: data collection; stating the question; data wrangling; data analysis; data interpretation; and communication by working on projects. The projects will involve data collected by an organization (e.g., organization or scientist), using published data, or scraping web pages. All projects will involve some type of collaboration or communication. Students are expected to be familiar with the application of basic statistical methods used for inference (e.g., general linear models), prediction (e.g., linear and logistic regression), and are comfortable with basic data analysis using a programming language such as R or Python. Students will be expected to adopt a reproducible research workflow using tools such as R Markdown, R Notebook, or Jupyter Notebook.

Class time will be a mixture of informal lectures, class discussions, meetings with collaborators, and student presentations.

This course will not cover specific “methods”, nevertheless it’s important that students are able to independently learn and apply unfamilar methods.

Evaluation

All work will be graded on a scale from 1 to 4 (sometimes with pluses and minuses) where:

Grade value Description
1 Work does not meet expectations.
2 Work meets expectations minimally, possibly missing some.
3 Good work; meets all or most expectations.
4 Excellent work; exceeds expectations.

Grades will almost always be 2 or 3 (1’s and 4’s are rare). Generally speaking, a 2 is a B, a 3 is an A, and a 4 is an A+.

Project Item Value
Project #1 Proposal 5%
Draft report 5%
Final report 10%
Presentation on project #1 20%
Project #2 Proposal 5%
Draft report 5%
Final report 10%
Presentation on project #2 20%
Participation Attendance, participate in discussions, prepare for class 20%

Tentative Course Schedule

Class Date Description Reading Due
1 09-12 Introduction, data analysis case study
2 09-19 Data Analysis, questions, web scraping, discuss ideas project #1 R. D. Peng and Matsui (2015) (1-3), Leek and Peng (2015)
3 09-26 Exploratory data analysis, discuss ideas project #1 R. D. Peng and Matsui (2015) (4), Donoho (2015)
4 10-03 Models R. D. Peng and Matsui (2015) (5,6,7) Project #1 proposal
6 10-10 Inference vs. prediction, discuss project #2 R. D. Peng and Matsui (2015) (8), Breiman (2001) Project #1 draft report
7 10-17 Project #1 presentations Project #1 report due, meet with collaborators by 10-27
8 10-24 Project #1 presentations Project #2 proposals due
9 10-31 Project #2 check-in
- 11-07 No class - Fall reading week
9 11-14 Project #2 check-in Lazer et al. (2014)
10 11-21 Project #2 check-in and quick presentations Project #2 draft report
11 11-28 Project #2 presentations
12 12-5 Project #2 presentations Project #2 report due

Reading References

Breiman, Leo. 2001. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16 (3). Institute of Mathematical Statistics: 199–231.

Donoho, David. 2015. “Years of Data Science, 2015.” http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf.

Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “Google Flu Trends Still Appears Sick: An Evaluation of the 2013-2014 Flu Season.”

Leek, Jeffery T., and Roger D. Peng. 2015. “What Is the Question?” Science 347 (6228). American Association for the Advancement of Science: 1314–5. doi:10.1126/science.aaa6146.

Peng, Roger D, and Elizabeth Matsui. 2015. “The Art of Data Science.” A Guide for Anyone Who Works with Data. Skybrude Consulting. https://bookdown.org/rdpeng/artofdatascience/.