STA 2201S: Applied Statistics II Spring 2015
Final Project due April 15 11.59pm
The project report should be between three and five pages, and be a non-technical summary of your analysis. This will not include any code, but it may include tables and plots. You should make sure to have an introduction, to provide a detailed reference for the source of data, to state the scientific problem(s) of interest and your conclusions.
In a statistical appendix describe the main statistical methods used, give a summary of the statistical results, including what models were considered, what models formed the basis for the report above, and why. In this appendix you can include code excerpts, additional plots, and tables, as needed
Finally an executable file, either an R script or an R Markdown file or a knitr file is required, that will enable me to reproduce the results used in your report. This file should include the data frame that you constructed from your dataset, so that I don't need to use read.table or read.csv.
Homework 3
Due April 1, 11.59 on Blackboard. On the Blackboard web page you can find the assignment under "Course Materials".- Questions(Updated Mar 20 to correct typos in Q4)
- Latex source
- Paper for Q3
Homework 2
Due March 6, 11.59 pm on Blackboard. On the Blackboard web page you can find the assignment under "Course Materials".- Here is a paper Archer found that discusses choosing between quasi-Poisson and negative binomial. If you use the ideas in this paper for your homework be sure to include a reference.
- Q2(d). Q: can we choose between quasi-Poisson and negative binomial using AIC? A: I don't think you can use AIC for the quasi-Poisson, because there is not a genuine log-likelihood. I would rely on plots and on a study of the mean-variance relationship.
Q: If (ii) indicates that there is an association in one city but not in another, why would we be interested in (iii)? A: I *think* it could be the case in principle that you could have enough noise in the data that (iii) and (ii) could be compatible.
Archer found this resource, which is very clear. In particular, you might find it easier to think about the answers to the 3 parts by fitting sequences of Poisson GLMs of the form:(D = disease; B = blood group; C = city)
D + B + C, DC + B, DB + C, D + BC, DB + BC, etc.
and figuring out how these sub-models link with the 3 parts of the question.
- Q2(a): You will want to refer to the AOAS paper for answering this question. It is not a standard generalized linear model of the type I described in class, unless \(\nu\) is considered fixed. So you can assume this for putting it in the GLM form. It is however a two-parameter exponential family, so if you interpret \(\theta = (\log\lambda, \nu)\), then the question can be answered as stated. Either version is fine.
- Q2(d): Thanks to Alex-Antoine, for pointing out that the CMP model cannot be estimated using the Galapagos Island data.
I've revised the question, suggesting to try the negative binomial model instead. (Which can be fit.)
It's possible that a rate model is better for this data, if we think that the number of species might be proportional to the area of the island. Bonus marks for exploring this.
- In Q1, the notation \(\underline y\) means the vector of all the observations \((y_{111}, \dots, y_{JKL})\)
- Homework Questions Feb 18: Q2(d) changed; Feb 13: Typos corrected Latex source
- Jager & Leek, for Q3
- Sellers & Shmueli, for Q2. This paper on generalized linear models with the Conway-Maxwell-Poisson distribution appeared in the Annals of Applied Statistics in 2010.
Homework 1
- Marking Scheme
- Corrections and clarifications:
- On Jan.27, Q2 (b) and (d) were updated.
- Q3: Several students have asked: " what is the meaning of the main analysis of this endpoint?"
A: "Main analysis", means "what statistical analysis did they use to study this response". Often there is more than one, but one in particular that leads to the result emphasized in the abstract and conclusions. If there is more than one, just say so. - Q2(d): The HW sheet was changed on Jan.27. As of today (Feb 3) You ONLY NEED to show the first part (with p's all equal).
- Q2(b): Use the result \(\sum y_i = \sum n_i \hat p_i\), which is true as long as the design matrix has a column of 1's.
- Homework Questions Updated Jan 27
- Latex for Homework Questions
- Reference paper for Q1
April 1
- Slides
- Leslie Beck on Vitamin D, Globe & Mail March 29
- Institute of Medicine's "explanation" of how the RDA for Vitamin D was determined
- André Picard, Globe & Mail
March 25
March 18
March 11
- Slides
- R script
- Jenny Bryan, again, this time with a Shiny App illustrating a catalogue of graphics and the R code to draw them
March 4
- Slides Part 1
- Slides Part 2
- RMarkdown file for Part 2
- Jenny Bryan's code to search cran for examples -- terrific!
- Unreliable research picture from the Economist
- the Cochrane Collaboration publishes reviews of the literature in health care and health policy
February 25
- Slides
- Just discovered these RStudio Cheatsheets -- Brilliant!
February 11
- Slides (updated Feb 16, using photos of blackboard)
- Measles web pages
- Royal Statistical Soceity's Significance Magazine
- National Health Service, UK, with links to published research
- Baird, et al. (2008) Case-control study finds no evidence of MMR vaccination link to autism
- The Lancet retracted the Wakefield 1998 paper in 2010.
February 4
- Slides
- iPad version
- Data Scientist ``the sexiest job of the 21st century" Harvard Business Review
- Yihui Li's web page for knitr
- More or Less podcasts on the BBC. "WS Global Wealth 24 Jan 15" discusses the Oxfam report. "WS Bad Luck and Cancer 10 Jan 15" reviews the Science article.
January 28
- Slides, which include links to many data sources
- Data Science and R
January 21
- Slides
- iPad annotations
- Economist news article on sea-level rise
- Nature paper referred to in the article
- R-Bloggers
January 14
- Slides
- Paper on teaching evaluations
- Cancer risk: Links to the articles in the NY Times, Economist, and Science, are given in the slides, but other posts include
- Science reporter's reflection on the original item in Science News
- David Spiegelhalter's explanation at the Understanding Uncertainty web page
- The Guardian's post led with ``Please journalists, get a clue before you write about science.
- Similarly, this criticism is short and clear.
- A collections of links on this story has been published here.
- Of those, Thomas Lumley's is very clear, and helps sort out the log scale.
January 7
- Slides
- iPad slides annotated
- Buzzfeed article
- Information about knitr and Sweave
- R code to reproduce analysis in slides
- Report of the Presidential Commision on the Space Shuttle Challenger Accident. The oring data is about 1/3 of the way down this page.
Course Information
Text
Extending the Linear Model with R by J. Faraway.Recommended
Statistical Models by A.C. Davison.Principles of Applied Statistics by D.R. Cox and C.A. Donnelly
Computing
You are welcome to use the statistical computing package of your choice, but I will refer exclusively to the R computing package. Some online resources that I've found helpful are:- RStudio, an IDE for R that has many useful features
- Information about knitr and Sweave
- Some tips and tricks for RStudio by Paul Chang
- The official introduction, from CRAN
- R Reference card
- The R Cookbook
- John Verzani's online book simpleR.
- If you already know SAS/SPSS/Stata, you may find this Quick-R guide helpful.
- Thomas Lumley's R course notes are often recommended on the LinkedIn R Project
- Revolution Analytics has a list of several more R resources, including wikis and free online books.