STA 441s20 Final Exam

STA441s20 Final Exam

On Thursday April 16th around 3 pm, I posted some practice questions on the multinomial logit model, with answers.

On Friday morning, the link to Quiz 9 is working now.

On Friday afternoon, answer to Q7 of 2018 final is posted.

Summary

The exam will be Saturday April 18th, 1-4 pm.
There are 9 questions, on 10 pages including the cover page.
The exam will be online, in the form of a Quercus assignment. It will be accessible at 1pm and due at 4pm. You will download the exam and my SAS printouts in pdf format, and upload a single pdf with your answers on the Quercus website.
It would be nice if you could write your answers on the question paper somehow, but this is not required.
Questions on the exam will be like the quiz questions, and as you know, the quiz questions are like the homework.
The exam is open book and open notes. All materials on the course website are allowed. During the exam, you are requested not to consult with anyone, and not to access any websites other than the course website and SAS on Demand.
You will be required to use SAS, but only to select sample size. You may use my code. You are advised to have SAS OnDemand up and ready to go at the start of the exam. You will append your log file(s) and results file(s) to the end of the exam file you upload.
Some practice sample size questions are given below.
There will be some other SAS questions on the exam, as well as non-SAS questions. The other SAS questions will be based on my input and output. They are worth 32 marks out of 100.
The SAS questions (apart from the sample size questions) will be based on some (but not all) of the data sets described below. To prepare, become familiar with the data sets, try some analyses, and understand the results. You will not use your output on the exam; questions will be based on my output. Every analysis I do will be fairly obvious, and similar to lecture and homework.
A formula sheet is now available on the course website.
I will be available in the Quercus Bb Collaborate course room during the exam to answer any questions and help straighten out technical problems.
Office hours will be 10:30-11:30 a.m, Wednesday April 15th through Friday April 17th in the Quercus Bb Collaborate course room. More details are given below.

Office Hours

Office hours will be in the Quercus course room.

Thursday April 9th through Monday April 13th: 11am -12pm. This goes through the weekend, and overlaps with various religious and semi-religious holidays. I hope nobody is offended.
No office hours Tuesday April 14th.
Wednesday April 15th through Saturday April 17th: 10:30 am - 11:30 am.

Past Exams

2016
- Exam Question 7 Solution
- Printouts
2018
- Exam
- Printouts

Quiz solutions

Multinomial Logit Practice

Homework

This course is all about the homework. The homework tells you what I want you to be able to do. Lecture material is only useful to the extent that it helps you do the homework. The text may help too. It is less focused on what we are doing this time, but it is more detailed.

To study for the final, I recommend that you

Re-do the non-SAS parts of the homework.
1. For each assignment, locate the corresponding lecture slides. They are pretty much in chronological order (order of time). If this is a difficult task, you are not familiar enough with the course material.
2. Look at the lecture slides and the homework problems together. Observe how most of the homework problems are asking you to use some concept or method from the lecture. Of course sometimes I just want you to think about something, but most questions have a lesson.
3. Re-do the problems, referring to your earlier answers
4. If you do not get what a problem means or what it is asking you to do, this means you should find out. You are missing something, and it could be on the final exam.
Using SAS, do something reasonable with the final data sets described below. What's reasonable? In my opinion, more or less what you did on the SAS part of the homework. However, there is more than one "right answer." The important thing is to become familiar with the data sets, try some analyses, and understand the results. You will not bring your output to the exam. Questions will be based on my output.

Sample Sample Size Questions

In a double-blind drug trial with an experimental group and a control group, we want to be 90% sure of detecting an effect on blood pressure if the true mean response to the drug is a quarter of a standard deviation above the control group. The plan is to use a 2-tailed t-test with the usual 0.05 significance level. What is the smallest sample size that will get this job done? (Using matpow1.sas, I got n = 675 which I increased to 676 to maintain equal sample sizes.)
In another version of the first question, suppose we want a power of 0.90 if the drug explains 1% of the population variation in blood pressure. What sample size is needed? Note that "remaining variation" in this case is just variation, because we are not controlling for anything. The same formulas apply. (Using popvar.sas, which writes on the log file, I got n = 1,045. There was no assumption of equal sample sizes.)
In yet another version of the 2-sample problem, what sample size is required if we want the difference to be significant provided it explains 1% of the sample variation? That's an R² of 0.01, which is tiny. (Using sampvar.sas, I get n = 385)
Baby chickens are to be randomly assigned to one of four feed formulas, and then weighed after 6 weeks. Suppose the true (population) treatment means in grams are μ₁ = 220, μ₂ = 275, μ₃ = 250 and μ₄ = 330, with a common variance of σ² = 2,800. Based on an F-test with the usual α = 0.05 significance level, what total sample size is needed to have a power of 0.80? Assume equal sample sizes, so that your answer will be a multiple of 4. (Using matpow2.sas, I get n = 23 increased to n = 24 for equal sample sizes.)
Pigs are routinely given large doses of antibiotics even when they show no signs of illness, to protect their health under unsanitary conditions. Pigs will be randomly assigned to one of three antibiotics. The response variable will be dressed weight (weight of the pig after slaughter and removal of head, intestines and skin). Mother's and father's live adult weight will be used as covariates. Suppose that antibiotic explains 12% of the remaining variation in weight after taking parents' weight into account. What sample size is required for an F-test to detect this with probability 0.80? Equal sample sizes are not required. (Using popvar.sas, I get n=79)
In another version of the pig weight question above, what sample size is required for the F-test for antibiotic to be significant, provided that antibiotic explains at least 12% of the remaining sample variation? (using sampvar.sas, I get n = 52.)

Data Sets

Exam questions worth 32 points out of 100 will be based on my SAS output for at least two and at most four of the following data sets. Try some analyses. Look up any terminology that is unfamiliar, or you can ask in office hours (but why wait?). Understand what the variables are, because I will not be answering questions about the data sets during the exam. What I will do with the data is very predictable.

The Stat Class data: Gender, Race, Quiz average, Computer assignment average, Midterm score and Final Exam score from a statistics class, long ago.
The UCLA Grad School data: A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, affect admission into graduate school. The response variable, admit/don't admit, is a binary variable. The variables are admit gre gpa rank. I think that rank must be prestige of the school.
The Program Choice data: Incoming high school students choose their programs of study. Variables are
- Gender: 0=Male, 1=Female
- Socioeconomic status: 1, 2, 3
- Math score
- Reading score
- Science score
- Social studies score
- Writing score
- Program choice: 1=general, 2=academic, 3=vocational
I will use the variables names given in the first line of the data file.
The Chicken Weight data: These data are from an experiment on the effect of diet on early growth of chickens. The body weights of the chickens were measured at birth and every second day thereafter until day 20. They were also measured on day 21. There were four groups of chicks on different protein diets. The variables are
- weight: Body weight in grams
- Time: Days since birth
- Chick: Identification code
- Diet: 1, 2, 3 or 4
I will use the variables names given in the first line of the data file.

The Self-Esteem data: Self-esteem is how good you feel about yourself. Data are the self-esteem scores of 12 individuals enrolled in 2 successive short-term trials (4 weeks) - control (placebo) and special diet trials. The self esteem score was recorded at three time points: at the beginning (t1), midway (t2) and at the end (t3) of the trials. The same 12 participants are enrolled in the two different trials with enough time between trials. This means that every subject was in both treatment conditions at all 3 time points.

Reading the data in a good way is a bit tricky. Here is my code, with some proc means and proc print statements that should help you see what's going on.

data psych; /* Multivariate data read */ 
     infile '/home/brunner0/441s20/selfesteem3.data.txt' firstobs=2;
     input   line1 id1 treatment1 $ Control1 Control2 Control3
             line2 id2 treatment2 $ Diet1    Diet2    Diet3;

proc means data = psych;
     var Diet1-Diet3 Control1-Control3;

data utility; /* Univariate data read */ 
     infile '/home/brunner0/441s20/selfesteem3.data.txt' firstobs=2;
     input line id Treatment $ t1 t2 t3; 
     /* But t1-t3 are still on the same line. */

proc means mean data=utility;
     class treatment; 
     var t1-t3;

data psych0;
     set utility;
     se = t1; Time = 1; output;
     se = t2; Time = 2; output;
     se = t3; Time = 3; output;
     label se = 'Self-Esteem';
     keep Treatment Time id se;

proc print data=utility;
proc print data=psych0;

The Basketball Data: Right handed basketball players take right and left-handed hook shots from the three spots on the floor (left baseline, right baseline and middle), for a total of 6 shots. Hit or miss is recorded for each shot. I will use the variables names given in the first line of the data file.
The Ozone Data: Ozone is great when it's in the upper atmosphere protecting us from the sun's radiation. At ground level, it's air pollution. There is usually a lot of air pollution in New York city. Readings of the following variables were taken from May 1, 1973 to September 30, 1973.
- Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island
- SolarRad: Solar radiation in Langleys in the frequency band 4000-7700 Angstroms from 0800 to 1200 hours at Central Park
- Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport
- Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.

Further Comments

Ordinarily, I'd say bring a calculator. It might be better to have R ready to go; to me, R is the ultimate calculator.
Please remember to answer exactly the question that is asked. If you give the answer to a related homework problem, I will conclude that you do not know what's going on, and mark accordingly.
It would be nice if you could write your answers on the question paper somehow, but this is not required.
PDF is strongly preferred, but I will take anything legible.

This document is licensed under a Creative Commons Attribution-ShareAlike 3.0 (or later) Unported License. The basketball data are protected by the Creative Commons license too.