\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 2101/442f12 Assignment Nine}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent Please bring your R printout for Question~\ref{awards} to the quiz, and also your SAS log file and list file from the Question~\ref{chicks}. It has to be the log file, not just a listing of the SAS program. The log file and the list file \emph{must be from the same run of SAS.} Question~\ref{workout} and the non-computer parts of the other questions are just practice for the quiz, and are not to be handed in. Any necessary formulas will be provided. % Poisson regression with R % Multiple comparisons with SAS % Interactions taken from 2011A7 \begin{enumerate} \item \label{workout} In a study comparing the effectiveness of different exercise programmes, volunteers were randomly assigned to one of three exercise programmes ($A$, $B$, $C$) or put on a waiting list and told to work out on their own. Aerobic capacity is the body's ability to process oxygen. Aerobic capacity was measured before and after 6 months of participation in the program (or 6 months of being on the waiting list). The response variable was improvement in aerobic capacity. The explanatory variables were age (a covariate) and treatment group. \begin{enumerate} \item First consider a regression model with an intercept, and no interaction between age and treatment group. \begin{enumerate} \item Make a table showing how you would set up indicator dummy variables for treatment group. Make Waiting List the reference category \item Write the regression model. Please use $x$ for age, and make its regression coefficient $\beta_1$. \item In terms of $\beta$ values, what null hypothesis would you test to find out whether, allowing for age, the three exercise programmes differ in their effectiveness? \item Write the null hypothesis for the preceding question as $H_0: \mathbf{L}\boldsymbol{\beta}=\mathbf{0}$. Just give the $\mathbf{L}$ matrix. \item In terms of $\beta$ values, what null hypothesis would you test to find out whether Programme $B$ was better than the waiting list? \item In terms of $\beta$ values, what null hypothesis would you test to find out whether Programmes $A$ and $B$ differ in their effectiveness? \item Suppose you wanted to estimate the difference in average benefit between programmes $A$ and $C$ for a 27 year old participant. Give your answer in terms of $\widehat{\beta}$ values. \item Is it safe to assume that age is independent of the other explanatory variables? Answer Yes or No and briefly explain. \end{enumerate} \newpage \item Now consider a regression model with an intercept and the interaction (actually a set of interactions) between age and treatment. \begin{enumerate} \item Write the regression model. Make it an extension of your earlier model. \item Suppose you wanted to know whether the slopes of the 4 regression lines were equal. In terms of $\beta$ values, what null hypothesis would you test? \item Suppose you wanted to know whether any differences among mean improvement in the four treatment conditions depends on the participant's age. In terms of $\beta$ values, what null hypothesis would you test? \item Write the null hypothesis for the preceding question as $H_0: \mathbf{L}\boldsymbol{\beta}=\mathbf{0}$. Just give the $\mathbf{L}$ matrix. It is $r \times p$. What is $r$? What is $p$? \item Suppose you wanted to know whether the difference in effectiveness between Programme $A$ and the Waiting List depends on the participant's age. In terms of $\beta$ values, what null hypothesis would you test? \item Suppose you wanted to estimate the difference in average benefit between programmes $A$ and $C$ for a 27 year old participant. Give your answer in terms of $\widehat{\beta}$ values. \end{enumerate} \item Now consider a regression model \emph{without} an intercept, but \emph{with} possibly unequal slopes. Make a table to show how the dummy variables could be set up, and write the regression model. Again, please use $x$ for age and make its regression coefficient $\beta_1$. For each treatment condition, what is the conditional expected value of $Y$? The answer is in terms of $x$ and the $\beta$ values. Please put these values as the last column of your table. \begin{enumerate} \item Suppose you wanted to know whether the slopes of the 4 regression lines were equal. In terms of $\beta$ values, what null hypothesis would you test? \item Suppose you wanted to know whether any differences among mean improvement in the four treatment conditions depends on the participant's age. In terms of $\beta$ values, what null hypothesis would you test? \item Write the null hypothesis for the preceding question as $H_0: \mathbf{L}\boldsymbol{\beta}=\mathbf{0}$. Just give the $\mathbf{L}$ matrix. It is $r \times p$. What is $r$? What is $p$? \item Suppose you wanted to know whether the difference in effectiveness between Programme $A$ and the Waiting List depends on the participant's age. In terms of $\beta$ values, what null hypothesis would you test? \item Suppose you wanted to estimate the difference in average benefit between programmes $A$ and $C$ for a 27 year old participant. Give your answer in terms of $\widehat{\beta}$ values. \end{enumerate} \end{enumerate} \newpage \item \label{awards} Awards received by students at a particular high school are thought to occur according to a Poisson process. That is, the numbers of awards received by students in one year are independent Poisson random variables, with mean $\lambda$ that may depend on characteristics of the student. From the \href{http://www.utstat.toronto.edu/~brunner/appliedf12/data/}{Data Sets} link on the course home page, you can find the Awards data. The variables are Sutudent identification code, Number of awards, Program (1=General, 2=Academic, 3=Vocational), and Score on a test of general academic knowledge. If you use \texttt{labels = c("General", "Academic", "Vocational")} in your \texttt{factor} statement, you will get nicer output. \begin{enumerate} \item Using \texttt{table}, make frequency table of number of awards. Does it look roughly normal? \item Consider a Poisson regression model, without actually fitting it yet. \begin{enumerate} \item What is the linear predictor? There should be no product terms (yet). \item What is the link function? \item Make a table with 3 rows, one for each academic program. Make columns showing how R will define the dummy variables for the variable academic program. If you're not sure, you can check your answer with R. \item Add another column to your table, showing the expected number of awards given score on the math test, for each academic program. \item The expected number of awards for a student in the Vocational program is \underline{\hspace{15mm}} times as great as the expected number of awards for a student in the General program with the same score on the general knowledge test. % e^beta2 \item The expected number of awards for a student in the Academic program is \underline{\hspace{15mm}} times as great as the expected number of awards for a student in the General program with the same score on the general knowledge test. % e^beta3 \item The expected number of awards for a student in the Academic program is \underline{\hspace{15mm}} times as great as the expected number of awards for a student in the Vocational program with the same score on the general knowledge test. % e^(beta2-beta3) %\newpage \item Explain why this model could be called a ``proportional means" model. \item Suppose we wanted to test the proportional means assumption (and it is an assumption). \begin{enumerate} \item Write a linear model for the log of the mean for the full model you would use. \item State the null hypothesis. It is a statement about the $\beta$ values in the full model. \item What is the reduced model? \item What are the degrees of freedom of this test? \end{enumerate} \end{enumerate} \item Now fit the proportional means Poisson regression model to the awards data. For each question below, state the null hypothesis, give the value of the test statistic ($Z$ or $\chi^2$), the $p$-value, and be able to state the conclusion in plain language. Give a \emph{directional} conclusion if possible, even though the test is non-directional. \begin{enumerate} \item Controlling for academic program, is score on the test of general knowledge related to the expected number of awards? \item Controlling for score on the test of general knowledge, do students in the Academic program get more awards on average than students in the General program? \item Controlling for score on the test of general knowledge, do students in the Vocational program get more awards on average than students in the General program? \item Do any of the explanatory variables matter? You could do this with a calculator from the default output if necessary, but do it with R and get the $p$-value. \item Controlling for score on the test of general knowledge, do students in the Vocational program get the same number of awards on average as students in the Academic program? I can't get this from the default output. \item The expected number of awards for a student in the Vocational program is estimated to be \underline{\hspace{15mm}} times as great as the expected number of awards for a student in the General program with the same score on the general knowledge test. % e^beta2 \item The expected number of awards for a student in the Academic program is estimated to be \underline{\hspace{15mm}} times as great as the expected number of awards for a student in the General program with the same score on the general knowledge test. % e^beta3 \item The expected number of awards for a student in the Academic program is estimated to be \underline{\hspace{15mm}} times as great as the expected number of awards for a student in the Vocational program with the same score on the general knowledge test. % e^(beta2-beta3) \end{enumerate} \end{enumerate} % As an experiment \item \label{chicks} The Chick Weights Data was originally an R dataset called \texttt{chickwts}. In this question, we will analyze it with SAS. There is a link to the data file from the \href{http://www.utstat.toronto.edu/~brunner/appliedf12/data/}{Data Sets} link on the course home page. In this study, newly hatched chickens were randomly assigned to one of six different feed supplements, and their weight in grams after 6 weeks was recorded. I did not show you how to do this in SAS, but you can read character-valued variables by ptting a \$ sign after the variable name in your input statement. \begin{enumerate} \item Make sure a table of means, standard deviations and sample sizes for the 6 feed types is part of your output. \item Test whether the six mean weights are different. Get the $F$ statistic, degrees of freedom, $p$-value and proportion of explained variation. \item You want to know which means are different from which other means. Carry out the multiple comparison procedure likely to be the most powerful in this situation. Base your conclusions on the usual $\alpha=0.05$ \emph{joint} significance level for the family of tests. Of course when you state your conclusions in plain language, you would not mention the significance level or joint significance level. But to be honest, stating the conclusions in plain language isn't easy. The pattern is complicated. \item Test for differences among mean weights for the five feed types \emph{excluding} horsebean. \begin{enumerate} \item First, write the null hypothesis in terms of $\mu$ values. \item Now obtain the $F$ statistic, degrees of freedom and $p$-value. Do you reject $H_0$ at $\alpha=0.05$? \end{enumerate} \item Obtain a 95\% confidence interval for the difference between the expected weight for chicks fed horsebean, versus the average of the other expected values. Your answer is a pair of numbers. \item Is the test of that last contrast different from zero as a Scheff\'e follow-up to the over-all one-factor analysis of variance? \item State the conclusion from that Scheff\'e in plain, non-statistical language. \item Would you advise a chicken farmer to purchase the Horsebean feed supplement if she wanted big fat chickens? \end{enumerate} \end{enumerate} \vspace{70mm} %\newpage \noindent \begin{center}\begin{tabular}{l} \hspace{6.5in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistics, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/appliedf12} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/appliedf12}} \end{document} # Awards Work # Took data from http://www.ats.ucla.edu/stat/r/dae/poissonreg.htm # Changed the example a little: score = math+20 rm(list=ls()) prizes = read.table("http://www.utstat.toronto.edu/~brunner/appliedf12/data/awards.data") attach(prizes); head(prizes) program = factor(prog, labels = c("General", "Academic", "Vocational")) model1 = glm(awards ~ score + program, family=poisson); summary(model1) anova(model1,test='Chisq') # Switch dummy var coding to test Academic vs Vocational kon2 = contr.treatment(3,base=2) # Academic is ref plabs = c("General", "Academic", "Vocational") colnames(kon2) <- plabs[c(1,3)] # Giving it col names works! program2 = factor(prog); contrasts(program2) = kon2 model1b = glm(awards ~ score + program2, family=poisson); summary(model1b) model2 = glm(awards ~ score + program + score:program, family=poisson); summary(model2) anova(model1,model2,test='Chisq') # Chick weights with R, but I need to do it in SAS rm(list=ls()) attach(chickwts) # Means and SDs msd <- function(fac,variab,sorted=F) { # Table of means, Ns and standard deviations if(prod(!is.na(fac))* prod(!is.na(variab))==0) {stop("Function msd does not work when there are NAs.")} if(!is.factor(fac)) Stop("First argument must be a factor.") name = names(table(fac)); k = length(name) msd = numeric(3*k); dim(msd) = c(k,3) rownames(msd) = name colnames(msd) = c("Mean","St.Dev.","N") for(i in 1:k) { msd[i,1] = mean(variab[fac==name[i]],na.rm=T) msd[i,2] = sqrt( var(variab[fac==name[i]],na.rm=T)) msd[i,3] = length(variab[fac==name[i]]) } if(sorted) msd = msd[order(msd[,1]),] # Rows sorted by mean msd # Return the value } # End of function msd msd(feed,weight,sorted=T) # Boxplots plot(feed,weight) # Test equality of 6 means aov1 = aov(weight~feed); summary(aov1) lm1 = lm(weight~feed); summary(lm1) # Residuals n = length(weight); n # 71 critval = qt(1-(0.05/n)/2,n-6-1); critval sort(rstudent(aov1)) # 3.560597 No problem # Tukey TukeyHSD(aov1,ordered=T) msd(feed,weight,sorted=T) # glh.test(aov1,c(1,-1,0,0,0,0)) # Package manager select gmodels # That was H0: beta0=beta1. It uses the lm model! # Compare means excluding horsebean L = rbind(c(0,0,1,0,0,0), c(0,0,0,1,0,0), c(0,0,0,0,1,0), c(0,0,0,0,0,1)) # Package manager select gmodels glh.test(aov1,L) # F = 9.3182, yes /********************* 2101f12HW9chickwts.sas ***********************/ options linesize=79 noovp formdlim='_' nodate; title 'STA2101f12 HW9 Check: Chick Weights'; data cluck; infile 'chickweights.data' firstobs=2; /* Skip the header */ input id weight feed $; /*$*/ label weight = 'Weight in grams at 6 weeks'; proc freq; tables feed; proc glm; title2 'One-Factor ANOVA: Just the defaults'; class feed; model weight=feed / clparm; means feed; lsmeans feed / pdiff adjust=tukey; contrast 'AllButHorsebean' feed 1 0 -1 0 0 0, feed 0 0 1 -1 0 0, feed 0 0 0 1 -1 0, feed 0 0 0 0 1 -1; contrast 'HorseVsOthers' feed 1 -5 1 1 1 1; estimate 'HorseVsOthers' feed -1 5 -1 -1 -1 -1 / divisor=5; /* Get Scheffe critical value from proc iml */ proc iml; title2 'Scheffe critical value for all possible contrasts'; numdf = 5; /* Numerator degrees of freedom for initial test */ dendf = 65; /* Denominator degrees of freedom for initial test */ alpha = 0.05; critval = finv(1-alpha,numdf,dendf); scrit = critval * numdf; print "Initial test has" numdf " and " dendf "degrees of freedom." "----------------------------------------------------------" "Using significance level alpha = " alpha "------------------------------------------------" "Critical value for the initial test is " critval "------------------------------------------------" "Critical value for Scheffe tests is " scrit "------------------------------------------------";