\documentclass[11pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} \topmargin=-.3in \textheight=9.4in %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 2101/442 Assignment Ten}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \vspace{3mm} \begin{enumerate} \item \label{awards} Awards received by students at a particular high school are thought to occur according to a Poisson process. That is, the numbers of awards received by students in one year are independent Poisson random variables, with mean $\lambda$ that may depend on characteristics of the student. Data are given in the file \href{http://www.utstat.toronto.edu/~brunner/appliedf14/code_n_data/hw/awards.data}{\texttt{awards.data}}. There is a link from the course home page in case the one in this document does not work. The variables are Sutudent identification code, Number of awards, Program (1=General, 2=Academic, 3=Vocational), and Score on a test of general academic knowledge. If you use \texttt{labels = c("General", "Academic", "Vocational")} in your \texttt{factor} statement, you will get nicer output. \begin{enumerate} \item Using \texttt{table}, make frequency table of number of awards. Does it look roughly normal? \item Consider a Poisson regression model, without actually fitting it yet. Your model has no product terms, for now. \begin{enumerate} \item Make a table with 3 rows, one for each academic program. Make columns showing how R will define the dummy variables for the variable academic program. If you're not sure, you can check your answer with R. \item Add another column to your table, showing the expected number of awards given score on the academic knowledge test, for each academic program. \item The expected number of awards for a student in the Vocational program is \underline{\hspace{15mm}} times as great as the expected number of awards for a student in the General program with the same score on the general knowledge test. Give your answer in terms of model parameters ($\beta$ quantities). % e^beta2 \item The expected number of awards for a student in the Academic program is \underline{\hspace{15mm}} times as great as the expected number of awards for a student in the General program with the same score on the general knowledge test. Give your answer in terms of model parameters ($\beta$ quantities).% e^beta3 \item The expected number of awards for a student in the Academic program is \underline{\hspace{15mm}} times as great as the expected number of awards for a student in the Vocational program with the same score on the general knowledge test. Give your answer in terms of model parameters ($\beta$ quantities). % e^(beta2-beta3) %\newpage \item Explain why this model could be called a ``proportional means" model. \item Suppose we wanted to test the proportional means assumption (and it is an assumption). \begin{enumerate} \item Write a linear model for the log of the mean for the full model you would use. \item State the null hypothesis. It is a statement about the $\beta$ values in the full model. \item What is the reduced model? \item What are the degrees of freedom of this test? \end{enumerate} \end{enumerate} \vspace{20mm} \pagebreak \item Now fit the proportional means Poisson regression model to the awards data. Some of the questions below ask for estimation, while others ask for hypothesis tests. For the estimation questions, give numbers. For the hypothesis test questions, state the null hypothesis, give the value of the test statistic ($Z$ or $\chi^2$), the $p$-value, and be able to state the conclusion in plain language. Give a \emph{directional} conclusion if possible, even though the test is non-directional. \begin{enumerate} \item Controlling for academic program, is score on the test of general knowledge related to the expected number of awards? \item Controlling for score on the test of general knowledge, do students in the Academic program get more awards on average than students in the General program? \item Controlling for score on the test of general knowledge, do students in the Vocational program get more awards on average than students in the General program? \item Do any of the explanatory variables matter? You could do this with a calculator from the default output if necessary, but do it with R and get the $p$-value. \item Controlling for score on the test of general knowledge, do students in the Vocational program get the same number of awards on average as students in the Academic program? I can't get this from the \texttt{summary} output. \item The expected number of awards for a student in the Vocational program is estimated to be \underline{\hspace{15mm}} times as great as the expected number of awards for a student in the General program with the same score on the general knowledge test. % e^beta2 \item The expected number of awards for a student in the Academic program is estimated to be \underline{\hspace{15mm}} times as great as the expected number of awards for a student in the General program with the same score on the general knowledge test. % e^beta3 \item The expected number of awards for a student in the Academic program is estimated to be \underline{\hspace{15mm}} times as great as the expected number of awards for a student in the Vocational program with the same score on the general knowledge test. % e^(beta2-beta3) \item Give an estimate and an approximate (large-sample) 95\% confidence interval for the expected number of awards won by students in the Academic programme with a score of 80 on the knowledge test. Please do \emph{not} use the \texttt{predict} function to get the standard error, though you can use it to check your prediction. Your answer is a set of three numbers. You may need to think about this one a bit. \end{enumerate} \end{enumerate} \pagebreak \item Arsenic is a powerful poison, which is why it has been used on farms for many years to kill insects. Even in very small amounts, arsenic can cause cancer in humans, and recently it has been found that rice and foods made from rice tend to be very high in arsenic. Brown rice is worse, by the way. In a controlled experiment, pots of rice were prepared by either washing the rice first or not, and by cooking the rice in either a low, a medium or a high amount of water. The response variable is amount of arsenic in the cooked rice. \begin{enumerate} \item Use a regression model with \emph{cell means coding}. That's the model with no intercept, and one indicator dummy variable for each treatment combination. You don't have to say how the dummy variables are defined. That will become clear in the next part. Just give the regression equation. \item Write the expected amounts of arsenic in the table below, in terms of the $\beta$ parameters of your model. \begin{center} \begin{tabular}{|l|c|c|c|} \hline & \multicolumn{3}{|c|}{Amount of Water} \\ \hline & Low & Medium & High \\ \hline Washed & ~~~~~~~~~~ & ~~~~~~~~~~ & ~~~~~~~~~~ \\ \hline Unwashed & ~~~~~~~~~~ & ~~~~~~~~~~ & ~~~~~~~~~~ \\ \hline \end{tabular}\end{center} \item If you wanted to test whether the effect of washing the rice depended on how much water you cook it in, what is the null hypothesis? Give your answer in terms of the $\beta$ values in your model. \item If you wanted to test whether washing the rice before cooking has any effect if the rice is cooked in a lot of water, what is the null hypothesis? Give your answer in terms of $\beta$ values. \item Suppose you want to test whether the amount of water used to cook the rice makes any difference if the rice has been washed. What is the null hypothesis? Give your answer in terms of $\beta$ values. \item Averaging across different amounts of water used to cook the rice, does pre-washing affect the amount of arsenic in the rice. What null hypothesis would you test to answer this question? Give your answer in terms of $\beta$ values. \item If you wanted to test whether the effect of the amount of water used to cook the rice depends on whether you wash it first, what is the null hypothesis? Give your answer in terms of $\beta$ values. \end{enumerate} % Specifying that all the sample sizes are equal and asking for a non-centrality parameter makes a nice last part to this question, but it's too time-consuming for the final. \item Consider a two-factor analysis of variance in which each factor has two levels. Use this regression model for the problem: \begin{displaymath} Y_i = \beta_0 + \beta_1 d_{i,1} + \beta_2 d_{i,2} + \beta_3 d_{i,1}d_{i,2} + \epsilon_i, \end{displaymath} where $d_{i,1}$ and $d_{i,2}$ are dummy variables. \begin{enumerate} \item Make a two-by-two table showing the four treatment means in terms of $\beta$ values. Use \emph{effect coding}. In terms of the $\beta$ values, state the null hypothesis you would use to test for \begin{enumerate} \item Main effect of the first factor \item Main effect of the second factor \item Interaction \end{enumerate} \item Make a two-by-two table showing the four treatment means in terms of $\beta$ values. Use \emph{indicator dummy variables} (zeros and ones). In terms of the $\beta$ values, state the null hypothesis you would use to test for \begin{enumerate} \item Main effect of the first factor \item Main effect of the second factor \item Interaction \end{enumerate} \item Which dummy variable scheme do you like more? \end{enumerate} \item In a study of math education in elementary school, equal numbers of boys and girls were randomly assigned to one of three training programmes designed to improve spatial reasoning. After five school days of training, the students were given a standardized test of spatial reasoning. Score on the spatial reasoning test is the response variable. You will define a regression model for this factorial analysis of variance. Don't write the model yet. \begin{enumerate} \item In the table below, show how your dummy variables are defined. \emph{Use effect coding.} That's the scheme with an intercept and minus ones. Write the name of each dummy variable at the head of its column. \begin{center} \begin{tabular}{|l|c|} \hline Girls, Programme 1 & ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \\ \hline Girls, Programme 2 & \\ \hline Girls, Programme 3 & \\ \hline Boys, Programme 1 & \\ \hline Boys, Programme 2 & \\ \hline Boys, Programme 3 & \\ \hline \end{tabular} \end{center} \item Give $E[Y_i|\mathbf{X}_i=\mathbf{x}_i]$ for the full model. Include the interaction terms. Notice you are \emph{not} being asked to write expected values in the table. They are too messy. \item Suppose you want to test whether, averaging across training programmes, there is a difference between girls and boys in their average performance on the spatial reasoning test. State the null hypothesis in terms of the $\beta$ values in your model. \item Suppose you want to test whether, averaging across boys and girls, there is a difference between training programmes in average performance on the spatial reasoning test. State the null hypothesis in terms of the $\beta$ values in your model. \item Suppose you want to test whether the sex difference in average performance depends on which training programme the children are in. State the null hypothesis in terms of the $\beta$ values in your model. \end{enumerate} \end{enumerate} \vspace{30mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6.5in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistics, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/appliedf14} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/appliedf14}} \end{document} # Awards work rm(list=ls()) prizes = read.table("awards.data") attach(prizes); head(prizes) table(awards) program = factor(prog, labels = c("General", "Academic", "Vocational")) table(prog,program) model1 = glm(awards ~ score + program, family=poisson); summary(model1) anova(model1,test='Chisq') # Do a Wald test for comparing Academic and vocational. # Now the confidence interval V = vcov(model1); betahat = model1$coefficients #$ x = c(1,80,1,0) logEhat = sum(x*betahat) # Log of estimated expected value Ehat = exp(logEhat); Ehat # Estimated expected value = 1.046947 se = sqrt( t(x) %*% V %*% x); se # Standard error of logEhat = 0.09955895 climits = c(logEhat-1.96*se, logEhat+1.96*se); climits # CI for logEhat # climits = -0.1492569 0.2410141 CI = exp(climits); CI # CI for Ehat: (0.8613478, 1.2725390) # Checking with predict. good = data.frame(score=80,program="Academic") predict(model1,newdata=good,se=T) #'The default is on the scale of the linear predictors; the alternative "response" is on the scale of the response variable.' So this gives logEhat = 0.0458786. se is right too. # Try direct prediction of the expected value. predict(model1,newdata=good,se=T,type="response") # Get estimate of 1.046947, right, but se=0.104233 surely based on delta method c(1.046947-1.96*0.104233, .046947+1.96*0.104233) # 0.8426503 0.2512437 close. # Awards Work from 2012 # Took data from http://www.ats.ucla.edu/stat/r/dae/poissonreg.htm # Changed the example a little: score = math+20 rm(list=ls()) prizes = read.table("http://www.utstat.toronto.edu/~brunner/appliedf12/data/awards.data") attach(prizes); head(prizes) program = factor(prog, labels = c("General", "Academic", "Vocational")) model1 = glm(awards ~ score + program, family=poisson); summary(model1) anova(model1,test='Chisq') # Switch dummy var coding to test Academic vs Vocational kon2 = contr.treatment(3,base=2) # Academic is ref plabs = c("General", "Academic", "Vocational") colnames(kon2) <- plabs[c(1,3)] # Giving it col names works! program2 = factor(prog); contrasts(program2) = kon2 model1b = glm(awards ~ score + program2, family=poisson); summary(model1b) model2 = glm(awards ~ score + program + score:program, family=poisson); summary(model2) anova(model1,model2,test='Chisq')