\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links %\usepackage{fullpage} %\pagestyle{empty} % No page numbers \oddsidemargin=0in % Good for US Letter paper \evensidemargin=0in \textwidth=6.5in \topmargin=-0.8in \headheight=0in \headsep=0.5in \textheight=9.4in \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 2101/442 Assignment 9}}\footnote{This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistics, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/appliedf17} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/appliedf17}}} \vspace{1 mm} \end{center} \noindent Questions \ref{awards}, \ref{heart} and \ref{sales} use R. Please bring your printouts to the quiz on Friday November 17th. The non-computer questions on this assignment are practice for the quiz, and are not to be handed in. Please do the problems using the formula sheet as necessary. A copy of the formula sheet will be distributed with the quiz if necessary. As usual, you may use anything on the formula sheet unless you are directly asked to prove it. \begin{enumerate} %%%%%%%%%% Poisson regression %%%%%%%%%% \item \label{awards} Awards received by students at a particular high school are thought to occur according to a Poisson process. That is, the numbers of awards received by students in one year are independent Poisson random variables, with mean $\lambda$ that may depend on characteristics of the student. We will adopt a Poisson regression model with a linear model for the natural log of $\lambda_i$. Data are given in the file \\ \href{http://www.utstat.toronto.edu/~brunner/data/legal/awards.data.txt} {\texttt{http://www.utstat.toronto.edu/$\sim$brunner/data/legal/awards.data.txt}}. The variables are Student identification code, Number of awards, Program (1=General, 2=Academic, 3=Vocational), and Score on a test of general academic knowledge. If you use \texttt{labels = c("General", "Academic", "Vocational")} in your \texttt{factor} statement, you will get nicer output. \begin{enumerate} \item Using \texttt{table}, make frequency table of number of awards. Does it look roughly normal? \item Consider a Poisson regression model, without actually fitting it yet. Your model has no product terms, for now. \begin{enumerate} \item Make a table with 3 rows, one for each academic program. Make columns showing how R will define the dummy variables for the variable academic program. If you're not sure, you can check your answer with \texttt{contrasts}. \item Add another column to your table, showing the expected number of awards given score on the academic knowledge test, for each academic program. \item The expected number of awards for a student in the Vocational program is \underline{\hspace{15mm}} times as great as the expected number of awards for a student in the General program with the same score on the general knowledge test. Give your answer in terms of model parameters ($\beta$ quantities). % e^beta2 \item The expected number of awards for a student in the Academic program is \underline{\hspace{15mm}} times as great as the expected number of awards for a student in the General program with the same score on the general knowledge test. Give your answer in terms of model parameters ($\beta$ quantities).% e^beta3 \item The expected number of awards for a student in the Academic program is \underline{\hspace{15mm}} times as great as the expected number of awards for a student in the Vocational program with the same score on the general knowledge test. Give your answer in terms of model parameters ($\beta$ quantities). % e^(beta2-beta3) %\newpage \item This model could be called a ``proportional means" model, because for each fixed $x$, Expected $y$s for any two categories are in the same proportion. For example, the expected number of awards for a student in the academic program might be twice the expected number of awards of a student in the vocational program with the same general knowledge score. Assuming the model is correct, if you plotted the three curves relating academic knowledge score to expected number of awards, would the curves be parallel? \item Suppose we wanted to test the proportional means assumption (and it is an assumption). \begin{enumerate} \item Write a linear model for the log of the mean for the full model you would use. \item State the null hypothesis. It is a statement about the $\beta$ values in the full model. \item What is the reduced model? \item What are the degrees of freedom of this test? \end{enumerate} \end{enumerate} \item Now fit the proportional means Poisson regression model to the awards data. Some of the questions below ask for estimation, while others ask for hypothesis tests. For the estimation questions, give numbers. For the hypothesis test questions, state the null hypothesis, give the value of the test statistic ($Z$ or $\chi^2$), the $p$-value, and be able to state the conclusion in plain language. Give a \emph{directional} conclusion if possible, even though the test is non-directional. \begin{enumerate} \item Controlling for academic program, is score on the test of general knowledge related to the expected number of awards? \item Controlling for score on the test of general knowledge, do students in the Academic program get more awards on average than students in the General program? \item Controlling for score on the test of general knowledge, do students in the Vocational program get more awards on average than students in the General program? \item Do any of the explanatory variables matter? You could do this with a calculator from the default output if necessary, but do it with R and get the $p$-value. \item Controlling for score on the test of general knowledge, do students in the Vocational program get the same number of awards on average as students in the Academic program? I can't get this from the \texttt{summary} output. \item The expected number of awards for a student in the Vocational program is estimated to be \underline{\hspace{15mm}} times as great as the expected number of awards for a student in the General program with the same score on the general knowledge test. % e^beta2 \item The expected number of awards for a student in the Academic program is estimated to be \underline{\hspace{15mm}} times as great as the expected number of awards for a student in the General program with the same score on the general knowledge test. % e^beta3 \item The expected number of awards for a student in the Academic program is estimated to be \underline{\hspace{15mm}} times as great as the expected number of awards for a student in the Vocational program with the same score on the general knowledge test. % e^(beta2-beta3) \item Give an estimate and an approximate (large-sample) 95\% confidence interval for the expected number of awards won by students in the Academic programme with a score of 80 on the knowledge test. Please do \emph{not} use the \texttt{predict} function to get the standard error, though you can use it to check your work. Your answer is a set of three numbers. \end{enumerate} \end{enumerate} % \newpage %%%%%%%%%% Multinomial Logit %%%%%%%%%% \item In the \emph{Heart attack data} (which you will analyze later), a sample of middle-aged men who had heart attacks were classified into three groups. Either they died of the first heart attack, or they died during the next 10 years, or they were still alive 10 years after the first attack. This is the response variable. Potential explanatory variables include age, blood pressure, and family history of heart disease (Yes-No). Let's just consider these for now. For interpretability, make the probability of being alive 10 years later the denominator in each generalized logit. \begin{enumerate} \item Write the multinomial logit model for these data. How many generalized logits do you have? Of course you must have a regression equation for each one. \item Solve for the probabilities in terms of the $beta$ values in your model. Show your work. \item Make a table with two rows, one for Family history = Yes, and one for Family history = No. In each row, write \emph{two} probability ratios. Let's call then ``relative risks." (The relative risk of dying in a particular way is the probability of dying that way divided by the probability of living.) \item Controlling for age and blood pressure, the relative risk of dying in the first heart attack is \underline{\hspace{15mm}} times as great for those with a family history of coronary heart disease. \item Controlling for age and blood pressure, the relative risk of dying in the next 10 years after the first heart attack is \underline{\hspace{15mm}} times as great for those with a family history of coronary heart disease. \end{enumerate} \pagebreak % This one is more or less a dud. \item \label{heart} The file \href{http://www.utstat.toronto.edu/~brunner/data/illegal/attack.data.txt} {\texttt{http://www.utstat.toronto.edu/$\sim$brunner/data/illegal/attack.data.txt}}. contains the \emph{Heart attack data}, in which a sample of middle-aged men who had heart attacks were classified into three groups. Either they died of the first heart attack, or they died during the next 10 years, or they were still alive 10 years after the first attack. This is the response variable. Please make the probability of being alive 10 years later the denominator in your generalized logits. % This will happen by default. The variables are \begin{itemize} \item AGE AT ENTRY TO STUDY \item AVERAGE DIASTOLIC BLOOD PRESSURE \item SERUM CHOLESTEROL \item NUMBER OF CIGARETTES PER DAY (Self report) \item HEIGHT IN INCHES \item WEIGHT IN POUNDS \item FAMILY HISTORY OF CORONARY HEART DISEASE \item EDUCATION %: Grade school, High school, or Post-secondary (College or university) \item OUTCOME \end{itemize} Instead of height and weight, let's use \href{http://en.wikipedia.org/wiki/Body_Mass_Index} {Body Mass Index} (BMI), defined as \begin{displaymath} \mbox{BMI} = 703 \times \frac{\mbox{weight~}}{\mbox{height}^2}. \end{displaymath} A BMI under 18.5 suggests that the person is underweight, while a value over 25 may indicate that the person is overweight. The first full model (the biggest one) will include all available explanatory variables, except that height and weight will be replaced by BMI. \begin{enumerate} \item Fit the model, meaning estimate the parameters. \begin{enumerate} \item Test whether \emph{any} of the explanatory variables are useful in predicting the response variable. This is one big test. Give the value of the test statistic, the degrees of freedom, and the $p$-value. The test statistic and $p$-value are on your printout, but the degrees of freedom are not. In plain language, what do you conclude? \item We should probably just give up, but let's proceed anyway for practice. If there is any hope, it looks like a model with just age, cholesterol level, and family history of heart disease. So carry out a simultaneous test of all the other explanatory variables. What is your full model? What is your reduced model? Give the value of the test statistic, the degrees of freedom, and the $p$-value. In plain language, what do you conclude? \end{enumerate} \item Based on the results of the last test, I am willing to consider the model with just age, cholesterol level, and family history of heart disease. For that model, it is possible to reject the null hypothesis that the regression coefficients for all the explanatory variables equal zero? What is your full model? What is your reduced model? Give the value of the test statistic, the degrees of freedom, and the $p$-value. In plain language, what do you conclude? \item Now for this model with three explanatory variables, test each of the explanatory variables controlling for the other two. That's three tests. For each one, what is your reduced model? Give the value of the test statistic, the degrees of freedom, and the $p$-value. In plain language, what do you conclude? % Could estimate a probability here, but I'm tired. \item Overall, what is your assessment of this analysis? \end{enumerate} %%%%%%%%%% Normal regression %%%%%%%%%% \item \label{workout} In a study comparing the effectiveness of different exercise programmes, volunteers were randomly assigned to one of three exercise programmes ($A$, $B$, $C$) or put on a waiting list and told to work out on their own. Aerobic capacity is the body's ability to process oxygen. Aerobic capacity was measured before and after 6 months of participation in the program (or 6 months of being on the waiting list). The response variable was improvement in aerobic capacity. The explanatory variables were age (a covariate) and treatment group. \emph{Treatment group includes the waiting list control condition}. \begin{enumerate} \item First consider a regression model with an intercept, and no interaction between age and treatment group. \begin{enumerate} \item Make a table showing how you would set up indicator dummy variables for treatment group. Make Waiting List the reference category \item Write the regression model. Please use $x$ for age, and make its regression coefficient $\beta_1$. \item In terms of $\beta$ values, what null hypothesis would you test to find out whether, allowing for age, the three exercise programmes differ in their effectiveness? \item Write the null hypothesis for the preceding question as $H_0: \mathbf{L}\boldsymbol{\beta}=\mathbf{0}$. Just give the $\mathbf{L}$ matrix. \item In terms of $\beta$ values, what null hypothesis would you test to find out whether Programme $B$ was better than the waiting list? \item In terms of $\beta$ values, what null hypothesis would you test to find out whether Programmes $A$ and $B$ differ in their effectiveness? \item Suppose you wanted to estimate the difference in average benefit between programmes $A$ and $C$ for a 27 year old participant. Give your answer in terms of $\widehat{\beta}$ values. \item Is it safe to assume that age is independent of the other explanatory variables? Answer Yes or No and briefly explain. \end{enumerate} % \newpage \item \label{interac} Now consider a regression model with an intercept and the interaction (actually a set of interactions) between age and treatment. \begin{enumerate} \item Write the regression model. Make it an extension of your earlier model. \item Suppose you wanted to know whether the slopes of the 4 regression lines were equal. In terms of $\beta$ values, what null hypothesis would you test? \item Suppose you wanted to know whether any differences among mean improvement in the four treatment conditions depends on the participant's age. In terms of $\beta$ values, what null hypothesis would you test? \item Write the null hypothesis for the preceding question as $H_0: \mathbf{L}\boldsymbol{\beta}=\mathbf{0}$. Just give the $\mathbf{L}$ matrix. It is $r \times p$. What is $r$? What is $p$? \item Suppose you wanted to know whether the difference in effectiveness between Programme $A$ and the Waiting List depends on the participant's age. In terms of $\beta$ values, what null hypothesis would you test? \item Suppose you wanted to \emph{estimate} the difference in average benefit between programmes $A$ and $C$ for a 27 year old participant. Give your answer in terms of $\widehat{\beta}$ values. \end{enumerate} \end{enumerate} \item \label{sales} Telephone sales representatives use computer software to help them locate potential customers, answer questions, take credit card information and place orders. Twelve sales representatives were randomly assigned to each of three new software packages the company was thinking of purchasing. The data for each sales representative include the software package (1, 2 or 3), sales last quarter with the old software, and sales this quarter with one of the new software packages. Sales are in number of units sold. The data are in the file \\ \href{http://www.utstat.toronto.edu/~brunner/data/legal/sales.data.txt} {\texttt{http://www.utstat.toronto.edu/$\sim$brunner/data/legal/sales.data.txt}}. \\ The explanatory and response variables are what you would think. \begin{enumerate} \item Fit a full model in which the slopes and intercepts of the regression lines relating sales last quarter to sales this quarter might depend on the kind of software the sales representatives are using. \item Carry out an ordinary $F$-test to determine whether the effect of software type on sales depends on the representative's performance last quarter. Be able to state your conclusion in plain, non-statistical language. \item Estimate the slopes of the three regression lines. Base the estimates on numbers from your printout. I don't see how you can do this without making a table. \item Carry out tests to answer these questions. If they are already on the output of \texttt{summary}, use that. \begin{enumerate} \item Are the slopes for Software 1 and 2 different? \item Are the slopes for Software 1 and 3 different? \item Are the slopes for Software 2 and 3 different? \end{enumerate} Protecting the three tests with a Bonferroni correction at the joint 0.05 significance level, what do you conclude? Plain language is not necessary, but you should say what happened. \item \label{diffatmean} The average (sample mean) performance last quarter was 76.56 (please use exactly this number). We are interested in whether the three software packages differ in their effectiveness for sales representatives with average performance last quarter. \begin{enumerate} \item Estimate expected performance this quarter for sales representatives with average performance last quarter. These three numbers should appear on your printout. \item State the null hypothesis in symbols. \item Carry out the $F$-test. % p = 0.5488 \item In plain language, what do you conclude? \end{enumerate} \end{enumerate} % ? \end{enumerate} \end{document} \item \begin{enumerate} \item \end{enumerate}