\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 312f12 Assignment Seven}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent Please bring your R printout for the last question to the quiz. The non-computer questions are practice for the quiz on Friday Nov. 2nd, and are not to be handed in. \begin{enumerate} \item The Wisconsin Power and Light Company studied the effectiveness of two devices for improving the efficiency of gas home-heating systems. The electric vent damper (EVD) reduces heat loss through the chimney when the furnace is in the off cycle by closing off the vent. It is controlled electrically. The thermally activated vent damper (TVD) is the same as the EVD except it is controlled by the thermal properties of a set of bimetal fins set in the vent. Ninety test houses were randomly assigned to have a free vent damper installed; 40 received EVDs and 50 received TVDs. For each house, energy consumption was measured for a period of several weeks with the vent damper active (``vent damper in") and for an equal period with the vent damper not active (``vent damper out". Here are the variables: \begin{itemize} \item[] House Identification Number \item[] Type of furnace (1=Forced air 2=Gravity 3=Forced water 4=Steam) \item[] Chimney area \item[] Chimney shape (1=Round 2=Square 3=Rectangular) \item[] Chimney height in feet \item[] Type of Chimney liner (0=Unlined 1=Tile 2=Metal) \item[] Type of house (1=Ranch 2=Two-story 3=tri-level 4=Bi-level 5=One and a half stories) \item[] House age in yrs \item[] Type of damper (1=EVD 0=TVD) \item[] Energy consumpt with damper active (in) \item[] Energy consumpt with damper inactive (out) \end{itemize} Consider a model in which the response variable (Y) is average energy consumption with vent damper in and vent damper out, and the explanatory variables are age of house ($X_1$), chimney area ($X_2$) and furnace type (4 categories). \begin{enumerate} \item You want to test whether, controlling for age of house and chimney area, average energy consumption depends on furnace type. \begin{enumerate} \item Give the null hypothesis in terms of the $\beta$s. \item Give $E[Y|\mathbf{X}]$ for the reduced model. \end{enumerate} \newpage \item You want to test whether, controlling for age of house and chimney area, average energy consumption depends on furnace type. \begin{enumerate} \item Give the null hypothesis in terms of the $\beta$s. \item Give $E[Y|\mathbf{X}]$ for the reduced model. \end{enumerate} \item You want to test whether, controlling for age of house and chimney area, average energy is different for Forced air furnaces and Gravity furnaces. \begin{enumerate} \item Give the null hypothesis in terms of the $\beta$s. \item Give $E[Y|\mathbf{X}]$ for the reduced model. \end{enumerate} \item You want to test whether, controlling for age of house and chimney area, average energy consumption is different for Forced air and forced water furnaces. \begin{enumerate} \item Give the null hypothesis in terms of the $\beta$s. \item Give $E[Y|\mathbf{X}]$ for the reduced model. \end{enumerate} \item You want to test whether, controlling for age of house and chimney area, average energy consumption is for Steam furnaces is different from the average of Forced air and Forced water furnaces. (You are comparing an expected value with the mean of two expected values.) \begin{enumerate} \item Give the null hypothesis in terms of the $\beta$s. \item Give $E[Y|\mathbf{X}]$ for the reduced model. \end{enumerate} \end{enumerate} \item High School History classes from across Ontario are randomly assigned to either a discovery-oriented or a memory-oriented curriculum in Canadian history. At the end of the year, the students are given a standardized test and the median score of each class is recorded. Please consider a regression model with these variables: \begin{itemize} \item[$X_1$] Equals 1 if the class uses the discovery-oriented curriculum, and equals 0 the class it uses the memory-oriented curriculum. \item[$X_2$] Average parents' education for the classroom \item[$X_3$] Average parents' income for the classroom \item[$X_4$] Number of university History courses taken by the teacher \item[$X_5$] Teacher's final cumulative university grade point average \item[$Y~$] Class median score on the standardized history test. \end{itemize} The full regression model has $E[Y|\mathbf{X}] = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \beta_4x_4 + \beta_5x_5$. Give $E[Y|\mathbf{X}]$ for the reduced model you would use to answer each of the following questions. Don't re-number the variables. Also, for each question please give the null hypothesis in terms of $\beta$ values. \begin{enumerate} \item If you control for parents' education and income and for teacher's university background, does curriculum type affect test scores? (And why is it okay to use the word ``affect?") \item Controlling for parents' education and income and for curriculum type, is teacher's university background (two variables) related to their students' test performance? \item Controlling for teacher's university background and for curriculum type, are parents' education and income (considered simultaneously) related to students' test performance? \item Controlling for curriculum type, teacher's university background and parents' education, is parents' income related to students' test performance? \end{enumerate} \item If two events have equal probability, the odds ratio equals \underline{~~~~~~}. \item For a multiple logistic regression model, if the value of the kth independent variable is increased by c units and everything else remains the same, the odds of Y=1 are \underline{~~~~~~} times as great. Prove your answer. \item For a multiple logistic regression model, let $P(Y_i=1| x_{i,1}, \ldots, x_{i,p-1}) = \pi(\mathbf{x}_i)$. Show that a linear model for the log odds is equivalent to \begin{displaymath} \pi(\mathbf{x}_i) = \frac{e^{\beta_0 + \beta_1 x_1 + \ldots + \beta_{p-1} x_{p-1}}} {1+e^{\beta_0 + \beta_1 x_1 + \ldots + \beta_{p-1} x_{p-1}}} = \frac{e^{\mathbf{x}_i^\prime\boldsymbol{\beta}}} {1+e^{\mathbf{x}_i^\prime\boldsymbol{\beta}}} \end{displaymath} \item Write the log likelihood for the last question, and simplify it as much as possible. \item A logistic regression model with no independent variables has just one parameter, $\beta_0$. It also the same probability $\pi = P(Y=1)$ for each case. \begin{enumerate} \item Write $\pi$ as a function of $\beta_0$; show your work. \item The \emph{invariance principle} of maximum likelihood estimation says the MLE of a function of the parameter is that function of the MLE. It is very handy. Now, still considering a logistic regression model with no independent variables, \begin{enumerate} \item Suppose $\overline{y}$ (the sample proportion of $Y=1$ cases) is 0.57. What is $\widehat{\beta}_0$? Your answer is a number. % 0.2818512 \item Suppose $\widehat{\beta}_0=-0.79$. What is $\overline{y}$? Your answer is a number. % 0.3121687 \end{enumerate} \end{enumerate} \newpage \item Consider a logistic regression in which the cases are newly married couples with both people from the same religion, the independent variable is religion (A, B, C and None -- let's call ``None" a religion), and the dependent variable is whether the marriage lasted 5 years (1=Yes, 0=No). \begin{enumerate} \item Make a table with four rows, showing how you would set up indicator dummy variables for Religion, with None as the reference category. \item Add a column showing the odds of the marriage lasting 5 years. The \emph{symbols} for your dummy variables should not appear in your answer, because they are zeros and ones, and different for each row. But of course your answer contains $\beta$ values. \item What is the ratio of the odds of a marriage lasting 5 years or more for Religion C to the odds of lasting 5 years or more for No Religion? Answer in terms of the $\beta$ symbols of your model. \item What is the ratio of the odds of lasting 5 years or more for religion A to the odds of lasting 5 years or more for Religion B? Answer in terms of the $\beta$ symbols of your model. \item You want to test whether Religion is related to whether the marriage lasts 5 years. State the null hypothesis in terms of one or more $\beta$ values. \item You want to know whether marriages from Religion A are more likely to last 5 years than marriages from Religion C. State the null hypothesis in terms of one or more $\beta$ values. \item You want to test whether marriages between people of No Religion have a 50-50 chance of lasting 5 years. State the null hypothesis in terms of one or more $\beta$ values. \end{enumerate} \item This question uses an R data set called \texttt{birthwt}. In R's Packages and Data menu, Select Package Manager, and make sure MASS is checked. Then, \texttt{help(birthwt)} will tell you about the data set. It's often used for logistic regression, but this time we're just going to do ordinary regression. The response variable will be the child's birth weight and the explanatory variables will be Mother's age, Mother's weight, and Race. \begin{enumerate} \item For each of the following questions, be able to give the null hypothesis in symbols. Give the value of the test statistic ($t$ or $F$), the $p$-value, and whether you reject $H_0$ at $\alpha=0.05$. How would you state the conclusion in plain language, with \emph{no} statistical terminology? (You can say ``allowing for" instead of ``controlling for.") \begin{enumerate} \item Controlling for mother's age and weight, do White and Black mothers differ in the mean weight of their babies? If one is more (meaning $H_0$ is rejected), which one race has heavier babies on average and how can you tell? \item Controlling for mother's age and weight, do White and Other mothers differ in the mean weight of their babies? If one is more (meaning $H_0$ is rejected), which one race has heavier babies on average and how can you tell? \item Controlling for mother's age and weight, do Black and Other mothers differ in the mean weight of their babies? If one is more (meaning $H_0$ is rejected), which one race has heavier babies on average and how can you tell? \item Controlling for mother's weight and race, is the mother's age related to her baby's weight? \item Controlling for mother's age and race, is the mother's weight related to her baby's weight? \item Controlling for mother's age and weight, is the mother's race related to her baby's weight? This is one test. \item Controlling for mother's race, are the mother's age and/or weight related to her baby's weight? This is one test. \item Are any of the explanatory variables related to baby's birth weight? This is one test. \end{enumerate} \item It's helpful to be able to do a general linear test of $H_0:\mathbf{L}\boldsymbol{\beta}=\mathbf{h}$ directly in R. To do it, you need to download a package. There is more than one possibility, but the \texttt{car} package is okay. In R's Packages and Data menu, Select Package Installer. Click on Get List. With Install Dependencies checked, select \texttt{car} and click Install Selected. I had to quit and restart R, then again in the Package Manager I had to \emph{select} car. Then \texttt{help(linearHypothesis)}. \begin{enumerate} \item Repeat the test of race controlling for age and weight, just to verify that you can get the same test statistic. \item With White as the reference category, repeat the test comparing Black to Other controlling for age and weight. \end{enumerate} \end{enumerate} \end{enumerate} \vspace{60mm} %\newpage \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistics, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/312f12} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/2101f12}} \end{document} library(MASS); library(car) attach(birthwt) race <- factor(race, labels = c("white", "black", "other")) fullmod = lm(bwt ~ age + lwt + race); summary(fullmod) # Switch contrasts to test Black vs other race2 = race contrasts(race2) = contr.treatment(3,base=3) # Other will be ref summary(lm(bwt ~ age + lwt + race2)) # Race controlling for age and weight red1 = lm(bwt ~ age + lwt) anova(red1,fullmod) # Age and weight controlling for race red2 = lm(bwt ~ race) anova(red2,fullmod) # Race controlling for age and weight with a general linear test. # Need car. Compare F = 4.7799 L = rbind(c(0,0,0,1,0), c(0,0,0,0,1) ) linear.hypothesis(fullmod,L) # With White as the reference category, repeat the test comparing # Black to Other controlling for age and weight. # Compare F = (-1.222)^2 = 1.493284 L = rbind(c(0,0,0,1,-1)) linear.hypothesis(fullmod,L)