\documentclass[11pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} \topmargin=-.3in \textheight=9.4in %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 2101/442 Assignment Eleven}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \vspace{3mm} \noindent Please bring printouts of your complete SAS log and list files for Question~\ref{chicks}, and your R printout for Question~\ref{birds} to the quiz; PDF output counts as a list file. Note that the log and list files \emph{must be from the same run of SAS}. The non-computer questions are just practice for the quiz, and are not to be handed in. \begin{enumerate} \item \label{chicks} In the Chick Weights study, newly hatched chickens were randomly assigned to one of six different feed supplements, and their weight in grams after 6 weeks was recorded. The Chick Weights Data was originally an R dataset called \texttt{chickwts}. It is in the file \href{http://www.utstat.toronto.edu/~brunner/appliedf13/code_n_data/hw/chickweights.data} {\texttt{chickweights.data}}. There is a link from the course home page in case the one in this document does not work. In this question, you will analyze the data with SAS. \begin{enumerate} \item Make sure a table of means, standard deviations and sample sizes for the 6 feed types is part of your output. \item Test whether the six mean weights are different. Get the $F$ statistic, degrees of freedom, $p$-value and proportion of explained variation. \item You want to know which means are different from which other means. Carry out the multiple comparison procedure likely to be the most powerful in this situation. Base your conclusions on the usual $\alpha=0.05$ \emph{joint} significance level for the family of tests. Of course when you state your conclusions in plain language, you would not mention the significance level or joint significance level. But to be honest, stating the conclusions in plain language isn't easy. The pattern is complicated. \item Test for differences among mean weights for the five feed types \emph{excluding} horsebean. \begin{enumerate} \item First, write the null hypothesis in terms of $\mu$ values. \item Now obtain the $F$ statistic, degrees of freedom and $p$-value. Do you reject $H_0$ at $\alpha=0.05$? \end{enumerate} \item Obtain a 95\% confidence interval for the difference between the expected weight for chicks fed horsebean, versus the average of the other expected values. Your answer is a pair of numbers. \item Would you advise a chicken farmer to purchase the Horsebean feed supplement if she wanted big fat chickens? \end{enumerate} \item If two events have equal probability, the odds ratio equals \underline{~~~~~~}. \item For a multiple logistic regression model, if the value of the kth explanatory variable is increased by c units and everything else remains the same, the odds of Y=1 are \underline{~~~~~~} times as great. Prove your answer. \item For a multiple logistic regression model, let $P(Y_i=1| x_{i,1}, \ldots, x_{i,p-1}) = \pi(\mathbf{x}_i)$. Show that a linear model for the log odds is equivalent to \begin{displaymath} \pi(\mathbf{x}_i) = \frac{e^{\beta_0 + \beta_1 x_1 + \ldots + \beta_{p-1} x_{p-1}}} {1+e^{\beta_0 + \beta_1 x_1 + \ldots + \beta_{p-1} x_{p-1}}} = \frac{e^{\mathbf{x}_i^\prime\boldsymbol{\beta}}} {1+e^{\mathbf{x}_i^\prime\boldsymbol{\beta}}} \end{displaymath} \item Write the log likelihood for a general logistic regression model, and simplify it as much as possible. Of course use the result of the last question. \item \label{delta} In the Logistic regression with R slide show, I reproduced the standard error for an estimated probability of passing the course. How did I do it? Show your work. \item A logistic regression model with no explanatory variables has just one parameter, $\beta_0$. It also the same probability $\pi = P(Y=1)$ for each case. \begin{enumerate} \item Write $\pi$ as a function of $\beta_0$; show your work. \item The \emph{invariance principle} of maximum likelihood estimation says the MLE of a function of the parameter is that function of the MLE. It is very handy. Now, still considering a logistic regression model with no explanatory variables, \begin{enumerate} \item Suppose $\overline{y}$ (the sample proportion of $Y=1$ cases) is 0.57. What is $\widehat{\beta}_0$? Your answer is a number. % 0.2818512 \item Suppose $\widehat{\beta}_0=-0.79$. What is $\overline{y}$? Your answer is a number. % 0.3121687 \end{enumerate} \end{enumerate} \item Consider a logistic regression in which the cases are newly married couples with both people from the same religion, the explanatory variable is religion (A, B, C and None -- let's call ``None" a religion), and the response variable is whether the marriage lasted 5 years (1=Yes, 0=No). \begin{enumerate} \item Make a table with four rows, showing how you would set up indicator dummy variables for Religion, with None as the reference category. \item Add a column showing the odds of the marriage lasting 5 years. The \emph{symbols} for your dummy variables should not appear in your answer, because they are zeros and ones, and different for each row. But of course your answer contains $\beta$ values. \item What is the ratio of the odds of a marriage lasting 5 years or more for Religion C to the odds of lasting 5 years or more for No Religion? Answer in terms of the $\beta$ symbols of your model. \item What is the ratio of the odds of lasting 5 years or more for religion A to the odds of lasting 5 years or more for Religion B? Answer in terms of the $\beta$ symbols of your model. \item You want to test whether Religion is related to whether the marriage lasts 5 years. State the null hypothesis in terms of one or more $\beta$ values. \item You want to know whether marriages from Religion A are more likely to last 5 years than marriages from Religion C. State the null hypothesis in terms of one or more $\beta$ values. \item You want to test whether marriages between people of No Religion have a 50-50 chance of lasting 5 years. State the null hypothesis in terms of one or more $\beta$ values. \end{enumerate} %\newpage \item \label{birds} People who raise large numbers of birds inhale potentially dangerous material, especially tiny fragments of feathers. Can this be a risk factor for lung cancer, controlling for other possible risk factors? The data are available in the file \href{http://www.utstat.toronto.edu/~brunner/appliedf13/code_n_data/hw/birdlung.data} {\texttt{birdlung.data}}. There is a link from the course home page in case the one in this document does not work. In this question, you will analyze the data with R. For a sample of birdkeepers and non-birdkeepers, the data file has whether they got lung cancer (1=Yes, 0=No), Gender (0=M, 1=F), Socioeconomic Status (0=Low, 1=High), Whether they are birdkeepers (1=Yes, 0=No) Age, How many years they have been smoking (including zero), and Cigarettes per day. If you look at \texttt{help(colnames)}, you can see how to add variable names to a data frame. First, make tables of the binary variables using \texttt{table}, Use \texttt{prop.table} to find out the percentages. What proportion of the sample had cancer. Any comments? There is one primary issue in this study: Controlling for all other variables, is birdkeeping significantly related to the chance of getting lung cancer? Perform a likelihood ratio test to answer the question. \begin{enumerate} \item In symbols, what is the null hypothesis? \item What is the value of the likelihood ratio test statistic $G^2$? The answer is a number. \item What are the degrees of freedom for the test? The answer is a number. \item What is the $p$-value? The answer is a number. \item What do you conclude? Presence of a relationship is not enough. Say what happened. \item For a non-smoking, bird-keeping woman of average age and low socioeconomic status, what is the estimated probability of lung cancer? The answer (a single number) should be based on the full model. \item For a non-smoking, non-bird-keeping woman of average age and low socioeconomic status, what is the estimated probability of lung cancer? The answer (a single number) should be based on the full model. \item Obtain a 95\% confidence interval for that last probability, based on the delta method. Do it the easiest way you can. Your answer is a pair of numbers. % \item Your answer to the last question made you uncomfortable. Why? Another approach is to start with a confidence interval for the log odds, and then use the fact that the function $p(x) = \frac{e^x}{1+e^x}$ is strictly increasing in $x$. Get the confidence interval this way. Again, your answer is a pair of numbers. Which confidence interval do you like more? \item Naturally, you should be able to interpret all the $Z$-tests too. Which one is comparable to the main likelihood ratio test you have just done? \item Also, are \emph{any} of the explanatory variables related to getting lung cancer? Carry out a single likelihood ratio test. You could do it from the default outut with a calculator, but use R. Get the $p$-value, too. \item Now please do the same as the last item, but with a Wald test. \end{enumerate} \item Finally and just for practice, fit a simple logistic regression model in which the single explanatory variable is number of cigarettes per day. \begin{enumerate} \item When a person from this population smokes ten more cigarettes per day, the odds of lung cancer are multiplied by $r$ (odds ratio). Give a point estimate of $r$. Your answer is a number. \item Using the \texttt{vcov} function and the delta method, give an estimate of the asymptotic variance of $r$. Your answer is a number. \end{enumerate} \end{enumerate} \vspace{10mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6.5in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistics, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/appliedf13} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/appliedf13}} \end{document} # The Byrds, missing the grad level stuff. birds = read.table("http://www.utstat.toronto.edu/~brunner/appliedf13/code_n_data/hw/birdlung.data") colnames(birds) = c("cancer", "sex", "ses", "birdkeep", "age", "yrsmoke", "cigday") head(birds) attach(birds) table(cancer); prop.table(table(cancer)) full = glm(cancer ~ sex+ses+birdkeep+age+yrsmoke+cigday,family=binomial) summary(full) red = update(full, . ~ .-birdkeep) anova(red,full,test='Chisq') # Simple simple = glm(cancer~cigday); summary(simple) betahat = simple$coefficients; betahat #$ rhat = exp(10*betahat[2]); rhat v = vcov(simple); v vrhat = (10*exp(10*betahat[2]))^2*v[2,2]; vrhat