\documentclass[12pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 2101/442f12 Assignment Eight}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent Please bring your R printout for Question~\ref{birds} to the quiz, and also your SAS log file and list file from the Question~\ref{SAT}. I really mean the log file, not just a listing of the SAS program. The log file and the list file \emph{must be from the same run of SAS.} It is very common to print the good list file, but bring an older log file that still has errors. This could lose you substantial marks. The other questions are just practice for the quiz. Any necessary formulas will be provided. \begin{enumerate} \item Show that for Poisson regression, the natural link is the log link. What is the dispersion parameter? What is the variance function? \item Consider a generalized linear model in which the response variable has a gamma distribution with unknown parameter $\beta$, and known $\alpha=1$. What is the natural link function? Why might using the natural link function present problems in a generalized linear model? \item \label{delta} In the Logistic regression with R slide show, I reproduced the standard error for an estimated probability of passing the course. How did I do it? Show your work. \item \label{birds} People who raise large numbers of birds inhale potentially dangerous material, especially tiny fragments of feathers. Can this be a risk factor for lung cancer, controlling for other possible risk factors? From the \href{http://www.utstat.toronto.edu/~brunner/appliedf12/data/}{Data Sets} link on the course home page, you can find the Bird Lung Cancer data. For a sample of birdkeepers and non-birdkeepers, it has Whether they got lung cancer (1=Yes, 0=No), Gender (0=M, 1=F), Socioeconomic Status (0=Low, 1=High), Whether they are birdkeepers (1=Yes, 0=No) Age, How many years they have been smoking (including zero), and Cigarettes per day. \texttt{help(colnames)} may be useful. First, make tables of the binary variables using \texttt{table}, Use \texttt{prop.table} to find out the percentages. What proportion of the sample had cancer. Any comments? There is one primary issue in this study: Controlling for all other variables, is birdkeeping significantly related to the chance of getting lung cancer? Perform a likelihood ratio test to answer the question. \begin{enumerate} \item In symbols, what is the null hypothesis? \item What is the value of the likelihood ratio test statistic $G^2$? The answer is a number. \item What are the degrees of freedom for the test? The answer is a number. \item What is the $p$-value? The answer is a number. \item What do you conclude? Presence of a relationship is not enough. Say what happened. \item For a non-smoking, bird-keeping woman of average age and low socioeconomic status, what is the estimated probability of lung cancer? The answer (a single number) should be based on the full model. \item For a non-smoking, non-bird-keeping woman of average age and low socioeconomic status, what is the estimated probability of lung cancer? The answer (a single number) should be based on the full model. \item Obtain a 95\% confidence interval for that last probability. Please don't use the delta method, or anyway don't calculate g-dot yourself; it's too much work! Your answer is a pair of numbers. \item Naturally, you should be able to interpret all the $Z$-tests too. Which one is comparable to the main likelihood ratio test you have just done? \item Also, are \emph{any} of the explanatory variables related to getting lung cancer? Carry out a single likelihood ratio test. You could do it from the default outut with a calculator, but use R. Get the $p$-value, too. \item Now please do the same as the last item, but with a Wald test. \end{enumerate} \item \label{SAT} This question uses the \href{http://www.utstat.toronto.edu/~brunner/appliedf12/data/sat.data} {\texttt{sat.data}} file you first saw in Assignment 5. There is a link on the course web page in case the one in this document does not work. This is just to get your feet wet with SAS and unix, in case you have not used these tools before. Using SAS, get sample sizes, means and standard deviations for all three variables. That's it. Bring your log file and your list file to the quiz. \end{enumerate} \vspace{70mm} %\newpage \noindent \begin{center}\begin{tabular}{l} \hspace{6.5in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistics, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/appliedf12} {\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/appliedf12}} \end{document} # The Byrds, missing the grad level stuff. birds = read.table("http://www.utstat.toronto.edu/~brunner/appliedf12/data/birdlung.data") colnames(birds) = c("cancer", "sex", "ses", "birdkeep", "age", "yrsmoke", "cigday") head(birds) attach(birds) table(cancer); prop.table(table(cancer)) full = glm(cancer ~ sex+ses+birdkeep+age+yrsmoke+cigday,family=binomial) summary(full) red = update(full, . ~ .-birdkeep) anova(red,full,test='Chisq') ============= Output ================ > birds = read.table("http://www.utstat.toronto.edu/~brunner/312f12/code_n_data/birdlung.data") > colnames(birds) = c("cancer", "sex", "ses", "birdkeep", "age", "yrsmoke", "cigday") > head(birds) cancer sex ses birdkeep age yrsmoke cigday 1 1 0 0 1 37 19 12 2 1 0 0 1 41 22 15 3 1 0 1 0 43 19 15 4 1 0 0 1 46 24 15 5 1 0 0 1 49 31 20 6 1 0 1 0 51 24 15 > attach(birds) > table(cancer); prop.table(table(cancer)) cancer 0 1 98 49 cancer 0 1 0.6666667 0.3333333 > full = glm(cancer ~ sex+ses+birdkeep+age+yrsmoke+cigday,family=binomial) > summary(full) Call: glm(formula = cancer ~ sex + ses + birdkeep + age + yrsmoke + cigday, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.5642 -0.8333 -0.4605 0.9808 2.2460 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.93736 1.80425 -1.074 0.282924 sex 0.56127 0.53116 1.057 0.290653 ses 0.10545 0.46885 0.225 0.822050 birdkeep 1.36259 0.41128 3.313 0.000923 *** age -0.03976 0.03548 -1.120 0.262503 yrsmoke 0.07287 0.02649 2.751 0.005940 ** cigday 0.02602 0.02552 1.019 0.308055 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 187.14 on 146 degrees of freedom Residual deviance: 154.20 on 140 degrees of freedom AIC: 168.2 Number of Fisher Scoring iterations: 5 > red = update(full, . ~ .-birdkeep) > anova(red,full,test='Chisq') Analysis of Deviance Table Model 1: cancer ~ sex + ses + age + yrsmoke + cigday Model 2: cancer ~ sex + ses + birdkeep + age + yrsmoke + cigday Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 141 165.87 2 140 154.20 1 11.67 0.0006352 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 >