\documentclass[11pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 312f22 Assignment Seven}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent These questions are practice for the quiz on Friday Nov. 11th, and are not to be handed in. The R part is deferred until next week. \begin{enumerate} \item Prove that the greater the log offs, the greater the probability. \item If two events have equal probability, the odds ratio equals \underline{~~~~~~}. \item For a multiple logistic regression model, if the value of the $j$th explanatory variable is increased by c units and everything else remains the same, the odds of Y=1 are \underline{~~~~~~} times as great. Show the calculation. \item For a multiple logistic regression model, let $P(Y_i=1| x_{i,1}, \ldots, x_k) = \pi(\mathbf{x}_i)$. Show that a linear model for the log odds is equivalent to \begin{displaymath} \pi(\mathbf{x}_i) = \frac{e^{\beta_0 + \beta_1 x_1 + \ldots + \beta_{p-1} x_k}} {1+e^{\beta_0 + \beta_1 x_1 + \ldots + \beta_{p-1} x_k}} = \frac{e^{\mathbf{x}_i^\prime\boldsymbol{\beta}}} {1+e^{\mathbf{x}_i^\prime\boldsymbol{\beta}}} \end{displaymath} \item Write the log likelihood for the last question, and simplify it as much as possible. \item In logistic regression, the \emph{null model} is a model with no explanatory variables. That is, $\beta_1 = \beta_2 = \cdots = \beta_k = 0$. There is just one unknown parameter, $\beta_0$. For this special case, a closed-form expression for the MLE is available. Derive it. \item That last question was an example of the \emph{invariance principle} of maximum likelihood estimation, which says the MLE of a function of the parameter is that function of the MLE. It is very handy. Now, still considering a logistic regression model with no explanatory variables, \begin{enumerate} \item Suppose $p$ (the sample proportion of $Y=1$ cases) is 0.57. What is $\widehat{\beta}_0$? Your answer is a number. % 0.2818512 \item Suppose $\widehat{\beta}_0=-0.79$. What is $p$? Your answer is a number. % 0.3121687 \end{enumerate} \item It is natural to estimate $\mathbf{x}^\prime_i\boldsymbol{\beta}$ with $\mathbf{x}^\prime_i\widehat{\boldsymbol{\beta}}_n$. What is the asymptotic (approximate large-sample) distribution of $\mathbf{x}^\prime_i\widehat{\boldsymbol{\beta}}_n$? Use the formula sheet and show your work. Give the asymptotic expected value and variance. \item We are working toward the Wald test of $H_0: \mathbf{L}\boldsymbol{\beta} = \mathbf{h}$. \begin{enumerate} \item What is the asymptotic distribution of $\mathbf{L}\widehat{\boldsymbol{\beta}}_n$? What facts from the formula sheet are you using? Give the expected value and covariance matrix. \item The asymptotic distribution of $W_n$ on the formula sheet uses another fact from the formula sheet. Which one? \item Why does the Wald statistic have $r$ degrees of freedom? \end{enumerate} \item In logistic regression, the $z$-test of $H_0: \beta_j = 0$ uses the test statistic $z = \frac{\widehat{\beta}_j}{se_{\widehat{\beta}_j}}$. Show that for the Wald test of this null hypothesis, $W_n=z^2$. \newpage \item Consider a logistic regression in which the cases are newly married couples with both people from the same religion, the explanatory variable is religion (A, B, C and None -- let's call ``None" a religion), and the response variable is whether the marriage lasted 5 years (1=Yes, 0=No). \begin{enumerate} \item Make a table with four rows, showing how you would set up indicator dummy variables for Religion, with None as the reference category. \item Add a column showing the odds of the marriage lasting 5 years. The \emph{symbols} for your dummy variables should not appear in your answer, because they are zeros and ones, and different for each row. But of course your answer contains $\beta$ values. \item What is the ratio of the odds of a marriage lasting 5 years or more for Religion C to the odds of lasting 5 years or more for No Religion? Answer in terms of the $\beta$ symbols of your model. \item What is the ratio of the odds of lasting 5 years or more for religion A to the odds of lasting 5 years or more for Religion B? Answer in terms of the $\beta$ symbols of your model. \item You want to test whether Religion is related to whether the marriage lasts 5 years. State the null hypothesis in terms of one or more $\beta$ values. \item You want to know whether marriages from Religion A are more likely to last 5 years than marriages from Religion C. State the null hypothesis in terms of one or more $\beta$ values. \item You want to test whether marriages between people of No Religion have a 50-50 chance of lasting 5 years. State the null hypothesis in terms of one or more $\beta$ values. \end{enumerate} \end{enumerate} % End of all the questions \vspace{70mm} %\newpage \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistics, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/312f22} {\texttt{http://www.utstat.toronto.edu/brunner/oldclass/312f22}} \end{document} \item The \emph{birth weight data} is a dataset containing information about a sample of new mothers. The response variable is a variable called \texttt{low}, an indicator of a baby with dangerously low weight at birth. To get the data and more information, type \texttt{library(MASS)}; \texttt{help(birthwt)} at the R prompt. Fit a logistic regression model with just the following explanatory variables: Age, mother's weight, race, smoking status during pregnancy, and an indicator for any first-trimester visits (1 = one or more, 0 = none). Make race a factor, and keep the default dummy variable setup. Look at \texttt{help(factor)} if you need to. I used the optional labels to make my output more readable. % I forgot family=binomial at first. Ouch. Not all the explanatory variables are significantly related to low birth weight when you control for the others. The non-significant variables could be removed from the model, but this is a matter of taste. We'll leave them in this time. Just to verify that we have the same model, my standard error for $\widehat{\beta}_0$ is 1.110641. If we don't agree, you might be using a different reference category for race, or you might have forgotten texttt{family=binomial}. \begin{enumerate} \item Reproduce the standard errors from \texttt{summary} using the \texttt{vcov} function. \item Controlling for all the other variables, is mother's weight at last period related to low birth weight at the 0.05 significance level? In plain, non-statistical language, what do you conclude? % Estimate Std. Error t value Pr(>|t|) % lwt -0.012507 0.006387 -1.958 0.05021 . \item Allowing for all the other variables, is smoking during pregnancy related to low birth weight at the 0.05 significance level? In plain, non-statistical language, what do you conclude? % Estimate Std. Error t value Pr(>|t|) % smoke 1.024100 0.387048 2.646 0.00815 ** \item Correcting for all the other variables, the odds of a low birth weight baby are an estimated \underline{\hspace{10mm}} times as great for a mother who smokes during pregnancy. \item We need to test for race, controlling for all other variables in the model. \begin{enumerate} \item Do a likelihood ratio test. Give the value of $G^2$, the degrees of freedom, and the $p$-value. In plain, non-statistical language, what do you conclude? % Gsq = 7.79, df=2, p = 0.02 \item Do a Wald test. Give the value of $W_n$, the degrees of freedom, and the $p$-value. In plain, non-statistical language, what do you conclude? % Wn = 7.41, df = 2, p = 0.025 \end{enumerate} \item Allowing for all the other variables, is there a difference between Black and White mothers in their chancees of having a low birth weight baby? In plain, non-statistical language, what do you conclude? Use the the 0.05 significance level. % Estimate Std. Error t value Pr(>|t|) % raceBlack 1.224274 0.517471 2.366 0.01799 * \item Correcting for all the other variables, the odds of a low birth weight baby are an estimated \underline{\hspace{10mm}} times as great for a Black mother, compared to a White mother. \item Give a 95\% confidence interval for that last number. \item This last question is about comparing Black and Other mothers in the chances of a low birth weight baby, controlling for all other variables. \begin{enumerate} \item Carry out a Wald test. What do you conclude? \item Correcting for all the other variables, the odds of a low birth weight baby are an estimated \underline{\hspace{10mm}} times as great for a Black mother, compared to an Other mother. \item Give a 95\% confidence interval for the odds ratio. This question might require a bit of thought. To help you along, what is the approximate large-sample distribution of the difference between the two $\widehat{\beta}$ values? The hypothesis matrix $\mathbf{L}$ from the Wald test should be useful. \end{enumerate} \end{enumerate} % End of the R question.