\documentclass[11pt]{article} %\usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{graphicx} % To include pdf files! \usepackage{amsmath} \usepackage{amsbsy} \usepackage{amsfonts} % for \mathbb{R} The set of reals \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links \usepackage{fullpage} %\pagestyle{empty} % No page numbers \begin{document} %\enlargethispage*{1000 pt} \begin{center} {\Large \textbf{STA 302f13 Assignment Seven}}\footnote{Copyright information is at the end of the last page.} \vspace{1 mm} \end{center} \noindent This assignment uses the data file \href{http://www.utstat.toronto.edu/~brunner/302f13/code_n_data/hw/CensusTract.data} {\texttt{CensusTract.data}}, given in \emph{Applied Linear Statistical Models} (1996), by Neter et al.. The data are used here without permission. There is a link on the course home page in case the one in this document does not work. The cases (there are $n$ cases) are a sample of census tracts in the United States. For each census tract, the following variables are recorded. \vspace{5mm} \begin{tabular}{ll} \texttt{area} & Land area in square miles \\ \texttt{pop} & Population in thousands \\ \texttt{urban} & Percent of population in cities \\ \texttt{old} & Percent of population 65 or older \\ \texttt{docs} & Number of active physicians \\ \texttt{beds} & Number of hospital beds \\ \texttt{hs} & Percent of population 25 or older completing 12+ years of school \\ \texttt{labor} & Number of persons 16+ employed or looking for work \\ \texttt{income} & Total Total before tax income in millions of dollars \\ \texttt{crimes} & Total number of serious crimes reported by police \\ \texttt{region} & Region of the country: 1=NE, 2=NC, 3=S, 4=W \\ \end{tabular} % \vspace{5mm} % \noindent % In the analyses you do, the dependent variable is \texttt{crimes}, and \texttt{region} is excluded for now. That means there are $k=9$ potential independent variables. \begin{enumerate} \item First, fit a regression model with \texttt{crimes} as the dependent variable and just one independent variable: \texttt{pop}. \begin{enumerate} \item In plain, non-statistical language, what do you conclude from this analysis? The answer is something about population size and number of crimes. \item What proportion of the variation in number of crimes is explained by population size? The answer is a number between zero and 1. \end{enumerate} \textbf{Bring your printout to the quiz.} \item Based on that last analysis, we will create a new dependent variable called crime \emph{rate}, defined as number of crimes divided by population size. Now fit\footnote{To ``fit" a model means to estimate the parameters.} a new regression model in which crime rate is a function of \texttt{area}, \texttt{urban}, \texttt{old}, \texttt{docs}, \texttt{beds}, \texttt{hs}, \texttt{labor} and \texttt{income}. % Don't hesitate to type \texttt{help(lm)} to find out more, and do not hesitate to do additional Based on this model, \begin{enumerate} \item What is $k$? The answer is a number. % 8 \item What is $\widehat{\beta}_4$? The answer is a number. % 0.0042640 \item Give the test statistic, the degrees of freedom and the $p$-value for each of the following null hypotheses. The answers are numbers from your printout. \begin{enumerate} \item $H_0: \beta_1=\beta_2= \cdots = \beta_8 = 0$ \item $H_0: \beta_6=0$ \item $H_0: \beta_0=0$ \end{enumerate} \item What proportion of the variation in crime rate is explained by the independent variables in this model? The answer is a number. % 0.3214 \item What is the smallest value of $\widehat{\epsilon}_i$? The answer is a number. % \item What is the largest value of $\widehat{\epsilon}_i$? The answer is a number. % \item Look at the output of \texttt{summary}. For the first entry under ``\texttt{t value}" (that's \texttt{2.057}), what is the null hypothesis? The answer is a symbolic statement involving one or more Greek letters. % H0: beta0=0 \item Look at the $F$ test at the end of the \texttt{summary} output. What is the null hypothesis? The answer is a symbolic statement involving one or more Greek letters. % H0: beta1 = ... = beta8 = 0 \item Controlling for all the other variables in the model, is number of hospital beds related to crime rate? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item Allowing for other variables, census regions with more hospital beds tend to have \underline{~~~~~~~} crime rates. % lower \end{enumerate} \item Controlling for all the other variables in the model, is number of physicians related to crime rate? \begin{enumerate} \item Give the null hypothesis in symbols. \item Give the value of the test statistic. The answer is a number from your printout. \item Give the $p$-value. The answer is a number from your printout. \item Do you reject the null hypothesis at $\alpha = 0.05$? Answer Yes or No. \item Allowing for other variables, census regions with more physicians tend to have \underline{~~~~~~~} crime rates. % higher \end{enumerate} \item Predict the crime rate for a new census tract with an area of 2,500 square miles, 50 percent urban, 10 percent senior citizens, 2,000 doctors, 6,000 hospital beds, 50 percent finished high school, a labour force of 450 thousand, and a total income of 6,500 million dollars. Give both a predicted value (a single number) and a 95\% prediction interval. \end{enumerate} \textbf{Bring your printout to the quiz.} \item The general linear model with normal error terms is $\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$, the columns of $\mathbf{X}$ are linearly independent, and $\boldsymbol{\epsilon} \sim N_n(\mathbf{0},\sigma^2\mathbf{I}_n)$. You know that \begin{itemize} \item $\widehat{\boldsymbol{\beta}} = (\mathbf{X}^\prime \mathbf{X})^{-1} \mathbf{X}^\prime \mathbf{Y} \sim N_{k+1}\left(\boldsymbol{\beta}, \sigma^2 (\mathbf{X}^\prime \mathbf{X})^{-1}\right)$ \item $SSE/\sigma^2 = \hat{\boldsymbol{\epsilon}}^\prime \hat{\boldsymbol{\epsilon}}/\sigma^2 \sim \chi^2(n-k-1) $, independent of $\widehat{\boldsymbol{\beta}}$. \end{itemize} Derive the $(1-\alpha)\times 100\%$ prediction interval for a new observation from this population. \end{enumerate} % \vspace{20mm} \noindent \begin{center}\begin{tabular}{l} \hspace{6in} \\ \hline \end{tabular}\end{center} This assignment was prepared by \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner}, Department of Statistical Sciences, University of Toronto. It is licensed under a \href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US} {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website: \href{http://www.utstat.toronto.edu/~brunner/oldclass/302f13} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/302f13}} \end{document} % This is the only question that does not require R. You may use the fact ``Census Tract data," \vspace{5mm} \noindent These problems are preparation for the quiz in tutorial on Friday November 1st, and are not to be handed in. # R work rm(list=ls()) census = read.table("http://www.utstat.toronto.edu/~brunner/302f13/code_n_data/hw/CensusTract.data") attach(census) # Q1 summary(lm(crimes~pop,data=census)) # 95 percent # Q2 crimerate = crimes/pop mod = lm(crimerate ~ area + urban + old + docs + beds + hs + labor + income) summary(mod) newregion = data.frame(area=2500, urban=50, old=10, docs=2000, beds=6000, hs=50, labor=450, income=6500) predict(mod,newdata=newregion,interval='prediction') ==== Session ===== > census = read.table("http://www.utstat.toronto.edu/~brunner/302f13/code_n_data/hw/CensusTract.data") > attach(census) The following object(s) are masked from 'census (position 3)': area, beds, crimes, docs, hs, income, labor, old, pop, region, urban The following object(s) are masked from 'census (position 4)': area, beds, crimes, docs, hs, income, labor, old, pop, region, urban The following object(s) are masked from 'census (position 5)': area, beds, crimes, docs, hs, income, labor, old, pop, region, urban > # Q1 > summary(lm(crimes~pop,data=census)) # 95 percent Call: lm(formula = crimes ~ pop, data = census) Residuals: Min 1Q Median 3Q Max -114137 -3969 977 5443 91703 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -6413.391 1956.590 -3.278 0.00132 ** pop 66.469 1.229 54.077 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 18730 on 139 degrees of freedom Multiple R-squared: 0.9546, Adjusted R-squared: 0.9543 F-statistic: 2924 on 1 and 139 DF, p-value: < 2.2e-16 > # Q2 > crimerate = crimes/pop > mod = lm(crimerate ~ area + urban + old + docs + beds + hs + labor + income) > summary(mod) Call: lm(formula = crimerate ~ area + urban + old + docs + beds + hs + labor + income) Residuals: Min 1Q Median 3Q Max -28.1128 -8.3957 -0.4209 7.1998 31.1864 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 21.0000936 10.2108838 2.057 0.041691 * area 0.0014182 0.0003977 3.566 0.000506 *** urban 0.1489428 0.0638183 2.334 0.021114 * old 0.0858062 0.4465427 0.192 0.847915 docs 0.0042640 0.0019497 2.187 0.030502 * beds -0.0015261 0.0006059 -2.519 0.012972 * hs 0.4475895 0.1415152 3.163 0.001939 ** labor 0.0019947 0.0238075 0.084 0.933354 income 0.0001003 0.0016995 0.059 0.953037 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 12.24 on 132 degrees of freedom Multiple R-squared: 0.3214, Adjusted R-squared: 0.2803 F-statistic: 7.815 on 8 and 132 DF, p-value: 1.472e-08 > newregion = data.frame(area=2500, urban=50, old=10, docs=2000, beds=6000, hs=50, labor=450, income=6500) > predict(mod,newdata=newregion,interval='prediction') fit lwr upr 1 56.15093 31.75172 80.55013 >