%                                   302f20Assignment8.tex
\documentclass[11pt]{article} 
%\usepackage{amsbsy} % for \boldsymbol and \pmb 
\usepackage{graphicx} % To include pdf files!
\usepackage{amsmath}
\usepackage{amsbsy}
\usepackage{amsfonts}
\usepackage{comment}
\usepackage{euscript} % for \EuScript
\usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % For links
\usepackage{fullpage}
%\pagestyle{empty} % No page numbers


\begin{document}
%\enlargethispage*{1000 pt} 

\begin{center}   
{\Large \textbf{STA 302f20 Assignment Eight}\footnote{This assignment was prepared by  \href{http://www.utstat.toronto.edu/~brunner}{Jerry Brunner},
Department of Statistical Sciences, University of Toronto. It is licensed under a 
\href{http://creativecommons.org/licenses/by-sa/3.0/deed.en_US}
     {Creative Commons Attribution - ShareAlike 3.0 Unported License}. Use any part of it as you like and share the result freely. The \LaTeX~source code is available from the course website:
\href{http://www.utstat.toronto.edu/~brunner/oldclass/302f20} {\small\texttt{http://www.utstat.toronto.edu/$^\sim$brunner/oldclass/302f20}}} }
\vspace{1 mm}
\end{center}

\noindent
The following problems are not to be handed in. They are preparation for the quiz in tutorial and the final exam. Please try them before looking at the answers. Use the formula sheet.  Please remember that the R part (Question~\ref{computer}) is \emph{not a group project}. You may compare numerical answers, but do not show anyone your code or look at anyone else's.

\begin{enumerate} 

\item \label{talk} Data from a STA302 class many years ago consist of quiz average, computer assignment average, midterm score and Final Exam score, all in percent. We seek to predict final exam score from the term work. 
    \begin{enumerate}
        \item \label{model} Write the regression equation in scalar form using $x_{i,j}$ and $y_i$ variables. Assume the order of predictor variables given above. 
        \item What is the expected final exam score for a student with a 70\% average on the quizzes, 85\% on the computer assignments, and 65\% on the midterm? Answer in terms of $\beta_j$ values.
        \item For any fixed quiz average and computer average, a score one point higher on the midterm yields an expected mark on the Final Exam that is \underline{\hspace{10mm}} higher.

        \item We want a hypothesis test to answer this question: Are any of the term work variables useful in predicting final exam score? This is one test.
                \begin{enumerate}
                    \item State the null hypothesis in terms of scalar $\beta_j$ values. 
                    \item The null hypothesis could be written $H_0: \mathbf{C}\boldsymbol{\beta} = \mathbf{t}$. Give the $\mathbf{C}$ and $\mathbf{t}$ matrices.
                    \item The null hypothesis could be tested using the full-reduced model approach. Give the regression equation for the reduced model. % Do not renumber the explanatory variables or regression coefficients. 
                \end{enumerate}

        \item Controlling for computer assignment average and midterm score, is quiz average related to Final Exam score?
                \begin{enumerate}
                    \item State the null hypothesis in scalar form. 
                    \item The null hypothesis could be written $H_0: \mathbf{C}\boldsymbol{\beta} = \mathbf{t}$. Give the $\mathbf{C}$ and $\mathbf{t}$ matrices.
                    \item The null hypothesis could be tested using the full-reduced model approach. Give the regression equation for the reduced model. Do not renumber the explanatory variables or regression coefficients. 
                \end{enumerate}

% \pagebreak

        \item Allowing for quiz average and computer assignment average, is midterm score a predictor of Final Exam score?
                \begin{enumerate}
                    \item State the null hypothesis in terms of scalar $\beta_j$ values. 
                    \item The null hypothesis could be written $H_0: \mathbf{C}\boldsymbol{\beta} = \mathbf{t}$. Give the $\mathbf{C}$ and $\mathbf{t}$ matrices.
                    \item The null hypothesis could be tested using the full-reduced model approach. Give the regression equation for the reduced model. Do not renumber the explanatory variables or regression coefficients. 
                \end{enumerate}

        \item Holding for quiz average and midterm score fixed, is computer assignment average connected to Final Exam score?
                \begin{enumerate}
                    \item State the null hypothesis in terms of scalar $\beta_j$ values. 
                    \item The null hypothesis could be written $H_0: \mathbf{C}\boldsymbol{\beta} = \mathbf{t}$. Give the $\mathbf{C}$ and $\mathbf{t}$ matrices.
                    \item The null hypothesis could be tested using the full-reduced model approach. Give the regression equation for the reduced model. Do not renumber the explanatory variables or regression coefficients. 
                \end{enumerate}

        \item Controlling for computer assignment average, is quiz average or midterm score (or both) related to Final Exam score? This is one test.
                \begin{enumerate}
                    \item State the null hypothesis in terms of scalar $\beta_j$ values. 
                    \item The null hypothesis could be written $H_0: \mathbf{C}\boldsymbol{\beta} = \mathbf{t}$. Give the $\mathbf{C}$ and $\mathbf{t}$ matrices.
                    \item The null hypothesis could be tested using the full-reduced model approach. Give the regression equation for the reduced model. Do not renumber the explanatory variables or regression coefficients. 
                \end{enumerate}

        \item The professor thinks that the quizzes and midterm should have equal weight, and should be worth twice as much as the computer assignments. If this idea is correct, it should be reflected in the relationship of the term marks to the final exam. Also, it makes sense that if a student got zero on all three components of the term mark, he or she should also expect a zero on the final exam --- even though this extreme case is outside the range of the data. Taken together, these ideas represent an unusual but testable null hypothesis. If it is rejected, we could say that the professor's ideas are not supported by the data. 
                \begin{enumerate}
                    \item State the null hypothesis in terms of scalar $\beta_j$ values. 
                    \item The null hypothesis could be written $H_0: \mathbf{C}\boldsymbol{\beta} = \mathbf{t}$. Give the $\mathbf{C}$ and $\mathbf{t}$ matrices.
                    \item The null hypothesis could be tested using the full-reduced model approach. Give the regression equation for the reduced model. Do not renumber the explanatory variables or regression coefficients. 
                \end{enumerate}

    \end{enumerate} % End of the question that is just talk.

\pagebreak %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\item \label{computer} Data from the example of Question~\ref{talk} are available 
\href{http://www.utstat.utoronto.ca/~brunner/data/legal/LittleStatclassdata.txt}{here}. The URL is \\
\verb!http://www.utstat.utoronto.ca/~brunner/data/legal/LittleStatclassdata.txt! .
In the data file, the quiz averages and computer averages are out of ten, but you should fix them up so they are out of 100. Prepare one pdf document showing your input and output for the questions below. You might be asked to attach it to the quiz. 
    \begin{enumerate}
        \item Start with a \texttt{summary} of the data frame, a correlation matrix, and a matrix of scatterplots using \texttt{pairs}, just so you have a general idea of what is going on. 
                \begin{enumerate}
                    \item What is the median score on the computer assignments? 
                          The answer is a number.
                    \item What is the correlation between the computer average 
                          and the final exam score?  The answer is a number.
                    \item You multiplied computer average and quiz average by 10 to convert to percent. Does this affect their correlations with other variables? Answer Yes or No and prove your answer.
                    \item Suppose you were to fit a simple regression model with quiz average as the single explanatory variable and final exam score as the response variable. Without actually fitting the model, what would $R^2$ be? The answer is a number. % 0.39597772^2 = 0.157 or so.
                \end{enumerate}

        \item Fit the full model (your answer to Question~\ref{model}) and display \texttt{summary} on it. The answers to many of the questions below are in the output of \texttt{summary}, or can be obtained from the \texttt{summary} output by a quick calculation.
        \item What is $\widehat{\beta}_2$? The answer is a number.
        
        \item What is $n$ for this problem? 
        \item What is $k$ for this problem? 
        \item What are the dimensions of the $\mathbf{X}$ matrix? The answer is a pair of numbers, number of rows and number of columns. 
        \item What are the dimensions of $\widehat{\boldsymbol{\beta}}$? The answer is a pair of numbers, number of rows and number of columns. 
        \item What are the dimensions of $\widehat{\boldsymbol{\epsilon}}$? The answer is a pair of numbers, number of rows and number of columns. 
        \item  What are the dimensions of 
$\widehat{\boldsymbol{\epsilon}}^{\,\prime\,} \widehat{\boldsymbol{\epsilon}}$?
        \item What are the dimensions of the $\widehat{\mathbf{y}}$ matrix? The answer is a pair of numbers,  number of rows and number of columns.
        \item What are the dimensions of the hat matrix $\mathbf{H}$? The answer is a pair of numbers, number of rows and number of columns. 
        \item What is $\widehat{\boldsymbol{\epsilon}}^{\,\prime\,} \widehat{\boldsymbol{\epsilon}}$? You can calculate this number from the output of \texttt{summary} employing R as a calculator, using the fact that \texttt{Residual standard error} from your printout is the square root of \emph{MSE}. The answer is subject to a bit of rounding error, but it's okay. Get a more accurate answer, too.
        \item What is $SST$? The answer is a single number. First, obtain the number from the output of \texttt{summary}, using R as a calculator. There is some algebra; show your work. Then, check the result by a more direct calculation. I used the \texttt{var} function. The second way is more accurate, but the first one is more interesting.

        
        \item What is the predicted final exam score for a student with a 70\% average on the quizzes, 85\% on the computer assignments, and 65\% on the midterm?  The answer is a number.
        \item For any fixed quiz average and computer average, a score one point higher on the midterm yields a predicted mark on the Final Exam that is \underline{\hspace{10mm}} higher. The answer is a number. % Betahat3 = 
        \item What is the largest $\widehat{\epsilon}_i$ in absolute value? 
        \item For each of the following null hypotheses, give the value of the test statistic and the $p$-value. The answers are numbers that appear in the output from \texttt{summary}. Also state whether you reject $H_0$ at $\alpha = 0.05$. 
\begin{center}
\begin{tabular}{|c|c|c|c|} \hline 
$H_0$ & Test Statistic & $p$-value & Reject $H_0$?  \\ \hline
$\beta_1 = \beta_2 = \beta_3 = 0$ &  &  & \\ \hline
$\beta_0 = 0$ &  &  & \\ \hline
$\beta_1 = 0$ &  &  & \\ \hline
$\beta_2 = 0$ &  &  & \\ \hline
$\beta_3 = 0$ &  &  & \\ \hline
\end{tabular}
\end{center}
        \item What proportion of the variation (sum of squares) in final exam mark is explained by the term work? The answer is a number. % R^2 = 0.2662

        \item There is a hypothesis test to answer the question: Controlling for computer assignment average and midterm score, is quiz average related to Final Exam score?
                \begin{enumerate}
                    \item State the null hypothesis in symbols. 
                    \item A nice 2-sided $t$-test is part of the default output. Give the value of the $t$ statistic, the degrees of freedom, and the $p$-value.
                    \item Do you reject the null hypothesis at $\alpha=0.05$? Answer Yes or No.
                    \item Are the results statistically significant at the $\alpha=0.05$ level? Answer Yes or No.
                    \item In plain, non-statistical language, what do you conclude?
                    \item You can test this same null hypothesis in the form $H_0: \mathbf{C}\boldsymbol{\beta} = \mathbf{t}$, using the general linear $F$-test. Do it using my \texttt{ftest} function. Does $F=t^2$? Compare the $p$-values. 
                    \item Carry out the same test using the full-reduced model approach.
                    \item Once you have allowed for computer assignment average and midterm score, what proportion of the remaining variation does quiz average explain? The answer is a number between zero and one.
                \end{enumerate}

        \item Give a 95\% confidence interval for $\beta_1$. Why does this confidence interval provide one more way of testing $H_0: \beta_1=0$?

        \item Consider a student who is ``average" on all three explanatory variables. The expected final exam score for such a student would be 
\begin{displaymath}
     E(y|x_1 = \overline{x}_1, x_2 = \overline{x}_2, x_3 = \overline{x}_3) = \beta_0 + \beta_1\overline{x}_1 + \beta_2\overline{x}_2 + \beta_3\overline{x}_3 ,
\end{displaymath}
where the $\overline{x}_j$ are the sample means of quiz average, computer average and midterm test. 
                \begin{enumerate}
                    \item Give a point estimate of the expected value. The answer is a number.
                    \item You knew it was equal to $\overline{y}$ all along. Why? % A4Q3 
                    \item Give a 95\% confidence interval for the expected value. The answer is a pair of numbers, a lower confidence limit and an upper confidence limit.  
                    \item Use the \texttt{t.test} function (see \texttt{help(t.test)}) to get the usual 95\% confidence interval around $\overline{y}$. The two confidence intervals are a little bit different. Why? 
                \end{enumerate}        

        \item \label{plaintalk} For each of the following questions, give the null hypothesis you tested to answer the question, and also a conclusion expressed in plain, non-statistical language. Remember the rules: No statistical terminology, draw a directional conclusion if you can, be guided by $\alpha=0.05$ but never mention it, and don't accept $H_0$. All the information you need is in the output of \texttt{summary} from the full model.
        \begin{enumerate}
            \item Controlling for quiz average and computer average, is mark on the midterm test related to mark on the final exam?
            \item Allowing for mark on the midterm test and quiz average, is computer average a useful predictor of mark on the final exam?
            \item Taking into account mark on the midterm test and computer average, is quiz average related connected to mark on the final exam?
            \item Are any of the predictor variables useful?
        \end{enumerate}

        \item For Question \ref{plaintalk}, suppose we treated the last test (Are any of the predictor variables useful?) as an overall test, and treated the other tests as follow-ups with a Bonferroni correction. What is the conclusion now? Only mention findings that are statistically significant with the Bonferroni correction.

    \item Controlling for mark on the midterm test, are the other two variables (either or both) related to mark on the Final exam?
     \begin{enumerate}
            \item State the null hypothesis in terms of scalar $\beta_j$ values.
            \item State the null hypothesis in matrix terms. That is, give the matrices $\mathbf{C}$, $\boldsymbol{\beta}$ and $\mathbf{t}$ in $H_0: \mathbf{C}\boldsymbol{\beta} = \mathbf{t}$.
            \item Write the reduced model. Please do not re-number the variables or the $\beta_j$ parameters.
                    \item Obtain the $F^*$ test statistic my \texttt{ftest} function.  
                    \item Carry out the same test using the full-reduced model approach.
            \item Give the $p$-value. The answer is a number. 
            \item Do you reject $H_0$ at $\alpha = 0.05$? Answer Yes or No.
            \item Are the results statistically significant at the $\alpha = 0.05$ level? Answer Yes or No.
            \item Allowing for mark on the midterm test, what proportion of the remaining variation in final exam score is explained by computer average and quiz average? 
            \item State your conclusions (if any) in plain, non-statistical language.
        \end{enumerate}
    \end{enumerate} % End of the computer question

\end{enumerate} % End of all the questions

%\vspace{60mm}

\end{document}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


% \vspace{30mm} \hrule \vspace{30mm} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \vspace{30mm}\hrule\vspace{30mm}

% $E(y|x_1 = \overline{x}_1, x_2 = \overline{x}_2, x_3 = \overline{x}_3)$






# Order is      QuizAve CompAve MidTerm          FinalExam


#####################################