STA429/1007 Assignment 4

Quiz on Thursday Oct. 18th


This assignment is based on material in Chapters 1 through 3 of the online text, and associated lecture material.

  1. Explain the difference between random assignment and random sampling. Check Chapter One if you need to. Give examples of studies where you have each one without the other, both, and neither (4 examples).
  2. Explain the difference between an observational study and an experimental study. Check Chapter One if you need to. Give an example of each one. For clarity, your examples should have only one independent variable.
  3. High School History classes from across Ontario are randomly assigned to either a discovery-oriented or a memory-oriented curriculum in Canadian history. At the end of the year, the students are given a standardized test and the median score of each class is recorded. Please consider a regression model with these variables.:

    The full regression model is E[Y|X] = β0 + β1X1 + β2X2 + β3X3 + β4X4 + β5X5.

    Give the reduced model you would use to answer each of the following questions.

    1. If you control for parents' education and income and for teacher's university background, does curriculum type affect test scores? (And why is it okay to use the word "affect?")
    2. Controlling for parents' education and income and for curriculum type, is teacher's university background (two variables) related to their students' test performance?
    3. Controlling for teacher's university background and for curriculum type, are parents' education and income (considered simultaneously) related to students' test performance?
    4. Controlling for curriculum type, teacher's university background and parents' education, is parents' income related to students' test performance?
  4. For the Trees data, you did a regression predicting volume from two independent variables" height and diameter. Continuing with this, please use proc glm's estimate statement to predict the volume of lumber obtained from a tree 10 inches in diameter and 50 feet tall. Your answer is a single number.
  5. Predict the volume of lumber obtained from a hypothetical tree whose height and diameter are exactly equal to the respective sample means. Compare your answer to the sample mean of the dependent variable.
  6. Run another regression in which you save the residuals. Now use proc plot to make a scatter diagram with residual on the Y axis and diameter on the X axis. Using a pencil and perhaps a bit of imagination, draw a U-shaped curve through the points. It looks like the residuals (what's left over from the regression) have a curvy relationship to diameter, suggesting that the straight-line model for diameter may not be quite right. This is how you can use residual plots to detect curvilinear relationships when there are too many independent variables to look at a full scatterplot.

    That U-shaped curve could be a parabola, opening upwards. A parabola is Y = X2, so maybe a diameter-squared (polynomial) term in the regression would be helpful. Actually, there is a physical basis for this. If those trees were cylinders, their volume would be π r2 * height. Since the radius r is half the diameter, this would make the volume proportional to diameter-squared. I know Black Cherry Trees are not cylinders, but still it's enough motivation to consider a quadratic term.

  7. So, go back to the data step and calculate the square of diameter. Why not use my program statement? d2 = diameter**2;. Now run another regression that predicts volume from height, diameter and d2.
    1. When you control for height and diameter, does diameter-squared make a statistically significant contribution to the prediction of volume?
    2. When you control for diameter and diameter-squared, does height make a statistically significant contribution to the prediction of volume?
    3. What about diameter when you control for height and diameter-squared?
    4. What proportion of the remaining variation is explained by diameter when you control for the other two variables? The answer is a number. Please do it with proc iml.
    5. What proportion of the remaining variation is explained by diameter-squared when you control for the other two variables? My answer is 0.56. It's huge! (Relative to what was still left to be explained.) Please do it with proc iml.
    6. What proportion of the remaining variation is explained by height when you control for the other two variables? The answer is a number. Please do it with proc iml.

Please bring your log file and list file to the quiz.