Assignment 3

Statistical Consulting Assignment Three

Quiz at 10:10 a.m. on Thursday Oct. 9

The file mcars2.data has four variables, country of origin, fuel efficiency in kilometers per litre, length in millimeters and weight in kilograms. Your client gives you these data, and asks two questions.

Which country produces the most fuel efficient cars?
Once you allow for size of car, which country produces the most fuel efficient cars?

The typical request from a client is much less clear than this, so please think a bit about what you'd do. Then read the next few paragraphs for some guidance you'd never get in practice.

Of course you're going to give the client information about differences in average (mean) fuel efficiency, even though that was not requested explicitly. Furthermore, though an unsophisticated client might be content with sample means for the first question, you're going to give her conclusions based on hypothesis tests -- because there's random noise in any data set, and hypothesis tests help to filter it out. A general rule is to use the most simple and well-known technique that's appropriate, so the tool will be some kind of regression or analysis of variance.

You'll test for differences among the three means, and then if it's statistically significant at α = 0.05, follow up with Bonferroni-corrected pairwise comparisons. The Bonferroni correction is not the only good one, but it's simple and I want us all to be doing the same thing so I can check your answers.

For the second question, you'll use some kind of regression or analysis of covariance to control for weight and length. Again, if the initial test is significant you'll follow up with Bonferroni-corrected pairwise comparisons using a joint 0.05 significance level.

Naturally, before you do all this you'll run some basic descriptive statistics and look at them! This will help ensure that your main analyses are not an example of "garbage in, garbage out." I don't want to see this part; don't hand it in. I suggest writing a separate SAS program to do this.

You'll notice that the data file has a first line that tells you what the variables are. This is handy, but SAS can't use it (R can). You must either delete the line or skip it with the firstobs option on the input statement. My input statement was infile 'mcars2.dat' firstobs=2; . Also, to read a character-valued variable, follow the variable name with a dollar sign ($) in the input statement.

Please bring both your log file and your list file to the quiz. You might be asked to hand one or both of them in. Please make sure that the log and list file are from the same run, so if you have errors I can track them down. It is surprisingly easy to violate this rule, but please don't.

What might the quiz questions be like?

Numerical values of the test statistics (F and t values), p-values, Bonferroni-adjusted p-values, sample means, etc. These numbers will be on your list file.
Brief answers to the client's questions, in language a non-statistician might understand. You will base conclusions only on hypothesis tests in which the null hypothesis is rejected at α = 0.05, but you will avoid formal statistical terminology at all costs. I am serious about this. I will deduct marks for the use of statistical terms that a someone with a high school education would not understand.
Name the independent variable(s), dependent variable(s). Are each variable quantitative, or categorical? Is this an experimental study, or is it observational?

There's a lot more you could do for the client, like residual plots, tests for parallel slopes, curvilinear relationships and the like. But we'll stick to the basics for now.

Warning: This is not a group project. You are expected to do the most or all of the work independently. It is okay to discuss general principles with each other, but you should not look at the SAS code of any other student, or allow any other student to look at your code. It is okay to compare numerical answers, and compare SAS output. It is not okay to copy, or to allow your work to be copied.