STA441 Assignment 4
Quiz on Friday Feb. 9th in tutorial
This assignment is on permutation and randomization tests, discussed in Chapter 11 of the online text. For each question below, please bring hard copy of your log file (not your program file) and results file to the quiz. That's two log files and two results files.
You may be asked to answer questions about them, and you may be asked to hand them in with the quiz.
Note that each log file and corresponding results file must be produced by the same run of SAS. If they are not, you will get zero for all questions about the data set involved.
- The Diet Data are from a study of people trying to lose weight. The data are available in an Excel Spreadsheet. Variables are
- Person: Identification code
- gender: F or M
- Age: In years
- Height: In cm.
- pre.weight: Weight in kg. before starting the diet.
- Diet: 1, 2, or 3, randomly assigned
- weight6weeks: Weight in kg. after 6 weeks on the diet
The diet data are from a University of Sheffield website, and I assume they were intended to be shared.
- Using SAS (of course), calculate a new variable representing weight loss. Produce frequency distributions for the categorical variables, and display the mean, standard deviation, minimum and maximum for the quantitative variables.
- Run proc means again, just on your weight loss variable. This time request t and probt, as you did with the furnace data. Is there evidence of average weight loss? Are you comfortable with a causal explanation? What are some potential confounding variables?
- By includng the line class diet; you can get separate tests for each diet. What do you conclude? Use plain language.
- This assignment is about permutation and randomization tests, but for comparison, carry out a standard one-way ANOVA in which the explanatory variable is diet, and the response variable is weight loss. Follow it up with pairwise comparisons of means. Since the sample sizes are nearly equal, Tukey comparisons are likely to be be most powerful, so do that. What do you conclude? Use plain language.
- proc npar1way with the wilcoxon option will give you a classical Kruskal-Wallis non-parametric one-way analysis of variance by ranks. Go ahead and give it a try. Did diet have an effect? Why can I say "effect?"
- Follow up that last analysis with Bonferroni-protected pairwise comparisons. You will be doing two-sample rank tests, in which the Kruskal-Wallis test becomes a Mann-Whitney U test. Note that this is not the way proc npar1way does pairwise comparisons by default. The advantage of our way is that you know what you're doing. For each comparison, you'll get a Wilcoxon Two-Sample Test as well as the Kruskal-Wallis test. Pay attention to the Kruskal-Wallis. Convert p-values to Bonferroni-corrected p-values. On the quiz, a calculator could be helpful if you have to do this. A hint, if you need one, is the word where.
- The following is not a SAS question, for a change. Those pairwise tests are based on a large-sample chi-square approximation. The exact sub-command without mc would yield exact permutation tests. Genuine permutation tests are attractive, but you have to watch out. Consider the comparison of Diet 1 and Diet 2. Twenty-four participants were assigned to Diet 1 and 27 were assigned to Diet 2.
- So, the number of ways to allocate the weight loss numbers to Diet 1 and Diet 2 is 51-choose-24. What is this number? You don't have to use SAS.
- You would need to calculate a t-statistic for each re-arrangement of the data. Suppose you could compute 100 t-statistics per second. How many years would it take?
- Finally we get to the main non-parametric analysis of these data. Using proc npar1way, do a randomization test using the raw data rather than ranks. The null hypothesis is that Diet had no effect.
- Using proc multtest, follow up the last test with multiple comparisons, using a permutation adjustment of the p-values. In plain, non-statistical language, what do you conclude? Are your conclusions different from what you obtained from the usual one-way ANOVA?
- This question is based on the
Furnace data from Assignment 2. Please do it with a separate SAS program -- that is, not just a continuation of the program for Question 1.
- Create a new variable in which square and rectangular chimneys are combined into a single category. Make a cross-tabulation of this variable with house type (all 5 categories). In the cells of the table, display the observed frequencies, expected frequencies, and either row or column percentages. You should be able to directly read off the percentage of round chimneys for each house type. Do a traditional chi-squared test of independence. My lowest expected frequency is 1.3146, so the traditional test might be okay despite the warnings from SAS. Does there appear to be a relationship between house type and chimney shape? Answer Yes or No.
- There is actually some doubt about the traditional test, because it is based on a large-sample chi-squared distribution, and some of those observed frequencies are pretty small. So we will go to exact tests. Do the exact test, not a randomization test. There should be no problem with computation time. What do you conclude?
- That last test can be understood as a test for differences among 5 percentages -- say, percentage of houses with round chimneys. We want to determine where the departure from equality comes from, using pairwise comparisons of the percentages. Do 10 two-sided Fisher's exact tests and apply a Bonferroni correction. Use where. It's not pretty, but you can copy-paste. What do you conclude? Use plain, non-statistical language.
- In Assignment 2, you created a difference variable, described as "the difference between energy consumption with vent damper in and vent damper out." Actually, we want this variable to represent energy savings associated with using the damper. The damper is active when it's "in," and inactive when it's "out." Energy consumption should generally be less when the damper is active. Anyway, make sure your difference variable represents energy savings.
Let's check one correlation, the correlation between house age and energy savings from the damper.
- Using proc univariate (see SAS Example Two), test whether house age and your energy savings variable have (univariate) normal distributions. Are you comfortable with the assumption of normality? If not, the usual p-value for the Pearson correlation may be questionable.
- Using proc corr, calculate the Spearman rank correlation and the usual test. What can you conclude?
- Weirdly, the p-value is based on the assumption that the data have a bivariate normal distribution before ranking. If this were true, why convert the data to ranks? If it's false ... well, let's do a permutation test. More exactly, please do a randomization test of the Spearman correlation. In plain, non-statistical language, what do you conclude?
This assignment was prepared by Jerry Brunner, Department of Mathematical and Computational Sciences, University of Toronto Mississauga. It is licensed under a Creative Commons Attribution-ShareAlike 3.0 (or later) Unported License. Use and share it freely. The data are also open.