Assignment 7

Assignment Seven: Quiz on Friday March 8th in tutorial

This assignment is on regression with normal error terms. It is based on Lecture slide sets 9 and 10, and material in Chapter 5 of the online text. The following formulas will be provided with the quiz (and the final exam), whether they are needed or not:

Telephone sales representatives use computer software to help them locate potential customers, answer questions, take credit card information and place orders. Twelve sales representatives were randomly assigned to each of three new software packages the company was thinking of purchasing. The data for each sales representative include the software package (1, 2 or 3), sales last quarter with the old software, and sales this quarter with one of the new software packages. Sales are in number of units sold. The data are available in an Excel spreadsheet at
http://www.utstat.toronto.edu/brunner/data/legal/sales.data.xlsx
1. Using proc means, obtain the three sample means of sales this quarter.
2. Using proc reg and indicators with no intercept (cell means coding), fit a model in which sales last quarter is ignored. This is very different from controlling for it. We want to know whether software package has any effect on sales. Why is it okay to use the word "effect?"
  1. Write E(y|x) in Greek letters.
  2. Make a table, or at least be able to make one on the quiz if asked.
  3. True or False: The estimated regression coefficients b_j are just sample means.
  4. What is the null hypothesis for testing whether software package has any effect on sales? Give the answer in terms of Greek letters from the regression model.
  5. Give the test statistic. The answer is a number from your printout.
  6. Give the p-value. The answer is a number from your printout. The p-value is not the same as the test statistic.
  7. Do you reject H₀? Answer Yes or No.
  8. Are the results statistically significant at the 0.05 level? Answer Yes or No.
  9. Give the p-value for each pairwise comparison of software packages: That's 1 vs. 2, 1 vs. 3 and 2 vs. 3. You don't need to bother with any kind of correction for multiple testing this time.
  10. In plain, non-statistical language, what do you conclude from this analysis?
3. Now, still using cell means coding (indicators with no intercept), fit a model with software package and sales last quarter as the explanatory variables, and sales this quarter as the response variable. There are no interaction terms yet.
  1. Write E(y|x) in Greek letters. Make sure the variables are in the same order here and in your SAS program.
  2. Make a table showing how your dummy variables are set up. There should be one row for each software package, and a column for each dummy variable. Add a wider column on the right, in which you show E(y|x) for each software package. The symbols for your dummy variables will not appear in this column.
  3. What is the null hypothesis for testing whether software package has any effect on sales this quarter once you control for sales last quarter? Give the answer in terms of Greek letters from the regression model.
  4. Give the test statistic. The answer is a number from your printout.
  5. Give the p-value. The answer is a number from your printout.
  6. Do you reject H₀? Answer Yes or No.
  7. Are the results statistically significant at the 0.05 level? Answer Yes or No.
  8. Give the p-value for each pairwise comparison of software packages. You don't need to bother with any correction for multiple testing.
  9. What proportion of the remaining variation in sales this quarter is explained by software package once you allow for sales last quarter? The answer is a number that you calculate from the numbers on your printout. Bring a calculator to the quiz.
  10. In plain, non-statistical language, what do you conclude from this analysis?
4. The next task is to fit a model in which the slopes as well as the intercepts might be different for the three software packages. Before doing that, use proc sgplot to produce a scatterplot with the three regression lines. There is an example in the analysis of the Cars data (SAS Example 5).
5. Please fit the unequal slopes model in two ways: (1) With an intercept and using two indicator dummy variables, and (2) With no intercept and using three indicator dummy variables (cell means coding). The test statistics and p-values for the two models will be identical, verifying that you are doing everything right.
  1. Write E(y|x) in Greek letters. Make sure the variables are in the same order here and in your SAS program.
  2. Make a table showing how your dummy variables are set up. There should be one row for each software package, and a column for each dummy variable. Add a wider column on the right, in which you show E(y|x) for each software package. The symbols for your dummy variables will not appear in this column.
  3. What is the null hypothesis for testing whether the three slopes are equal? Give the answer in terms of Greek letters from the regression model.
  4. What is the null hypothesis for testing whether the effect of software program on sales this quarter depends on sales last quarter? Give the answer in terms of Greek letters from the regression model.
  5. Give the test statistic. The answer is a number from your printout.
  6. Give the p-value. The answer is a number from your printout.
  7. Do you reject H₀? Answer Yes or No.
  8. Are the results statistically significant? Answer Yes or No.
  9. What proportion of the remaining variation in sales this quarter is explained by unequal slopes once you allow for the other explanatory variables in the model? The answer is a number that you calculate from the numbers on your printout.
  10. Estimate the slope of the line relating sales last quarter to sales this quarter, for software package 1. The answer may be directly on your printout, or it may be a number that you calculate from the numbers on your printout.
  11. Estimate the slope of the line relating sales last quarter to sales this quarter, for software package 2. The answer may be directly on your printout, or it may be a number that you calculate from the numbers on your printout.
  12. Estimate the slope of the line relating sales last quarter to sales this quarter, for software package 3. The answer may be directly on your printout, or it may be a number that you calculate from the numbers on your printout.
  13. Is the slope of the line for software package 2 significantly different from zero at the 0.05 level? The test you want may be directly on your printout, or you may have to carry out a custom test.
  14. Which of the three estimated slopes are significantly different from each other at the 0.05 level? You are summarizing the results of three tests. Use a Bonferroni correction for three tests.
In the job satisfaction and diversity study, employees at Canadian corporations filled out questionnaires about their jobs. Questionnaires employed 5-point scales, where 5 indicates the highest level of the trait or opinion being assessed (like job satisfaction) and 1 indicating the lowest level. The wording of the questions was varied so that sometimes a 1 indicated higher satisfaction (for example, strong disagreement with "I hate my job."), but the numbers were switched around so that in the data file, larger numbers always indicate more. Data consist of answers to
- Ten questions about committment (loyalty) to the organization, with higher numbers indicating more committment.
- Five questions about relations with colleagues at work, with higher numbers indicating better relations.
- Twelve questions about relations with magnagement, in particular the respondent's immediate boss. Higher numbers indicate better relations.
- Six questions about fair opportunities for advancement, with higher numbers indicating more fairness.
- Four questions about job satisfaction, with higher numbers indicating more satisfaction.
- Three questions about senior management's committment to diversity, with higher numbers indicating more committment.
- Gender: 0=Male, 1=Female
- Visible Minority status: 0=No, 1=Yes. This term is ambiguous, especially in Toronto. I believe it is intended to be a polite way to say whether you look European or not.
- Education level, numbered 1-7. The exact meanings of the numbers are unknown, but surely higher numbers must indicate more education, at least mostly.
- Marital status: 1=never married, 2=married, 3=divorced or separated, 4=widowed. This is a guess, but I'm fairly confident.
- Age in years
- Born outside Canada: 0=No, 1=Yes
There are two data sets, an exploratory sample and a replication sample. The exploratory data are available in the Excel spreadsheet DiversityExplore.xlsx. The replication data are in DiversityReplic.xlsx. These are real data. They are professionally cleaned, but don't take anything for granted. Open the spreadsheet and look at in in Excel (or in Open Office, if you refuse to use Microsoft products). Do you see anything unusual? We will get back to this later.
1. Read the exploratory data and take a look at frequency distributions of everything. I see something unexpected about the questions on senior management's commitment to diversity; do you? Please do not include these frequency distributions in the results file you bring to the quiz. My proc freq is commented out.
2. Create the following variables by adding the scores on individual questionnaire items. We will treat these as quantitative.
  - Commitment to the organization
  - Relations with colleagues at work
  - Relations with magnagement
  - Fair opportunities for advancement
  - Job satisfaction
  - Senior management's commitment to diversity
  Some handy SAS syntax for this is sum(of com1-com10).
3. We are also going to treat age (and even education, in some analyses) as quantitative. The problem is that as you noticed when you looked at the Excel spreadsheet, missing values were coded as blank. This is quite natural. But it causes Excel to make any column with missing values character-valued rather than numeric, and then in proc import, SAS believes Excel. This is also quite natural, but if age is character-valued, you can't do numerical calculations on it. You need to convert age (and maybe other variables) to numeric. I didn't remember how to do this, so I searched sas convert character to numeric. This was pretty easy, and good to know.
4. Make a new variable, an indicator for whether the person is married or not. Not married includes separated, because we have no choice. Of course it should be numeric so you can use it in a regression. Be careful about missing values.
5. Make frequency distributions of the categorical variables, not including the individual questionnaire items you added up. Also, use proc means to get basic statistics on the quantitative variables. The quantitative variables are the ones you created by adding up questionnaire items, plus age and education. Education is both categorical and quantitative. You should be able to answer questions like
  - How many respondents were never married?
  - What percentage were born outside Canada?
  - What is mean job satisfaction? (My answer is 14.788)
  - How many missing values were there for education?
  Variable labels and proc format are not required, but they are a good idea. If you cannot remember what the variables are or what the variable values mean, nobody is going to tell you during the quiz.
6. Out of the many possible analyses involving the categorical variables, test the association between sex and whether or not the person is married. More than one test statistic is produced by default. We'll take the first one, the common Pearson chi-squared test of independence; the formula was given in the first slide show, but you won't need the formula. If these results are significant, you will be able to draw a directional conclusion by comparing the percent married for men and women.
7. Use proc corr to obtain a correlation matrix of the quantitative variables, and be able to interpret any correlation that is significant. If you have not seen proc corr yet in lecture, there is an example in Chapter Two or you can look it up online.
8. Using proc means, obtain mean education for minority and non-minority respondents. Test for significance using either proc glm or proc reg (or both if you really feel like it), ignoring all other potential explanatory variables. What proportion of the variation in education is explained by visible minority status? Describe the results in plain, non-statistical language.
9. Now run a big regression in which job satisfaction is the response variable, and the explanatory variables are Relations with colleagues at work, Relations with magnagement, Fair opportunities for advancement, Senior management's commitment to diversity, Sex, Visible minority status, Education (treated as quantitative), Whether or not the respondent is married, Age, and Whether or not the respondent was born outside Canada. After looking at the t statistics, carry out a single test of the variables that were non-significant. This means run the regression once, and then add a test statement to the code and run it again.
10. What proportion of the remaining variation is explained by the non-significant variables? You can do it with proc iml or with a calculator, but in any case it's something you should be able to do with a calculator for any F-test or t-test on your printout.
11. Now fit a model with just the four significant explanatory variables. Be ready to interpret the results of each test, calculate proportions of remaining variation, and so on.
12. Why is the following conclusion unjustified? Allowing for relations with management, perceived fairness in opportunities for advancement and visible minority status, good relations with colleagues at work lead to more job satisfaction.
13. I also tried diagnostics on the diversity data, but the results were not very clear and in the end I deleted the homework questions.
14. The delicious phrase "Bonferonni-corrected cross-validation" means that after an exploratory analysis has identified a collection of conclusions backed up by significant results, you take a second, independent sample from the same population and test only the hypotheses that supported your final conclusions in the exploratory analysis. You protect this replication set at the joint 0.05 significance level with a Bonferroni correction. That is, you will say that a result is replicated (so you actually believe it) if p < 0.05/k on the replication data.
  Using DiversityReplic.xlsx, carry out this procedure on the tests for the four variables that worked in the exploration phase. Do this in the same SAS program as the rest of the assignment. What are your final conclusions?

Please bring your log and results files from both questions to the quiz. Do not write anything on the printouts in advance except your name and student number. You may be asked to hand them in. The log and list files must be generated by the same SAS program. There must be no errors or warnings in your log files. Bring a calculator to the quiz. As usual, you are not allowed to write conclusions on your printouts in advance, or otherwise cause them to appear on your printouts.

This assignment is licensed under a Creative Commons Attribution-ShareAlike 3.0 (or later) Unported License. Use and share it freely.