STA429/1007 F 2004 Handout 5 Regression with Categorical Independent Variables (Shorter version) First, a piece of mathread.sas showing the format for apparent nationality. ------------------------------------------------------------------------------------------ value natfmt 1 = 'Chinese' 2 = 'Japanese' 3 = 'Korean' 4 = 'Vietnamese' 5 = 'Other Asian' 6 = 'Eastern European' 7 = 'Hispanic' 8 = 'English-speaking' 9 = 'French' 10 = 'Italian' 11 = 'Greek' 12 = 'Germanic' 13 = 'Other European' 14 = 'Middle-Eastern' 15 = 'Pakistani' 16 = 'East Indian' 17 = 'Sub-Saharan' 18 = 'OTHER' ; ------------------------------------------------------------------------------------------ /********************** mathcat1.sas **********************/ title 'Regression with Categorical IV: Math data'; options linesize=79 pagesize=2000 noovp formdlim='_'; libname math '/homes/students/u0/stats/brunner/mathlib'; /* Full path to permanent SAS datasets */ libname library '/homes/students/u0/stats/brunner/mathlib'; /* SAS will seach for permanently stored formats ONLY in a place called "library." */ proc freq data=math.explore; tables nation1 nation2; proc glm data=math.explore; class nation2; model grade=nation2; means nation2; proc format; value ethfmt 1 = 'Chinese' 2 = 'Other Asian' 3 = 'Eastern European' 4 = 'English-speaking' 5 = 'Other European' 6 = 'Middle Eastern' 7 = 'East Indian' 8 = 'Other'; data ecat; set math.explore; n1 = nation1; if n1=. then ethnic1=.; else if n1=1 then ethnic1 = 1; /* Chinese */ else if 2 <= n1 <= 5 then ethnic1 = 2; /* Other Asian */ else if n1=6 then ethnic1 = 3; /* Eastern European */ else if n1=8 then ethnic1 = 4; /* English-speaking */ else if n1=7 or 9 <= n1 <= 13 then ethnic1 = 5; /* Other European */ else if n1=14 then ethnic1 = 6; /* Middle Eastern */ else if n1=16 then ethnic1 = 7; /* East Indian */ else ethnic1=8; /* Other */ label ethnic1 = 'Ethnicity of Name Acc to Rater1'; n2 = nation2; if n2=. then ethnic2=.; else if n2=1 then ethnic2 = 1; /* Chinese */ else if 2 <= n2 <= 5 then ethnic2 = 2; /* Other Asian */ else if n2=6 then ethnic2 = 3; /* Eastern European */ else if n2=8 then ethnic2 = 4; /* English-speaking */ else if n2=7 or 9 <= n2 <= 13 then ethnic2 = 5; /* Other European */ else if n2=14 then ethnic2 = 6; /* Middle Eastern */ else if n2=16 then ethnic2 = 7; /* East Indian */ else ethnic2=8; /* Other */ label ethnic2 = 'Ethnicity of Name Acc to Rater2'; format ethnic1 ethnic2 ethfmt.; /* I like rater1, but I like rater2 even more */ /* Make all 8 dummy vars, but use only 7 if model has intercept. */ if ethnic2=. then e1=.; else if ethnic2 = 1 then e1=1; else e1=0; if ethnic2=. then e2=.; else if ethnic2 = 2 then e2=1; else e2=0; if ethnic2=. then e3=.; else if ethnic2 = 3 then e3=1; else e3=0; if ethnic2=. then e4=.; else if ethnic2 = 4 then e4=1; else e4=0; if ethnic2=. then e5=.; else if ethnic2 = 5 then e5=1; else e5=0; if ethnic2=. then e6=.; else if ethnic2 = 6 then e6=1; else e6=0; if ethnic2=. then e7=.; else if ethnic2 = 7 then e7=1; else e7=0; if ethnic2=. then e8=.; else if ethnic2 = 8 then e8=1; else e8=0; proc freq; tables nation1*ethnic1 nation2*ethnic2 / norow nocol nopercent missing; tables ethnic1*ethnic2; tables ethnic2*(e1-e8) / norow nocol nopercent missing; tables sex; proc glm; class ethnic2; model grade=ethnic2; proc reg; model grade=e1-e7; allsame: test e1=e2=e3=e4=e5=e6=e7=0; proc reg; model grade=e1-e8 / noint; sameall: test e1=e2=e3=e4=e5=e6=e7=e8; proc reg; title3 'Include Sex and Ethnicity in the regression'; model grade = gpa hscalc precalc calc sex e2-e8; hschool: test gpa=hscalc=0; dtest: test precalc=calc=0; compare: test precalc=calc; EthBack: test e2=e3=e4=e5=e6=e7=e8=0; proc glm; class sex ethnic2; model grade = sex ethnic2 sex*ethnic2; means sex ethnic2 sex*ethnic2; _______________________________________________________________________________ Nationality of name acc to rater1 Cumulative Cumulative nation1 Frequency Percent Frequency Percent --------------------------------------------------------------------- Chinese 73 13.30 73 13.30 Japanese 2 0.36 75 13.66 Korean 12 2.19 87 15.85 Vietnamese 16 2.91 103 18.76 Other Asian 23 4.19 126 22.95 Eastern European 61 11.11 187 34.06 Hispanic 35 6.38 222 40.44 English-speaking 96 17.49 318 57.92 French 7 1.28 325 59.20 Italian 22 4.01 347 63.21 Greek 8 1.46 355 64.66 Germanic 11 2.00 366 66.67 Other European 25 4.55 391 71.22 Middle-Eastern 64 11.66 455 82.88 Pakistani 2 0.36 457 83.24 East Indian 72 13.11 529 96.36 Sub-Saharan 2 0.36 531 96.72 OTHER 18 3.28 549 100.00 Frequency Missing = 30 We did the proc glm just to get the means in this compact format. The GLM Procedure Level of ------------grade------------ nation2 N Mean Std Dev Chinese 66 61.2727273 21.1823914 East Indian 57 64.3157895 19.7259419 Eastern European 48 56.0833333 19.9625536 English-speaking 73 56.6301370 17.2107096 French 9 52.0000000 20.4205779 Germanic 10 57.7000000 22.6374812 Greek 6 60.5000000 9.1815031 Hispanic 17 60.5294118 16.9893998 Italian 21 56.0000000 17.7341479 Japanese 1 36.0000000 . Korean 4 71.7500000 17.5570499 Middle-Eastern 40 59.6750000 19.5505754 OTHER 5 56.4000000 11.6318528 Other Asian 6 45.6666667 12.5485723 Other European 7 47.4285714 26.2923128 Pakistani 1 50.0000000 . Sub-Saharan 4 50.7500000 6.3966137 Vietnamese 10 58.4000000 21.8794678 Skipping some checks ... Table of ethnic1 by ethnic2 ethnic1(Ethnicity of Name Acc to Rater1) ethnic2(Ethnicity of Name Acc to Rater2) Frequency | Percent | Row Pct | Col Pct |Chinese |Other |Eastern |English-| Total | |Asian |European|speaking| -----------------+--------+--------+--------+--------+ Chinese | 71 | 2 | 0 | 0 | 73 | 12.93 | 0.36 | 0.00 | 0.00 | 13.30 | 97.26 | 2.74 | 0.00 | 0.00 | | 73.96 | 5.71 | 0.00 | 0.00 | -----------------+--------+--------+--------+--------+ Other Asian | 15 | 25 | 2 | 1 | 53 | 2.73 | 4.55 | 0.36 | 0.18 | 9.65 | 28.30 | 47.17 | 3.77 | 1.89 | | 15.63 | 71.43 | 3.08 | 1.05 | -----------------+--------+--------+--------+--------+ Eastern European | 0 | 1 | 50 | 0 | 61 | 0.00 | 0.18 | 9.11 | 0.00 | 11.11 | 0.00 | 1.64 | 81.97 | 0.00 | | 0.00 | 2.86 | 76.92 | 0.00 | -----------------+--------+--------+--------+--------+ English-speaking | 4 | 4 | 1 | 79 | 96 | 0.73 | 0.73 | 0.18 | 14.39 | 17.49 | 4.17 | 4.17 | 1.04 | 82.29 | | 4.17 | 11.43 | 1.54 | 83.16 | -----------------+--------+--------+--------+--------+ Other European | 0 | 1 | 7 | 14 | 108 | 0.00 | 0.18 | 1.28 | 2.55 | 19.67 | 0.00 | 0.93 | 6.48 | 12.96 | | 0.00 | 2.86 | 10.77 | 14.74 | -----------------+--------+--------+--------+--------+ Middle Eastern | 0 | 0 | 2 | 0 | 64 | 0.00 | 0.00 | 0.36 | 0.00 | 11.66 | 0.00 | 0.00 | 3.13 | 0.00 | | 0.00 | 0.00 | 3.08 | 0.00 | -----------------+--------+--------+--------+--------+ East Indian | 0 | 0 | 1 | 1 | 72 | 0.00 | 0.00 | 0.18 | 0.18 | 13.11 | 0.00 | 0.00 | 1.39 | 1.39 | | 0.00 | 0.00 | 1.54 | 1.05 | -----------------+--------+--------+--------+--------+ Other | 6 | 2 | 2 | 0 | 22 | 1.09 | 0.36 | 0.36 | 0.00 | 4.01 | 27.27 | 9.09 | 9.09 | 0.00 | | 6.25 | 5.71 | 3.08 | 0.00 | -----------------+--------+--------+--------+--------+ Total 96 35 65 95 549 17.49 6.38 11.84 17.30 100.00 (Continued) Table of ethnic1 by ethnic2 ethnic1(Ethnicity of Name Acc to Rater1) ethnic2(Ethnicity of Name Acc to Rater2) Frequency | Percent | Row Pct | Col Pct |Other E|Middle |East |Other | Total |uropean |Eastern |Indian | | -----------------+--------+--------+--------+--------+ Chinese | 0 | 0 | 0 | 0 | 73 | 0.00 | 0.00 | 0.00 | 0.00 | 13.30 | 0.00 | 0.00 | 0.00 | 0.00 | | 0.00 | 0.00 | 0.00 | 0.00 | -----------------+--------+--------+--------+--------+ Other Asian | 1 | 2 | 5 | 2 | 53 | 0.18 | 0.36 | 0.91 | 0.36 | 9.65 | 1.89 | 3.77 | 9.43 | 3.77 | | 0.99 | 3.51 | 5.81 | 14.29 | -----------------+--------+--------+--------+--------+ Eastern European | 4 | 0 | 3 | 3 | 61 | 0.73 | 0.00 | 0.55 | 0.55 | 11.11 | 6.56 | 0.00 | 4.92 | 4.92 | | 3.96 | 0.00 | 3.49 | 21.43 | -----------------+--------+--------+--------+--------+ English-speaking | 7 | 0 | 0 | 1 | 96 | 1.28 | 0.00 | 0.00 | 0.18 | 17.49 | 7.29 | 0.00 | 0.00 | 1.04 | | 6.93 | 0.00 | 0.00 | 7.14 | -----------------+--------+--------+--------+--------+ Other European | 83 | 1 | 1 | 1 | 108 | 15.12 | 0.18 | 0.18 | 0.18 | 19.67 | 76.85 | 0.93 | 0.93 | 0.93 | | 82.18 | 1.75 | 1.16 | 7.14 | -----------------+--------+--------+--------+--------+ Middle Eastern | 1 | 50 | 8 | 3 | 64 | 0.18 | 9.11 | 1.46 | 0.55 | 11.66 | 1.56 | 78.13 | 12.50 | 4.69 | | 0.99 | 87.72 | 9.30 | 21.43 | -----------------+--------+--------+--------+--------+ East Indian | 1 | 1 | 67 | 1 | 72 | 0.18 | 0.18 | 12.20 | 0.18 | 13.11 | 1.39 | 1.39 | 93.06 | 1.39 | | 0.99 | 1.75 | 77.91 | 7.14 | -----------------+--------+--------+--------+--------+ Other | 4 | 3 | 2 | 3 | 22 | 0.73 | 0.55 | 0.36 | 0.55 | 4.01 | 18.18 | 13.64 | 9.09 | 13.64 | | 3.96 | 5.26 | 2.33 | 21.43 | -----------------+--------+--------+--------+--------+ Total 101 57 86 14 549 18.40 10.38 15.66 2.55 100.00 Frequency Missing = 30 Skipping checks of e1-e8 ... Cumulative Cumulative sex Frequency Percent Frequency Percent ----------------------------------------------------------- Male 285 51.72 285 51.72 Female 266 48.28 551 100.00 Frequency Missing = 28 _______________________________________________________________________________ Regression with Categorical IV: Math data 6 16:43 Wednesday, October 13, 2004 The GLM Procedure Class Level Information Class Levels Values ethnic2 8 Chinese East Indian Eastern European English-speaking Middle Eastern Other Other Asian Other European Number of observations 579 NOTE: Due to missing values, only 385 observations can be used in this analysis. _______________________________________________________________________________ Regression with Categorical IV: Math data 7 16:43 Wednesday, October 13, 2004 The GLM Procedure Dependent Variable: grade Final mark (if any) Sum of Source DF Squares Mean Square F Value Pr > F Model 7 3695.1830 527.8833 1.43 0.1925 Error 377 139351.2430 369.6319 Corrected Total 384 143046.4260 R-Square Coeff Var Root MSE grade Mean 0.025832 32.77514 19.22581 58.65974 Source DF Type I SS Mean Square F Value Pr > F ethnic2 7 3695.182958 527.883280 1.43 0.1925 Source DF Type III SS Mean Square F Value Pr > F ethnic2 7 3695.182958 527.883280 1.43 0.1925 _______________________________________________________________________________ Regression with Categorical IV: Math data 8 16:43 Wednesday, October 13, 2004 The REG Procedure Model: MODEL1 Dependent Variable: grade Final mark (if any) Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 7 3695.18296 527.88328 1.43 0.1925 Error 377 139351 369.63194 Corrected Total 384 143046 Root MSE 19.22581 R-Square 0.0258 Dependent Mean 58.65974 Adj R-Sq 0.0077 Coeff Var 32.77514 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 53.50000 6.07974 8.80 <.0001 e1 1 7.77273 6.52408 1.19 0.2343 e2 1 2.73810 7.38679 0.37 0.7111 e3 1 2.58333 6.68310 0.39 0.6993 e4 1 3.13014 6.48280 0.48 0.6295 e5 1 2.85714 6.49951 0.44 0.6605 e6 1 6.17500 6.79735 0.91 0.3642 e7 1 10.81579 6.59151 1.64 0.1017 _______________________________________________________________________________ Regression with Categorical IV: Math data 9 16:43 Wednesday, October 13, 2004 The REG Procedure Model: MODEL1 Test allsame Results for Dependent Variable grade Mean Source DF Square F Value Pr > F Numerator 7 527.88328 1.43 0.1925 Denominator 377 369.63194 _______________________________________________________________________________ The REG Procedure Model: MODEL1 Dependent Variable: grade Final mark (if any) NOTE: No intercept in model. R-Square is redefined. Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 8 1328467 166058 449.25 <.0001 Error 377 139351 369.63194 Uncorrected Total 385 1467818 Root MSE 19.22581 R-Square 0.9051 Dependent Mean 58.65974 Adj R-Sq 0.9030 Coeff Var 32.77514 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| e1 1 61.27273 2.36653 25.89 <.0001 e2 1 56.23810 4.19542 13.40 <.0001 e3 1 56.08333 2.77501 20.21 <.0001 e4 1 56.63014 2.25021 25.17 <.0001 e5 1 56.35714 2.29792 24.53 <.0001 e6 1 59.67500 3.03987 19.63 <.0001 e7 1 64.31579 2.54652 25.26 <.0001 e8 1 53.50000 6.07974 8.80 <.0001 _______________________________________________________________________________ Test sameall Results for Dependent Variable grade Mean Source DF Square F Value Pr > F Numerator 7 527.88328 1.43 0.1925 Denominator 377 369.63194 _______________________________________________________________________________ Regression with Categorical IV: Math data 12 16:43 Wednesday, October 13, 2004 Include Sex and Ethnicity in the regression The REG Procedure Model: MODEL1 Dependent Variable: grade Final mark (if any) Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 12 43590 3632.50289 18.77 <.0001 Error 276 53413 193.52377 Corrected Total 288 97003 Root MSE 13.91128 R-Square 0.4494 Dependent Mean 60.61246 Adj R-Sq 0.4254 Coeff Var 22.95119 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Intercept Intercept 1 -72.84682 11.68505 -6.23 gpa High School GPA 1 1.18286 0.18253 6.48 hscalc HS Calculus 1 0.33347 0.10016 3.33 precalc Number precalculus correct 1 1.73408 0.58004 2.99 calc Number calculus correct 1 0.69419 0.39547 1.76 sex 1 1.07063 1.72557 0.62 e2 1 -0.13357 3.85211 -0.03 e3 1 -0.59464 3.23489 -0.18 e4 1 -1.77769 2.87413 -0.62 e5 1 -0.06677 2.94415 -0.02 e6 1 -2.17740 3.45108 -0.63 e7 1 2.88897 2.99501 0.96 e8 1 1.64621 5.74186 0.29 Parameter Estimates Variable Label DF Pr > |t| Intercept Intercept 1 <.0001 gpa High School GPA 1 <.0001 hscalc HS Calculus 1 0.0010 precalc Number precalculus correct 1 0.0030 calc Number calculus correct 1 0.0803 sex 1 0.5355 e2 1 0.9724 e3 1 0.8543 e4 1 0.5367 e5 1 0.9819 e6 1 0.5286 e7 1 0.3356 e8 1 0.7746 _______________________________________________________________________________ Regression with Categorical IV: Math data 13 16:43 Wednesday, October 13, 2004 Include Sex and Ethnicity in the regression The REG Procedure Model: MODEL1 Test hschool Results for Dependent Variable grade Mean Source DF Square F Value Pr > F Numerator 2 11413 58.98 <.0001 Denominator 276 193.52377 _______________________________________________________________________________ Regression with Categorical IV: Math data 14 16:43 Wednesday, October 13, 2004 Include Sex and Ethnicity in the regression The REG Procedure Model: MODEL1 Test dtest Results for Dependent Variable grade Mean Source DF Square F Value Pr > F Numerator 2 1711.22289 8.84 0.0002 Denominator 276 193.52377 _______________________________________________________________________________ Regression with Categorical IV: Math data 15 16:43 Wednesday, October 13, 2004 Include Sex and Ethnicity in the regression The REG Procedure Model: MODEL1 Test compare Results for Dependent Variable grade Mean Source DF Square F Value Pr > F Numerator 1 322.00651 1.66 0.1982 Denominator 276 193.52377 _______________________________________________________________________________ Regression with Categorical IV: Math data 16 16:43 Wednesday, October 13, 2004 Include Sex and Ethnicity in the regression The REG Procedure Model: MODEL1 Test EthBack Results for Dependent Variable grade Mean Source DF Square F Value Pr > F Numerator 7 98.29303 0.51 0.8283 Denominator 276 193.52377 _______________________________________________________________________________ Sex by ethnicity 2-way ANOVA Regression with Categorical IV: Math data 17 16:43 Wednesday, October 13, 2004 Include Sex and Ethnicity in the regression The GLM Procedure Class Level Information Class Levels Values sex 2 Female Male ethnic2 8 Chinese East Indian Eastern European English-speaking Middle Eastern Other Other Asian Other European Number of observations 579 NOTE: Due to missing values, only 383 observations can be used in this analysis. _______________________________________________________________________________ Regression with Categorical IV: Math data 18 16:43 Wednesday, October 13, 2004 Include Sex and Ethnicity in the regression The GLM Procedure Dependent Variable: grade Final mark (if any) Sum of Source DF Squares Mean Square F Value Pr > F Model 15 6847.6569 456.5105 1.23 0.2454 Error 367 136069.8940 370.7627 Corrected Total 382 142917.5509 R-Square Coeff Var Root MSE grade Mean 0.047913 32.82331 19.25520 58.66319 Source DF Type I SS Mean Square F Value Pr > F sex 1 110.794519 110.794519 0.30 0.5849 ethnic2 7 3504.025570 500.575081 1.35 0.2256 sex*ethnic2 7 3232.836800 461.833829 1.25 0.2768 Source DF Type III SS Mean Square F Value Pr > F sex 1 316.814991 316.814991 0.85 0.3559 ethnic2 7 3542.751451 506.107350 1.37 0.2189 sex*ethnic2 7 3232.836800 461.833829 1.25 0.2768 _______________________________________________________________________________ Regression with Categorical IV: Math data 19 16:43 Wednesday, October 13, 2004 Include Sex and Ethnicity in the regression The GLM Procedure Level of ------------grade------------ sex N Mean Std Dev Female 193 58.1295337 18.4319923 Male 190 59.2052632 20.2598196 Level of ------------grade------------ ethnic2 N Mean Std Dev Chinese 66 61.2727273 21.1823914 East Indian 56 64.2857143 19.9031421 Eastern European 48 56.0833333 19.9625536 English-speaking 72 56.7222222 17.3133688 Middle Eastern 40 59.6750000 19.5505754 Other 10 53.5000000 9.1195760 Other Asian 21 56.2380952 20.1367941 Other European 70 56.3571429 18.8718299 Level of Level of ------------grade------------ sex ethnic2 N Mean Std Dev Female Chinese 24 62.5416667 17.2475056 Female East Indian 24 63.2083333 18.9965195 Female Eastern European 32 57.5312500 17.4946708 Female English-speaking 40 57.5750000 18.0851086 Female Middle Eastern 17 63.7647059 14.9102708 Female Other 6 49.6666667 5.4650404 Female Other Asian 12 47.6666667 19.5696116 Female Other European 38 55.3421053 20.9481681 Male Chinese 42 60.5476190 23.3020489 Male East Indian 32 65.0937500 20.8208323 Male Eastern European 16 53.1875000 24.5498642 Male English-speaking 32 55.6562500 16.5209727 Male Middle Eastern 23 56.6521739 22.2130361 Male Other 4 59.2500000 11.2361025 Male Other Asian 9 67.6666667 15.1657509 Male Other European 32 57.5625000 16.3153154