/************************ tubeclean.sas *******************************/
/* Data cleaning for TUBES Data: There is no point in doing the right */
/* statistical analysis on data that are full of errors.              */
/**********************************************************************/

title2 'Data cleaning for tubes data';
%include 'tuberead.sas';
/* More data step */

/* Error check variables (internal consistency) */
/* MCG must be the same on each line */
if mcg ne mcg2 then mcger1=line1;
if mcg2 ne mcg3 then mcger2=line2;
if mcg3 ne mcg4 then mcger3=line3;
if mcg4 ne mcg5 then mcger4=line4;
if mcg5 ne mcg6 then mcger5=line5;
if mcg6 ne mcg7 then mcger6=line6;
if mcg7 ne mcg8 then mcger7=line7;
if mcg8 ne mcg9 then mcger8=line8;
if mcg9 ne mcg10 then mcger9=line9;
if mcg10 ne mcg11 then mcger10=line10;
if mcg11 ne mcg12 then mcger11=line11;
if mcg12 ne mcg13 then mcger12=line12;
if mcg13 ne mcg14 then mcger13=line13;

/* REPLIC must be the same on each line */
if replic1 ne replic2 then replicer1=line1;
if replic2 ne replic3 then replicer2=line2;
if replic3 ne replic4 then replicer3=line3;
if replic4 ne replic5 then replicer4=line4;
if replic5 ne replic6 then replicer5=line5;
if replic6 ne replic7 then replicer6=line6;
if replic7 ne replic8 then replicer7=line7;
if replic8 ne replic9 then replicer8=line8;
if replic9 ne replic10 then replicer9=line9;
if replic10 ne replic11 then replicer10=line10;
if replic11 ne replic12 then replicer11=line11;
if replic12 ne replic13 then replicer12=line12;
if replic13 ne replic14 then replicer13=line13;

/* Increase in length and number of sclerotia from am to pm */
array diff{28} ldiff1-ldiff14 sdiff1-sdiff14; 
do i=1 to 28;
   diff{i}=pm{i}-am{i}; /* am and pm are defined in tuberead */
end;


proc freq; 
     title3 'Frequency distributions';
     tables line1-line14 mcg mcg2-mcg14 replic1-replic14 day1-day14 
            mcger1-mcger13 replicer1-replicer13;

proc means n mean min max;
     title3 'Means of quantitative variables';
     var amlng1-amlng14 pmlng1-pmlng14 length1-length14 
         amscl1-amscl14 pmscl1-pmscl14  sclr1-sclr14 
         amslope pmslope rate weight;

proc freq;
     title3 'Look at am to pm change each day';
     tables ldiff1-ldiff14 sdiff1-sdiff14; 

/* At this point it looks like length10 is the primary DV. Data set is small,
so look at the whole thing */

proc sort;
     by mcg length10;

proc print;
     var line1 mcg length10 sclr10 weight rate;

/* Notes: Commented out

Freq dist for line1: entries must all be odd, but I see a line 238. Checking
the raw data file, see -- ahha! two line 227s. But the rest of the data look
okay. This is just cosmetic. Forget it.

Looking at proc means: The last day with no missing observations for am length
is day 11, but max is 31.2. These are 30cm race tubes, so we'd better stick
with day 10. pmlng11 has a max length of 32.8. It's growing beyond the end of
the tube. Stick to day 10.

It looks like they recorded missing values instead of zero for sclerotia. I
could fix this, but I'm not sure I need to.

if amscl1=. then amscl1=0; if pmscl1=. then pmscl1=0;
if amscl2=. then amscl2=0; if pmscl2=. then pmscl2=0;

Looking at difference variables. This is a careful lab study, but still there
is measurement error. Especially look at the -17 for sdiff12. We could track
it down and hide it, which is something a Biologist might do. But to a
statistician, everything has a piece of random error attached. It's a fact of
life. So we model it or live with it. In this case, we live with it.

At this point I'm looking at length10 as my primary dependent variable. There
is a good justification, but I don't want to type it now. Other good DVs are
sclr10, weight and rate.  

Linda wanted:

data fixed;
     set mould;
     if line1 ne 113;