What every graduate student should know especially if they wonder what might be on the Ph.D. Comprehensive Examination

I am preparing questions some of which I will submit for the next Ph.D. comprehensive examination. I decided to post these here to guarantee that everyone can pass.

I will try to stop at 100 problems  (the page will be updated from time to time).  So you only have to think about only 1 per week -- if you start now.

Don't Panic.

You aren't expected to know the answers.  Just how to do them.
The problems require only a good understanding of undergraduate material.
Some however are difficult.  Some are impossible -- and identifying these is the problem.  Some are just fun.

I will be happy to discuss these Sept. -> March. The problems are presented here in pseudo LateX for ease of typing and distribution.
 

Have fun.

 

X, based on measurements,  is assumed to have some distribution.

Give an example where you will confidently assume a Gaussian distribution.

Give an example where you will confidently assume a binomial distribution.
 

X_1, ...X_n are used to compute a t-test for mu_X = 3.

Of course no random variables are independent and therefore these are correlated.
What hurts more positive or negative correlation and why?
Hint: Consider really extreme values of correlation.
 

X_1, ..., X_n are converted to ranks before analysis.

What is Cor(X_i, X_j)?

What is the RELATIVE bias of s^2 in this case?  Evaluate when n = 100.

X_1, ...X_n and Y_1, ...Y_n are two samples of binary data.

Compute a confidence interval for P_X - P_Y.

Compute the standard interval, based on a t-statistic,  for mu_X - mu_Y.

Compare these numerically when n = 100.

Which test would you use if the p's might be different? Which test is presented in most undergraduate texts?
 
 

X_1, ..., X_n are a random sample from f(x).


What is the asymptotic variance of the median, med(X)?

What is the relative efficiency of the median compared with the mean if the data are Gaussian.

What if the data came from the contaminated Gaussian where each X_i comes from Gau(0,1)
with probability (1-alpha) and from Gau(0,9) with probability (alpha)?

For what value of alpha are the variances the same?

Suppose 1 statistician summarized N batches of such data with means,
and another statistician summarized another N batches with medians.
Your job is to identify the statistician with the less variable summaries.
Using terms like size and power estimate the sample size required to do this if alpha = 0.
 

X_1, X_2, ... are independent variables in a least-squares regression with independent dependent variables.

Show that the variance of beta_1 increases without limit as the number of independent variables increases except under a very unrealistic condition.  State the condition.

Find bounds for var(yhat_i) and av(var(yhat_i)).

Infant mortality under 1 Kg. is 30%. a new treatment is expected to cut this in half.

If a simple clinical trial is run to establish the effectiveness of the new treatment, what sample size is required?
Justify your answer.
 

Math scores in grade 7 were obtained as follows.

25 schools randomly selected from the public school board.
25 schools randomly selected from the separate school board.
2 teachers randomly chosen in each school .
20 students randomly chosen from each teacher .

A standard analysis of variance table leads to a test of the hypothesis
that there is no difference in the scores of students in public and separate schools.

What are its degrees of freedom?
 

X_1, X_2, ... are blood pressures of high risk heart patients measured from heart attack till death or 5 years which ever comes first.

Half of the patients are randomly selected a placed on a low salt diet.
The experiment wants to know if the diet reduces blood pressure.
Draft the section on analysis that they must include in their grant proposal.
 

The standard deviation of weights of faculty on the 6th floor is 50 Kg.

Comment on the above statement.
 

Assuming the best of all possible worlds -- that the data actually are independent Gaussian,

Calculate the sample size to estimate the variance with a confidence interval with width 10% of the true value.
 

An election poll  of 1500 indicates that the PQ are at 47 % and the Lib at 46%.

Compute the variance of the estimate % to vote PQ.

Compute the MSE of the same.

Compute a confidence interval for %PQ - %Lib.
 

X_1, ...,X_200 are independent Gaussian

Compute an approximation of the variance of log(s^2).

Compute an approximation of the variance of the usual t-statistic for testing mu_X = 0.
 

X is poisson lambda > 10.

Compute an approximation of Var(X).
 

X is n^(-1) Binomial(n, p) with n p > 10.

Compute the transformation t(X) with variance approximately independent of p.
 

X is sigma^2 Chi squared with mean M and variance V.

Compute the degrees of freedom in terms of M and V.
 

Y_1, ... Y_n are independent mean 0 variance sigma^2

Calculate E[sum (X_i Y_i)^2] and Var[sum (X_i Y_i)^2].  Using the above problem, estimate the effective degrees of freedom
of the n random variables (X_i Y_i).

If the independent variables of a straight line regression are gaussian, what is the effective degrees of freedom.

X_1, ... X_n  are independent bernoulli p_i

Compute the likelihood ratio statistic for testing p_i = p.
Compute its distribution.
Compare the mean and variance of this statistic and compare these values with those of
the usual chi-squared approximation when p = 0.01 and n = 100,000.
Why is the approximation so bad? (note n * p is huge so phat is approx. gaussian.)

X_1, ... X_n are independent Gaussian

Let mhat(k) be the usual estimate of the k^th central moment.
Compute approximations of E[mhat(k)] and Var[mhat(k)].
Compute n(k) such that the standard deviation of mhat(k) is 10% of E[mhat(k)].
Table n(k) for k = 2,...6.
What does this table tell you.

Repeat for X_i exponential.
Compare the table with the one above.
 

X_i,j  1 <= i <= k, j <= j <= n_i, are independent with means mu and variance sigma^2_i.

Find the minimum variance linear unbiased estimate of mu.

If the sigma^2 are not known, estimate mu.
 

I forgot to mention


in the above that n_3 = 2 and s^2_3 = 0. Now what is muhat?
 
 

Experimental design.

84% of recent theses include some Monte Carlo calculations.  These are experiments.
And every PhD in statistics is expected to know the basics.
So go to the library and pick a thesis.
From the thesis complete the following:

Purpose _________ was the purpose of the experiment clearly stated?
Design __________ what terms were used to describe the design e.g. factorial, randomized block -- or does it look like somebody just dreaming up things to do.
Analysis of results________ how was the form of analysis described e.g. paired t-test, analysis of variance ...
Significance levels   were the claimed differences supported by any tests of significance.
Conclusions ____ are the conclusions supported by the experiment -- are related to the stated purpose.
Mark ____ assuming that the student was taking an  undergraduate course in experimental design assign a mark for the
assignment for a) the design and b) the analysis

This is really a question for your final oral.  Two black balls can fail you. Don't let this be one. Be warned.
 

X_1, ...X_n are bernoulli variables indicating whether a child with a severe chronic condition is above or below a threshold.

The nature of the disease is that patients vary up and down   from day to day and without treatment X_i is bernoulli with p=1/2.
An investigator designs a study for a treatment that is supposed to slowly increase p.
20 patients are to be followed for 1 year.
What is the critical value for a 5% test of p = 1/2?
Is this for  a 1 or two sided test?

One day, about 3 months into the study, the investigator notes that only 6 of the 20 are above the threshold and wonders if she should stop the trial because it looks like the treatment is harmful.

Prepare a brief report based on calculations on the statistical considerations involved.