What every graduate student should know especially if they wonder what
might be on the Ph.D. Comprehensive Examination
I am preparing questions some of which I will submit for the next Ph.D.
comprehensive examination. I decided to post these here to guarantee that
everyone can pass.
I will try to stop at 100 problems (the page will be updated from
time to time). So you only have to think about only 1 per week --
if you start now.
Don't Panic.
You aren't expected to know the answers. Just how to do them.
The problems require only a good understanding of undergraduate material.
Some however are difficult. Some are impossible -- and identifying
these is the problem. Some are just fun.
I will be happy to discuss these Sept. -> March. The problems are presented
here in pseudo LateX for ease of typing and distribution.
Have fun.
X, based on measurements, is assumed to have some distribution.
Give an example where you will confidently assume a Gaussian distribution.
Give an example where you will confidently assume a binomial distribution.
X_1, ...X_n are used to compute a t-test for mu_X = 3.
Of course no random variables are independent and therefore these are correlated.
What hurts more positive or negative correlation and why?
Hint: Consider really extreme values of correlation.
X_1, ..., X_n are converted to ranks before analysis.
What is Cor(X_i, X_j)?
What is the RELATIVE bias of s^2 in this case? Evaluate when n
= 100.
X_1, ...X_n and Y_1, ...Y_n are two samples of binary data.
Compute a confidence interval for P_X - P_Y.
Compute the standard interval, based on a t-statistic, for mu_X
- mu_Y.
Compare these numerically when n = 100.
Which test would you use if the p's might be different? Which test is
presented in most undergraduate texts?
X_1, ..., X_n are a random sample from f(x).
What is the asymptotic variance of the median, med(X)?
What is the relative efficiency of the median compared with the mean
if the data are Gaussian.
What if the data came from the contaminated Gaussian where each X_i
comes from Gau(0,1)
with probability (1-alpha) and from Gau(0,9) with probability (alpha)?
For what value of alpha are the variances the same?
Suppose 1 statistician summarized N batches of such data with means,
and another statistician summarized another N batches with medians.
Your job is to identify the statistician with the less variable summaries.
Using terms like size and power estimate the sample size required to
do this if alpha = 0.
X_1, X_2, ... are independent variables in a least-squares regression with
independent dependent variables.
Show that the variance of beta_1 increases without limit as the number
of independent variables increases except under a very unrealistic condition.
State the condition.
Find bounds for var(yhat_i) and av(var(yhat_i)).
Infant mortality under 1 Kg. is 30%. a new treatment is expected to cut
this in half.
If a simple clinical trial is run to establish the effectiveness of the
new treatment, what sample size is required?
Justify your answer.
Math scores in grade 7 were obtained as follows.
25 schools randomly selected from the public school board.
25 schools randomly selected from the separate school board.
2 teachers randomly chosen in each school .
20 students randomly chosen from each teacher .
A standard analysis of variance table leads to a test of the hypothesis
that there is no difference in the scores of students in public and
separate schools.
What are its degrees of freedom?
X_1, X_2, ... are blood pressures of high risk heart patients measured
from heart attack till death or 5 years which ever comes first.
Half of the patients are randomly selected a placed on a low salt diet.
The experiment wants to know if the diet reduces blood pressure.
Draft the section on analysis that they must include in their grant
proposal.
The standard deviation of weights of faculty on the 6th floor is 50 Kg.
Comment on the above statement.
Assuming the best of all possible worlds -- that the data actually are
independent Gaussian,
Calculate the sample size to estimate the variance with a confidence interval
with width 10% of the true value.
An election poll of 1500 indicates that the PQ are at 47 % and the
Lib at 46%.
Compute the variance of the estimate % to vote PQ.
Compute the MSE of the same.
Compute a confidence interval for %PQ - %Lib.
X_1, ...,X_200 are independent Gaussian
Compute an approximation of the variance of log(s^2).
Compute an approximation of the variance of the usual t-statistic for
testing mu_X = 0.
X is poisson lambda > 10.
Compute an approximation of Var(X).
X is n^(-1) Binomial(n, p) with n p > 10.
Compute the transformation t(X) with variance approximately independent
of p.
X is sigma^2 Chi squared with mean M and variance V.
Compute the degrees of freedom in terms of M and V.
Y_1, ... Y_n are independent mean 0 variance sigma^2
Calculate E[sum (X_i Y_i)^2] and Var[sum (X_i Y_i)^2]. Using the
above problem, estimate the effective degrees of freedom
of the n random variables (X_i Y_i).
If the independent variables of a straight line regression are gaussian,
what is the effective degrees of freedom.
X_1, ... X_n are independent bernoulli p_i
Compute the likelihood ratio statistic for testing p_i = p.
Compute its distribution.
Compare the mean and variance of this statistic and compare these values
with those of
the usual chi-squared approximation when p = 0.01 and n = 100,000.
Why is the approximation so bad? (note n * p is huge so phat is approx.
gaussian.)
X_1, ... X_n are independent Gaussian
Let mhat(k) be the usual estimate of the k^th central moment.
Compute approximations of E[mhat(k)] and Var[mhat(k)].
Compute n(k) such that the standard deviation of mhat(k) is 10% of
E[mhat(k)].
Table n(k) for k = 2,...6.
What does this table tell you.
Repeat for X_i exponential.
Compare the table with the one above.
X_i,j 1 <= i <= k, j <= j <= n_i, are independent with
means mu and variance sigma^2_i.
Find the minimum variance linear unbiased estimate of mu.
If the sigma^2 are not known, estimate mu.
I forgot to mention
in the above that n_3 = 2 and s^2_3 = 0. Now what is muhat?
Experimental design.
84% of recent theses include some Monte Carlo calculations. These
are experiments.
And every PhD in statistics is expected to know the basics.
So go to the library and pick a thesis.
From the thesis complete the following:
Purpose _________ was the purpose of the experiment clearly stated?
Design __________ what terms were used to describe the design e.g.
factorial, randomized block -- or does it look like somebody just dreaming
up things to do.
Analysis of results________ how was the form of analysis described
e.g. paired t-test, analysis of variance ...
Significance levels were the claimed differences supported
by any tests of significance.
Conclusions ____ are the conclusions supported by the experiment --
are related to the stated purpose.
Mark ____ assuming that the student was taking an undergraduate
course in experimental design assign a mark for the
assignment for a) the design and b) the analysis
This is really a question for your final oral. Two black balls
can fail you. Don't let this be one. Be warned.
X_1, ...X_n are bernoulli variables indicating whether a child with a severe
chronic condition is above or below a threshold.
The nature of the disease is that patients vary up and down
from day to day and without treatment X_i is bernoulli with p=1/2.
An investigator designs a study for a treatment that is supposed to
slowly increase p.
20 patients are to be followed for 1 year.
What is the critical value for a 5% test of p = 1/2?
Is this for a 1 or two sided test?
One day, about 3 months into the study, the investigator notes that
only 6 of the 20 are above the threshold and wonders if she should stop
the trial because it looks like the treatment is harmful.
Prepare a brief report based on calculations on the statistical considerations
involved.