STA 438: Cloud-Based Research Computing For Statistics

Research computing in Statistics tends to be cyclic. In the typical project, there is a brief period of heavy use in which the researcher is running simulations. At this point almost no hardware is fast enough or has enough processors. Once the results are in, there is a period of analysis and writing. At this point, there is almost no need for computational power. Meanwhile, the hardware is sitting idle and becoming obsolete. Then a mistake is discovered or the next project comes along, and the cycle repeats. It's wasteful. 

There are several solutions to this problem. One is to have a fairly large number of researchers sharing the same hardware, and contributing grant money to maintenance, the purchase of new equipment, and the salary of the system administrator who has to babysit the machine. If their projects are somewhat out of phase, the demand on the machine will be relatively even. This is the current model in the Department of Statistical Sciences at the University of Toronto (it's better in the summer). Another promising model, often overlooked, is to harness the unused capacity of the computers in student labs.

Our project aims to explore a third option: commercial cloud computing. The idea is to do the computing on a virtual machine that can be scaled up for the heavy computing phase, and then scaled back down immediately once the results are in. Depending on the cost of a minimal machine and the cost of scaling up, it might be possible for a researcher with a grant to pay a few hundred dollars for the brief use of something equivalent to a supercomputer. Then while the researcher was writing about the results, there would be need for a room full of expensive equipment with exotic air-conditioning requirements. 

Here are some details. Research computing in Statistics consists mostly of simulations, and most people use R. In the typical simulation, the same thing is done over and over with varying inputs, perhaps with a different randomly generated set of data each time. In a unix/linux environment, if you run several R jobs at the same time, the operating system is smart enough to send each of them to a different processor. This makes rather easy to split a big job into several almost identical parts, and do each part on a separate processor. Each part of the job writes an output file, and a shell scrip collects the output into one big data file for tabulation and analysis. The more processors there are, the faster the job gets done. There is no need for special multi-processor versions of the software. The usual R installation is good enough.

Our plan is to pay a small amount of money (using Jerry's UTFA funds) for a virtual linux machine on a service like Amazon Web Services. Then we will install R, get the code ready, scale up to a large number of processors, run a simulation study, and scale back down again. The objective is to see whether we can do this, and see how much it will cost. The main product will be a set of detailed instructions and an example.

We want the simulation project to be meaningful to both of us. So, since we were student and teacher in a course on categorical data analysis, we will examine the consequences of using ordinary linear regression with normal error terms when the response variable is binary, and a logistic regression would be much more appropriate. In particular, we wonder how bad the p-values really are. Maybe they will not be as bad as one might expect. We shall see. 

Consequently, in the first phase of the project, Abhirag will design the simulation study with a lot of input from Jerry, and then he will write the code -- meanwhile learning some linux and investigating Amazon Web Services.