Introduction to S and R


History and Terminology

Most major statistical packages are computer programs that have their own control language. The syntax goes with one computer program and just one. The SAS language controls the SAS software, and that's it. Minitab syntax controls Minitab, and that's it. S is a little different. Originally, S was both a program and a language; they were developed together at the former AT\&T Bell Labs starting in the late 1970's. Like the unix operating system (also developed around the same time at Bell Labs, among other places), S was open-source and in the public domain. ``Open-source" means that the actual program code (initially in Fortran, later in C) was public. It was free to anyone with the expertise to compile and install it.

Later, S was spun off into a private company that is now called Insightful Corporation. They incorporated both the S syntax and the core of the S software into a commercial product called S-Plus. S-Plus is \emph{not} open-source. The ``Plus" part of S-Plus is definitely proprietary. S-Plus uses the S language, but the S language is not owned by Insightful Corporation. It's in the public domain.

R also uses the S language. This is a unix joke. You know, like how the unix \texttt{less} command is an improved version of \texttt{more}. Get it? R is produced by a team of volunteer programmers and statisticians, under the auspices of the Free Software Foundation. It is an official GNU project. What is GNU? GNU stands for ``GNU's Not Unix." The recursive nature of this answer is a unix joke. Get it?

The GNU project was started by a group of programmers (led by the great Richard Stallman, author of \texttt{emacs}) who believed that software should be open-source and free for anyone to use, copy or modify. They were irritated by the fact that corporations could take unix, enhance it in a few minor (or major) ways, and copyright the result. Solaris, the version of unix used on many Sun workstations, is an example. An even more extreme example is Macintosh OS X, which is just a very elaborate graphical shell running on top of Berkeley Standard Distribution unix.

The GNU operating system was to look and act like unix, but to be rewritten from the ground up, and legally protected in such a way that it could not be incorporated into any piece of software that was proprietary. Anybody would be able to modify it and even sell the modified version -- or the original. But any modified version, like the original, would have to be open-source, with no restrictions on copying or use of the software. The main GNU project has been successful; the result is called linux.

R is another successful GNU project. The R development team rewrote the S software from scratch without using any of the original code. It runs under the unix, linux, MS Windows and Macintosh operating systems. It is free, and easy to install. Go to http://www.R-project.org to obtain a copy of the software or more information. There are also links on the course home page .

While they were redoing S, the R development team quietly fixed an extremely serious problem. While the S language provides a beautiful environment for simulation and customized computer-intensive statistical methods, the S software did the job in a terribly inefficient way. The result was that big simulations ran very slowly, and long-running jobs often aborted or crashed the system unless special and very unpleasant measures were taken. S-Plus, because it is based on the original S code, inherits these problems. R is immune to them. Anyway, S is a language, and R is a piece of software that is controlled by the S language. The discussion that follows will usually refer to S, but all the examples will use the R implementation of S --- specifically, R version 1.3.0 running under unix on tuzo. Mostly, what we do here will also work in S-Plus. Why would you ever want to use S-Plus? Well, it does have some capabilities that R does not have (yet), particularly in the areas of survival analysis and spatial statistics.

R as a Calculator

To start R, type ``R" and return at the unix prompt. Like this:

tuzo.erin > R

R : Copyright 2001, The R Development Core Team
Version 1.4.0  (2001-12-19)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type `license()' or `licence()' for distribution details.

R is a collaborative project with many contributors.
Type `contributors()' for more information.

Type `demo()' for some demos, `help()' for on-line help, or
`help.start()' for a HTML browser interface to help.
Type `q()' to quit R.

>

The S language is built around functions. As you can see above, even asking for help and quitting are functions (with no arguments).

The primary mode of operation of S is line oriented and interactive. It is quite unix-like, with all the good and evil that implies. R gives you a prompt that looks like a "greater than" sign. You type a command, press Return (Enter), and the program does what you say. Its default behaviour is to return the value of what you type, often a numerical value. In the first example, we receive the ">" prompt, type "1+1" and then press the Enter key. R tells us that the answer is 2. Then we obtain 23=8$.

> 1+1
[1] 2
> 2^3 # Two to the power 3
[1] 8

What is this [1] business? It's clarified when we ask for the numbers from 1 to 30.

> 1:30
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30

R will give you an array of numbers in compact form, using a number in brackets to indicate the ordinal position of the first item on each line. When it answered 1+1 with [1] 2, it was telling us that the first item in this array (of one item) was 2.

S has an amazing variety of mathematical and statistical functions. For example, the gamma function: Note that everything to the left of a # is a comment.

> gamma(.5)^2      # Gamma(1/2) = Sqrt(Pi)
[1] 3.141593
> gamma(500)       # Gamma(n) = (n-1)! Gamma of 500 = 499! is a big number, 
                   #                   but not infinity
[1] Inf
> lgamma(500)      # Log of the gamma function, surprisingly useful
[1] 2605.116

Assignment of values is carried out by a "less than" sign followed immediately by a minus sign; it looks like an arrow pointing to the left. The command x <- 1 would be read "x gets 1."

> x <- 1           # Assigns the value 1 to x
>  y <- 2
> x+y
[1] 3
> z <- x+y
> z
[1] 3
> x <- c(1,2,3,4,5,6)    #  Collect these numbers; x is now a vector

Originally, x was a single number. Now it's a vector (array) of 6 numbers. S operates naturally on vectors.

> y <- 1 + 2*x
> cbind(x,y)
     x  y
[1,] 1  3
[2,] 2  5
[3,] 3  7
[4,] 4  9
[5,] 5 11
[6,] 6 13

The cbind command binds the vectors x and y into columns. The result is a matrix whose value is returned (displayed on the screen), since it is not assigned to anything.

The bracket (subscript) notation for selecting elements of an array is very powerful. The following is just a simple example.

> z <- y[x>4]           # z gets y such that x > 4
> z
[1] 11 13

If you put an array of integers inside the brackets, you get those elements, in the order indicated.

> y[c(6,5,4,3,2,1)] # y in opposite order
[1] 13 11  9  7  5  3
> y[c(2,2,2,3,4)] # Repeats are okay
[1] 5 5 5 7 9
> y[7] # There is no seventh element.  NA is a missing value code
[1] NA

Most operations on arrays are performed element by element. If you take a function of an array, S applies the function to each element of the array and returns an array of function values.

> z <- x/y       # Most operations are performed element by element
> cbind(x,y,z)
     x  y         z
[1,] 1  3 0.3333333
[2,] 2  5 0.4000000
[3,] 3  7 0.4285714
[4,] 4  9 0.4444444
[5,] 5 11 0.4545455
[6,] 6 13 0.4615385

Traditional matrix operations are available too. First enter some matrices.


> A <- rbind( c(1,7,2,4),
+             c(5,9,3,1),
+             c(2,1,8,6) )
> a     # S is case sensitive
Error: Object "a" not found

> A
     [,1] [,2] [,3] [,4]
[1,]    1    7    2    4
[2,]    5    9    3    1
[3,]    2    1    8    6

> b <- rbind(c(1,1,1,1)) ; b
     [,1] [,2] [,3] [,4]
[1,]    1    1    1    1

> Ab <- A%*%b ; Ab      # Can't multiply 3x4 by a 1x4
Error in A %*% b : non-conformable arguments

> b <- cbind(c(1,1,1,1))
> Ab <- A%*%b ; Ab
     [,1]
[1,]   14
[2,]   18
[3,]   17

> t(b)%*%t(A)  # Transpose of a matrix product is matrix product of transposes
     [,1] [,2] [,3]
[1,]   14   18   17
>
> diag(4)
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    0    1    0    0
[3,]    0    0    1    0
[4,]    0    0    0    1
> I <- diag(4)
> A%*%I
     [,1] [,2] [,3] [,4]
[1,]    1    7    2    4
[2,]    5    9    3    1
[3,]    2    1    8    6

S can also solve systems of linear equations. For example, say we wanted to solve x1+x2=2 and x1-x2=0 simultaneously for x1 and x2. In matrix form, we want to solve Ax=b for x.

> help(solve)
solve                  package:base                  R Documentation

Solve a System of Equations

Description:

     This generic function solves the equation `a %*% x = b' for `x',
     where `b' can be either a vector or a matrix.

Usage:

     solve(a, b, tol = 1e-7, ...)

Arguments:

       a: a numeric matrix containing the coefficients of the linear
          system.

       b: a numeric vector or matrix giving the right-hand side(s) of
          the linear system.  If omitted, `b' is taken to be an
          identity matrix and `solve' will return the inverse of `a'.

     tol: the tolerance for detecting linear dependencies in the

> A <- rbind( c(1, 1),
+             c(1, -1) )
> b <- cbind(c(2,0))
> solve(A,b)
     [,1]
[1,]    1
[2,]    1

The solution of Ax=b is x = A-1b. If you omit the second argument of the solve function, you get A-1.

> solve(A)
     [,1] [,2]
[1,]  0.5  0.5
[2,]  0.5 -0.5
> > A%*%t(A)
     [,1] [,2] [,3]
[1,]   70   78   49
[2,]   78  116   49
[3,]   49   49  105
> AtA <- A%*%t(A)
> AtAinv <- solve(AtA)
> AtA %*% AtAinv
              [,1]         [,2]         [,3]
[1,]  1.000000e+00 1.582068e-15 6.661338e-16
[2,] -1.110223e-15 1.000000e+00 1.110223e-16
[3,]  0.000000e+00 1.665335e-16 1.000000e+00
> > det(A)
[1] -2
> det(AtA)
[1] 141750

> eigen(AtA)
$values
[1] 216.63767  64.16487  10.19745

$vectors
          [,1]       [,2]       [,3]
[1,] 0.5297662 -0.1858793  0.8275244
[2,] 0.6661834 -0.5126705 -0.5416352
[3,] 0.5249261  0.8382230 -0.1477658

> prod(eigen(AtA)$values)
[1] 141750

Unfortunately, R cannot solve systems of nonlinear equations.

S is a great environment for producing high-quality graphics, though we won't use it much for that. Here's just one example. We activate the pdf graphics device, so that all subsequent graphics in the session are written to a file that can be viewed with Adobe's Acrobat Reader. We then make a line plot of the standard nomal density and quit. First we define a function for the standard normal density.

Actually, graphics are a good reason to download and install R on your desktop or laptop computer. By default, you'll see nice graphics output on your screen. Under unix, it's a bit of a pain unless you're in an X-window environment (and we're assuming that you are not). You have to transfer that pdf file somewhere and view it with Acrobat or Acrobat Reader.

Continuing the session, a couple of interesting things happen when we quit. First, we are asked if we want to save the "workspace image." The responses are Yes, No and Cancel (don't quit yet). If you say Yes, R will write a file containing all the objects (variables, matrices, etc.) that have been created in the session. Next time you start R, your work will be "restored" to where it was when you quit.

> pdf("testor.pdf")
> ndens <- function(x) # Standard Normal Density
+   {
+   ndens <- 1/sqrt(2*3.14159) * exp(-x^2/2)
+   ndens
+   }
> ndens(0)
[1] 0.3989424
> cbind(-3:3,ndens(-3:3))
     [,1]       [,2]
[1,]   -3 0.00443185
[2,]   -2 0.05399099
[3,]   -1 0.24197083
[4,]    0 0.39894245
[5,]    1 0.24197083
[6,]    2 0.05399099
[7,]    3 0.00443185
> X <- seq(from=-3,to=3,by=.1)
> fX <- ndens(X)
> pdf("normaldens.pdf")
> plot(X,fX,type="l") # That's a lower case L
> q()
Save workspace image? [y/n/c]: yes
tuzo.erin > ls
testor.pdf

Notice that when we type ls to list the files, we see only testor.pdf, the pdf file containing the plot. Where is the workspace image? It's an invisible file; type ls -a to see all the files.

tuzo.erin > ls -a
./              ../             .RData          testor.pdf

tuzo.erin > R

R : Copyright 2001, The R Development Core Team
Version 1.4.0  (2001-12-19)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type `license()' or `licence()' for distribution details.

R is a collaborative project with many contributors.
Type `contributors()' for more information.

Type `demo()' for some demos, `help()' for on-line help, or
`help.start()' for a HTML browser interface to help.
Type `q()' to quit R.

[Previously saved workspace restored]

>

All the examples so far are interactive, but for serious work, it's better to work with a command file. Put your commands in a text (ASCII) file and execute them all at once. This way, if (when) you discover a little error, you can just edit your command file and re-run it.

Suppose your commands are in a file called commands.R. At the S prompt, you'd execute them with source("commands.R"). When you do it this way, you can't display objects by just typing their names. You need to use the print or cat function. For example, print(A).

Then you can copy and paste the output (all the output, please!) into a word processing document. Please use a fixed-width font like courier. Your printout will look much better.

Maximum Likelihood

Here are a couple of command files that do numerical maximum likelihood. They do the following:
  1. Define the function that computes the likelihood, or log likelihood, or minus log likelihood or whatever. I always do -2* log likelihood, because the minimum value of this function can be directly used in likelihood ratio tests.
  2. How we got the data into R -- usually a scan statement.
  3. Listing of the data
  4. The nlm statement and resulting output.
  5. Calculation of the MLE using an explicit formula, if this is possible.
# ex1.R    Bernoulli MLE (Example 1): true theta = 1/10
# execute with source("ex1.R")   at the R prompt          
cat("\n")
cat("1. Define the function BernLL \n")
BernLL <- function(theta,x) # -2 Log likelihood for Bernoulli distribution
                            # Data are in x
   {
   n <- length(x) ; sumx <- sum(x)
   BernLL <- -2*sumx*log(theta) - 2*(n-sumx)*log(1-theta)
   BernLL
   } # end of function BernLL
print(BernLL)
cat("\n")
cat("2. Read the data  \n")
ex1dat <- scan("ex1.dat")
cat("\n")
cat("3. Print the data \n")
print(ex1dat)
cat("\n")
cat("4. Minimize the function with nlm \n")
print(nlm(BernLL,.5,x=ex1dat))
print(warnings())
cat("\n")
cat("5. Compare exact MLE \n")
print(mean(ex1dat))
# Another way to do it. The output of nlm (like most R output)
# is a list of matrices, and this list caqn be assigned toa variable
# so you can get at it easliy. This is very unlike most stats packages.
# Bernsearch <- nlm(BernLL,.5,x=ex1dat)
# print(Bernsearch)
# thetahat <- Bernsearch$estimate ; print(thetahat)
##################### End of ex1.r ##############################


# ex2.R       Gamma MLE (Example 2)
# Run from the unix prompt with   R --vanilla < ex2.R > ex2.out
#
# 1. Define the function
GamLL <- function(theta,x) # -2 Log likelihood for Gamma distribution
                           # Data are in x
   {
   alpha <- theta[1] ; beta <- theta[2]
   n <- length(x) ; sumx <- sum(x) ; sumlogx <- sum(log(x))
   GamLL <- 2*n*alpha*log(beta) + 2*n*lgamma(alpha) + 2*sumx/beta -
            2*(alpha-1)*sumlogx
   GamLL
   } # end of function GamLL
#
# 2. Read the data
gamdat <- scan("ex2.dat")
# 3. Print the data
print(gamdat)
#
# 4. Minimize the function with nlm. Except, watch out!
#    The starting value can matter.
#
print(nlm(GamLL,c(1,1),x=gamdat)[1:2])
print(nlm(GamLL,c(10,10),x=gamdat)[1:2])
print(nlm(GamLL,c(100,100),x=gamdat)[1:2])
# Need good starting values - Try method of moments.
xbar <- mean(gamdat) ; s2 <- var(gamdat)
astart <- xbar^2/s2 ; bstart <- s2/xbar
print(nlm(GamLL,c(astart,bstart),x=gamdat)[1:2])
#
# 5. There is no explicit formula for the MLE, but we can look at
#    the Method of Moments estimates astart and bstart.
astart ; bstart # Invisible with source function.
##################### End of ex2.R ##############################