3. Some Notes to Help Get You Started

These are some notes that I put together for my class, "Human Population Biology," offered in the Department of Anthropological Sciences at Stanford. It has a decidedly evolutionary anthropology flavor, but it should help get you started thinking about doing demographic work in R.

R Objects

Some discussion of objects. Then we talk about vectors and matrices and eventually lists and data frames.

Manipulating Lists: Vectors and Matrices

A vector is simply a list of numbers arranged either as a row or a column. Say that we perform a census on a community of chimpanzees. Nishida et al. (1990) suggest that the chimpanzee life cycle can be meaningfully structured according to the following categories: (1) infant, (2) older infant, (3) juvenile, (4) adolescent, (5) older adolescent, (6) prime adult, and (7) old adult. In 1980, Nishida et al.'s census of the Mahale Mountain M-group revealed that the community is composed of 7 infants, 13 older infants, 8 juveniles, 13 adolescents, 5 older adolescents, 35 prime adults, and 9 old adults. We could represent this information in a compact manner by defining a column vector cx:

cx.jpg

To define this vector in R, simply type:
> cx <- c(7, 13, 8, 13, 5, 35, 9)
Typing cx will cause R to echo the new variable:
> cx
[1] 7 13 8 13 5 35 9

Nearly everything that you type into R will be in the form of lhs ← rhs. What this means is that you name something on the left hand side (lhs) of the assignment operator (i.e., "<-") and enter the value for that thing on the right hand side (rhs). In this chimpanzee example, the lhs is cx, and the rhs is the vector of stage abundances. You can use any string (i.e., combination of letters and numbers and underscores) for a name as long as it begins with a letter. While you can give a variable any name, it is good practice to choose meaningful names that will help you remember what they stand for. If you have a vector of reproductive values, rather than calling it stuff, instead call it something like vx. R variables and functions are case sensitive, so the variable c is distinct from the variable C.
Vectors are formed by concatenating a list of numbers using the R command c. The elements of the list are separated by commas.
> cx <- c(7, 13, 8, 13, 5, 35, 9)
The default for R is not to echo input. To view your input simply type, cx at the command prompt:
> cx
[1] 7 13 8 13 5 35 9

Nishida et al. (1990) actually provide a time series of population counts. It is often convenient to represent such time series in a single array (e.g., if you wanted to plot changes in demographic composition over time). Using the Mahale census data from 1988, we can construct a matrix of population counts composed of two censuses separated by eight years. The matrix will contain seven rows corresponding to the seven stages of the chimpanzee life cycle. It will also contain two columns corresponding to the two different years, 1980 and 1988. To construct this matrix, we type:
> cx1980 <- c(7, 13, 8, 13, 5, 35, 9)
> cx1988 <- c(9, 11, 15, 8, 9, 38, 0)
> C <- cbind(cx1980, cx1988)
The output of these commands is:


> C
cx1980 cx1988
[1,] 7 9
[2,] 13 11
[3,] 8 15
[4,] 13 8
[5,] 5 9
[6,] 35 38
[7,] 9 0

We used the command cbind to bind two vectors together in an array. Note that R is agnostic as to whether a vector of numbers that you enter is a row vector or column vector. By using the command cbind, we forced them to be read as column vectors.

What would happen is you typed in C <- c(cx1980, cx1988) ?
Now what would happen if you typed
> C <- t(cbind(cx1980, cx1988))?

The R command t is the transpose operator, which exchanges the rows and columns of a matrix.
When combining vectors and matrices in R, it is very important that you keep track of the sizes of your arrays. For example, Boesch and Boesch-Achermann (2000) categorize the chimpanzee life cycle in a different way from Nishida and colleagues. Instead of the seven stages, Boesch & Boesch-Achermann employ just four: (1) infant, (2) juvenile, (3) adolescent, and (4) adult. Their census data for 1982 is entered below:
> cx.boesch <- c(18,10,15,30)
> cx.boesch
[1] 18 10 15 30

What happens if we try to combine the Nishida and Boesch data?
> cbind(cx.boesch,C)


cx.boesch cx1980 cx1988
[1,] 18 7 9
[2,] 10 13 11
[3,] 15 8 15
[4,] 30 13 8
[5,] 18 5 9
[6,] 10 35 38
[7,] 15 9 0

Warning message:
number of rows of result is not a multiple of vector length (arg 1) in: cbind(cx.boesch, C)

We get a warning message telling us that we tried to combine two things of different lengths, but we still get an array of numbers for output. R uses what is known as a recycling rule to combine arrays of differing lengths. This means that the elements of the shorter structure get re-used, starting over with the first, until it is the length of the longer structure. Note that the warning message is only given if the longer structure is not a multiple of the shorter one:

> a <- c(2,4)
> b <- c(1,3,1,3)
> cbind(a,b) a b
[1,] 2 1
[2,] 4 3
[3,] 2 1
[4,] 4 3

The recycling rule can be a powerful programming tool, but it can also cause serious (and potentially comical) problems. Beware.

Accessing Elements in Arrays

Arrays are indexed by their row and column addresses. Wood and Smouse (1982) conducted a demographic study of the Gainj of the New Guinea highlands. From their data, we can construct a Leslie matrix with five 10-year age classes. Remember that a Leslie matrix contains age-specific fertilities along the first row and age-specific survival probabilities along the subdiagonal. The matrix for the Gainj is:

gainj.jpg

We can enter these data into R as follows:

> gainj <- c(0, 0, 1.0791, 1.1676, 0.25,0.939, 0, 0, 0, 0, 0, 0.995, 0,
+ 0, 0,0, 0, 0.981, 0, 0,0, 0, 0, 0.973, 0)
> gainj.leslie <- matrix(data=gainj,nrow=5,ncol=5,byrow=TRUE,dimnames=NULL)
> gainj.leslie


[,1] [,2] [,3] [,4] [,5]
[1,] 0.000 0.000 1.0791 1.1676 0.25
[2,] 0.939 0.000 0.0000 0.0000 0.00
[3,] 0.000 0.995 0.0000 0.0000 0.00
[4,] 0.000 0.000 0.9810 0.0000 0.00
[5,] 0.000 0.000 0.0000 0.9730 0.00

What did we just do? First, we entered the Leslie matrix data. We entered these data row-wise, meaning that our vector gainj has as it entries the entire first row, followed by the second, and so on. Note that by hitting before closing the parentheses, R suspends input and allows you to continue typing your input on the next line. We then used the command matrix to re-shape the 25-element vector into a 5 × 5 matrix.
(As an aside, what would we have done if we could not quite remember the syntax for forming a matrix from a data vector? Probably the easiest way to check on the syntax of an R function is by using the command line help facility. For our case, type ?matrix. Depending on your system, you will either get a new window with the R Help Information of the matrix command or the information will appear on your command line. )
Is is frequently useful to extract a specific element of the matrix we have constructed. R provides a variety of facilities for extracting elements of arrays (R Development Core Team 2005b; Venables and Ripley 1999). Probably the simplest is to use square brackets with the matrix index. Say we are interested in the age-specific fertility value of women between ages 20-30. This demographic rate is represented by the third element of the first row. To extract this, type:

> gainj.leslie[1,3]
[1] 1.0791

Vectors are indexed by a single number. Say we want to get the number of chimpanzee juveniles from the 1980 M-group at Mahale. Juveniles are the third stage of the developmental sequence, consequently, the number of juveniles will be the third element of the vector.
> cx[3]
[1] 8

Frequently, we want to extract more than a just a single element from an array we have made. For example, we might want all of the age-specific fertility values from a Leslie matrix. One way to do this is as follows:
> gainj.Fx <- gainj.leslie[1,]

Table1:
operator function
== equals
<= less than or equal
>= greater than or equal
! not
& and
| or

> gainj.Fx
[1] 0.0000 0.0000 1.0791 1.1676 0.2500

By not specifying a column index following the comma, we instruct R to extract all columns corresponding to the specified row index.
To extract multiple rows or columns, we us a colon to specify the range. If we wanted, for example, the 4×4 submatrix of the Gainj Leslie matrix that had the survival probabilities along the diagonal, we would do the following:

> gainj.leslie[2:5,1:4]


[,1] [,2] [,3] [,4]
[1,] 0.939 0.000 0.000 0.000
[2,] 0.000 0.995 0.000 0.000
[3,] 0.000 0.000 0.981 0.000
[4,] 0.000 0.000 0.000 0.973

R allows the use of negative subscripts as well. This can be useful if, for example, we want all of a matrix but one column. To get only the first 4 columns of the Gainj Leslie matrix using negative indices:
> gainj.leslie[,-5]

[,1] [,2] [,3] [,4]
[1,] 0.000 0.000 1.0791 1.1676
[2,] 0.939 0.000 0.0000 0.0000
[3,] 0.000 0.995 0.0000 0.0000
[4,] 0.000 0.000 0.9810 0.0000
[5,] 0.000 0.000 0.0000 0.9730

Another powerful way to access matrix elements is by using a logical vector. Say that we wanted to access only the non-zero elements of the Gainj Leslie matrix. We can do this efficiently with a logical vector. The commonly used logical operators are given in table 1.
Logical vectors underlie the concept of conditional execution. That is, an operation is only executed if the value of the logical is true.

> q <- gainj.leslie != 0
> gainj.leslie[q]
[1] 0.9390 0.9950 1.0791 0.9810 1.1676 0.9730 0.2500

What does the structure q look like?
Logical vectors and matrices are powerful tools for speeding up calculations. We can frequently use a logical matrix in place of a control loop when programming, saving valuable computational time. Note that for the previous example, the traditional way to assemble a vector of the nonzero elements of the Leslie matrix would be to use two nested for loops with a logical evaluation of the form if the ijth number is not equal to zero, then put it in the vector, otherwise move on.
Now that we have a projection matrix, we can project the population forward in time. Assume we start with a population structure of a single person in each of the five age classes. We therefore need to create an initial population vector which we will call no. We also need to create a matrix to hold the projected population. We will call that N. The length of our projection will be ten years, so with our initial population, we will need a matrix which contains 11 columns.

> no <- rep(1,5)
> N <- matrix(0,nrow=5,ncol=11)
> N[,1] <-no > for (i in 2:11){
> N[,i] <- gainj.leslie%*%N[,i-1]
> }

To plot the change in the total population size, we can sum the columns of our population and then use the R function plot. To sum along the columns of N, we use the R function apply.

> pop <- apply(N,2,sum)
> t <- 0:10 > plot(t,pop)
> lines(t,pop)

Basic Operations

Typically, the reason we want to construct vectors and matrices or extract elements from them is so we can perform operations on them. If we know the age-specific survival probability of 0-5 year-olds, and we have a census count of the number of newborns in a year, we might want to make a prediction about the number of these newborns who will still be alive at the end of five years. To do this, we multiply the px value for the 0-5 year period by the appropriate entry from the cxvector. If p5= .78 and c1= 123, then the predicted number of survivors in five years is:

gainjProj.jpg

Figure 3: 100-year (10 interval) projection of an initial population of five females using the Gainj Leslie Matrix (Equation 2).

> n <- .78 * 123
> n
[1] 95.94

The operator (*) means multiplication. The other basic arithmetic operators are (+) for addition, (-) for subtraction, (/) for division, and (^) for power.

Plots and Graphs

One of the coolest things about R is its ability to make awesome graphics. We'll be keeping it simple for the time-being. A graph is made taking two (usually) vectors of the same length and plotting the elements of one against the elements of the other. The command for this is plot(), which takes three arguments, the third being optional.
Binford and Chasko (1976) present vital statistic data for the Nunamiut of Alaska, USA 1935-1968. Entering the time series of population counts given in their table 10, we get.

> nun <- c(66, 66, 65, 67, 68, 70, 69, 71, 71, 74, 74, 73, 74, 74, 76,
+ 79, 82, 84, 87, 88, 91, 96, 99, 104, 107, 114, 117, 121, 127, 128,
+ 131, 134, 136, 136)
>
> year <- 1935:1968

nunamiut.jpg
Figure 4: Changes in Total Population Size of the Nunamiut, 1935-1968.

To plot this time-series of population counts against the year, we simply type the following command:

> plot(year,nun)
> lines(year,nun)

This yields the plot presented in figure 4.
The plot in figure 4 allows us to visualize the time series of population counts well, but it is very basic. To improve the impact of the plot, we might want to add color to it. We should also give it more descriptive axis labels and possibly add a title.
R provides an incredible array of tools for creating professional graphics. We will focus on only a few graphics commands.
Let's re-plot the data with the points in color, the line a different color, and add both axis labels and a title to the plot. We can control the plotting symbol using the argument pch=n, where n is a number between 0 and 18. We define the plot as a scatter plot of points using type="p". Other possible types of plots include "l" for lines, "b" for both, as well as a variety of others. We will use type "p" to facilitate making the points and lines different colors.

> plot(year,nun,pch=16,col="red",type="p",
+ xlab="Year", ylab="Population Size")
> lines(year,nun,col="blue")

nunamiut_red.jpg
Figure 5: Changes in Total Population Size of the Nunamiut, 1935-1968, re-plotted.

> title(main="Nunanimut Population Dynamics, 1935-1968")

The results are plotted in figure 5.
A graphical means for diagnosing exponential population growth is to see if the plot of the logarithm of population size against time produces a straight line. We can specify logarithmic axes using the argument log. This argument can take the value of "x", "y" or "xy" depending on which axis we want to be a logarithmic scale. Going back to the Gainj population projection, we can plot the data on semilogarithmic axes as follows:

> plot(t,pop,pch=16,col="red",type="p",log="y",
+ xlab="Year", ylab="Population Size")
> lines(t,pop,col="blue")

Figure 6 shows that the Gainj population projection shows exponential growth. The plot is approximately linear on the semilogarithmic axes.

gainjSemilog.jpg
Figure 6: 100-year population projection of the Gainj, semilogarithmic axes.

Multiple Plots

We frequently want to plot multiple things on the same axes. For example, we may wish to simultaneously plot empirical observations or the results of a simulation along with the predictions of a theoretical model. R makes this very easy. In this example, we will generate a stochastic growth process and compare the results from the expected deterministic results. We begin by generating random normal variates with mean 1.02 and standard deviation 0.02.

> x <- rep(0,100)
> x[1] <- 20

> # generate normal random numbers for rate of increase
> w <- rnorm(100,mean=1.02,sd=0.02)

> # growth with noise
> for (i in 2:100) x[i] <- x[i-1]*w[i]
> plot(t,x,pch=3,col="red")

> #compare to plot without noise
> xx <- rep(0,100) > xx[1] <- 20
> R <- mean(w)
> for (i in 2:100) xx[i] <- R*xx[i-1]
> lines(t,xx,col="blue")

Note that we use a # to indicate a comment which R will ignore. Comments are essential when you write complex scripts, but they can also be handy when entered at the command line, particularly if you save your work history for future reference (a highly recommended practice!).
Not surprisingly, the model fits very well with the simulation (figure 7).

growthComp.jpg
Figure 7: Comparison of simulated growth data with white noise and the deterministic model

Other Plots

Histograms, barplots, etc.

Lists & Data Frames

So far, we have talked about data structures such as vectors and matrices that can be considered discrete units (i.e., we perform operations on them as complete structures even though we may wish to extract information from them). However, we also frequently want to work with data that are related, but will not necessarily be used together.

Lists A list is simply a collection of different items. Say that we were collecting data on the kinship networks of children. We interview heads of households about all the children currently residing in their home, and determine whether these children are natal or fostered, whether their mother and father are alive and the current ages of the child's parents or, if the child is fostered, their parents' ages at death. A list which would hold this information for a particular child might look like this:

> child1 <- list(name="mary", child.age=6, status="foster", + mother.alive=F, father.alive=T, parents.ages=c(24,35))

This list contains character data, numeric data, and logical data. There are a number of options for accessing the components of a list. Probably the easiest is to use the component names. To extract components using the component names, we use the $ operator. For example, if we wanted to determine if the child's mother were alive, we could type:

>child1$mother.alive
[1] FALSE
$

We can add components to a list using the concatenate function

> child1 <- c(child1, no.siblings=2, sib.ages=c(1,3))

Data Frames A data frame is an R object that holds a data matrix. It is essentially a list of variables, all with the same length.
Data frames can be entered at the R command line, but the most common means of creating data frames is to read them in from delimited text files. R also has a special library for reading in common binary formats (R Development Core Team 2005a).
We will use the read.table function to create a data frame holding life table information for the country of Venezuela in 1963 (Keyfitz and Flieger 1990). The columns of the data file are: age, number of women in age class (nKx), number of deaths (nMx), and the number of births (nBx). Put the name of the text file you are reading in double quotes. If the file is not in your working directory, you need to specify the path. The path specification will vary by the machine that you use. The R documentation strongly recommends that you specify a header. It does seem to make the whole process go more smoothly, so let's do that.

> ven <- read.table("/home/jhj1/Teaching/summercourse/venezuela.deaths.dat",
+ header=TRUE)
>
> ven

 
  age    nKx  nMx    nBx 
1   0 151377 7670      0 
2   1 569468 3640      0 
3   5 589891  926      0 
4  10 489546  299    796 
5  15 392537  334  49162 
6  20 324897  565 104087 
7  25 281655  572  88138 
8  30 253215  665  59394
9  35 214278  637  38580 
10 40 170131  801  10870 
11 45 146577  820   2251 
12 50 121309  998    268 
13 55  90784 1132      0 
14 60  74128 1811      0 
15 65  52962 1107      0 
16 70  36029 1504      0 
17 75  25374 1055      0 
18 80  15262 1345      0 
19 85   6757 1412      0 
s with lists, we can access individual columns of the data frame by using the $ operator:

> ven$age
[1] 0 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
$

When we are working intensively with one data frame, it frequently becomes inconvenient to have to type full variable names with the $ operator. To overcome this, we can use the R function attach, which will make the columns of the data frame visible as variables themselves.

> attach(ven)
> age
[1] 0 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85

>
> detach(ven)

Getting Help

When you need help with, say, the syntax of an R function, probably the simplest way to get help is with the command line help facility. To create the stochastic growth model above, we needed to generate random normal variates. If we wanted to get help about the function rnorm, we could type at the command line ?rnorm. Depending on the system, we would either get on the command line or in a newly spawned window a help file. Here is the beginning of the help file for rnorm:

Normal package:base R Documentation
The Normal Distribution
Description:


Usage:

Arguments: