4. Making a Period Life Table
# Making a life table provides lots of R lessons. Here, I step
# through the logic of constructing a period life table taken from the
# Human Mortality Database.
# Learn to Love Your Text Editor
# Here are the first couple lines of a file downloaded from the HMD
# website which I have named "usa_deaths_1x1.txt". There is a comment
# in the first line.
# The United States of America, Deaths (5x1) Last modified: 27-Aug-2004
Year Age Female Male Total
1959 0 47902.43 64566.45 112468.88
1959 1-4 7683.28 9492.76 17176.04
1959 5-9 3714.58 5349.69 9064.27
# Reading Data Tables
#It is always a good idea to include a header row when you read in a
#ASCII table of data
usa.deaths <- read.table("usa_deaths_1x1.txt", header=TRUE, skip=1)
# Here's what the first 5 rows look like:
usa.deaths[1:5,]
Year Age Female Male Total
1 1959 0 47902.43 64566.45 112468.88
2 1959 1 3248.39 3836.93 7085.32
3 1959 2 1856.79 2423.21 4280.00
4 1959 3 1381.59 1750.88 3132.47
5 1959 4 1196.51 1481.74 2678.25
my.data <- read.table("data.txt", header=TRUE)
# Frequently, there will be text before the data begin describing the
# data set, telling you when it was last updated, etc. Here, the
# skip=n option is useful. The file I read had a one-line comment at
# the beginning, so I passed a skip=1 argument to read.table().
# This tells R to ignore the first line of the file and to treat the
# second as a header indicating the column names.
# Make sure that the data file contains only data after the skipped
# portion. For example, in mortality data, the last open age interval
# will often be denoted with something like 85+ or 110+ (as in the
# HMD). You need to use your text editor to remove such
# marks. Otherwise, R will not recognize the contents of the file as
# values. Instead it will treat them as factors. This can lead to all
# sorts of maddening problems.
# Subsetting Data Using Logicals
# Say we wanted to construct a life table for 1959, the first year in
# the file. We need to extract the number of recorded female deaths
# nDx, needed to construct a life table.
nKx <- usa.deaths$Female[usa.deaths$Year == 1959]
# We would use an analogous procedure to get the exposures, nKx, from
# the relevant data frame. Say we had done that and now had three
# vectors: "Age", "nKx", "nDx".
# We can put these into a data frame:
usa.data <- data.frame("Age"=Age, "nKx"=nKx, "nDx"=nDx)
# This will create a data frame with three columns and appropriate
# column names (the things in quotes -- they need not be the same as
# the variable names, of course).
# Vectorized calculations
# The first thing we need to calculate a period life table is the
# central death rate. The calculation is:
nMx <- usa.data$nDx/usa.data$nKx
# If the two vectors are the same size R, will divide the one by the
# other elementwise, no problem. What happens if they're different
# sizes?
# Note that we have to make the variables "nDx" and "nKx" visible to
# R. We do this by using the data.frame name "usa.data" and indicate
# which columns of this data frame to use with the dollar sign. If
# you plan to use the columns of a data frame frequently for a
# session, you can attach the data frame, making all its columns
# visible to R.
attach(usa.data)
# Now to calculate the central death rates, we simply type
nMx <- nDx/nKx
# In order to construct a life table, we need to convert the central
# death rates to age-specific probabilities of death, nqx. The
# Greville equation shows that we can do this with the only additional
# information of the average age of individuals dying in the interval
# x to x+n, nax.
# For a life table with 5 year age classes for age=5,10,..., it is
# usually a reasonable assumption that nax=2.5. That is, deaths are
# uniformly distributed through the interval. Its the first two nax
# values we need to worry about. We can easily do this in R:
nax <- rep(2.5,length(Age))
# Now, we just need to substitute the correct values of nax for nax[1]
# and nax[2].
# Assuming that you have been given values of b0 and b1 (e.g., from
# Keyfitz, or Coale-Demeny), substitute for nax[1] and, depending on the
# model, also substitute an appropriate value for nax[2].
nax[1] <- b0 + b1 *nMx[1]
nax[2] <- 1.5
# The average of of death in an open age interval is probably not
# x+n/2. A standard assumption is that the nax value for the last age
# class is the inverse of the observed nMx value.
nmax <- length(nMx) # create a variable to hold this as it will be useful again
nax[nmax] <- 1/nMx[nmax]
# You need to specify the widths of the age classes. R is very
# flexible with the way it handles concatenated lists. Here's a way
# to do it for the standard age classes of a human life table:
n <- c(1,4, rep(5, nmax - 3),999)
# The first age class is one year wide, the second is four, the next
# nmax-3 is five, and the last is some arbitrarily large number.
# Apply the Greville equation to convert nMx -> nqx:
nqx <- (n*nMx) / (1 + (n-nax)*nMx)
nqx[nmax] <- 1.0 # by definition
# Here's a nifty vectorized trick to calculate lx from nqx. No for loops!
lx <- c(1,1-qx)
lx <- cumprod(lx)
# (we could actually do this in one line: lx <- cumprod(c(1,1-qx)) )
# Vectorized calculation of ndx. This is just lx-lx+n. Do this in
# one line without a loop:
dx <- -diff(lx)
# In order to calculate nLx from lx, we need two things lx and dx.
# Unfortunately, they appear with different indices, which makes for a
# tricky for-loop. We need the lx value for age x but a ndx for age
# x+n. Here's trick that allows us to avoid complicated indexing:
lxpn <- lx[-1]
# If you want to remove a single element i from a list, use a negative
# index. This only works for removing a single element. If, say, we
# wanted to remove the first three elements of nMx, we would use
# nMx[4:nmax].
# Having removed the first element from lx and called it lxpn, we have
# solved the staggered-index problem without a for-loop:
nLx <- n*lxpn + ndx*nax
# Tx is the person years of remaining life in a cohort. To get Tx, we
# need to sum nLx from each age x forward to the end of the life
# table. Using a nifty vectorized trick, quite specific to R, we can
# simply calculate this:
Tx <- rev(cumsum(rev(nLx))
# We reverse nLx, calculate the cumulative sum and then reverse it
# again (thereby making it the right direction).
# ex is Tx divided by lx. lx is one element too long (remember, we
# prepend it with a 1 (by definition)):
ex <- Tx/lx[1:nmax]
# Now all we need to do is put all the columns in a structure. It's
# nice to round the numbers to make it more readable:
lt <- cbind("age"=Age,
"nax" = round(nax,4),
"nMx" = round(nMx,4),
"nqx" = round(nqx[1:nmax],4),
"lx" = round(lx[1:nmax],4),
"ndx" = round(ndx,4),
"nLx" = round(nLx,4),
"Tx" = round(Tx,2),
"ex" = round(ex,2) )
# Note that in order to put a bunch of columns together we don't use
# c(). Instead, we use cbind(). If we wanted rows, we'd have used
# rbind().
Posted by jhj1 at August 18, 2005 11:21 AM
Comments
Post a comment
Thanks for signing in, . Now you can comment. (sign out)
(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)