Intercensal Population Estimates

A common pitfall when using small area population data spanning decades is the potential series break around census enumeration years resulting from how the Census Bureau estimates population. The issue of structural breaks involves the switch by the Census Bureau from using the 2010 decennial census to a 2020 “blended base” as the population base for their estimates. While this is typically a non-issue for data users when the Census Bureau releases their “intercensal” population series, which are linked over time to reflect the difference between their estimates based on the previous census and what the actual census enumeration is. Given the issues with the 2020 enumeration resulting from the COVID-19 pandemic on data collection and tabulation, the Census Bureau has not yet fully incorporated the 2020 enumeration into their population estimates. Instead, they rely on a “blended base” that incorporates data from the 2020 enumeration, the 2020 Demographic Analysis, and the Vintage 2020 population estimates. While the blended base may be phased out in the future, the Census Bureau has yet to release an intercensal series for 2010 through 2020. Consequently, an empirical analysis that uses series with a known time series break will provide biased estimates. In this post, I demonstrate how to compute an intercensal series.

Terminology

Reader beware—there is a lot of jargon in the following discussion. To help keep it all straight (for my sake as well!), I will list all of the concepts up front. Here I refer to an intercensal series as a linked time series of population data over time. Linking the data over time means tethering the population estimates to two census enumerations to form a consistent time series that starts with a first census and ends with the subsequent. To form an intercensal series, we need a postcensal series and a census. A postcensal series represents the population projections based on a previous census enumeration that adds or subtracts to/from the population based on the demographic components of change, i.e., births, deaths, and net migration.

Background

The Census Bureau uses a modification of the standard cohort component method to produce population each year. Their approach, known as the administrative record method, uses administrative data on births, deaths, and migration to produce midyear population estimates by demographic detail, i.e., age, sex, race, and ethnicity.

The standard cohort component framework for each area \(i\) follows the balancing equation

\[ P_{i, t+1} = P_{i, t} + B_{i, t} - D_{i,t } + NM_{i,t}, \]

where

  • \(P_{i,t}\) is the midyear population in period \(t\)
  • \(B_{i,t}\) is the number of births over \(t\) to \(t+1\)
  • \(D_{i,t}\) is the number of deaths over \(t\) to \(t+1\)
  • \(NM_{i,t}\) is level of net in- and out-migration over \(t\) to \(t+1\)

Each of the underlying components rely on administrative records acquired from various agencies, including birth and death certificates from the National Center for Health Statistics (NCHS), Form 1044 filings from the Internal Revenue Service (IRS), Medicare enrollment from the Centers for Medicare and Medicaid Services (CMS), and other records from the U.S. Census Bureau. The Census Bureau updates these component datasets each year, including revisions to previous years, to produce their midyear estimates.

The Das Gupta Method

The standard method to construct intercensal population estimates is the Das Gupta (1981) 6-factor method. The Das Gupta method was introduced in an internal technical memo within the Census Bureau from the 1980s and remains the standard approach to form intercensal population time series today. The approach reconciles the error of closure between the postcensal estimates based on the previous census and the censal estimate from the most recent enumeration. Das Gupta takes the ratio of the enumeration to the postcensal estimate and distributes it geometrically across the previous decade.

The general formula for the Das Gupta method is

\[ P_{i,t} = Q_{i,t} \Bigg( \frac{P_{i,t_1}}{Q_{i,t_1}} \Bigg)^\frac{t - t_0}{t_1 - t_0}, \]

where

  • \(P_{i,t}\) is the population estimate
  • \(Q_{i,t}\) is the postcensal population estimate
  • \(t_1\) is the recent enumeration date
  • \(t_0\) is the previous enumeration date

Data

My primary data source is the Population Estimates from the U.S. Census Bureau. The population estimates provide the total population for various age groups for the intercensal period of April 2010 through July 2021. Specifically, I draw on two sets of postcensal estimates—the Vintage 2020 evaluation estimates and the Vintage 2021 postcensal estimates. The Vintage 2020 evaluation estimates are based on the 2010 census and provide postcensal population estimates from April 1st, 2010 through July 1st, 2020. The subsequent Vintage 2021 postcensal estimates are based on a “blended base” instead of the complete 2020 census and span April 1st, 2020 through July 1st, 2021.

The first step is to clean up and ssemble the PEP data.

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# Prep a list of county geographies ----
# 
# Here I use an external list of counties (and equivalents) based on the 
# 2020 definitions.
#
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

# Census geographies
geo_link <- "https://www2.census.gov/geo/docs/reference/codes2020/national_county2020.txt"

# Load the geography list
df_geos <- read_delim(file      = geo_link,
                      delim     = "|",
                      col_types = "ccccc") %>% 
  
  # Make GEOID
  mutate(GEOID = paste0(STATEFP, COUNTYFP))


# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# DL the PEP data (if needed) ----
#
# Collect the population by age and sex data, downloading the data if not
# already saved.
#
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

# PEP data links
pep_links <- paste(
  "https://www2.census.gov/programs-surveys/popest/datasets", 
  c("2010-2020/counties/asrh/CC-EST2020-AGESEX-ALL.csv",
    "2020-2021/counties/asrh/cc-est2021-agesex-all.csv"),
  sep = "/"
)

# Get the PEP datasets
lapply(pep_links, function(file){
  local_file <- file.path("./data-raw", basename(file))
  if(!file.exists(local_file)) {
    download.file(file, local_file)
    }
})
## [[1]]
## NULL
## 
## [[2]]
## NULL
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# Clean up the PEP data ----
#
# Append the Vintage 2020 and 2021 `AGESEX` data, sub-setting to include 
# only total population and ages 16 plus, and generating the 5-digit FIPS
#
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

# Vintages to read
vntg_list <- c(2020, 2021)

# Compile the data
df_popest <- map_dfr(vntg_list, function(vntg){
  
  # Read in the data
  read_csv(file       = glue("./data-raw/cc-est{vntg}-agesex-all.csv"),
           na         = c("", "X"),
           col_select = c("SUMLEV",
                          "STATE",
                          "COUNTY",
                          "YEAR",
                          "POPESTIMATE",
                          "AGE16PLUS_TOT"),
           col_types = "dddddd") %>% 
  
  # Assemble variables
  mutate(GEOID = sprintf("%02.0f%03.0f", STATE, COUNTY),
         YEAR  = case_when(vntg == 2021 ~ YEAR + 14,
                           TRUE ~ YEAR))
  
  }) %>% 
  
  # Sort by area and year
  arrange(GEOID, YEAR)

Closing Errors

Next I compute the closing error for each county (and equivalent). The closing error is simply the ratio of the April 1st, 2020 blended base to the April 2020 postcensal estimate. This measure captures the extent to which the cumulative components of change applied to the 2010 census are consistent with the most recent 2020 population base used to construct the next set of postcensal projections. Where the error of closure is greater than one, the components of change tend to undershoot an area’s population. Conversely, a closure error of less than one shows that the cohort component method overshoots the 2020 blended base population. In either case, observed differences between the 2010 postcensal estimates and the 2020 blended base may result from measurement error in the underlying demographic and vital statistics.

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# Compute the closing error ----
#
# Subset the Vintages to the April 2020 postcensal estimate (Vintage 2020)
# and the April 2020 blended base (Vintage 2021) and compute the error of 
# between the two vintages.
#
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


df_closing <- left_join(
  
  # Vintage 2020
  df_popest %>% 
    filter(YEAR == 13) %>% 
    select(GEOID, POP20_TOTAL = POPESTIMATE, POP20_16PLUS = AGE16PLUS_TOT),
  
  # Vintage 2021
  df_popest %>% 
    filter(YEAR == 15) %>% 
    select(GEOID, POP21_TOTAL = POPESTIMATE, POP21_16PLUS = AGE16PLUS_TOT),
  
  # Join on `GEOID`
  by = "GEOID"
  
  ) %>% 
  
  # Closing errors
  mutate(CLOSING_TOT = POP21_TOTAL/POP20_TOTAL,
         CLOSING_16P = POP21_16PLUS/POP20_16PLUS) %>% 
  
  # Clean up
  select(GEOID, CLOSING_TOT, CLOSING_16P)

Das Gupta Adjustment

With the closure errors in hand for each county (and equivalent), I can now allocate the closure error using the Das Gupta (1981) method (Das Gupta, 1981; Census Bureau, 2021). Here I subset the Vintage 2020 evaluation estimates to include the data from April 2020 through July 2019 and append the Vintage 2021 data from April 2020 through July 2021 (recall, I am adjusting the Vintage 2020 series prior to April 2020 to match the Vintage 2021 blended base). For this purpose I create a DATE column to reflect the appropriate date for each estimate and allocate the closing error to each Vintage 2020 postcensal data point based on the number of days passed since April 1st, 2010.

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
# Das Gupta (1981) adjustment ----
#
# Use the previously computed errors of closure to adjust the Vintage 2020
# population series for each area.
#
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


df_adjusted <- bind_rows(
  
  # Vintage 2020
  df_popest %>% 
    filter(YEAR %in% 2:12),
  
  # Vintage 2021
  df_popest %>% 
    filter(YEAR %in% 15:17)
  
  ) %>% 
  
  # Sort descending by county
  arrange(GEOID, YEAR) %>% 
  
  # Recode the PEP dates (clean this step at some point)
  mutate(x    = YEAR,
         DATE = case_when(
           x == 1 ~ "2010/04/01",
           x == 2 ~ "2010/04/01",
           x == 3 ~ "2010/07/01",
           x == 4 ~ "2011/07/01",
           x == 5 ~ "2012/07/01",
           x == 6 ~ "2013/07/01",
           x == 7 ~ "2014/07/01",
           x == 8 ~ "2015/07/01",
           x == 9 ~ "2016/07/01",
           x == 10 ~ "2017/07/01",
           x == 11 ~ "2018/07/01",
           x == 12 ~ "2019/07/01",
           x == 13 ~ "2020/04/01",
           x == 14 ~ "2020/07/01",
           x == 15 ~ "2020/04/01",
           x == 16 ~ "2020/07/01",
           x == 17 ~ "2021/07/01"),
         DATE = as.Date(DATE),
         x = NULL
         ) %>% 
  
  # Join the closing errors
  left_join(df_closing, by = "GEOID") %>% 
  
  # Das Gupta adjustment
  mutate(POPTOT_ADJ = ifelse(YEAR < 13,
                             das_gupta(POPESTIMATE, DATE, CLOSING_TOT),
                             POPESTIMATE),
         
         POP16_ADJ  = ifelse(YEAR < 13,
                             das_gupta(AGE16PLUS_TOT, DATE, CLOSING_16P),
                             AGE16PLUS_TOT)
         ) %>% 
  
  # Drop unnecessary cols
  select(-STATE, -COUNTY) %>% 
  
  # Add the area metadata
  left_join(df_geos, by = "GEOID") %>% 
  
  # Arrange cols
  relocate(SUMLEV, GEOID, STATEFP, -COUNTYFP, COUNTYNAME, DATE, -YEAR)

Results

After applying the Das Gupta (1981) method, we now have a complete set of linked population time series for each county (and equivalent) in the U.S. Having read this far, one might wonder: is the change in vintage really that big of an issue? My obvious economist’s answer is - “it depends.”

First, I plot the distribution of closing errors for the total county (and equivalent) population. Where the postcensal (Vintage 2020) and blended base (Vintage 2021) would equal the April 2020 (Vintage 2020) estimates, I draw a vertical dashed line to show where the 2010-based cohort component model perfectly represents the new, 2020-based one. For the Vintage 2021 blended base estimates, my computed errors of closure show that most areas require an upwards adjustment historically, since the blended base population is often larger than the postcensal population computed from the 2010 census and cumulative vital events and migration. We might therefore expect that the differences in the more recent population series may introduce bias in any subsequent empirical work.

Distribution of Closing Errors Between Vintages 2020 and 2021

Figure 1: Distribution of Closing Errors Between Vintages 2020 and 2021

Below is a table that shows the five largest and smallest errors of closure. Generally, counties (and equivalents) with the largest differences between the Vintage 2020 and 2021 estimates are relatively small areas.

Table 1: Top and Bottom 5 Errors of Closure
FIPS Area Name Closing Error April 2020 Population
48301 Loving County 0.3720930 64
48229 Hudspeth County 0.6514751 3,202
48137 Edwards County 0.7379346 1,422
13191 McIntosh County 0.7634252 10,975
02068 Denali Borough 0.7713197 1,619
25007 Dukes County 1.1788944 20,600
02185 North Slope Borough 1.1854917 11,031
25019 Nantucket County 1.2556153 14,255
16065 Madison County 1.3140863 52,913
06105 Trinity County 1.3150506 16,112

Why might it be that pronounced differences exist between the two sources? One might expect that differences would be small for areas where the demographic and vital statistics are relatively complete and well measured. Unfortunately, the answer is not so clear cut. For example, let’s take a relatively large metropolitan county — Cook County, IL.

Population Estimates for Cook County, IL

Figure 2: Population Estimates for Cook County, IL

In the above figure, the red line indicates the Vintage 2020 postcensal estimates, showing a noticeable downward trend since 2014. However, the more recent blended base data that incorporate the 2020 census results show a sizable difference simply moving from Vintage 2020 to 2021 of around 3% of Cook County’s population or around 154,000 people. While 3% doesn’t appear large compared to Cook County’s immense population of over 5 million, Chicago alone contains over 100 individual municipalities For example, the closure error alone is almost 9 of my hometowns (somewhere in the Chicago area) combined!

Even in larger areas like Cook County, IL, the error of closure may be small percentage-wise, but can represent a noticeable structural break in the county’s population time series. This is especially true when looking at economic data represented as rates. For example, using the PEP data to construct crime rates for Cook County (be careful when looking at crime rates for counties, see Maltz and Targonski, 2004), we might observe a statistically significant decline in the crime rate simply due to the discontinuity in the crime rate denominator. The issue naturally compounds for areas with larger errors of closure as well.

As shown in Table 1, the closure errors tend to be the largest for smaller counties (and equivalents). For example, the revision for the smallest county, Loving County, TX, is around 37% compared to the Vintage 2021 blended base. In this case, the cohort component method substantially overshot the new population base derived from the 2020 census. The following figure shows the original and adjusted population series for Loving County, TX.

Case in point, discontinuities introduced by the change in vintage is an empirical concern.

Future Work

This simple work presents a simple approach to construct intercensal population estimates for counties and equivalents spanning 2000 through 2021. These series are linked together by adjusting for errors of closure and, therefore, allow for consistent comparisons of area population across time. These data are particularly useful when constructing panel data for counties (and equivalents) for recent decades. This is an imperfect correction for the time being until the Census Bureau releases the official intercensal series for 2010-2020. In the meantime, the proposed method links the Vintage 2020 and 2021 time series and allows for comparisons over time. It is also used by other federal statistical agencies, such as the Bureau of Economic Analysis, in an ad hoc fashion to bridge the population series until the Census Bureau releases the intercensal data.

Data Availability

All codes are available on GitHub and I will eventually publish the resulting data by age/sex and by race/ethnicity on openICPSR.

Andrew C. Forrester
Andrew C. Forrester
Economist and Statistician in Washington, DC

Economist and statistician in Washington, DC working on economic statistics, labor and financial economics, time series and seasonal adjustment, and quantitative demography. All views are my own.

Related