Creating Time Series Data from the American Community Survey (ACS)

One of the most important decisions in my Covid Demographics Explorer app was how to communicate the demographic changes that occurred between 2019 and 2021. I started by computing the raw population changes and saw that San Francisco County’s population went down 66,348. But that number, by itself, is hard to reason about. You really need to understand how the population normally fluctuates. That led me to graph the population estimates (and percent changes) over the course of the entire American Community Survey (ACS) like this:

These charts let you easily conclude that the drop that occurred during Covid was unusually large.

The only problem with this graph is that it is computationally expensive to generate. The Census Bureau API limits you to requesting information about a single year during each API call. That means that the above graph took 17 API calls to make. For a single, ad hoc analysis this is fine. But for a web application this can lead to a serious performance problem.

My solution was to write a standalone script, gen_county_data.py, which generates all the data the app will ever need ahead of time. The script takes about 90 seconds to run and ends by writing the data to the file county_data.csv, which I check in to github. The app reads this file into a dataframe on startup, and uses it to generate the graphs.

By and large this approach works well. If you need to create a similar dataset I recommend using gen_county_data.py as a starting off place. But you should also be aware that there are issues with treating the ACS as a time series. The script documents these issues, but I feel that they are significant and surprising enough to warrant documenting here as well.

Variables Change Meaning

Variables in the ACS have two components: a “Name” and a “Label”. As an example, in the 2022 1-Year ACS the variable for “Total Worked from Home” has “Name” B08006_017E and “Label” Estimate!!Total:!!Worked from home. This led me to write a script that gets the value for B08006_017E over the entire ACS and create graphs like this:And yet there’s clearly a problem. The 400% increase between 2019 and 2021 makes sense due to Covid. But what can explain the 1,000+% increase from 2005 to 2006?

The answer shocked me: in 2005 the variable B08006_017E has label Estimate!!Total!!Motorcycle! That is, it was used for something completely different than what it was used for in subsequent years. If you want the number of people who worked at home in 2005 you need to use B08006_021E.

Handling Variable Changes

The most dangerous part of these variable changes is that they can go undetected. After encountering the above bug I took two steps to reduce the chance of them happening again.

Safeguard #1: New App: “ACS-Var-Checker”

I was surprised that there appears to be no easy way to check all labels a variable has had over the course of the ACS. Yesterday afternoon I published an app that I think would have helped me debug this issue faster. I call the app “ACS-Var-Checker” and you can view it here. You simply give the app a variable name and ACS Span and it tells you all labels that variable has had over time. Here is a screenshot of the app:

Safeguard #2: Output in gen_county_data.py

After generating the data gen_county_data.py then calculates the unique labels each variable has had over the years. It outputs the result like this::

B08006_017E has the following labels:
    'Estimate!!Total!!Worked at home' in years [2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018]
    'Estimate!!Total:!!Worked from home' in years [2019, 2021, 2022]

As you can see, the variable still has two different labels, even after removing 2005 from the dataset. However, this label change is completely innocuous. If you wind up modifying the script to add new variables, I highly recommend that you carefully read this output!

Counties Come and Go

Another difficulty with creating time series data from the ACS is that geographies change over time. I was aware that ZIP Code Tabulation Areas (ZCTAs) change frequently, but I thought that this would not be an issue if I used Counties. I was wrong.

The authoritative document in this area is the Census Bureau’s Substantial Changes to Counties and County Equivalent Entities: 1970-Present. One example: in 2022 Connecticut replaced their 8 counties with 9 “Planning Regions”! The Planning Regions have different boundaries than the Counties, so there is no obvious way to do a time series analysis over a period that includes this change.

Another issue is that I am using the ACS 1-Year Estimates. These estimates are only published for counties that have a population over 65,000. Naturally some counties oscillate above and below this number, and so pop in and out of the dataset.

Handling Geography Changes

There are multiple ways you can handle these geography changes. In my case, the key point of the dataset is to compare changes between 2019 and 2021. So I only include counties in the final dataset if they appear in both those years.

The script first loops through each year, getting data on all counties which have data that year. After that dataset is generated, it creates a list of the counties which are present in both 2019 and 2021. It then filters the original dataframe to just those counties. This results in removing 34 counties from the original dataset. The script documents this transformation like this:

Generating all historic data took 104.7 seconds.
The resulting dataframe has 13,782 rows with 858 unique counties.
818 counties appear in both 2019 and 2021.
After filtering df_county_data to only those counties, the resulting dataframe has 13,585 rows with 818 unique counties.

Conclusion

It seems natural to view the ACS as a time series. But two technical issues (variable changes and geography changes) result in this being more complicated than it sounds. The script I created for this project, gen_county_data.py, can serve as a blueprint for others who need to create similar datasets. And the app I created (acs-var-checker), can help detect when variables change meaning over the years.

 

Ari Lamstein

Ari Lamstein

I’m a software engineer who focuses on data projects.

I most recently worked as a Staff Data Science Engineer at a marketing analytics consultancy. While there I developed internal tools for our data scientists, ran workshops on data science and mentored data scientists on software engineering.

Thanks for visiting!

Sign up to stay up to date with the latest blog posts: