New Python Functions for Working with Multi-Year ACS Data

As part of my hometown_analysis project I’ve written some new functions for working with multi-year data from the American Community Survey (ACS). The functions are download_multiyear, graph_multiyear and pct_change_multiyear. This code is open source and released under the MIT License. It currently lives in utils.py in the hometown_analysis repo. While I don’t think the functions warrant a package on PyPI, I do think that they can help others doing similar projects. This post demonstrates how to use them.

The central problem these functions address is that the ACS is not designed as a time-series. So every time you analyze a table over multiple years you need to write certain boilerplate code. These functions handle a lot of those tasks for you. The intention is that these functions speed up exploratory data analysis.

This post walks through an example of using these functions to examine table B05012: Nativity in the United States for Great Neck Union Free School District in New York over time. The first ACS 5-year estimates were published in 2009. You are not supposed to compare years with overlapping data. This leaves us with 3 vintages to compare: 2009, 2014 and 2019. The 2024 estimates are scheduled to be released later this year.

Downloading Multiple Years of Data

As a reminder, here is how we can use ced.download in the censusdis package to download data on a single year:
import censusdis.data as ced

from censusdis.datasets import ACS5
from censusdis.states import NY

df = ced.download(
    dataset=ACS5,
    vintage=2009,
    group="B05012",
    state=NY,
    school_district_unified="12510",
)
print(df)
STATE SCHOOL_DISTRICT_UNIFIED  B05012_001E  B05012_002E  B05012_003E  \
0    36                   12510        44953        31623        13330   

             GEO_ID                                             NAME  
0  9700000US3612510  Great Neck Union Free School District, New York
Some things to note:
  • The vintage parameter to ced.download takes a single year.
  • The data we get back has “variables” for column names (ex. “B05012_001E”). We need to do some work to convert them to “Labels” (such as “Total”).
  • The dataset has some columns (such as STATE) which feel a bit redundant given that we know the state we requested data about.

The API for download_multiyear is meant to mirror that of ced.download. The primary difference is that the new vintages parameter takes a list of years:

from utils import download_multiyear

df = download_multiyear(
    dataset=ACS5,
    vintages=[2009, 2014, 2019],
    group="B05012",
    state=NY,
    school_district_unified="12510"
)
df
... Total Native Foreign-Born Year
0 44953 31623 13330 2009
0 45249 30096 15153 2014
0 45044 30755 14289 2019
Under the hood download_multiyear calls ced.download for each year and combines the results into a single dataframe. It also adds a Year column. download_mulityear has 3 additional parameters which have default values:
  • rename_vars=True. If True then rename the columns from variables to labels. The labels from the last year are used. Only the last portion of the label (!! is a separator) is used and any trailing : is dropped.
  • drop_cols=True. If True then drops columns which do not contain survey data. This tends to be geographic metadata.
  • prompt=True. download_multiyear emits a warning if a variable’s label changed during the selected years. If prompt is True then users are also prompted to confirm that they want to continue with the download despite the label mismatch. In order to reduce false positives : is removed when doing the comparison (e.g. “Total:” and “Total” are considered identical).
I first learned that ACS variables can change meaning when doing my Covid Demographics Explorer project. As I described in this blog post, the variable B08006_017E changed from “Estimate!!Total!!Motorcycle” in 2005 to “Estimate!!Total!!Worked at home” in 2006.

Graphing Multiple Years of ACS Data

Putting the data in such a simple form makes it easy to write a function to graph it. That’s what graph_multiyear does:

from utils import graph_multiyear

graph_multiyear(
    df=df,
    title="Population by Nativity in Great Neck School District",
    yaxis_title="Population"
)

These graphs will render interactively on your local machine. However, I could not figure out how to make them interactive in WordPress.

The optional parameter y_cols allows you to render only a subset of the columns:
graph_multiyear(
    df=df,
    title="Population by Nativity in Great Neck School District",
    yaxis_title="Population",
    y_cols=["Native", "Foreign-Born"]
)

 

Graphing Percent Change

While Pandas has a function pct_change, it is difficult to use on our dataset because it works on all columns (including the “Year” column). Since I anticipate doing this operation multiple times in this analysis (including boilerplate code like rounding the result), I wrote the function pct_change_multiyear:

from utils import pct_change_multiyear

df = pct_change_multiyear(df)
print(df)

graph_multiyear(
    df=df,
    title="Percent Change in Population by Nativity in Great Neck School District",
    yaxis_title="Percent Change"
)
Total Native Foreign-Born Year
0 NaN NaN NaN 2009
0 0.7 -4.8 13.7 2014
0 -0.5 2.2 -5.7 2019


Conclusion

The functions download_multiyear, graph_multiyear and pct_change_multiyear have sped up my ability to do exploratory data analysis of ACS data that involves multiple years of the same table. The code is open source and available for others to use as well. Please contact me if you have any questions.

Ari Lamstein

Ari Lamstein

I’m a software engineer who focuses on data projects.

I most recently worked as a Staff Data Science Engineer at a marketing analytics consultancy. While there I developed internal tools for our data scientists, ran workshops on data science and mentored data scientists on software engineering.

Thanks for visiting!

Sign up to stay up to date with the latest blog posts: