As part of my hometown_analysis project I’ve written some new functions for working with multi-year data from the American Community Survey (ACS). The functions are download_multiyear
, graph_multiyear
and pct_change_multiyear
. This code is open source and released under the MIT License. It currently lives in utils.py in the hometown_analysis repo. While I don’t think the functions warrant a package on PyPI, I do think that they can help others doing similar projects. This post demonstrates how to use them.
The central problem these functions address is that the ACS is not designed as a time-series. So every time you analyze a table over multiple years you need to write certain boilerplate code. These functions handle a lot of those tasks for you. The intention is that these functions speed up exploratory data analysis.
Downloading Multiple Years of Data
ced.download
in the censusdis
package to download data on a single year:import censusdis.data as ced from censusdis.datasets import ACS5 from censusdis.states import NY df = ced.download( dataset=ACS5, vintage=2009, group="B05012", state=NY, school_district_unified="12510", ) print(df)
STATE SCHOOL_DISTRICT_UNIFIED B05012_001E B05012_002E B05012_003E \ 0 36 12510 44953 31623 13330 GEO_ID NAME 0 9700000US3612510 Great Neck Union Free School District, New York
- The
vintage
parameter to ced.download takes a single year. - The data we get back has “variables” for column names (ex. “B05012_001E”). We need to do some work to convert them to “Labels” (such as “Total”).
- The dataset has some columns (such as STATE) which feel a bit redundant given that we know the state we requested data about.
The API for download_multiyear is meant to mirror that of ced.download. The primary difference is that the new vintages
parameter takes a list of years:
from utils import download_multiyear df = download_multiyear( dataset=ACS5, vintages=[2009, 2014, 2019], group="B05012", state=NY, school_district_unified="12510" ) df
... Total Native Foreign-Born Year 0 44953 31623 13330 2009 0 45249 30096 15153 2014 0 45044 30755 14289 2019
Year
column. download_mulityear has 3 additional parameters which have default values:rename_vars=True
. If True then rename the columns from variables to labels. The labels from the last year are used. Only the last portion of the label (!!
is a separator) is used and any trailing:
is dropped.drop_cols=True
. If True then drops columns which do not contain survey data. This tends to be geographic metadata.-
prompt=True
. download_multiyear emits a warning if a variable’s label changed during the selected years. If prompt is True then users are also prompted to confirm that they want to continue with the download despite the label mismatch. In order to reduce false positives:
is removed when doing the comparison (e.g. “Total:” and “Total” are considered identical).
Graphing Multiple Years of ACS Data
Putting the data in such a simple form makes it easy to write a function to graph it. That’s what graph_multiyear
does:
from utils import graph_multiyear graph_multiyear( df=df, title="Population by Nativity in Great Neck School District", yaxis_title="Population" )
These graphs will render interactively on your local machine. However, I could not figure out how to make them interactive in WordPress.
y_cols
allows you to render only a subset of the columns:graph_multiyear( df=df, title="Population by Nativity in Great Neck School District", yaxis_title="Population", y_cols=["Native", "Foreign-Born"] )
Graphing Percent Change
While Pandas has a function pct_change
, it is difficult to use on our dataset because it works on all columns (including the “Year” column). Since I anticipate doing this operation multiple times in this analysis (including boilerplate code like rounding the result), I wrote the function pct_change_multiyear
:
from utils import pct_change_multiyear df = pct_change_multiyear(df) print(df) graph_multiyear( df=df, title="Percent Change in Population by Nativity in Great Neck School District", yaxis_title="Percent Change" )
Total Native Foreign-Born Year 0 NaN NaN NaN 2009 0 0.7 -4.8 13.7 2014 0 -0.5 2.2 -5.7 2019![]()
Conclusion
The functions download_multiyear
, graph_multiyear
and pct_change_multiyear
have sped up my ability to do exploratory data analysis of ACS data that involves multiple years of the same table. The code is open source and available for others to use as well. Please contact me if you have any questions.