New Parameter to `download_multiyear`

In a previous post I demonstrated how to use three Python functions I wrote to work with multi-year data from the American Community Survey (ACS): download_multiyear, graph_multiyear and pct_change_multiyear. A limitation of download_multiyear was that it could only download entire tables, not individual variables. I recently encountered a situation where this was problematic. As a result I just added support for downloading individual variables. This post demonstrates both the initial problem and the new solution.

To run the code in this post clone the hometown_analysis repo.

Starting Point: A Single Year

The table we will be working with is B05006: Place of Birth for the Foreign-born Population in the United States. Here is how to download the entire table from 2019:

from censusdis.datasets import ACS5
from censusdis.states import NY

from utils import download_multiyear

df = download_multiyear(
    dataset=ACS5,
    vintages=[2019],
    group="B05006",
    state=NY,
    school_district_unified="12510",
)
df.T.sort_values(by=0, ascending=False).head(11)

The output is:

0
Total                                  14289
Asia                                   11013
Eastern Asia                            5047
South Central Asia                      4701
China                                   3768
Iran                                    3723
China, excluding Hong Kong and Taiwan   2726
Year                                    2019
Europe                                  1821
Americas                                1270
Korea                                   1179

The dataframe has 169 columns. Most of the top columns contain aggregate data (like “Total” or “Asia”). There are only 3 countries in the top 11 rows (China, Iran and Korea). And the population in those countries goes down very quickly.

The Problem: Changing Number of Variables

Since I am interested in trends over time, I thought to get the same table for additional years. However, this does not work:

df = download_multiyear(
    dataset=ACS5,
    vintages=[2009, 2014, 2019],
    group="B05006",
    state=NY,
    school_district_unified="12510",
    prompt=False,
)
print(df)

The code generates a lot of warnings, and then raises this Exception:

CensusApiException: Census API request to https://api.census.gov/data/2009/acs/acs5/variables/B05006_162E.json

The issue is that table B05006 has had a different number of variables over the years (7 variables were added in 2019). While I could update the code to handle this case, it’s not obvious to me that one should be doing things like this in the first place. A better approach is to find the particular variables you care about and select them individually.

Solution: download_variables

The particular variables I was interested in are for China, Iran and Korea. By visiting this page you can see that they are B05006_049E, B05006_060E and B05006_054E. We can access them over multiple years by feeding them to the new parameter download_variables:

df = download_multiyear(
    dataset=ACS5,
    vintages=[2009, 2014, 2019],
    download_variables=["B05006_049E", "B05006_060E", "B05006_054E"],
    state=NY,
    school_district_unified="12510",
)
print(df)

The output is:

...   China  Iran  Korea  Year
0   1543  4424    857  2009
0   2593  5168    918  2014
0   3768  3723   1179  2019

As a reminder, if any of these variables had been used for a different label over time, then download_variables would have emitted a warning.

Graphing the Results

As a reminder, we can graph the results of download_multiyear with graph_multiyear:

from utils import graph_multiyear

graph_multiyear(
    df,
    "Place of Birth for the Foreign-Born Population<br>Great Neck Unified School District",
    "Population",
)

Percent Change

We can convert the results of download_multiyear to show percent change with the function pct_change_multiyear :

from utils import pct_change_multiyear

df_percent_change = pct_change_multiyear(df)

graph_multiyear(
    df_percent_change,
    "Place of Birth for the Foreign-Born Population<br>Great Neck Unified School District",
    "Percent Change",
)

The output is:

   China  Iran  Korea  Year
0    NaN   NaN    NaN  2009
0   68.0  16.8    7.1  2014
0   45.3 -28.0   28.4  2019

Conclusion

These functions were created to help me explore (in Python) how the demographics of my hometown have changed over the course of my life. To learn more about this project, visit the hometown_analysis repository on github. There you can find background on the project as well as analyses using these functions.

In addition to the analysis being interesting in itself, my hope is that the repository will inspire others to undertake similar projects. I often describe Census data as a national treasure that is underappreciated by the data science community. The code is released under the MIT license, which means that you can freely use it as a starting off point for your own projects.

 

Ari Lamstein

Ari Lamstein

I’m a software engineer who focuses on data projects.

I most recently worked as a Staff Data Science Engineer at a marketing analytics consultancy. While there I developed internal tools for our data scientists, ran workshops on data science and mentored data scientists on software engineering.

Thanks for visiting!

Sign up to stay up to date with the latest blog posts: