Category: Open Data (page 1 of 2)

“A Guide to Working With Census Data in R” is now Complete!

Two weeks ago I mentioned that I was clearing my calendar until I finished writing A Guide to Working with Census Data in R. Today I’m happy to announce that the Guide is complete!

I even took out a special domain for the project: RCensusGuide.info.

The Guide is designed to address two problems that people who work with R and US Census Data face:

  1. Understanding what data the Census Bureau publishes.
  2. Understanding what CRAN packages are available to help with their project.

The Guide details the five most popular datasets that the Census Bureau publishes. It also describes the seven most popular R packages for working with Census data.

The best part? The Guide is free, and lists ways for you to continue learning about both the datasets and R packages that it mentions.

A special thank you to the R Consortium for funding this project!

I hope that you enjoy reading the Guide!

Read the Guide

Update on R Consortium Census Guide

As I mentioned in July, my proposal to the R Consortium to create a Guide to Working with Census Data in R was accepted. Today I’d like to share an update on the project.

The proposal breaks the creation of the guide into four milestones. So far Logan Powell and I have completed two of those milestones:

  • Milestone 1: Select publication technology. We have decided to use Github Pages to publish the Guide.
  • Milestone 2: Assemble list of resources to include. We have assembled three types of resource to include:
    • A list of R packages that work with Census Data.
    • A description of the five most popular datasets which the Census Bureau publishes.
    • A description of various ways that the Census Bureau provides training and support.

The next Milestone is to convert this list of resources into a single, coherent document. Before doing this, though, we want to give the broader R and Census communities a chance to review the resources we plan to include. We’re particularly interested to know if we’ve missed anything important, or if we’re planning to include anything that is broadly considered to be superfluous.

If you have a minute, please review what we plan to include. If you have feedback, you can provide it by filing an issue in the project’s github repository.

Review the Guide

Notes from the 2018 APDU Conference

I recently attended the 2018 Association of Public Data Users (APDU) conference. This was my second time attending the conference, and I enjoyed learning more about how other people use federal statistics. Some of my favorite presentations were:

  • Julia Lane on her experience teaching Big Data to social scientists and government workers.
  • The Bureau of Economic Analysis (BEA) on their R package.
  • Aaron Klein on the value of policy makers getting realtime data

Julia Lane and the Coleridge Initative

As an “R guy” I’m normally the odd man out at APDU. While many attendees do work with data, they tend to use GUI-based tools such as Excel and Tableau. I’ve often wondered if any of the attendees would benefit from learning programming language-based data tools such as R or Python.

It turns out that Julia is the author of an entire book on this topic: Big Data and Social Science! She is also a director of the Coleridge Initiative, which provides Python-based data analysis training to government workers and social scientists.

Julia spoke about her experience with these projects, and the results seemed very positive!

The Bureau of Economic Analysis (BEA) has an R package

While I mostly blog about data that the Census Bureau publishes, APDU is a great reminder of how many other statistical agencies there are. An example is the Bureau of Economic Analysis (BEA) which, among other things, calculates Gross Domestic Product (GDP) statistics.

BEA was a sponsor of the conference, and I got to chat with one of the people running their booth. I was surprised to learn that BEA has published their own R package: bea.R. This is the first time that I have heard of a government agency developing their own R package!

The person I spoke with mentioned that BEA’s Chief Innovation Officer, Jeff Chen, is a big proponent of R. You can learn more about the BEA here.

I think that it would be interesting to extend Choroplethr to work with data from the BEA.

Aaron Klein on Policymakers Getting Realtime Data

Aaron Klein, a former Treasury official, spoke about the value of policy makers getting realtime data. Aaron worked in Treasury during the foreclosure crisis, and spoke about the challenges policymakers faced in responding to it. One issue was quantifying the impact that foreclosures and abandoned homes have on broader communities.

He recently wrote a research paper that attempted to answer this question: Understanding the True Costs of Abandoned Properties: How Maintenance Can Make a Difference. One statistic from the talk left a big impression on me: vacant residential buildings account for one of every 14 residential building fires in America. When you consider that only a small portion of residential homes are vacant, this statistic is truly startling.

Having data like this at the start of the foreclosure crisis might have improved how policymakers responded to it.

Census Academy Update!

Last year I announced a collaboration with the Census Bureau to help with their new online training platform. This website, called “Census Academy”, is designed to help the public learn how to work with data that the Census Bureau publishes.

Census Academy recently had a soft launch, and there is already a lot of content on it. I’ve marked up an image of the homepage below to draw attention to three sections: Data Gems, Courses and “Join the Community”.

Data Gems

Data Gems are short instructional videos created by experts at the Census Academy.

A recent Data Gem, for example, explained how ZIP Code Tabulation Areas (a geography published by the Census Bureau) are different than ZIP Codes (which are published by the Postal Service). This is a question that I am frequently asked, and I found that the Data Gem gave a great answer!

Courses

My biggest contribution to Census Academy has been the creation of a course on visualizing data from the American Community Survey (ACS) in R. This course, which will be published soon, can be thought of as an in-depth extension of my existing course Learn to Map Census Data in RI will make an announcement here when the course becomes available!

Join the Community

Census Academy just started, and new training material is being added all the time. I think that any R programmer with an interest in demographic data can benefit from Census Academy!

To get notified when Census Academy adds new training material, click the “Join the Community” button on the Census Academy homepage.

Visit Census Academy

My Proposal to the R Consortium

I recently submitted a Proposal to the R Consortium. I decided to share this proposal on my blog for two reasons:

  1. To raise awareness of an opportunity I see in the R ecosystem.
  2. To raise awareness of the R Consortium as a funding source for R projects.

I first learned about the R Consortium when I attended the Boston EARL Conference in 2015. When I attended the San Francisco EARL Conference last year there was a session dedicated to projects the Consortium had funded. I was very impressed, and this year decided to submit a proposal myself!

I have no idea if this proposal will be accepted. Nonetheless, I hope that publishing it raises awareness of this relatively new funding source, and my own thoughts on how the R ecosystem can improve.

Below is a verbatim copy of the proposal.


Proposal to Create an R Consortium Working Group Focused on US Census Data

The Problem

R users who wish to work with US Census Data face two significant problems: Package Selection and Data Navigation.

Package Selection

An R user looking for packages to help with a project on US Census Data would likely start by going to CRAN’s list of contributed packages and doing a search for “Census”. This process is non-optimal for a number of reasons:

  1. It yields 41 hits, which is a large number to sort through.
  2. Many package titles appear to be duplicative (e.g. censusr’s title “Collect Data from the Census API” is very similar to censusapi’s title of “Retrieve Data from the Census APIs”).
  3. Some packages that deal with US Census Data do not have the word “Census” in their name or title (e.g. the choroplethr package).

For these reasons, even an experienced R user might find it challenging to determine which R package, if any, can help them with their Census-related project.

Note that these issues might also lead to package duplication on CRAN. That is, a developer might create a package to solve a problem which an existing package already solves.

Data Navigation

People often turn to the Census Bureau for answers to questions such as “How many people live in a certain region?” They are often surprised that the answer depends on which data product they query. That is, many R users begin their interaction with Census Data not fully understanding the array of data available to them, or when it is better to use one data product over another.

The Plan

We would like to create an R Consortrium Working Group to foster greater cooperation between the US Census Bureau and R community. We hope that the working group will become a place where statisticians working with census data can cooperate under the guidance of the Census Bureau.

The first project we would like the Working Group to take on is the creation of a guide to getting started with Census data in R (“The Guide”). The Guide would address the problems listed above by:

  1. Listing, describing and linking to packages currently available via CRAN that work with US Census Data.
  2. Listing, describing and linking to reference materials for better understanding and working with Census data.

The most likely failure mode for the Package Selection section is not including all the relevant packages, or not communicating the differences between the packages in a way that helps users decide which package is most appropriate for their use. At this point we do not know whether the Guide should simply copy the CRAN package description, or also include additional information. (And if it should include additional information, what information should that be?) We plan to address this risk by publishing and publicizing drafts of our work online, and incorporating feedback from the R community.

The most likely failure mode for the Data Navigation section is not providing resources which are useful or relevant to the actual needs of the R community. In the same way that CRAN has a wealth of packages that can be difficult to navigate, the Census Bureau also has a wealth of training material that can be difficult to navigate. We plan to address this by publishing and publicizing drafts of our work online, and incorporating feedback from the R community.

Another failure mode which we do not address in this proposal is maintenance of the Guide. While the Guide might be completely accurate the time of publication, it will naturally become less accurate over time. At this point it is not clear what the best way to maintain the Guide is.

The Team

Ari Lamstein. Ari is an R Trainer and Consultant who is currently working as a Data Science Instructor at the US Census Bureau. He has written several R packages that deal with US Census Data. Ari is planning to focus on on the Package Selection portion of the Guide.

Logan T Powell. Logan is the Developer Experience and Engagement Lead at the US Census Bureau. Logan is planning to work on the Data Navigation portion of the Guide.

Zach Whitman, PhD. Zach is the Chief Data Officer at the US Census Bureau. Zach is planning to work on future projects related to changes in the Census API.

Kyle Walker, PhD. Kyle is Associate Professor of Geography at Texas Christian University. Kyle is the primary author of the tigris, tidycensus, and idbr R packages for working with Census Bureau spatial and tabular data in R.

Project Milestones

Milestone 1: Select Publication Technology (1 Month, $500)

Our first task will be selecting technology to use to publish the Guide. We would like the technology to be free and easy to access, as well as free to host and easy to update. We are planning to start this phase by evaluating Github, Github Pages and WordPress.

This milestone wil be completed when there is a “Hello World” version of the Guide published, and both authors understand the workflow for editing it.

Milestone 2: Assemble list of resources to include (1 Month, $1,000)

The initial package and resource lists will be based on our personal knowledge and experience. The lists will be stored in github so that other people can contribute to the lists. That github repository will then be publicized via our blogs and social media accounts.

This milestone will be completed when we have the final list of Package and Training resources that we plan to include in the guide.

Milestone 3: Complete Publication of Guide (1 Month, $2,000)

After we have the final list of resources we plan to include in the Guide, we will need to write up the results.

This milestone will be completed once the final draft of the Guide is published.

We plan to announce completion of this milestone on our blogs and social media accounts.

Milestone 4: Complete Dissemination (1 Month, $500)

Once the Guide is completed, we will focus on disseminating it to the largest possible audience.

We will start by simply announcing the Guide’s completion on our blogs and social media accounts.

We will also reach out to the Census Bureau, which has already indicated an interest in linking to it from their website.

We also believe that CRAN might want to link to the Guide on the Official Statistics & Survey Methodlogy Task View. However, we have not yet spoken to the Task View Maintainer about this. If the maintainer thinks that this project is not a perfect fit, then we are open to creating a separate task view dedicated to US Census Statistics.

How Can The ISC Help

We are seeking to create an ISC Working Group to promote using R to analyze US Census Data. The individual who will be committed to leading and managing the group’s activities is Ari Lamstein.

We are requesting a $4,000 grant to fund completion of the Working Group’s first project: a Guide to getting started with Census Data in R (“The Guide”). The Project Milestones sections contains a breakdown of funding by milestone.

The Census Bureau is currently planning changes to its API. We hope that a future project for the Working Group will be to get feedback on the API from the R Community.

In addition to creating a Working Group and financially supporting the creation of the Guide, we believe that the R Consortium can help this project to succeed by facilitating cooperation among stakeholders and disseminating and promoting the Guide on the R Consortium website.

Dissemination

We plan to publish the Guide under the Creative Commons Attribution 4.0 License.

As indicated in the Project Milestones Section, we plan to include the community in the drafting of the Guide, and we plan to publicize project milestones on our blogs and social media accounts.

The Census Bureau has also indicated an interest in linking to the Guide on its website once it is completed.

We also hope to speak with the maintainer of CRAN’s Official Statistics & Survey Methodology Task View about including a link to the Guide. If the maintainer thinks that this project is not a perfect fit, then we are open to creating a separate task view dedicated to US Census Statistics.

Update on My Collaboration with the Census Bureau

In November I announced a new collaboration with the Census Bureau around data science training. Today I’d like to provide an update on this project.

I recently presented a draft of my first course for this project internally and it was well received. The working title of this course is Mapping State ACS Data in R with Choroplethr. We are using github to store the course materials, and you can view them here.

If you are interested in this project then I recommend starring and following the repository on github. Why? Because that’s the best way to learn about new training materials! Also, Steven Johnson and Maggie Cawley from Boomerang Geospatial are currently working on a course for this project that relates to OpenStreetMap and Census Geography. If you follow this repository then you will learn as soon as that course is available.

Comparison with Learn to Map Census Data in R

Mapping State ACS Data in R with Choroplethr can be thought of as a super-charged version of my email course Learn to Map Census Data in R.  The “final project” in both courses is the same: the creation of this this map, which shows the percent change in per capita income between 2010 and 2015:

The main differences between the courses relate to content, medium and community.

Content

Learn to Map Census Data in R is a short course that aims to give people a quick win in writing R code that maps data from the US Census Bureau. Students are expected to know at least a bit of R beforehand.

Mapping State ACS Data in R with Choroplethr is a longer course that has no prerequisites. In fact, the first module provides information on installing R and getting set up with RStudio. The longer format also allows me to provide more information about the American Community Survey and how it differs from the Decennial Census. This is material that I had to skip in Learn to Map Census Data in R.

Medium

Mapping State ACS Data in R with Choroplethr will be an asynchronous video course. All the lessons will have video that you can watch on demand, as well as downloadable code examples. This will allow learners to easily skip over content that they already know or rewatch lessons that they found confusing the first time through.

By contrast, all the content in Learn to Map Census Data in R is delivered via email. That makes it great for a short course. But it limits the amount of content that I can deliver, and it also makes it hard for students to review past lessons.

Community

In Learn to Map Census Data in R the only interaction that the learner has is with me, via email. While the details haven’t been finalized yet, we are hoping to create an online community that will allow students to interact not only with each other, but also potentially the Census Bureau’s Data Dissemination Specialists.

Next Steps

The final version of Mapping State ACS Data in R with Choroplethr will be submitted by the end of March. The best way to know about changes to the curriculum is to follow the github repository, which you can do here.

R Programmers: What is your biggest problem when working with Census data?

A few weeks ago I announced my latest project: Data Science instruction at the Census Bureau.

In addition to announcing the project, I also snuck a bit of market research into the post. I asked people the types of analyses they do when working with Census data. I also asked what packages they use when solving those problems.

23 people left comments, and they have been very helpful in shaping the curriculum of the course. Thank you to everyone who left a comment!

That was such an effective way to learn about the community of R Census users that I’ve decided to do it again. If you are an R programmer who has worked with Census data, please leave a comment with an answer to this question:

What is your biggest problem when working with Census data in R?

Understanding the obstacles people face has the potential to help us design better courses.

Leave your answer as a comment below!

New Project: Data Science Instruction at the US Census Bureau!

Today I am delighted to announce an exciting new collaboration. I will be working with the US Census Bureau as a Data Science Instructor!

Over the next six months I will be helping Census develop courses on using R to work with Census Data. These courses will be free and open to the public. People familiar with my open source work will realize that this project is right up my alley!

As a start to this project I am trying to gather two pieces of information:

  1. Which packages do R programmers typically use when working with Census data?
  2. What types of analyses do R programmers typically do with Census data?

If you use R to work with Census data, please leave an answer below!

Free Software Foundation “Social Benefit” Award Nominations

Ezra Haber Glenn, the author of the acs package in R, recently posted about the Free Software Foundation’s “Social Benefit” Award on the acs mailing list:

acs.R Community:

The Free Software Foundation is now accepting nominations for the 2017
“Project of Social Benefit Award,” presented to the project or team
responsible for applying free software, or the ideas of the free
software movement, in a project that intentionally and significantly
benefits society in other aspects of life.

If anyone is willing to nominate the acs package, the recognition
would be much appreciated — the package has been generously supported
by MIT and the Puget Sound Regional Council, as well as a great deal
of user-feedback and creative development on the part of the
ACS/Census/R community.

The nomination form is quick and easy — see
https://my.fsf.org/projects-of-social-benefit-award-nomination.
Deadline 11/5.

More info at https://www.fsf.org/awards/sb-award/.

Thanks!

I’m reposting this here for a few reasons.

The first is that I only learned about this award from Ezra’s post, and I think that it’s worth raising awareness of the award itself.

The second is that, in my opinion, the acs package does “intentionally and significantly benefit society.” I have used the acs package over several years to learn more about US demographics. Choroplethr, my R package for creating statistical maps, also uses the acs package to retrieve data from the Census Bureau. Several thousand people have taken my free course on Choroplethr, and each of those people has benefitted from the acs package as well.

Finally, I’m mentioning this award to point out that R package developers receive compensation in different ways. None of us receive monetary compensation when people use our packages. However, Ezra has indicated that getting nominated for this award would be useful to him.

For all these reasons, I was happy to nominate the acs package for the Free software Foundation’s “Social Benefit” Award. It took me less than 5 minutes to fill out the form. If you are a user of choroplethr, and you enjoy its integration with US Census Data, then I encourage you to nominate the acs package as well. You can do so here.

Meeting Titans of Open Data

The recent Association of Public Data Users (APDU) Conference gave me the opportunity to meet some people who have made tremendous contributions to the world of Open Data.

Jon Sperling

The author with Jon Sperling

The author with Jon Sperling

One of the most popular R packages I’ve published is choroplethrZip. This package contains the US Census Bureau’s Zip Code Tabulation Area (ZCTA) map, as well as metadata and visualization functions. It literally never occurred to me to think about who created the first ZCTA map, the methodology for creating it, and so on.

It turns out that one of the people who created the ZCTA map – Jon Sperling – was at the conference. We had a fascinating conversation, and I even got to take selfie with him! You can learn more about Jon’s role in the creation of the TIGER database in his 1992 paper Development and Maintenance of the TIGER Database: Experiences in Spatial Data Sharing at the U.S. Bureau of the Census.

Jon currently works at the Department of Housing and Urban Development (HUD). You can learn more about their data here.

Nancy Potok

Nancy Potok is the Chief Statistician of the United States, and she gave a fascinating keynote. Before her talk I did not know that the country even had a Chief Statistician. Her talk taught me about the Federal Statistical System as well as the Interagency Council on Statistical Policy.

The Q&A portion of her talk was also interesting. A significant portion of the audience worked at federal agencies. I believe that it was during this session when someone said “I can’t tell how much of my time should be dedicated to supporting the data and analyses which I publish. In a very real sense, my job ends when I publish it.” This question helped me understand why it is sometimes difficult to get help understanding how to use government datasets: there simply aren’t incentives for data publishers to support users of that data.

Andrew Reamer

Sometimes at a conference you can tell that there’s a celebrity there just by how people act towards them.

In this case it seemed that everyone except me knew who Andrew Reamer was. During the Q&A portion of talks, Andrew would raise his hand and speakers would say “Hi Andrew! What’s your question?”. Or they would get a question from someone else and say “I’m actually not sure what the answer is. Maybe Andrew knows. Andrew?”

Note that Andrew didn’t actually have a speaking slot at the event. But yet he still did a lot of speaking!

During lunch I worked up the courage to go up Andrew and ask him about his work. It turns out that he is a Research Professor at George Washington University’s Institute of Public Policy. He is a well-known social scientist. As I’m quickly learning, social scientists tend to be major users of public datasets.

I previously published a quote from the Census Bureau that the American Community Survey (a major survey they run) impacts how over $400 billion is allocated. However, I was never able to get any more granularity on that. If you’re interested in the answer to that, it turns out that Andrew wrote a piece on it: Counting for Dollars 2020: The Role of the Decennial Census in the Geographic Distribution of Federal Funds.

Sarah Cohen

Sarah Cohen is a Pulitzer Prize winning journalist who is now the Assistant Editor for Computer-Assisted Reporting at the New York Times. Before Sarah’s talk I only had passing familiarity with data journalism, and had a very narrow view of what it entailed. Sarah’s keynote gave examples of different ways that public data is shaping journalism. She also discussed common problems that arise when journalists communicate with data publishers.

Key Takeaway

The 2017 APDU Conference was a great chance for me (an R trainer and consultant with a strong interest in open data) to meet people who have made major contributions to the field of public data. If you are interested in learning more about the Association of Public Data Users, I recommend visiting their website here.

« Older posts

© 2019 AriLamstein.com

Theme by Anders NorenUp ↑