Category: Blog (page 1 of 9)

“A Guide to Working With Census Data in R” is now Complete!

Two weeks ago I mentioned that I was clearing my calendar until I finished writing A Guide to Working with Census Data in R. Today I’m happy to announce that the Guide is complete!

I even took out a special domain for the project: RCensusGuide.info.

The Guide is designed to address two problems that people who work with R and US Census Data face:

  1. Understanding what data the Census Bureau publishes.
  2. Understanding what CRAN packages are available to help with their project.

The Guide details the five most popular datasets that the Census Bureau publishes. It also describes the seven most popular R packages for working with Census data.

The best part? The Guide is free, and lists ways for you to continue learning about both the datasets and R packages that it mentions.

A special thank you to the R Consortium for funding this project!

I hope that you enjoy reading the Guide!

Read the Guide

Update on the R Consortium Census Working Group

It’s been a while since I shared any information about the R Consortium’s Census Working Group.

Today I’d like to share an update on three projects which the group is involved in.

The project which you might already be familiar with is the “Guide to Working with Census Data in R”. This is my survey paper that will include popular datasets that Census publishes, R packages that facilitate working with Census Data and resources to learn more about the above. I am planning to complete this document in the next two weeks.

The project which launched this working group is my video course on using Choroplethr to visualize data from the American Community Survey (ACS). This course will be free and hosted on the Census Bureau’s website. The course has already been completed, and we’re in the final stages of getting approval from the bureau to publish the course on its website. I expect this to be completed by the end of the year.

A new project which we are considering taking on is a webinar series highlighting authors of R packages that work with Census data. This would give us a chance to create a library of educational material about using R to work with Census data.

If you have any questions or feedback about this, please contact me.

Update on R Consortium Census Guide

As I mentioned in July, my proposal to the R Consortium to create a Guide to Working with Census Data in R was accepted. Today I’d like to share an update on the project.

The proposal breaks the creation of the guide into four milestones. So far Logan Powell and I have completed two of those milestones:

  • Milestone 1: Select publication technology. We have decided to use Github Pages to publish the Guide.
  • Milestone 2: Assemble list of resources to include. We have assembled three types of resource to include:
    • A list of R packages that work with Census Data.
    • A description of the five most popular datasets which the Census Bureau publishes.
    • A description of various ways that the Census Bureau provides training and support.

The next Milestone is to convert this list of resources into a single, coherent document. Before doing this, though, we want to give the broader R and Census communities a chance to review the resources we plan to include. We’re particularly interested to know if we’ve missed anything important, or if we’re planning to include anything that is broadly considered to be superfluous.

If you have a minute, please review what we plan to include. If you have feedback, you can provide it by filing an issue in the project’s github repository.

Review the Guide

Notes from the 2018 APDU Conference

I recently attended the 2018 Association of Public Data Users (APDU) conference. This was my second time attending the conference, and I enjoyed learning more about how other people use federal statistics. Some of my favorite presentations were:

  • Julia Lane on her experience teaching Big Data to social scientists and government workers.
  • The Bureau of Economic Analysis (BEA) on their R package.
  • Aaron Klein on the value of policy makers getting realtime data

Julia Lane and the Coleridge Initative

As an “R guy” I’m normally the odd man out at APDU. While many attendees do work with data, they tend to use GUI-based tools such as Excel and Tableau. I’ve often wondered if any of the attendees would benefit from learning programming language-based data tools such as R or Python.

It turns out that Julia is the author of an entire book on this topic: Big Data and Social Science! She is also a director of the Coleridge Initiative, which provides Python-based data analysis training to government workers and social scientists.

Julia spoke about her experience with these projects, and the results seemed very positive!

The Bureau of Economic Analysis (BEA) has an R package

While I mostly blog about data that the Census Bureau publishes, APDU is a great reminder of how many other statistical agencies there are. An example is the Bureau of Economic Analysis (BEA) which, among other things, calculates Gross Domestic Product (GDP) statistics.

BEA was a sponsor of the conference, and I got to chat with one of the people running their booth. I was surprised to learn that BEA has published their own R package: bea.R. This is the first time that I have heard of a government agency developing their own R package!

The person I spoke with mentioned that BEA’s Chief Innovation Officer, Jeff Chen, is a big proponent of R. You can learn more about the BEA here.

I think that it would be interesting to extend Choroplethr to work with data from the BEA.

Aaron Klein on Policymakers Getting Realtime Data

Aaron Klein, a former Treasury official, spoke about the value of policy makers getting realtime data. Aaron worked in Treasury during the foreclosure crisis, and spoke about the challenges policymakers faced in responding to it. One issue was quantifying the impact that foreclosures and abandoned homes have on broader communities.

He recently wrote a research paper that attempted to answer this question: Understanding the True Costs of Abandoned Properties: How Maintenance Can Make a Difference. One statistic from the talk left a big impression on me: vacant residential buildings account for one of every 14 residential building fires in America. When you consider that only a small portion of residential homes are vacant, this statistic is truly startling.

Having data like this at the start of the foreclosure crisis might have improved how policymakers responded to it.

R Consortium Proposal Accepted!

Today I am happy to announce that my proposal to the R Consortium was accepted!

I first announced that I was submitting a proposal back in March. As a reminder, the proposal has two distinct parts:

  1. Creating a guide to working with US Census data in R.
  2. Creating an R Consortium Working Group focused on US Census Data.

If you’d like to read the proposal in its entirety, you can do so here.

I am currently planning to develop the “Census Guide” in public using github. If you’d like to follow along with the development, you can do so by visiting the github respository and clicking the “watch” button:

View the Github Repository 

New Version of ggplot2

I just received a note from Hadley Wickham that a new version of ggplot2 is scheduled to be submitted to CRAN on June 25. Here’s what choroplethr users need to know about this new version of ggplot2.

Choroplethr Update Required

The new version of ggplot2 introduces bugs into choroplethr. In particular, choroplethr does not pass R cmd check when the new version of ggplot2 is loaded. I am planning to submit a new version of choroplethr to CRAN that addresses this issue prior to June 25.

This change is the third or fourth time I’ve had to update choroplethr in recent years as a result of changes to ggplot2. This experience reminds me of one of the first lessons I learned as a software engineer: “More time is spent maintaining old software than writing new software.”

Simple Features Support

One of the most common questions I get about choroplethr is whether I intend to add support for interactive maps. My answer has always been “I can’t do that until ggplot2 adds support for Simple Features.” Thankfully, this new version of ggplot2 introduces that support!

Currently all maps in the choroplethr ecosystem are stored as ggplot2 “fortified” dataframes. This is a format unique to ggplot2. Storing the maps in this format makes it possible to render the maps as quickly as possible. The downside is that:

  1. ggplot2 does not support interactive graphics.
  2. The main interactive mapping library in CRAN (leaflet) cannot render fortified data frames. It can only render maps stored as Shapefiles or Simple Features.

Once ggplot2 adds support for Simple Features, I can begin work on adding interactive map support to choroplethr. The first steps will likely be:

  1. Updating choroplethr to be able to render maps stored in the Simple Features format.
  2. Migrating choroplethr’s existing maps to the Simple Features format.

After that, I can start experimenting with adding interactive graphics support to choroplethr.

Note that Simple Features is not without its drawbacks. In particular, many users are reporting performance problems when creating maps using Simple Features and ggplot2. I will likely not begin this project until these issues have been resolved.

Thoughts on the CRAN Ecosystem

This issue has caused me to reflect a bit about the stability of the CRAN ecosystem. 

ggplot2 is used by about 1,700 packages. It’s unclear to me how many of these packages will have similar problems as a result of this change to ggplot2. And of the impacted packages, how many have maintainers who will push out a new version before June 25?

And ggplot2, of course, is just one of many packages on CRAN. This issue has the potential to occur whenever any package on CRAN is updated.

This issue reminded me that CRAN has a complex web of dependencies, and that package maintainers are largely unpaid volunteers. It seems like a situation where bugs can easily creep into an end user’s code.

My Proposal to the R Consortium

I recently submitted a Proposal to the R Consortium. I decided to share this proposal on my blog for two reasons:

  1. To raise awareness of an opportunity I see in the R ecosystem.
  2. To raise awareness of the R Consortium as a funding source for R projects.

I first learned about the R Consortium when I attended the Boston EARL Conference in 2015. When I attended the San Francisco EARL Conference last year there was a session dedicated to projects the Consortium had funded. I was very impressed, and this year decided to submit a proposal myself!

I have no idea if this proposal will be accepted. Nonetheless, I hope that publishing it raises awareness of this relatively new funding source, and my own thoughts on how the R ecosystem can improve.

Below is a verbatim copy of the proposal.


Proposal to Create an R Consortium Working Group Focused on US Census Data

The Problem

R users who wish to work with US Census Data face two significant problems: Package Selection and Data Navigation.

Package Selection

An R user looking for packages to help with a project on US Census Data would likely start by going to CRAN’s list of contributed packages and doing a search for “Census”. This process is non-optimal for a number of reasons:

  1. It yields 41 hits, which is a large number to sort through.
  2. Many package titles appear to be duplicative (e.g. censusr’s title “Collect Data from the Census API” is very similar to censusapi’s title of “Retrieve Data from the Census APIs”).
  3. Some packages that deal with US Census Data do not have the word “Census” in their name or title (e.g. the choroplethr package).

For these reasons, even an experienced R user might find it challenging to determine which R package, if any, can help them with their Census-related project.

Note that these issues might also lead to package duplication on CRAN. That is, a developer might create a package to solve a problem which an existing package already solves.

Data Navigation

People often turn to the Census Bureau for answers to questions such as “How many people live in a certain region?” They are often surprised that the answer depends on which data product they query. That is, many R users begin their interaction with Census Data not fully understanding the array of data available to them, or when it is better to use one data product over another.

The Plan

We would like to create an R Consortrium Working Group to foster greater cooperation between the US Census Bureau and R community. We hope that the working group will become a place where statisticians working with census data can cooperate under the guidance of the Census Bureau.

The first project we would like the Working Group to take on is the creation of a guide to getting started with Census data in R (“The Guide”). The Guide would address the problems listed above by:

  1. Listing, describing and linking to packages currently available via CRAN that work with US Census Data.
  2. Listing, describing and linking to reference materials for better understanding and working with Census data.

The most likely failure mode for the Package Selection section is not including all the relevant packages, or not communicating the differences between the packages in a way that helps users decide which package is most appropriate for their use. At this point we do not know whether the Guide should simply copy the CRAN package description, or also include additional information. (And if it should include additional information, what information should that be?) We plan to address this risk by publishing and publicizing drafts of our work online, and incorporating feedback from the R community.

The most likely failure mode for the Data Navigation section is not providing resources which are useful or relevant to the actual needs of the R community. In the same way that CRAN has a wealth of packages that can be difficult to navigate, the Census Bureau also has a wealth of training material that can be difficult to navigate. We plan to address this by publishing and publicizing drafts of our work online, and incorporating feedback from the R community.

Another failure mode which we do not address in this proposal is maintenance of the Guide. While the Guide might be completely accurate the time of publication, it will naturally become less accurate over time. At this point it is not clear what the best way to maintain the Guide is.

The Team

Ari Lamstein. Ari is an R Trainer and Consultant who is currently working as a Data Science Instructor at the US Census Bureau. He has written several R packages that deal with US Census Data. Ari is planning to focus on on the Package Selection portion of the Guide.

Logan T Powell. Logan is the Developer Experience and Engagement Lead at the US Census Bureau. Logan is planning to work on the Data Navigation portion of the Guide.

Zach Whitman, PhD. Zach is the Chief Data Officer at the US Census Bureau. Zach is planning to work on future projects related to changes in the Census API.

Kyle Walker, PhD. Kyle is Associate Professor of Geography at Texas Christian University. Kyle is the primary author of the tigris, tidycensus, and idbr R packages for working with Census Bureau spatial and tabular data in R.

Project Milestones

Milestone 1: Select Publication Technology (1 Month, $500)

Our first task will be selecting technology to use to publish the Guide. We would like the technology to be free and easy to access, as well as free to host and easy to update. We are planning to start this phase by evaluating Github, Github Pages and WordPress.

This milestone wil be completed when there is a “Hello World” version of the Guide published, and both authors understand the workflow for editing it.

Milestone 2: Assemble list of resources to include (1 Month, $1,000)

The initial package and resource lists will be based on our personal knowledge and experience. The lists will be stored in github so that other people can contribute to the lists. That github repository will then be publicized via our blogs and social media accounts.

This milestone will be completed when we have the final list of Package and Training resources that we plan to include in the guide.

Milestone 3: Complete Publication of Guide (1 Month, $2,000)

After we have the final list of resources we plan to include in the Guide, we will need to write up the results.

This milestone will be completed once the final draft of the Guide is published.

We plan to announce completion of this milestone on our blogs and social media accounts.

Milestone 4: Complete Dissemination (1 Month, $500)

Once the Guide is completed, we will focus on disseminating it to the largest possible audience.

We will start by simply announcing the Guide’s completion on our blogs and social media accounts.

We will also reach out to the Census Bureau, which has already indicated an interest in linking to it from their website.

We also believe that CRAN might want to link to the Guide on the Official Statistics & Survey Methodlogy Task View. However, we have not yet spoken to the Task View Maintainer about this. If the maintainer thinks that this project is not a perfect fit, then we are open to creating a separate task view dedicated to US Census Statistics.

How Can The ISC Help

We are seeking to create an ISC Working Group to promote using R to analyze US Census Data. The individual who will be committed to leading and managing the group’s activities is Ari Lamstein.

We are requesting a $4,000 grant to fund completion of the Working Group’s first project: a Guide to getting started with Census Data in R (“The Guide”). The Project Milestones sections contains a breakdown of funding by milestone.

The Census Bureau is currently planning changes to its API. We hope that a future project for the Working Group will be to get feedback on the API from the R Community.

In addition to creating a Working Group and financially supporting the creation of the Guide, we believe that the R Consortium can help this project to succeed by facilitating cooperation among stakeholders and disseminating and promoting the Guide on the R Consortium website.

Dissemination

We plan to publish the Guide under the Creative Commons Attribution 4.0 License.

As indicated in the Project Milestones Section, we plan to include the community in the drafting of the Guide, and we plan to publicize project milestones on our blogs and social media accounts.

The Census Bureau has also indicated an interest in linking to the Guide on its website once it is completed.

We also hope to speak with the maintainer of CRAN’s Official Statistics & Survey Methodology Task View about including a link to the Guide. If the maintainer thinks that this project is not a perfect fit, then we are open to creating a separate task view dedicated to US Census Statistics.

Update: “Difficult to Reproduce Choroplethr Bug” Isolated and Fixed

Yesterday I wrote that I’ve received a number of choroplethr bug reports recently that I simply cannot reproduce. Due to a large number of people replying with their system information, along with whether or not they were able to reproduce the bug, I was able to track it down. Thank you to everyone who helped!

It appears that the issue only occurs when choroplethr is used in conjunction with the development version of ggplot2. I personally never use the development version of ggplot2, which explains why I was never able to reproduce the bug.

Normally I would not update choroplethr to work better with the development version of another package. But I received a lot of help to isolate this issue, so I decided to submit a fix to github. You can get it by typing the following:

If you were able to reproduce the bug before, I would appreciate it if you could retest with this version and verify that everything now works for you.

Technical Details

In order to create a map of the 50 US States, choroplethr first renders a map of the continental United States. It then individually renders Alaska and Hawaii, and affixes them to the continental US map as annotations. You can view the entire process here. A key function in that process is this line:

With the development version of ggplot2, the call to ggplotGrob fails. It appears that the failure is due to changes in how the development version of ggplot2 handles themes. Choroplethr uses two custom themes:

  • theme_clean hides hides everything except the map and legend.
  • theme_inset acts like theme_clean but also hides the legend.

You can view the choroplethr theme code here.

I’m not an expert on these new changes to ggplot2, but it appears that ggplotGrob now wants a theme with all member variables specified. It also appears that there is a new built-in theme called theme_void that does everything my theme_clean does while also setting all of the theme member variables. (This means that calling ggplotGrob with theme_void does not lead to a crash). I used theme_void as the basis of my fix. You can see the specific code change here.

(As a note to myself, it now appears that using one of the theme_replace functions would simplify the code of my new theme_inset code. I am intentionally holding off on this change, though, until I know whether or not this will even be an issue in the next CRAN version of ggplot2).

Difficult to Reproduce Choroplethr Bug: Can You Help?

Since November 2017 I have received three bug reports from users who see this error when running any command in the choroplethr package:

Normally when someone reports a choroplethr bug that I cannot reproduce I recommend they do the following:

  1. Update to the latest version of R and RStudio.
  2. Type “update.packages(ask=FALSE)” to get the latest version of each package.
  3. Restart R and try again.

However, each of the three users has tried this it has not solved their problem.

It appears that the bug is at least somewhat OS dependent. I am running the following version of R:

and I cannot reproduce the bug. The users who report the bug have these configurations:

Because I do not have access to a machine configuration that produces the bug, I am limited in my ability to debug this.

A google search for this error shows that issues #179 (a linux machine) and #185 (no OS specified) in the ggmap package both report the same error and were filed at roughly the same time that I started receiving these error reports. These issues seem to have been resolved by this change. However, since I am not able to reproduce the error on my own machine, I am reluctant to make any updates to the choroplethr code base based on what I see in that commit.

If you are able to shed some light on what is going on with this bug, then please leave a comment below.

The following code seems to reproduce the bug:

Update

Thank you for everyone who provided test results in the comments. I have been able to narrow down the issue to occurring in the development version of ggplot2. That is, the OS issue appears to be a red herring. I am now working on the issue.

Update on My Collaboration with the Census Bureau

In November I announced a new collaboration with the Census Bureau around data science training. Today I’d like to provide an update on this project.

I recently presented a draft of my first course for this project internally and it was well received. The working title of this course is Mapping State ACS Data in R with Choroplethr. We are using github to store the course materials, and you can view them here.

If you are interested in this project then I recommend starring and following the repository on github. Why? Because that’s the best way to learn about new training materials! Also, Steven Johnson and Maggie Cawley from Boomerang Geospatial are currently working on a course for this project that relates to OpenStreetMap and Census Geography. If you follow this repository then you will learn as soon as that course is available.

Comparison with Learn to Map Census Data in R

Mapping State ACS Data in R with Choroplethr can be thought of as a super-charged version of my email course Learn to Map Census Data in R.  The “final project” in both courses is the same: the creation of this this map, which shows the percent change in per capita income between 2010 and 2015:

The main differences between the courses relate to content, medium and community.

Content

Learn to Map Census Data in R is a short course that aims to give people a quick win in writing R code that maps data from the US Census Bureau. Students are expected to know at least a bit of R beforehand.

Mapping State ACS Data in R with Choroplethr is a longer course that has no prerequisites. In fact, the first module provides information on installing R and getting set up with RStudio. The longer format also allows me to provide more information about the American Community Survey and how it differs from the Decennial Census. This is material that I had to skip in Learn to Map Census Data in R.

Medium

Mapping State ACS Data in R with Choroplethr will be an asynchronous video course. All the lessons will have video that you can watch on demand, as well as downloadable code examples. This will allow learners to easily skip over content that they already know or rewatch lessons that they found confusing the first time through.

By contrast, all the content in Learn to Map Census Data in R is delivered via email. That makes it great for a short course. But it limits the amount of content that I can deliver, and it also makes it hard for students to review past lessons.

Community

In Learn to Map Census Data in R the only interaction that the learner has is with me, via email. While the details haven’t been finalized yet, we are hoping to create an online community that will allow students to interact not only with each other, but also potentially the Census Bureau’s Data Dissemination Specialists.

Next Steps

The final version of Mapping State ACS Data in R with Choroplethr will be submitted by the end of March. The best way to know about changes to the curriculum is to follow the github repository, which you can do here.

« Older posts

© 2019 AriLamstein.com

Theme by Anders NorenUp ↑