Category: Open Source (page 1 of 2)

“A Guide to Working With Census Data in R” is now Complete!

Two weeks ago I mentioned that I was clearing my calendar until I finished writing A Guide to Working with Census Data in R. Today I’m happy to announce that the Guide is complete!

I even took out a special domain for the project: RCensusGuide.info.

The Guide is designed to address two problems that people who work with R and US Census Data face:

  1. Understanding what data the Census Bureau publishes.
  2. Understanding what CRAN packages are available to help with their project.

The Guide details the five most popular datasets that the Census Bureau publishes. It also describes the seven most popular R packages for working with Census data.

The best part? The Guide is free, and lists ways for you to continue learning about both the datasets and R packages that it mentions.

A special thank you to the R Consortium for funding this project!

I hope that you enjoy reading the Guide!

Read the Guide

New Version of Choroplethr.com!

Last year I decided to take out the domain Choroplethr.com. I used it to host information about Choroplethr, my suite of R packages for mapping demographic statistics. A year and a half later, I’ve made the first significant change to the site!

The old version of Choroplethr.com was structured as a series of essays that demonstrated how to use specific parts of each package. After collecting data on how people were interacting with the site, as well as talking to hundreds of Choroplethr users, I decided to update the site.

The new site has 4 sections:

  1. Free Course. The most popular introduction to Choroplethr continues to be Learn to Map Census Data in R, my free course on using Choroplethr.
  2. Packages and Documentation. This section is new: it contains a description of each of the five packages that make up Choroplethr, as well as documentation on each function and object in each package.
  3. Courses. This section describes the two advanced courses I created, and how they can help Choroplethr users. These courses are Mapmaking in R with Choroplethr and Shapefiles for R Programmers
  4. Support. This section provides information on how to get support when using Choroplethr.

I hope that you’ll take a moment to visit the new version of Choroplethr.com and let me know what you think!

Visit Choroplethr.com

Choroplethr v3.6.2 is now on CRAN

A new version of Choroplethr is now on CRAN! You can get it by typing the following from the R Console:

I encourage all Choroplethr users to install this new version!

Overview

A new version of ggplot2 will be released shortly. This new version of ggplot2 introduces many new features. These new features, however, come at a cost: some packages which use ggplot2 (such as Choroplethr) are not compatible with this new version.

In fact, if you try to draw a map with the new version of ggplot2 and the old version of Choroplethr, you might get an error. This new version of Chorplethr (v3.6.2) fixes those issues.

Technical Details

Choroplethr was one of (if not the) first R package to create maps of the US where Alaska and Hawaii are rendered as insets. Here’s an example map, along with code:

The new version of ggplot2 was causing Choroplethr to crash whenever it attempted to render the above map. In order to understand the underlying issue, it is helpful to first understand how Choroplethr creates the above map.

Insets and ?annotation_custom

The most important thing to notice about the above map is this: it is inaccurate.

Alaska and Hawaii are not really located where they appear. And they are not drawn to scale, either.

Here is how the 50 US states “really” look:

In order to render Alaska and Hawaii as insets, Choroplethr:

  1. First pre-renders the Continental US, Alaska and Hawaii as separate ggplot objects.
  2. Then affixes the Alaska and Hawaii ggplot objects to the continental US object using ggplot2’s ggplotGrob and annotation_custom functions.

Insets and Themes

The two images differ in more than just the placement of Alaska and Hawaii, though. The first map also has a clear background, whereas the second map has a grey background and tick marks for longitude and latitude. Also, Choroplethr uses a single legend for all three maps (the legends for Alaska and Hawaii are suppressed).

Traditionally Choroplethr has implemented these aesthetic tweaks in the functions theme_clean (for the Continental US) and theme_inset (for Alaska and Hawaii). The code for these functions originally came from Section 13.19 (“Making a Map with a Clean Background”) of Winston Chang’s R Graphics Cookbook. Here’s what that code used to look like:

The Bug

The bug was happening when Choroplethr was attempting to affix the map of Alaska, with theme_inset applied to it, to the map of the Continental US:

The last line in the above snippet was causing Choroplethr to crash. It appears that the intersection of theme_insetggplotGrob and annotation_custom was causing a problem.

The Fix

It appears that ggplot2 has a new function for creating a clear background on a map that does not cause these problems. That function is theme_void. Modifying theme_clean and theme_inset to use theme_void solved the problem. Here is the new version of those functions:

New Version of ggplot2

I just received a note from Hadley Wickham that a new version of ggplot2 is scheduled to be submitted to CRAN on June 25. Here’s what choroplethr users need to know about this new version of ggplot2.

Choroplethr Update Required

The new version of ggplot2 introduces bugs into choroplethr. In particular, choroplethr does not pass R cmd check when the new version of ggplot2 is loaded. I am planning to submit a new version of choroplethr to CRAN that addresses this issue prior to June 25.

This change is the third or fourth time I’ve had to update choroplethr in recent years as a result of changes to ggplot2. This experience reminds me of one of the first lessons I learned as a software engineer: “More time is spent maintaining old software than writing new software.”

Simple Features Support

One of the most common questions I get about choroplethr is whether I intend to add support for interactive maps. My answer has always been “I can’t do that until ggplot2 adds support for Simple Features.” Thankfully, this new version of ggplot2 introduces that support!

Currently all maps in the choroplethr ecosystem are stored as ggplot2 “fortified” dataframes. This is a format unique to ggplot2. Storing the maps in this format makes it possible to render the maps as quickly as possible. The downside is that:

  1. ggplot2 does not support interactive graphics.
  2. The main interactive mapping library in CRAN (leaflet) cannot render fortified data frames. It can only render maps stored as Shapefiles or Simple Features.

Once ggplot2 adds support for Simple Features, I can begin work on adding interactive map support to choroplethr. The first steps will likely be:

  1. Updating choroplethr to be able to render maps stored in the Simple Features format.
  2. Migrating choroplethr’s existing maps to the Simple Features format.

After that, I can start experimenting with adding interactive graphics support to choroplethr.

Note that Simple Features is not without its drawbacks. In particular, many users are reporting performance problems when creating maps using Simple Features and ggplot2. I will likely not begin this project until these issues have been resolved.

Thoughts on the CRAN Ecosystem

This issue has caused me to reflect a bit about the stability of the CRAN ecosystem. 

ggplot2 is used by about 1,700 packages. It’s unclear to me how many of these packages will have similar problems as a result of this change to ggplot2. And of the impacted packages, how many have maintainers who will push out a new version before June 25?

And ggplot2, of course, is just one of many packages on CRAN. This issue has the potential to occur whenever any package on CRAN is updated.

This issue reminded me that CRAN has a complex web of dependencies, and that package maintainers are largely unpaid volunteers. It seems like a situation where bugs can easily creep into an end user’s code.

My Proposal to the R Consortium

I recently submitted a Proposal to the R Consortium. I decided to share this proposal on my blog for two reasons:

  1. To raise awareness of an opportunity I see in the R ecosystem.
  2. To raise awareness of the R Consortium as a funding source for R projects.

I first learned about the R Consortium when I attended the Boston EARL Conference in 2015. When I attended the San Francisco EARL Conference last year there was a session dedicated to projects the Consortium had funded. I was very impressed, and this year decided to submit a proposal myself!

I have no idea if this proposal will be accepted. Nonetheless, I hope that publishing it raises awareness of this relatively new funding source, and my own thoughts on how the R ecosystem can improve.

Below is a verbatim copy of the proposal.


Proposal to Create an R Consortium Working Group Focused on US Census Data

The Problem

R users who wish to work with US Census Data face two significant problems: Package Selection and Data Navigation.

Package Selection

An R user looking for packages to help with a project on US Census Data would likely start by going to CRAN’s list of contributed packages and doing a search for “Census”. This process is non-optimal for a number of reasons:

  1. It yields 41 hits, which is a large number to sort through.
  2. Many package titles appear to be duplicative (e.g. censusr’s title “Collect Data from the Census API” is very similar to censusapi’s title of “Retrieve Data from the Census APIs”).
  3. Some packages that deal with US Census Data do not have the word “Census” in their name or title (e.g. the choroplethr package).

For these reasons, even an experienced R user might find it challenging to determine which R package, if any, can help them with their Census-related project.

Note that these issues might also lead to package duplication on CRAN. That is, a developer might create a package to solve a problem which an existing package already solves.

Data Navigation

People often turn to the Census Bureau for answers to questions such as “How many people live in a certain region?” They are often surprised that the answer depends on which data product they query. That is, many R users begin their interaction with Census Data not fully understanding the array of data available to them, or when it is better to use one data product over another.

The Plan

We would like to create an R Consortrium Working Group to foster greater cooperation between the US Census Bureau and R community. We hope that the working group will become a place where statisticians working with census data can cooperate under the guidance of the Census Bureau.

The first project we would like the Working Group to take on is the creation of a guide to getting started with Census data in R (“The Guide”). The Guide would address the problems listed above by:

  1. Listing, describing and linking to packages currently available via CRAN that work with US Census Data.
  2. Listing, describing and linking to reference materials for better understanding and working with Census data.

The most likely failure mode for the Package Selection section is not including all the relevant packages, or not communicating the differences between the packages in a way that helps users decide which package is most appropriate for their use. At this point we do not know whether the Guide should simply copy the CRAN package description, or also include additional information. (And if it should include additional information, what information should that be?) We plan to address this risk by publishing and publicizing drafts of our work online, and incorporating feedback from the R community.

The most likely failure mode for the Data Navigation section is not providing resources which are useful or relevant to the actual needs of the R community. In the same way that CRAN has a wealth of packages that can be difficult to navigate, the Census Bureau also has a wealth of training material that can be difficult to navigate. We plan to address this by publishing and publicizing drafts of our work online, and incorporating feedback from the R community.

Another failure mode which we do not address in this proposal is maintenance of the Guide. While the Guide might be completely accurate the time of publication, it will naturally become less accurate over time. At this point it is not clear what the best way to maintain the Guide is.

The Team

Ari Lamstein. Ari is an R Trainer and Consultant who is currently working as a Data Science Instructor at the US Census Bureau. He has written several R packages that deal with US Census Data. Ari is planning to focus on on the Package Selection portion of the Guide.

Logan T Powell. Logan is the Developer Experience and Engagement Lead at the US Census Bureau. Logan is planning to work on the Data Navigation portion of the Guide.

Zach Whitman, PhD. Zach is the Chief Data Officer at the US Census Bureau. Zach is planning to work on future projects related to changes in the Census API.

Kyle Walker, PhD. Kyle is Associate Professor of Geography at Texas Christian University. Kyle is the primary author of the tigris, tidycensus, and idbr R packages for working with Census Bureau spatial and tabular data in R.

Project Milestones

Milestone 1: Select Publication Technology (1 Month, $500)

Our first task will be selecting technology to use to publish the Guide. We would like the technology to be free and easy to access, as well as free to host and easy to update. We are planning to start this phase by evaluating Github, Github Pages and WordPress.

This milestone wil be completed when there is a “Hello World” version of the Guide published, and both authors understand the workflow for editing it.

Milestone 2: Assemble list of resources to include (1 Month, $1,000)

The initial package and resource lists will be based on our personal knowledge and experience. The lists will be stored in github so that other people can contribute to the lists. That github repository will then be publicized via our blogs and social media accounts.

This milestone will be completed when we have the final list of Package and Training resources that we plan to include in the guide.

Milestone 3: Complete Publication of Guide (1 Month, $2,000)

After we have the final list of resources we plan to include in the Guide, we will need to write up the results.

This milestone will be completed once the final draft of the Guide is published.

We plan to announce completion of this milestone on our blogs and social media accounts.

Milestone 4: Complete Dissemination (1 Month, $500)

Once the Guide is completed, we will focus on disseminating it to the largest possible audience.

We will start by simply announcing the Guide’s completion on our blogs and social media accounts.

We will also reach out to the Census Bureau, which has already indicated an interest in linking to it from their website.

We also believe that CRAN might want to link to the Guide on the Official Statistics & Survey Methodlogy Task View. However, we have not yet spoken to the Task View Maintainer about this. If the maintainer thinks that this project is not a perfect fit, then we are open to creating a separate task view dedicated to US Census Statistics.

How Can The ISC Help

We are seeking to create an ISC Working Group to promote using R to analyze US Census Data. The individual who will be committed to leading and managing the group’s activities is Ari Lamstein.

We are requesting a $4,000 grant to fund completion of the Working Group’s first project: a Guide to getting started with Census Data in R (“The Guide”). The Project Milestones sections contains a breakdown of funding by milestone.

The Census Bureau is currently planning changes to its API. We hope that a future project for the Working Group will be to get feedback on the API from the R Community.

In addition to creating a Working Group and financially supporting the creation of the Guide, we believe that the R Consortium can help this project to succeed by facilitating cooperation among stakeholders and disseminating and promoting the Guide on the R Consortium website.

Dissemination

We plan to publish the Guide under the Creative Commons Attribution 4.0 License.

As indicated in the Project Milestones Section, we plan to include the community in the drafting of the Guide, and we plan to publicize project milestones on our blogs and social media accounts.

The Census Bureau has also indicated an interest in linking to the Guide on its website once it is completed.

We also hope to speak with the maintainer of CRAN’s Official Statistics & Survey Methodology Task View about including a link to the Guide. If the maintainer thinks that this project is not a perfect fit, then we are open to creating a separate task view dedicated to US Census Statistics.

Update: “Difficult to Reproduce Choroplethr Bug” Isolated and Fixed

Yesterday I wrote that I’ve received a number of choroplethr bug reports recently that I simply cannot reproduce. Due to a large number of people replying with their system information, along with whether or not they were able to reproduce the bug, I was able to track it down. Thank you to everyone who helped!

It appears that the issue only occurs when choroplethr is used in conjunction with the development version of ggplot2. I personally never use the development version of ggplot2, which explains why I was never able to reproduce the bug.

Normally I would not update choroplethr to work better with the development version of another package. But I received a lot of help to isolate this issue, so I decided to submit a fix to github. You can get it by typing the following:

If you were able to reproduce the bug before, I would appreciate it if you could retest with this version and verify that everything now works for you.

Technical Details

In order to create a map of the 50 US States, choroplethr first renders a map of the continental United States. It then individually renders Alaska and Hawaii, and affixes them to the continental US map as annotations. You can view the entire process here. A key function in that process is this line:

With the development version of ggplot2, the call to ggplotGrob fails. It appears that the failure is due to changes in how the development version of ggplot2 handles themes. Choroplethr uses two custom themes:

  • theme_clean hides hides everything except the map and legend.
  • theme_inset acts like theme_clean but also hides the legend.

You can view the choroplethr theme code here.

I’m not an expert on these new changes to ggplot2, but it appears that ggplotGrob now wants a theme with all member variables specified. It also appears that there is a new built-in theme called theme_void that does everything my theme_clean does while also setting all of the theme member variables. (This means that calling ggplotGrob with theme_void does not lead to a crash). I used theme_void as the basis of my fix. You can see the specific code change here.

(As a note to myself, it now appears that using one of the theme_replace functions would simplify the code of my new theme_inset code. I am intentionally holding off on this change, though, until I know whether or not this will even be an issue in the next CRAN version of ggplot2).

Difficult to Reproduce Choroplethr Bug: Can You Help?

Since November 2017 I have received three bug reports from users who see this error when running any command in the choroplethr package:

Normally when someone reports a choroplethr bug that I cannot reproduce I recommend they do the following:

  1. Update to the latest version of R and RStudio.
  2. Type “update.packages(ask=FALSE)” to get the latest version of each package.
  3. Restart R and try again.

However, each of the three users has tried this it has not solved their problem.

It appears that the bug is at least somewhat OS dependent. I am running the following version of R:

and I cannot reproduce the bug. The users who report the bug have these configurations:

Because I do not have access to a machine configuration that produces the bug, I am limited in my ability to debug this.

A google search for this error shows that issues #179 (a linux machine) and #185 (no OS specified) in the ggmap package both report the same error and were filed at roughly the same time that I started receiving these error reports. These issues seem to have been resolved by this change. However, since I am not able to reproduce the error on my own machine, I am reluctant to make any updates to the choroplethr code base based on what I see in that commit.

If you are able to shed some light on what is going on with this bug, then please leave a comment below.

The following code seems to reproduce the bug:

Update

Thank you for everyone who provided test results in the comments. I have been able to narrow down the issue to occurring in the development version of ggplot2. That is, the OS issue appears to be a red herring. I am now working on the issue.

Free Software Foundation “Social Benefit” Award Nominations

Ezra Haber Glenn, the author of the acs package in R, recently posted about the Free Software Foundation’s “Social Benefit” Award on the acs mailing list:

acs.R Community:

The Free Software Foundation is now accepting nominations for the 2017
“Project of Social Benefit Award,” presented to the project or team
responsible for applying free software, or the ideas of the free
software movement, in a project that intentionally and significantly
benefits society in other aspects of life.

If anyone is willing to nominate the acs package, the recognition
would be much appreciated — the package has been generously supported
by MIT and the Puget Sound Regional Council, as well as a great deal
of user-feedback and creative development on the part of the
ACS/Census/R community.

The nomination form is quick and easy — see
https://my.fsf.org/projects-of-social-benefit-award-nomination.
Deadline 11/5.

More info at https://www.fsf.org/awards/sb-award/.

Thanks!

I’m reposting this here for a few reasons.

The first is that I only learned about this award from Ezra’s post, and I think that it’s worth raising awareness of the award itself.

The second is that, in my opinion, the acs package does “intentionally and significantly benefit society.” I have used the acs package over several years to learn more about US demographics. Choroplethr, my R package for creating statistical maps, also uses the acs package to retrieve data from the Census Bureau. Several thousand people have taken my free course on Choroplethr, and each of those people has benefitted from the acs package as well.

Finally, I’m mentioning this award to point out that R package developers receive compensation in different ways. None of us receive monetary compensation when people use our packages. However, Ezra has indicated that getting nominated for this award would be useful to him.

For all these reasons, I was happy to nominate the acs package for the Free software Foundation’s “Social Benefit” Award. It took me less than 5 minutes to fill out the form. If you are a user of choroplethr, and you enjoy its integration with US Census Data, then I encourage you to nominate the acs package as well. You can do so here.

Upcoming talk at the Association of Public Data Users (APDU) Conference

On Thursday, September 14th I will be giving a lightning talk at the Association of Public Data Users (APDU) Conference in Alexandria, Virginia. The talk will be on choroplethr, my suite of R packages for mapping open datasets.

This talk is part of my effort to bridge the worlds of free analytical software (e.g. R) and public datasets.

If you are attending the conference then I hope to see you there!

You can learn more about the APDU conference here.

 

Recording of my Talk at the ACS Data Users Conference

I recently had the honor of speaking at the 2017 American Community Survey (ACS) Data Users Conference. My talk focused on choroplethr, the suite of R packages I’ve written for mapping open datasets in R. The ACS is the one of the main surveys that the US Census Bureau runs, and I was honored to share my work there.

My talk was recorded, and you can view it below:

Attending the conference was part of my effort to bridge the gap between the world of free data analysis tools (i.e. the R community) and the world of government datasets (which I consider the most interesting datasets around).

The conference was great. I have been meaning to write a blog post summarizing what I learned, but I simply haven’t had the time. Instead, I will point you to a recording of all the conference talks.

 

« Older posts

© 2019 AriLamstein.com

Theme by Anders NorenUp ↑