Page 2 of 11

Update: “Difficult to Reproduce Choroplethr Bug” Isolated and Fixed

Yesterday I wrote that I’ve received a number of choroplethr bug reports recently that I simply cannot reproduce. Due to a large number of people replying with their system information, along with whether or not they were able to reproduce the bug, I was able to track it down. Thank you to everyone who helped!

It appears that the issue only occurs when choroplethr is used in conjunction with the development version of ggplot2. I personally never use the development version of ggplot2, which explains why I was never able to reproduce the bug.

Normally I would not update choroplethr to work better with the development version of another package. But I received a lot of help to isolate this issue, so I decided to submit a fix to github. You can get it by typing the following:

If you were able to reproduce the bug before, I would appreciate it if you could retest with this version and verify that everything now works for you.

Technical Details

In order to create a map of the 50 US States, choroplethr first renders a map of the continental United States. It then individually renders Alaska and Hawaii, and affixes them to the continental US map as annotations. You can view the entire process here. A key function in that process is this line:

With the development version of ggplot2, the call to ggplotGrob fails. It appears that the failure is due to changes in how the development version of ggplot2 handles themes. Choroplethr uses two custom themes:

  • theme_clean hides hides everything except the map and legend.
  • theme_inset acts like theme_clean but also hides the legend.

You can view the choroplethr theme code here.

I’m not an expert on these new changes to ggplot2, but it appears that ggplotGrob now wants a theme with all member variables specified. It also appears that there is a new built-in theme called theme_void that does everything my theme_clean does while also setting all of the theme member variables. (This means that calling ggplotGrob with theme_void does not lead to a crash). I used theme_void as the basis of my fix. You can see the specific code change here.

(As a note to myself, it now appears that using one of the theme_replace functions would simplify the code of my new theme_inset code. I am intentionally holding off on this change, though, until I know whether or not this will even be an issue in the next CRAN version of ggplot2).

Difficult to Reproduce Choroplethr Bug: Can You Help?

Since November 2017 I have received three bug reports from users who see this error when running any command in the choroplethr package:

Normally when someone reports a choroplethr bug that I cannot reproduce I recommend they do the following:

  1. Update to the latest version of R and RStudio.
  2. Type “update.packages(ask=FALSE)” to get the latest version of each package.
  3. Restart R and try again.

However, each of the three users has tried this it has not solved their problem.

It appears that the bug is at least somewhat OS dependent. I am running the following version of R:

and I cannot reproduce the bug. The users who report the bug have these configurations:

Because I do not have access to a machine configuration that produces the bug, I am limited in my ability to debug this.

A google search for this error shows that issues #179 (a linux machine) and #185 (no OS specified) in the ggmap package both report the same error and were filed at roughly the same time that I started receiving these error reports. These issues seem to have been resolved by this change. However, since I am not able to reproduce the error on my own machine, I am reluctant to make any updates to the choroplethr code base based on what I see in that commit.

If you are able to shed some light on what is going on with this bug, then please leave a comment below.

The following code seems to reproduce the bug:


Thank you for everyone who provided test results in the comments. I have been able to narrow down the issue to occurring in the development version of ggplot2. That is, the OS issue appears to be a red herring. I am now working on the issue.

Update on My Collaboration with the Census Bureau

In November I announced a new collaboration with the Census Bureau around data science training. Today I’d like to provide an update on this project.

I recently presented a draft of my first course for this project internally and it was well received. The working title of this course is Mapping State ACS Data in R with Choroplethr. We are using github to store the course materials, and you can view them here.

If you are interested in this project then I recommend starring and following the repository on github. Why? Because that’s the best way to learn about new training materials! Also, Steven Johnson and Maggie Cawley from Boomerang Geospatial are currently working on a course for this project that relates to OpenStreetMap and Census Geography. If you follow this repository then you will learn as soon as that course is available.

Comparison with Learn to Map Census Data in R

Mapping State ACS Data in R with Choroplethr can be thought of as a super-charged version of my email course Learn to Map Census Data in R.  The “final project” in both courses is the same: the creation of this this map, which shows the percent change in per capita income between 2010 and 2015:

The main differences between the courses relate to content, medium and community.


Learn to Map Census Data in R is a short course that aims to give people a quick win in writing R code that maps data from the US Census Bureau. Students are expected to know at least a bit of R beforehand.

Mapping State ACS Data in R with Choroplethr is a longer course that has no prerequisites. In fact, the first module provides information on installing R and getting set up with RStudio. The longer format also allows me to provide more information about the American Community Survey and how it differs from the Decennial Census. This is material that I had to skip in Learn to Map Census Data in R.


Mapping State ACS Data in R with Choroplethr will be an asynchronous video course. All the lessons will have video that you can watch on demand, as well as downloadable code examples. This will allow learners to easily skip over content that they already know or rewatch lessons that they found confusing the first time through.

By contrast, all the content in Learn to Map Census Data in R is delivered via email. That makes it great for a short course. But it limits the amount of content that I can deliver, and it also makes it hard for students to review past lessons.


In Learn to Map Census Data in R the only interaction that the learner has is with me, via email. While the details haven’t been finalized yet, we are hoping to create an online community that will allow students to interact not only with each other, but also potentially the Census Bureau’s Data Dissemination Specialists.

Next Steps

The final version of Mapping State ACS Data in R with Choroplethr will be submitted by the end of March. The best way to know about changes to the curriculum is to follow the github repository, which you can do here.

R Programmers: What is your biggest problem when working with Census data?

A few weeks ago I announced my latest project: Data Science instruction at the Census Bureau.

In addition to announcing the project, I also snuck a bit of market research into the post. I asked people the types of analyses they do when working with Census data. I also asked what packages they use when solving those problems.

23 people left comments, and they have been very helpful in shaping the curriculum of the course. Thank you to everyone who left a comment!

That was such an effective way to learn about the community of R Census users that I’ve decided to do it again. If you are an R programmer who has worked with Census data, please leave a comment with an answer to this question:

What is your biggest problem when working with Census data in R?

Understanding the obstacles people face has the potential to help us design better courses.

Leave your answer as a comment below!

New Project: Data Science Instruction at the US Census Bureau!

Today I am delighted to announce an exciting new collaboration. I will be working with the US Census Bureau as a Data Science Instructor!

Over the next six months I will be helping Census develop courses on using R to work with Census Data. These courses will be free and open to the public. People familiar with my open source work will realize that this project is right up my alley!

As a start to this project I am trying to gather two pieces of information:

  1. Which packages do R programmers typically use when working with Census data?
  2. What types of analyses do R programmers typically do with Census data?

If you use R to work with Census data, please leave an answer below!

Free Software Foundation “Social Benefit” Award Nominations

Ezra Haber Glenn, the author of the acs package in R, recently posted about the Free Software Foundation’s “Social Benefit” Award on the acs mailing list:

acs.R Community:

The Free Software Foundation is now accepting nominations for the 2017
“Project of Social Benefit Award,” presented to the project or team
responsible for applying free software, or the ideas of the free
software movement, in a project that intentionally and significantly
benefits society in other aspects of life.

If anyone is willing to nominate the acs package, the recognition
would be much appreciated — the package has been generously supported
by MIT and the Puget Sound Regional Council, as well as a great deal
of user-feedback and creative development on the part of the
ACS/Census/R community.

The nomination form is quick and easy — see
Deadline 11/5.

More info at


I’m reposting this here for a few reasons.

The first is that I only learned about this award from Ezra’s post, and I think that it’s worth raising awareness of the award itself.

The second is that, in my opinion, the acs package does “intentionally and significantly benefit society.” I have used the acs package over several years to learn more about US demographics. Choroplethr, my R package for creating statistical maps, also uses the acs package to retrieve data from the Census Bureau. Several thousand people have taken my free course on Choroplethr, and each of those people has benefitted from the acs package as well.

Finally, I’m mentioning this award to point out that R package developers receive compensation in different ways. None of us receive monetary compensation when people use our packages. However, Ezra has indicated that getting nominated for this award would be useful to him.

For all these reasons, I was happy to nominate the acs package for the Free software Foundation’s “Social Benefit” Award. It took me less than 5 minutes to fill out the form. If you are a user of choroplethr, and you enjoy its integration with US Census Data, then I encourage you to nominate the acs package as well. You can do so here.

Meeting Titans of Open Data

The recent Association of Public Data Users (APDU) Conference gave me the opportunity to meet some people who have made tremendous contributions to the world of Open Data.

Jon Sperling

The author with Jon Sperling

The author with Jon Sperling

One of the most popular R packages I’ve published is choroplethrZip. This package contains the US Census Bureau’s Zip Code Tabulation Area (ZCTA) map, as well as metadata and visualization functions. It literally never occurred to me to think about who created the first ZCTA map, the methodology for creating it, and so on.

It turns out that one of the people who created the ZCTA map – Jon Sperling – was at the conference. We had a fascinating conversation, and I even got to take selfie with him! You can learn more about Jon’s role in the creation of the TIGER database in his 1992 paper Development and Maintenance of the TIGER Database: Experiences in Spatial Data Sharing at the U.S. Bureau of the Census.

Jon currently works at the Department of Housing and Urban Development (HUD). You can learn more about their data here.

Nancy Potok

Nancy Potok is the Chief Statistician of the United States, and she gave a fascinating keynote. Before her talk I did not know that the country even had a Chief Statistician. Her talk taught me about the Federal Statistical System as well as the Interagency Council on Statistical Policy.

The Q&A portion of her talk was also interesting. A significant portion of the audience worked at federal agencies. I believe that it was during this session when someone said “I can’t tell how much of my time should be dedicated to supporting the data and analyses which I publish. In a very real sense, my job ends when I publish it.” This question helped me understand why it is sometimes difficult to get help understanding how to use government datasets: there simply aren’t incentives for data publishers to support users of that data.

Andrew Reamer

Sometimes at a conference you can tell that there’s a celebrity there just by how people act towards them.

In this case it seemed that everyone except me knew who Andrew Reamer was. During the Q&A portion of talks, Andrew would raise his hand and speakers would say “Hi Andrew! What’s your question?”. Or they would get a question from someone else and say “I’m actually not sure what the answer is. Maybe Andrew knows. Andrew?”

Note that Andrew didn’t actually have a speaking slot at the event. But yet he still did a lot of speaking!

During lunch I worked up the courage to go up Andrew and ask him about his work. It turns out that he is a Research Professor at George Washington University’s Institute of Public Policy. He is a well-known social scientist. As I’m quickly learning, social scientists tend to be major users of public datasets.

I previously published a quote from the Census Bureau that the American Community Survey (a major survey they run) impacts how over $400 billion is allocated. However, I was never able to get any more granularity on that. If you’re interested in the answer to that, it turns out that Andrew wrote a piece on it: Counting for Dollars 2020: The Role of the Decennial Census in the Geographic Distribution of Federal Funds.

Sarah Cohen

Sarah Cohen is a Pulitzer Prize winning journalist who is now the Assistant Editor for Computer-Assisted Reporting at the New York Times. Before Sarah’s talk I only had passing familiarity with data journalism, and had a very narrow view of what it entailed. Sarah’s keynote gave examples of different ways that public data is shaping journalism. She also discussed common problems that arise when journalists communicate with data publishers.

Key Takeaway

The 2017 APDU Conference was a great chance for me (an R trainer and consultant with a strong interest in open data) to meet people who have made major contributions to the field of public data. If you are interested in learning more about the Association of Public Data Users, I recommend visiting their website here.

A “Pre-Training” R Survey

Recently I’ve been working with a client to help their analysts improve their proficiency with R. A major challenge in engagements like this is figuring out the needs of the analysts, as well as their general attitude to the training.

To address this I created a “Pre Training Survey” to use with my client. The results from the survey helped me shape the final curriculum. Below I share the actual survey I created in case it helps someone else:

  1. Does R ever frustrate you?
  2. Does R ever intimidate you (e.g. you think “I’ll just never get it”)?
  3. About how many hours a week do you use spend using R?
  4. What sort of things do you currently use R for (it’s OK to say “nothing”)?
  5. What sort of things would you like to use R for (it’s OK to say “nothing”)?
  6. It is important to me that the training is beneficial to you. What is one result you would like to get from this training?
  7. During the training I may get a chance to “Pair Program” with some members of your team. Pair Programming simply means working together on a problem at the same computer. During this time you would get to ask me any questions that you like. Are you interested in pair programming during the training?

Below are a series of technical questions. They are not designed to trick you. Rather, the answers will help me know where to focus my efforts during the training. 

It is perfectly fine to leave these questions blank, or to simply say “I don’t know.” Please do not use google or R to answer any of these questions.

Explain the difference between these two lines of code:

[code lang=”r”]

Explain the difference between these two lines of code

[code lang=”r”]
x = 1
x = "1"

What do these commands do?

[code lang=”r”]

Consider this R code:

[code lang=”r”]
x = c(1,2,3)

What is the output of each of these commands?

[code lang=”r”]

Consider this R code:

x = c(1,2,3)

What is the output of each of these commands?

[code lang=”r”]
x > 1

Consider this program:

[code lang=”r”]
x = c(1, 2, 3)
x = x * 2

What is the final value of x?

What is the output of this program?

[code lang=”r”]
x = c(1, 2, 3)
for (i in x)
if(x[i] > 1)


Consider this data frame:

[code lang=”r”]
a = 1:3
b = 1:3
df = data.frame(a, b)

What is the output of these three commands?

[code lang=”r”]

What does this program output to the console?

[code lang=”r”]
square = function(x)


Datasets for Building a Data Analysis Portfolio

I recently had the pleasure of attending the 2017 Association of Public Data Users (APDU) Conference.

My favorite part of the conference was talking to people who work with federal data on a daily basis. Overall I found people to be passionate about their work and eager to share information about it.

I know many of my readers are looking for interesting datasets to use in their portfolios, so I decided to publish a list of some of the most interesting datasets I learned about.

IRS Statistics

One of the most enjoyable conversations I had was with Kevin Pierce, an economist with the IRS. When I first learned where Kevin worked, I wanted to run away. However, we wound up having a fascinating conversation about IRS data.

Kevin works in the IRS’s Statistics of Income (SOI) program. As far as I can tell, SOI data is the highest quality data on US income that’s available. This is simply because all Americans are required to accurately file their taxes every year.

I was surprised to learn that the SOI publishes a great deal of this data. As an example, here is their page dedicated to data from Form 1040. They also aggregate this data by State, County and ZIP Code, so it is possible to map the data.

Kevin works on the IRS’s migration reports. Because the IRS knows everyone’s address and and income each year, they can analyze migration and the financial impact it has.

Any portfolio that focuses on income data from the IRS is sure to get a lot of attention!

Vital Statistics

Charles Rothwell, the Director of the National Center for Health Statistics, appeared on a panel titled “Federal Statistical Agency Leadership”. Charles is a gifted public speaker and I really enjoyed his presentation.

Charles works with “Vital Statistics”, which involves counting births and deaths. Normally I would shy away from a dataset like this. But as Charles pointed out, this data is necessary if you want to understand the opioid epidemic that the US is currently facing.

A portfolio that focused on using this dataset to explore the opioid epidemic would be fascinating to read.

Labor Statistics

Michael Dalton, a research economist at the Bureau of Labor Statistics (BLS), spoke on a panel about the Role of Commercial Firms in Public Data. I found his case studies to be very interesting, and after his talk we chatted for a bit. I asked him which BLS statistics he thought would be good for a data analysis student who is interested in employment data. He had several recommendations:

These statistics will tell you the types of jobs that people in the US have, as well as the amount that people in those occupations earn. If I had the time, I’d love to analyze the growth in the number tech workers in the Bay Area over time.

Of course, BLS also releases statistics on Unemployment. (Note that I have already packaged up some of that data in the rUnemploymentData package (1, 2)).

Michael also recommended checking out IPUMS, which is a resource I also heard about at the ACS Data Users Conference earlier this year.

Energy Data

I also had the pleasure of meeting Chip Berry of the Energy Information Administration (EIA). Chip manages the Residential Energy Consumptions Survey (RECS). I was not previously aware of the EIA, and it turns out that they have a ton of interesting data. For example, they have real-time information about energy supply and demand nationwide. They also know the location of each and every energy production facility in the US.

As I write this much of Florida is still without power due to hurricane Irma. If you were interested in researching this (or any other energy-related topic), this data would be a great place to start.

Closing Thoughts

In my experience, the more specialized a portfolio is, the easier it is for the portfolio to get traction. Each of the datasets I link to above could easily form the cornerstone of a successful data-related portfolio.

Upcoming talk at the Association of Public Data Users (APDU) Conference

On Thursday, September 14th I will be giving a lightning talk at the Association of Public Data Users (APDU) Conference in Alexandria, Virginia. The talk will be on choroplethr, my suite of R packages for mapping open datasets.

This talk is part of my effort to bridge the worlds of free analytical software (e.g. R) and public datasets.

If you are attending the conference then I hope to see you there!

You can learn more about the APDU conference here.


« Older posts Newer posts »

© 2019

Theme by Anders NorenUp ↑