Author: Ari Lamstein (page 1 of 10)

“A Guide to Working With Census Data in R” is now Complete!

Two weeks ago I mentioned that I was clearing my calendar until I finished writing A Guide to Working with Census Data in R. Today I’m happy to announce that the Guide is complete!

I even took out a special domain for the project: RCensusGuide.info.

The Guide is designed to address two problems that people who work with R and US Census Data face:

  1. Understanding what data the Census Bureau publishes.
  2. Understanding what CRAN packages are available to help with their project.

The Guide details the five most popular datasets that the Census Bureau publishes. It also describes the seven most popular R packages for working with Census data.

The best part? The Guide is free, and lists ways for you to continue learning about both the datasets and R packages that it mentions.

A special thank you to the R Consortium for funding this project!

I hope that you enjoy reading the Guide!

Read the Guide

Update on the R Consortium Census Working Group

It’s been a while since I shared any information about the R Consortium’s Census Working Group.

Today I’d like to share an update on three projects which the group is involved in.

The project which you might already be familiar with is the “Guide to Working with Census Data in R”. This is my survey paper that will include popular datasets that Census publishes, R packages that facilitate working with Census Data and resources to learn more about the above. I am planning to complete this document in the next two weeks.

The project which launched this working group is my video course on using Choroplethr to visualize data from the American Community Survey (ACS). This course will be free and hosted on the Census Bureau’s website. The course has already been completed, and we’re in the final stages of getting approval from the bureau to publish the course on its website. I expect this to be completed by the end of the year.

A new project which we are considering taking on is a webinar series highlighting authors of R packages that work with Census data. This would give us a chance to create a library of educational material about using R to work with Census data.

If you have any questions or feedback about this, please contact me.

Update on R Consortium Census Guide

As I mentioned in July, my proposal to the R Consortium to create a Guide to Working with Census Data in R was accepted. Today I’d like to share an update on the project.

The proposal breaks the creation of the guide into four milestones. So far Logan Powell and I have completed two of those milestones:

  • Milestone 1: Select publication technology. We have decided to use Github Pages to publish the Guide.
  • Milestone 2: Assemble list of resources to include. We have assembled three types of resource to include:
    • A list of R packages that work with Census Data.
    • A description of the five most popular datasets which the Census Bureau publishes.
    • A description of various ways that the Census Bureau provides training and support.

The next Milestone is to convert this list of resources into a single, coherent document. Before doing this, though, we want to give the broader R and Census communities a chance to review the resources we plan to include. We’re particularly interested to know if we’ve missed anything important, or if we’re planning to include anything that is broadly considered to be superfluous.

If you have a minute, please review what we plan to include. If you have feedback, you can provide it by filing an issue in the project’s github repository.

Review the Guide

Notes from the 2018 APDU Conference

I recently attended the 2018 Association of Public Data Users (APDU) conference. This was my second time attending the conference, and I enjoyed learning more about how other people use federal statistics. Some of my favorite presentations were:

  • Julia Lane on her experience teaching Big Data to social scientists and government workers.
  • The Bureau of Economic Analysis (BEA) on their R package.
  • Aaron Klein on the value of policy makers getting realtime data

Julia Lane and the Coleridge Initative

As an “R guy” I’m normally the odd man out at APDU. While many attendees do work with data, they tend to use GUI-based tools such as Excel and Tableau. I’ve often wondered if any of the attendees would benefit from learning programming language-based data tools such as R or Python.

It turns out that Julia is the author of an entire book on this topic: Big Data and Social Science! She is also a director of the Coleridge Initiative, which provides Python-based data analysis training to government workers and social scientists.

Julia spoke about her experience with these projects, and the results seemed very positive!

The Bureau of Economic Analysis (BEA) has an R package

While I mostly blog about data that the Census Bureau publishes, APDU is a great reminder of how many other statistical agencies there are. An example is the Bureau of Economic Analysis (BEA) which, among other things, calculates Gross Domestic Product (GDP) statistics.

BEA was a sponsor of the conference, and I got to chat with one of the people running their booth. I was surprised to learn that BEA has published their own R package: bea.R. This is the first time that I have heard of a government agency developing their own R package!

The person I spoke with mentioned that BEA’s Chief Innovation Officer, Jeff Chen, is a big proponent of R. You can learn more about the BEA here.

I think that it would be interesting to extend Choroplethr to work with data from the BEA.

Aaron Klein on Policymakers Getting Realtime Data

Aaron Klein, a former Treasury official, spoke about the value of policy makers getting realtime data. Aaron worked in Treasury during the foreclosure crisis, and spoke about the challenges policymakers faced in responding to it. One issue was quantifying the impact that foreclosures and abandoned homes have on broader communities.

He recently wrote a research paper that attempted to answer this question: Understanding the True Costs of Abandoned Properties: How Maintenance Can Make a Difference. One statistic from the talk left a big impression on me: vacant residential buildings account for one of every 14 residential building fires in America. When you consider that only a small portion of residential homes are vacant, this statistic is truly startling.

Having data like this at the start of the foreclosure crisis might have improved how policymakers responded to it.

R Consortium Proposal Accepted!

Today I am happy to announce that my proposal to the R Consortium was accepted!

I first announced that I was submitting a proposal back in March. As a reminder, the proposal has two distinct parts:

  1. Creating a guide to working with US Census data in R.
  2. Creating an R Consortium Working Group focused on US Census Data.

If you’d like to read the proposal in its entirety, you can do so here.

I am currently planning to develop the “Census Guide” in public using github. If you’d like to follow along with the development, you can do so by visiting the github respository and clicking the “watch” button:

View the Github Repository 

Census Academy Update!

Last year I announced a collaboration with the Census Bureau to help with their new online training platform. This website, called “Census Academy”, is designed to help the public learn how to work with data that the Census Bureau publishes.

Census Academy recently had a soft launch, and there is already a lot of content on it. I’ve marked up an image of the homepage below to draw attention to three sections: Data Gems, Courses and “Join the Community”.

Data Gems

Data Gems are short instructional videos created by experts at the Census Academy.

A recent Data Gem, for example, explained how ZIP Code Tabulation Areas (a geography published by the Census Bureau) are different than ZIP Codes (which are published by the Postal Service). This is a question that I am frequently asked, and I found that the Data Gem gave a great answer!

Courses

My biggest contribution to Census Academy has been the creation of a course on visualizing data from the American Community Survey (ACS) in R. This course, which will be published soon, can be thought of as an in-depth extension of my existing course Learn to Map Census Data in RI will make an announcement here when the course becomes available!

Join the Community

Census Academy just started, and new training material is being added all the time. I think that any R programmer with an interest in demographic data can benefit from Census Academy!

To get notified when Census Academy adds new training material, click the “Join the Community” button on the Census Academy homepage.

Visit Census Academy

New Version of Choroplethr.com!

Last year I decided to take out the domain Choroplethr.com. I used it to host information about Choroplethr, my suite of R packages for mapping demographic statistics. A year and a half later, I’ve made the first significant change to the site!

The old version of Choroplethr.com was structured as a series of essays that demonstrated how to use specific parts of each package. After collecting data on how people were interacting with the site, as well as talking to hundreds of Choroplethr users, I decided to update the site.

The new site has 4 sections:

  1. Free Course. The most popular introduction to Choroplethr continues to be Learn to Map Census Data in R, my free course on using Choroplethr.
  2. Packages and Documentation. This section is new: it contains a description of each of the five packages that make up Choroplethr, as well as documentation on each function and object in each package.
  3. Courses. This section describes the two advanced courses I created, and how they can help Choroplethr users. These courses are Mapmaking in R with Choroplethr and Shapefiles for R Programmers
  4. Support. This section provides information on how to get support when using Choroplethr.

I hope that you’ll take a moment to visit the new version of Choroplethr.com and let me know what you think!

Visit Choroplethr.com

Choroplethr v3.6.2 is now on CRAN

A new version of Choroplethr is now on CRAN! You can get it by typing the following from the R Console:

I encourage all Choroplethr users to install this new version!

Overview

A new version of ggplot2 will be released shortly. This new version of ggplot2 introduces many new features. These new features, however, come at a cost: some packages which use ggplot2 (such as Choroplethr) are not compatible with this new version.

In fact, if you try to draw a map with the new version of ggplot2 and the old version of Choroplethr, you might get an error. This new version of Chorplethr (v3.6.2) fixes those issues.

Technical Details

Choroplethr was one of (if not the) first R package to create maps of the US where Alaska and Hawaii are rendered as insets. Here’s an example map, along with code:

The new version of ggplot2 was causing Choroplethr to crash whenever it attempted to render the above map. In order to understand the underlying issue, it is helpful to first understand how Choroplethr creates the above map.

Insets and ?annotation_custom

The most important thing to notice about the above map is this: it is inaccurate.

Alaska and Hawaii are not really located where they appear. And they are not drawn to scale, either.

Here is how the 50 US states “really” look:

In order to render Alaska and Hawaii as insets, Choroplethr:

  1. First pre-renders the Continental US, Alaska and Hawaii as separate ggplot objects.
  2. Then affixes the Alaska and Hawaii ggplot objects to the continental US object using ggplot2’s ggplotGrob and annotation_custom functions.

Insets and Themes

The two images differ in more than just the placement of Alaska and Hawaii, though. The first map also has a clear background, whereas the second map has a grey background and tick marks for longitude and latitude. Also, Choroplethr uses a single legend for all three maps (the legends for Alaska and Hawaii are suppressed).

Traditionally Choroplethr has implemented these aesthetic tweaks in the functions theme_clean (for the Continental US) and theme_inset (for Alaska and Hawaii). The code for these functions originally came from Section 13.19 (“Making a Map with a Clean Background”) of Winston Chang’s R Graphics Cookbook. Here’s what that code used to look like:

The Bug

The bug was happening when Choroplethr was attempting to affix the map of Alaska, with theme_inset applied to it, to the map of the Continental US:

The last line in the above snippet was causing Choroplethr to crash. It appears that the intersection of theme_insetggplotGrob and annotation_custom was causing a problem.

The Fix

It appears that ggplot2 has a new function for creating a clear background on a map that does not cause these problems. That function is theme_void. Modifying theme_clean and theme_inset to use theme_void solved the problem. Here is the new version of those functions:

New Version of ggplot2

I just received a note from Hadley Wickham that a new version of ggplot2 is scheduled to be submitted to CRAN on June 25. Here’s what choroplethr users need to know about this new version of ggplot2.

Choroplethr Update Required

The new version of ggplot2 introduces bugs into choroplethr. In particular, choroplethr does not pass R cmd check when the new version of ggplot2 is loaded. I am planning to submit a new version of choroplethr to CRAN that addresses this issue prior to June 25.

This change is the third or fourth time I’ve had to update choroplethr in recent years as a result of changes to ggplot2. This experience reminds me of one of the first lessons I learned as a software engineer: “More time is spent maintaining old software than writing new software.”

Simple Features Support

One of the most common questions I get about choroplethr is whether I intend to add support for interactive maps. My answer has always been “I can’t do that until ggplot2 adds support for Simple Features.” Thankfully, this new version of ggplot2 introduces that support!

Currently all maps in the choroplethr ecosystem are stored as ggplot2 “fortified” dataframes. This is a format unique to ggplot2. Storing the maps in this format makes it possible to render the maps as quickly as possible. The downside is that:

  1. ggplot2 does not support interactive graphics.
  2. The main interactive mapping library in CRAN (leaflet) cannot render fortified data frames. It can only render maps stored as Shapefiles or Simple Features.

Once ggplot2 adds support for Simple Features, I can begin work on adding interactive map support to choroplethr. The first steps will likely be:

  1. Updating choroplethr to be able to render maps stored in the Simple Features format.
  2. Migrating choroplethr’s existing maps to the Simple Features format.

After that, I can start experimenting with adding interactive graphics support to choroplethr.

Note that Simple Features is not without its drawbacks. In particular, many users are reporting performance problems when creating maps using Simple Features and ggplot2. I will likely not begin this project until these issues have been resolved.

Thoughts on the CRAN Ecosystem

This issue has caused me to reflect a bit about the stability of the CRAN ecosystem. 

ggplot2 is used by about 1,700 packages. It’s unclear to me how many of these packages will have similar problems as a result of this change to ggplot2. And of the impacted packages, how many have maintainers who will push out a new version before June 25?

And ggplot2, of course, is just one of many packages on CRAN. This issue has the potential to occur whenever any package on CRAN is updated.

This issue reminded me that CRAN has a complex web of dependencies, and that package maintainers are largely unpaid volunteers. It seems like a situation where bugs can easily creep into an end user’s code.

My Proposal to the R Consortium

I recently submitted a Proposal to the R Consortium. I decided to share this proposal on my blog for two reasons:

  1. To raise awareness of an opportunity I see in the R ecosystem.
  2. To raise awareness of the R Consortium as a funding source for R projects.

I first learned about the R Consortium when I attended the Boston EARL Conference in 2015. When I attended the San Francisco EARL Conference last year there was a session dedicated to projects the Consortium had funded. I was very impressed, and this year decided to submit a proposal myself!

I have no idea if this proposal will be accepted. Nonetheless, I hope that publishing it raises awareness of this relatively new funding source, and my own thoughts on how the R ecosystem can improve.

Below is a verbatim copy of the proposal.


Proposal to Create an R Consortium Working Group Focused on US Census Data

The Problem

R users who wish to work with US Census Data face two significant problems: Package Selection and Data Navigation.

Package Selection

An R user looking for packages to help with a project on US Census Data would likely start by going to CRAN’s list of contributed packages and doing a search for “Census”. This process is non-optimal for a number of reasons:

  1. It yields 41 hits, which is a large number to sort through.
  2. Many package titles appear to be duplicative (e.g. censusr’s title “Collect Data from the Census API” is very similar to censusapi’s title of “Retrieve Data from the Census APIs”).
  3. Some packages that deal with US Census Data do not have the word “Census” in their name or title (e.g. the choroplethr package).

For these reasons, even an experienced R user might find it challenging to determine which R package, if any, can help them with their Census-related project.

Note that these issues might also lead to package duplication on CRAN. That is, a developer might create a package to solve a problem which an existing package already solves.

Data Navigation

People often turn to the Census Bureau for answers to questions such as “How many people live in a certain region?” They are often surprised that the answer depends on which data product they query. That is, many R users begin their interaction with Census Data not fully understanding the array of data available to them, or when it is better to use one data product over another.

The Plan

We would like to create an R Consortrium Working Group to foster greater cooperation between the US Census Bureau and R community. We hope that the working group will become a place where statisticians working with census data can cooperate under the guidance of the Census Bureau.

The first project we would like the Working Group to take on is the creation of a guide to getting started with Census data in R (“The Guide”). The Guide would address the problems listed above by:

  1. Listing, describing and linking to packages currently available via CRAN that work with US Census Data.
  2. Listing, describing and linking to reference materials for better understanding and working with Census data.

The most likely failure mode for the Package Selection section is not including all the relevant packages, or not communicating the differences between the packages in a way that helps users decide which package is most appropriate for their use. At this point we do not know whether the Guide should simply copy the CRAN package description, or also include additional information. (And if it should include additional information, what information should that be?) We plan to address this risk by publishing and publicizing drafts of our work online, and incorporating feedback from the R community.

The most likely failure mode for the Data Navigation section is not providing resources which are useful or relevant to the actual needs of the R community. In the same way that CRAN has a wealth of packages that can be difficult to navigate, the Census Bureau also has a wealth of training material that can be difficult to navigate. We plan to address this by publishing and publicizing drafts of our work online, and incorporating feedback from the R community.

Another failure mode which we do not address in this proposal is maintenance of the Guide. While the Guide might be completely accurate the time of publication, it will naturally become less accurate over time. At this point it is not clear what the best way to maintain the Guide is.

The Team

Ari Lamstein. Ari is an R Trainer and Consultant who is currently working as a Data Science Instructor at the US Census Bureau. He has written several R packages that deal with US Census Data. Ari is planning to focus on on the Package Selection portion of the Guide.

Logan T Powell. Logan is the Developer Experience and Engagement Lead at the US Census Bureau. Logan is planning to work on the Data Navigation portion of the Guide.

Zach Whitman, PhD. Zach is the Chief Data Officer at the US Census Bureau. Zach is planning to work on future projects related to changes in the Census API.

Kyle Walker, PhD. Kyle is Associate Professor of Geography at Texas Christian University. Kyle is the primary author of the tigris, tidycensus, and idbr R packages for working with Census Bureau spatial and tabular data in R.

Project Milestones

Milestone 1: Select Publication Technology (1 Month, $500)

Our first task will be selecting technology to use to publish the Guide. We would like the technology to be free and easy to access, as well as free to host and easy to update. We are planning to start this phase by evaluating Github, Github Pages and WordPress.

This milestone wil be completed when there is a “Hello World” version of the Guide published, and both authors understand the workflow for editing it.

Milestone 2: Assemble list of resources to include (1 Month, $1,000)

The initial package and resource lists will be based on our personal knowledge and experience. The lists will be stored in github so that other people can contribute to the lists. That github repository will then be publicized via our blogs and social media accounts.

This milestone will be completed when we have the final list of Package and Training resources that we plan to include in the guide.

Milestone 3: Complete Publication of Guide (1 Month, $2,000)

After we have the final list of resources we plan to include in the Guide, we will need to write up the results.

This milestone will be completed once the final draft of the Guide is published.

We plan to announce completion of this milestone on our blogs and social media accounts.

Milestone 4: Complete Dissemination (1 Month, $500)

Once the Guide is completed, we will focus on disseminating it to the largest possible audience.

We will start by simply announcing the Guide’s completion on our blogs and social media accounts.

We will also reach out to the Census Bureau, which has already indicated an interest in linking to it from their website.

We also believe that CRAN might want to link to the Guide on the Official Statistics & Survey Methodlogy Task View. However, we have not yet spoken to the Task View Maintainer about this. If the maintainer thinks that this project is not a perfect fit, then we are open to creating a separate task view dedicated to US Census Statistics.

How Can The ISC Help

We are seeking to create an ISC Working Group to promote using R to analyze US Census Data. The individual who will be committed to leading and managing the group’s activities is Ari Lamstein.

We are requesting a $4,000 grant to fund completion of the Working Group’s first project: a Guide to getting started with Census Data in R (“The Guide”). The Project Milestones sections contains a breakdown of funding by milestone.

The Census Bureau is currently planning changes to its API. We hope that a future project for the Working Group will be to get feedback on the API from the R Community.

In addition to creating a Working Group and financially supporting the creation of the Guide, we believe that the R Consortium can help this project to succeed by facilitating cooperation among stakeholders and disseminating and promoting the Guide on the R Consortium website.

Dissemination

We plan to publish the Guide under the Creative Commons Attribution 4.0 License.

As indicated in the Project Milestones Section, we plan to include the community in the drafting of the Guide, and we plan to publicize project milestones on our blogs and social media accounts.

The Census Bureau has also indicated an interest in linking to the Guide on its website once it is completed.

We also hope to speak with the maintainer of CRAN’s Official Statistics & Survey Methodology Task View about including a link to the Guide. If the maintainer thinks that this project is not a perfect fit, then we are open to creating a separate task view dedicated to US Census Statistics.

« Older posts

© 2019 AriLamstein.com

Theme by Anders NorenUp ↑