Reflections on Learning Pandas

One day last year I woke up with the thought: I want to know Python as well as I know R.

I like R, and happily use it every day at work. But I also have the impression that an entire ecosystem in the data world has sprung up around Python. It appears that this ecosystem is popular and somehow different than the ecosystem around R. I want to get hands on experience with this ecosystem and draw my own conclusions about it.

In an earlier post I wrote about my first steps in this adventure. In short, because I’m friends with a professional Python trainer (Reuven Lerner) I simply took his courses on introductory Python. Importantly, I also signed up for his “Weekly Python Exercises” service which gave me months of practice with basic Python.

However, none of that material dealt with analyzing data. I was learning how to do things akin to what I did when I worked as a software engineer: use basic data structures, create custom classes and write algorithms. But I never, say, read in a csv file and graphed the data.

It turns out that learning to do things like that in Python is a two-step process. First you have to learn the basics of Python, which I just did. And then need to learn a separate Python package called “Pandas”.

Learning Pandas

Panda-kun, from “Polar Bear Cafe”

Pandas is a Python package that lets you do R stuff in Python. The package has two main objects:

  • DataFrame. An object that acts like R’s data.frame class.
  • Series. An object that acts like R’s vector object.

In R, the columns of a data.frame are vectors. In Pandas, the columns of a DataFrame are Series.

Thankfully Reuven has a course on Pandas. However, I found this course to be frustrating! After a few hours I realized that Reuven’s approach to teaching Pandas mirrored my initial approach to teaching R. Back when I was first learning R I was working as a software engineer and viewed R as “just another programming language”. So when I taught it to my coworkers I focused on concepts in the programming language like vectorization and indexing. Reuven’s Pandas course is structured much like that.

Years later, when I took a “tidyverse train the trainer” workshop with RStudio (now Posit) I was exposed to a completely different way to teach R. This approach focuses on data analysis and has students visualize datasets right from the start. This methodology underpins Wickham and Grolemund’s book R for Data Science and is, I believe, a major reason for the book’s success.

While I recommend Reuven’s Pandas course to anyone who wants to learn Pandas, I do think there is room in the Pandas world for a course structured like R for Data Science. If you are aware of a course like this then please contact me.

Weekly Pandas Practice

After finishing a course in Pandas, my next step was to start tackling real world problems with it. As I mentioned earlier, my goal is to know Python as well as I know R. And while it’s true that I had completed a course on Pandas, I’ve also been programming in R for 10 years, so there is still a discrepancy in skill level.

This, in my opinion, is where Reuven’s approach to technical training really shines. Last year he launched a service called Bamboo Weekly that emails subscribers one analytical problem each week. You’re asked to solve the problem using Pandas, and he sends out a solution the following day.

My favorite part of this service is that the problems are all tied to current events. Last week’s problem, for example, was about aviation accidents. And after Netflix released viewership data we analyzed that. In short, I think that this service captures the unique joy of knowing tools like R and Pandas today: we’re swimming in a sea of interesting data, and these tools let you explore the data yourself, without relying on someone else’s analysis.

I’ve been doing Bamboo Weekly for about two months now. And while I’m still far from fluent in Pandas, I’m confident that if I continue with it then it’s only a matter of time before I get there.