How Much Python Do Data Scientists Need To Know?

A few months ago I wrote a blog post about Polars, a new dataframe library in Python that is both incredibly fast and easy to use. While the post was well-received, I did get some push-back on it. The criticism normally looked like this:

Even if everything you say about Polars is true, you still need to figure out a way to get fluent with Pandas. Because Pandas is everywhere in the Python world!

I took this feedback to heart and found a course that helped me get more comfortable with Pandas. Taking this course had a surprising side effect: it caused me to change my opinion about how proficient Data Scientists need to be as Software Engineers.

Pandas in 30 Days

My new recommended resource for learning Pandas is Pandas in 30 Days from Data School. This free course is taught by Kevin Markham, who founded Data School. It is structured as 30 short videos, and I went through one video each day over the course of a month.

Kevin is a talented teacher and each lesson is concise. The material is presented in a logical order, and everything is relevant to doing your first analysis. Kevin uses a handful of fun datasets throughout the course. I followed along with the videos by typing the commands into a Jupyter notebook on my own machine, and recommend others do so too.

Each lesson ends with the same exercise: download a dataset from Kaggle and apply that day’s lesson to the dataset. Prior to this course I had heard about Kaggle but never used it myself. This exercise is pretty simple but I still found it helpful.

If I could change one thing about the course it would be to make the exercises more specific and complex. If instead of “download a dataset” the course pointed us to a specific dataset, then Kevin could also provide a solution that we could compare our work to. It would also allow the exercises to build on each other and become more complex as the course went on.

Python Prerequisites for Pandas in 30 Days

When I first started learning Python I asked several people “How much Python do I need to know before learning Pandas?” I never got a good answer. And since I think that some people reading this likely have the same question, I thought to at least try to provide an answer here.

I previously thought that you should know at least the “basics” of Python before learning Pandas (ex. lists, tuples, dicts and how to write functions). Now my answer is “Maybe none? I’m not sure.”

In order to complete Pandas in 30 Days you will need some basic proficiency with Jupyter Notebooks (which isn’t Python per se), import statements and variable assignment. I believe that the lesson on creating DataFrames used both a list and dict, so knowing what they are prior to the course might help. But given how limited their use was, you might be able to pick up what you need during the course.

Compare this to the syllabus of Intro Python: Fundamentals, the first Python course I took. I recommend this course to anyone who wants to start learning Python. But as I review the syllabus, and reflect on what I needed to complete Pandas in 30 Days, I just don’t see a lot of overlap.

Python for Junior Data Scientists

If you don’t need to know much Python to start learning Pandas, then when (if ever) do Data Scientists need to start learning more about the language and software development in general? While I don’t have a great answer to this question, I now believe that my previous answer (“Data Scientists should aim to know as much about software development as junior Software Engineers”) was wrong.

In my last job I helped build internal tools for a team of data scientists and also served as a technical trainer / mentor for the data scientists. Our junior data scientists generally did not have a background in computer science. Because of this they often wrote code that, while correct, was difficult to reuse and maintain.

I used to think this was a problem, and worked with many of them to improve the quality of their code. But I eventually realized that this was not as big a problem as I thought it was. Writing code was only one of their responsibilities, and the code they wrote often did not need to be reused. In short, they could excel as data scientists while not excelling at software development. At this company we used R, but I think the same would have held true if we had used Python instead.

Python for Senior Data Scientists

A few months ago I took an Advanced Python Objects course and had an epiphany: “Python’s object system is why it became more popular than R. It’s just so extensible. You can easily create your own objects and have them elegantly work with the core language.” I mentioned this to a senior data scientist I know and they (a) had no idea what I was talking about and (b) had no interest in learning more. They considered this information to be irrelevant to their career.

It might be true that most data scientists don’t need to know about Properties, the Context Manager Protocol and Iterator Protocol. But these language features allow you to build really nice solutions to problems that you face. For example, here are places where the Pandas codebase uses Properties, the Context Manager Protocol and the Iterator Protocol. If you’re willing to view Pandas as a project created by and for data scientists, then at least some Data Scientists enjoy taking their knowledge of Python to a high level and using it to build reusable software. And given how popular Pandas is (over 270 million downloads last month (link)), clearly there’s some value to doing this.

I am interested in hearing from more Data Scientists about the incentives they face around both building reusable software (as opposed to doing one-off analyses) and improving their technical skills. Feel free to contact me if you’d like to share your experience with this.

Conclusion

If you are looking to get started with Pandas then I recommend Data School’s course Pandas in 30 Days. While I can’t say how familiar you need to be with Python prior to taking the course, I don’t think that the bar is very high. For reference, I believe that the main book for getting started in R (R for Data Science) does not assume that you have any prior experience with either R or software development. The same might be true here.

The course caused me to reconsider how much expertise Data Scientists need / should aspire to around Python and Software Engineering. Perhaps because I entered the field from a background in software development, I used to think that it was quite high. This was reinforced by lots of data scientists asking me to help them improve their technical skills.

However, I now think that this might have been a sampling bias. Not only do I come from a software development background, I also enjoy teaching. This might have led me to meet the subset of data scientists who want to improve their technical skills. I’ve recently met many Data Scientists who excel at their job and do not have a need (or interest) in building reusable software.

Comments?

While I’ve disabled comments on my blog, a primary reason I write is to connect with people who are interested in my articles. Feel free to contact me if you’d like to talk to me about this post.

Ari Lamstein

Ari Lamstein

I’m a software engineer who focuses on data projects.

I most recently worked as a Staff Data Science Engineer at a marketing analytics consultancy. While there I developed internal tools for our data scientists, ran workshops on data science and mentored data scientists on software engineering.

Thanks for visiting!

Sign up to stay up to date with the latest blog posts: