Make Pandas transformations readable

Published: 2021-02-15
|
Updated: 2022-04-11

This whole little post may seem a ridiculous nitpick, but I’ve read a lot of pandas code “in the wild”, and I’ve seen things ಠ_ಠ. This is a tiny, easy thing to improve literal readability. The actual advice is at the end: use a code formatter.


Pandas is a powerful tool for manipulating tabular data. With concise syntax, we can build complex pipelines of transformations, often in a single line of code. Not everything that can be a one-liner, should be a one-liner.

Suppose we’re working with the iris data set, and we’d like to calculate the average petal length within each species of flower. Our code might look something like this.

import pandas as pd
import seaborn as sns

iris = sns.load_dataset("iris")

mean_petal_lengths = iris.groupby("species").agg(mean_petal_lengths=("petal_length","mean")).reset_index()

I much prefer to split such code over multiple lines. Here’s how I would format the above.

mean_petal_lengths = (
    iris
    .groupby("species")
    .agg(mean_petal_lengths=("petal_length","mean"))
    .reset_index()
)

Each step of the transformation has its own line. This takes a little more space (hey, space in your source code costs nothing!), but makes the sequence of operations super easy to parse.

  • Line 1, begin with the DataFrame iris.
  • Line 2, group that DataFrame by "species".
  • Line 3, perform an aggregation on the "petal_length" column.
  • Line 4, flatten the result.

It reads like a little recipe. Notice that wrapping the code in parentheses means we don’t need to end each line with a \.

The example above is nice and short, and I still prefer the version that’s split across lines. Method chains can grow much larger than this, and making them easier on the eyes only becomes more worthwhile as they do.


There’s a notable exception to preferring extremely readable formatting. If you’re using a formatter like black, you don’t have to think about this at all. You don’t get to. It will keep the code as a single line until the line length limit is reached, then reformat into something similar to the second form (though putting a little more on each line). I’d prefer that it split lines sooner, but black is a special case in which you give up all control over formatting in exchange for not having to think about it.