4 Pandas Anti-Patterns to Avoid and How to Fix Them
pandas is a powerful data analysis library with a rich API that offers multiple ways to perform any given data manipulation task. Some of these approaches are better than others, and pandas users often learn suboptimal coding practices that become their default workflows. This post highlights four common pandas anti-patterns and outlines a complementary set of techniques that you should use instead.*
Pandas Anti-Patterns
For illustrative examples of good and bad pandas patterns, I'm using this Netflix dataset from Kaggle, which characterises almost 6,000 Netflix shows and movies with respect to 15 features spanning various data types.
import pandas as pd
# https://www.kaggle.com/datasets/victorsoeiro/netflix-tv-shows-and-movies
df = pd.read_csv("titles.csv")
print(df.sample(5, random_state=0))
print(df.shape)
# returns:
# id title type description release_year age_certification runtime genres production_countries seasons imdb_id imdb_score imdb_votes tmdb_popularity tmdb_score
# 1519 ts38761 Another Miss Oh SHOW Two women working in the same industry with th... 2016 TV-MA 69 ['drama', 'fantasy', 'romance', 'comedy'] ['KR'] 1.0 tt5679572 7.9 1769.0 22.672 8.2
# 4942 ts225657 Halston SHOW American fashion designer Halston skyrockets t... 2021 TV-MA 47 ['drama'] ['US'] 1.0 tt10920514 7.5 14040.0 21.349 7.3
# 895 tm34646 Sisterakas MOVIE A man takes revenge on his sister by hiring he... 2012 NaN 110 ['drama', 'comedy'] ['PH'] NaN tt2590214 5.2 286.0 2.552 4.9
# 5426 ts301609 Love Is Blind: Brazil SHOW The dating experiment comes to Brazil as local... 2021 TV-MA 56 ['romance', 'reality'] ['BR'] 1.0 tt15018224 6.1 425.0 5.109 6.4
# 2033 ts56038 Dave Chappelle SHOW Comedy icon Dave Chappelle makes his triumphan... 2017 NaN 60 ['comedy', 'documentation'] ['US'] 1.0 tt6963504 8.7 2753.0 2.962 7.6
# (5806, 15)
The four pandas anti-patterns we'll cover are:
- Mutating instead of chaining
- Using
for
loops with pandas DataFrames - Overusing
.apply
in place ofnp.select
,np.where
, and.isin
- Using incorrect data types
Anti-Pattern #1: Mutating instead of chaining
Most pandas practitioners first learn data processing with pandas by sequentially mutating DataFrames as a series of distinct, line-by-line operations. There are a few reasons why excessive mutation of pandas DataFrames can cause problems:
- It wastes memory by creating global variables (this is especially true if you create a differently named DataFrame at each step)
- The code is cumbersome and difficult to read
- It's liable to bugs - particularly in notebooks where the ordering of data manipulation steps may not be strongly enforced
- It often produces the irritating and notoriously confusing
SettingWithCopyWarning
And no, passing the inplace=True
parameter to the operation does not help.
The example below contrasts a workflow based on mutation with an equivalent implementation that leverages chaining. Here we're performing various data transformation and column creation operations to explore the hypothesis that TV shows with more seasons have higher audience ratings than those with fewer seasons.
df = pd.read_csv("titles.csv")
# Mutation - DON'T DO THIS
df_bad = df.query("runtime > 30 & type == 'SHOW'")
df_bad["score"] = df_bad[["imdb_score", "tmdb_score"]].sum(axis=1)
df_bad = df_bad[["seasons", "score"]]
df_bad = df_bad.groupby("seasons").agg(["count", "mean"])
df_bad = df_bad.droplevel(axis=1, level=0)
df_bad = df_bad.query("count > 10")
# Chaining - DO THIS
df_good = (df
.query("runtime > 30 & type == 'SHOW'")
.assign(score=lambda df_: df_[["imdb_score", "tmdb_score"]].sum(axis=1))
[["seasons", "score"]]
.groupby("seasons")
.agg(["count", "mean"])
.droplevel(axis=1, level=0)
.query("count > 10")
)
# returns:
# count mean
# seasons
# 1.0 835 13.064671
# 2.0 189 14.109524
# 3.0 83 14.618072
# 4.0 41 14.887805
# 5.0 38 15.242105
# 6.0 16 15.962500
Chaining transforms a DataFrame according to a multi-step procedure all at once. This guarantees the full and proper application of each pandas method, thus mitigating the risk of bugs. The code is more readable with each line cleanly representing a distinct operation (note: many Python code formatters will destroy this structure - wrap your pandas code blocks with '#fmt: off
' and '#fmt: on
' to prevent this). Chaining will feel natural for R users familiar with the magrittr %>%
operator.
The pandas .pipe
method
Occasionally, you'll need to perform complex data manipulation processes that can't be cleanly implemented using off-the-shelf chaining methods. This is where pandas' .pipe
can be used to abstract away complex DataFrame transformations into separately defined functions.
In the example below, we convert a string column containing lists of country codes into three separate columns for the first three individual country codes.
df = pd.read_csv("titles.csv")
def split_prod_countries(df_):
# split `production_countries` column (containing lists of country
# strings) into three individual columns of single country strings
dfc = pd.DataFrame(df_["production_countries"].apply(eval).to_list())
dfc = dfc.iloc[:, :3]
dfc.columns = ["prod_country1", "prod_country2", "prod_country3"]
return df_.drop("production_countries", axis=1).join(dfc)
df_pipe = df.pipe(split_prod_countries)
print(df["production_countries"].sample(5, random_state=14))
# returns:
# 3052 ['CA', 'JP', 'US']
# 1962 ['US']
# 2229 ['GB']
# 2151 ['KH', 'US']
# 3623 ['ES']
print(df_pipe.sample(5, random_state=14).iloc[:, -3:])
# returns:
# prod_country1 prod_country2 prod_country3
# 3052 CA JP US
# 1962 US None None
# 2229 GB None None
# 2151 KH US None
# 3623 ES None None
The .pipe
method allows us to incorporate this complex conversion into our chaining workflow without complicating the pipeline code.
FAQ for method chaining and using .pipe
How can I debug chained pandas methods?
As a starting point, try simply commenting out selected lines (i.e. methods) from the chained code. For more sophisticated debugging techniques, Matt Harrison has written about this comprehensively in Chapter 35 of Effective Pandas.
What if I want to view a DataFrame midway through the chain?
You can insert this piped function that uses IPython.display.display
to show the DataFrame at an intermediary stage without interrupting the rest of the chain: .pipe(lambda df_: display(df_) or df_)
What if I want to store a DataFrame from midway through the chain as its own variable?
You can insert this piped function to store an intermediate DataFrame without interrupting the rest of the chain.
Anti-Pattern #2: Using for
loops with pandas DataFrames
The use of for
loops in pandas is a code smell that should always be eliminated. This includes pandas' built-in generator methods DataFrame.iterrows()
and DataFrame.itertuples()
. There are two reasons to avoid looping in pandas:
- It's overly verbose, cumbersome, and incompatible with chaining. Alternative approaches can achieve the same result more succinctly, whilst being chainable.
- Looping through rows individually is slow. Vectorised operations are more performant - particularly when operating on numeric type columns.
Suppose we wanted to use our newly created prod_country1
column to create another column corresponding whether prod_country1
is in the top3/10/20 countries by count of occurrences. The code sample below shows a suboptimal for
loop approach, along with a cleaner .apply
implementation.
df = pd.read_csv("titles.csv").pipe(split_production_countries)
# obtain country ranks
vcs = df["prod_country1"].value_counts()
top3 = vcs.index[:3]
top10 = vcs.index[:10]
top20 = vcs.index[:20]
# Looping - DON'T DO THIS
vals = []
for ind, row in df.iterrows():
country = row["prod_country1"]
if country in top3:
vals.append("top3")
elif country in top10:
vals.append("top10")
elif country in top20:
vals.append("top20")
else:
vals.append("other")
df["prod_country_rank"] = vals
# df[col].apply() - DO THIS
def get_prod_country_rank(country):
if country in top3:
return "top3"
elif country in top10:
return "top10"
elif country in top20:
return "top20"
else:
return "other"
df["prod_country_rank"] = df["prod_country1"].apply(get_prod_country_rank)
print(df.sample(5, random_state=14).iloc[:, -4:])
# returns:
# prod_country1 prod_country2 prod_country3 prod_country_rank
# 3052 CA JP US top10
# 1962 US None None top3
# 2229 GB None None top3
# 2151 KH US None other
# 3623 ES None None top10
This .apply
pattern is a flexible approach that produces code that's easy to understand, and has the benefit of being chainable.
for
loop, .apply
can achieve the same result more elegantly.However, .apply
is not a vectorised operation - under the hood, it still loops through the rows of the DataFrame. As such, unless you're manipulating object
type (typically strings) columns, there are more performant approaches that should be used instead. This leads us to anti-pattern #3...
Anti-Pattern #3: Overusing .apply
in place of np.select
, np.where
, and .isin
Once pandas practitioners learn about .apply
, they often end up applying it everywhere. This isn't always a problem, as the .apply
approach produces coherent code and performs adequately with modestly-sized datasets.
With large datasets and compute intensive calculations, however, .apply
can be orders of magnitude slower than more efficient, vectorised workflows. This article nicely demonstrates the superiority of optimised approaches over .apply
.
The np.select + .isin combination
One useful pattern to use instead is the combination of np.select
and .isin
. Here's how it looks for our country rank example.
df = pd.read_csv("titles.csv").pipe(split_production_countries)
def get_prod_country_rank(df_):
vcs = df_["prod_country1"].value_counts()
return np.select(
condlist=(
df_["prod_country1"].isin(vcs.index[:3]),
df_["prod_country1"].isin(vcs.index[:10]),
df_["prod_country1"].isin(vcs.index[:20]),
),
choicelist=("top3", "top10", "top20"),
default="other"
)
df = df.assign(prod_country_rank=lambda df_: get_prod_country_rank(df_))
The performance improvement isn't as dramatic as that achieved in the article above, as we're operating on object
type columns, and much of the computation time is associated with the value counts calculation common to both approaches. Nonetheless, this is still ~2x faster than the .apply
implementation. If the value counts calculation is placed inside the function used with .apply
, it becomes 500x slower than piping np.select
+ .isin
instead!
In addition to the performance advantage, including the value counts calculation inside the piped np.select
+ .isin
function enables this transformation to be performed as a chained operation with no global dependencies. Note that the np.select
conditions list behaves equivalently to an if elif else
block in that it selects based on the first statement encountered that returns True
.
The np.where
option
An alternative option for cases where there are only two choices is np.where
. For instance, suppose we knew there was a bug with IMDB's scores, and we needed to subtract one from all reported scores for shows/movies released after 2016. We could perform that as follows.
df = df.assign(
adjusted_score=lambda df_: np.where(
df_["release_year"] > 2016, df_["imdb_score"] - 1, df_["imdb_score"]
)
)
object
type columns, .apply
will be less performant than other approaches. Instead, consider using np.select
, np.where
, and .isin
to perform efficient, vectorised transformations.Confusingly, pd.where
behaves slightly differently to np.where
, but can also be useful if you want to retain the values in a column subject to a certain condition but replace everything else (e.g. keep values if they're within the top5 value counts and set everything else to 'other').
Anti-Pattern #4: Using incorrect data types
Optimising the data types for each column in a pandas DataFrame will improve performance and memory usage. When working with large datasets, significant gains can be made by shrinking the default float64
and int64
data types to smaller equivalents, such as float16
and int8
, for columns where this doesn't result in data loss.
However, the most egregious data type mismatch worth eliminating from your pandas code is using strings instead of categoricals. Converting a low cardinality column of categorical data from its default object
type to a category
type often achieves memory usage improvements of 100x and computation speed ups of 10x. The code sample below demonstrates how this conversion can be performed within a chained workflow.
df = pd.read_csv("titles.csv")
df = df.assign(
age_certification=lambda df_: df_["age_certification"].astype("category")
)
object
types to category
, where appropriate.This presentation exemplifies some of the benefits of using optimised data types.
Conclusion
In this article, I've shown four pandas anti-patterns, and alternative approaches you should adopt instead. The code sample below illustrates how these best practices can be combined into a coherent workflow. This particular example shows how we can calculate the mean adjusted score of the shows, depending on the prevalence rank of the first production country.
import pandas as pd
import numpy as np
df = pd.read_csv("titles.csv")
def split_prod_countries(df_):
dfc = pd.DataFrame(df_["production_countries"].apply(eval).to_list())
dfc = dfc.iloc[:, :3]
dfc.columns = ["prod_country1", "prod_country2", "prod_country3"]
return df_.drop("production_countries", axis=1).join(dfc)
def get_prod_country_rank(df_):
vcs = df_["prod_country1"].value_counts()
return np.select(
condlist=(
df_["prod_country1"].isin(vcs.index[:3]),
df_["prod_country1"].isin(vcs.index[:10]),
df_["prod_country1"].isin(vcs.index[:20]),
),
choicelist=("top3", "top10", "top20"),
default="other",
)
def get_adjusted_score(df_):
return np.where(
df_["release_year"] > 2016, df_["imdb_score"] - 1, df_["imdb_score"]
)
(df
.query("runtime > 30 & type == 'SHOW'")
.pipe(split_prod_countries)
.assign(
imdb_score=lambda df_: get_adjusted_score(df_),
score=lambda df_: df_[["imdb_score", "tmdb_score"]].sum(axis=1),
prod_country_rank=lambda df_: get_prod_country_rank(df_),
rank=lambda df_: df_["prod_country_rank"].astype("category")
)
[["rank", "score"]]
.groupby("rank")
.agg(["count", "mean"])
.droplevel(axis=1, level=0)
.sort_values("mean", ascending=False)
)
#returns:
# count mean
# rank
# top10 37 15.232432
# other 1104 12.824819
# top3 78 12.624359
# top20 20 10.775000
Adopting these practices allows for the complex data transformations and processing to all be conducted in a single chained statement. The code is performant, readable, and simple to maintain and extend. If you're not already coding pandas in this way, I recommend giving it a try!
* This post draws inspiration from various sources - particularly Matt Harrison's excellent Effective Pandas book. Other references are linked throughout the article.