Using the Pandas apply function to add columns to DataFrames

This post originally appeared at AccessibleAI.dev on November 12th, 2022.

Pandas is a wonderful library for manipulating tabular data with Python. Out of the box Pandas offers many ways of adding, removing, and updating columns and rows, but sometimes you need a bit more power.

In this article we’ll explore the apply function and show how it can be used to run an operation against every row (or column) in your DataFrame - and why you might want to do that.

Why would you need Pandas apply?

In Pandas it’s fairly easy to add a new column to a DataFrame or update an existing one:

# Add a release_month column calculated from the existing release_date column
df['release_month'] = pd.DatetimeIndex(df['release_date']).month

Just by using the indexer on the DataFrame we can add or update a column to have a new value for every row in the DataFrame.

However, in doing this we are limited to expressions that are simple enough to easily express on the right of the assignment operator.

Thankfully, the apply function exists on Pandas DataFrames and lets us run custom functions for every row.

Applying Python Functions to DataFrame Rows using Apply

We can use the Pandas apply function to apply a single function to every row (or column) in a DataFrame. This allows us to run complex calculations and use those calculations to set column values.

For example, let’s say we had a DataFrame with a keyword_json column containing some JSON representing tags. We might want to parse this JSON and generate a comma separated value list of keywords. This list of keywords could then be set into a keywords column.

First, we declare an extract_keywords function that can be called for every row:

def extract_keywords(row):
    """
    This function takes in a row, gets some JSON representing keywords out of 
    its keyword_json column, and then builds a comma-separated list of values
    that gets set into a new keywords column.
    """

    # Grab our JSON for the keywords we want to process
    data = row['keyword_json']
    # additional JSON cleaning logic omitted for brevity

    # Start with an empty list of keywords
    keywords = ''

    # Loop over all loaded keywords and append them to the string
    loaded_keywords = json.loads(data)
    for item in loaded_keywords:
        keywords = keywords + item['name'] + ','

    # Add the keywords column with the final calculated string
    # If keywords already existed, its value would be replaced
    row['keywords'] = keywords

    # Return the modified row
    return row

Next, we call apply on our Pandas DataFrame to invoke that function once per row.

Important Note: By default apply will operate on each column instead of each row, so we specify axis=1 to work with rows instead.

df = df.apply(extract_keywords, axis=1)

This calls the function once per row and replaces the row with the returned value.

Like almost everything else in Pandas DataFrames, the apply function does not modify the original DataFrame, but returns a new one instead.

Closing Thoughts

The apply function is fairly slow to invoke, but it has a lot of power to allow you to do complex operations on your dataset.

Additionally, storing complex logic in functions instead of trying to do everything inline can improve the readability of your code. Improving readability usually improves maintainability, so this can be a very good thing.

While I always try to avoid apply if I can, the apply function can solve a large number of problems for you as you perform feature engineering and data wrangling in Python code using Pandas.

Author

Matt Eland

Microsoft MVP in AI, Author of "Refactoring with C#"

Matt Eland is a software engineering leader and data scientist who has served as a senior engineer, software engineering manager, professional programming instructor, and has helped build enterprise-level software at a variety of organizations before distinguishing himself as a Microsoft MVP in Artificial Intelligence by using technology to accomplish ridiculous things in the name of science and teaching others. Matt makes it his job to learn new things and share them with others through articles, videos, and talks at user groups and conferences covering a wide range of topics from software architecture to programming topics to artificial intelligence and data science. Matt is a current data analytics master's student, an AI Specialist at Leading EDJE, is the author of "Refactoring with C#" and is creating a LinkedIn course and book on Computer Vision on Azure. Matt occasionally sleeps as well.
View all posts

Recent Posts

Submitting conference abstracts that get accepted

Pitching a Tech Book to a Publisher

Writing a Book with Packt

Using the Pandas apply function to add columns to DataFrames

Running Python functions on every row in a DataFrame

Why would you need Pandas apply?

Applying Python Functions to DataFrame Rows using Apply

Closing Thoughts

Author

Related Content

Leave a ReplyCancel reply

Related Content

Appending Rows to a Pandas DataFrame

How to Generate Text with OpenAI, GPT-3, and...

Installing Anaconda for Python Development

Top 10 Dotnet Exception Anti-Patterns in C#

JSON Web Tokens Simplified

Discover more from The New Dev's Guide