Home > Python > Using the Pandas apply function to add columns to DataFrames

Using the Pandas apply function to add columns to DataFrames

Running Python functions on every row in a DataFrame

by

This post originally appeared at AccessibleAI.dev on November 12th, 2022.

Pandas is a wonderful library for manipulating tabular data with Python. Out of the box Pandas offers many ways of adding, removing, and updating columns and rows, but sometimes you need a bit more power.

In this article we’ll explore the apply function and show how it can be used to run an operation against every row (or column) in your DataFrame - and why you might want to do that.

Why would you need Pandas apply?

In Pandas it’s fairly easy to add a new column to a DataFrame or update an existing one:

# Add a release_month column calculated from the existing release_date column
df['release_month'] = pd.DatetimeIndex(df['release_date']).month

Just by using the indexer on the DataFrame we can add or update a column to have a new value for every row in the DataFrame.

However, in doing this we are limited to expressions that are simple enough to easily express on the right of the assignment operator.

Thankfully, the apply function exists on Pandas DataFrames and lets us run custom functions for every row.

Applying Python Functions to DataFrame Rows using Apply

We can use the Pandas apply function to apply a single function to every row (or column) in a DataFrame. This allows us to run complex calculations and use those calculations to set column values.

For example, let’s say we had a DataFrame with a keyword_json column containing some JSON representing tags. We might want to parse this JSON and generate a comma separated value list of keywords. This list of keywords could then be set into a keywords column.

First, we declare an extract_keywords function that can be called for every row:

def extract_keywords(row):
    """
    This function takes in a row, gets some JSON representing keywords out of 
    its keyword_json column, and then builds a comma-separated list of values
    that gets set into a new keywords column.
    """

    # Grab our JSON for the keywords we want to process
    data = row['keyword_json']
    # additional JSON cleaning logic omitted for brevity

    # Start with an empty list of keywords
    keywords = ''

    # Loop over all loaded keywords and append them to the string
    loaded_keywords = json.loads(data)
    for item in loaded_keywords:
        keywords = keywords + item['name'] + ','

    # Add the keywords column with the final calculated string
    # If keywords already existed, its value would be replaced
    row['keywords'] = keywords

    # Return the modified row
    return row

Next, we call apply on our Pandas DataFrame to invoke that function once per row.

Important Note: By default apply will operate on each column instead of each row, so we specify axis=1 to work with rows instead.

df = df.apply(extract_keywords, axis=1)

This calls the function once per row and replaces the row with the returned value.

Like almost everything else in Pandas DataFrames, the apply function does not modify the original DataFrame, but returns a new one instead.

Closing Thoughts

The apply function is fairly slow to invoke, but it has a lot of power to allow you to do complex operations on your dataset.

Additionally, storing complex logic in functions instead of trying to do everything inline can improve the readability of your code. Improving readability usually improves maintainability, so this can be a very good thing.

While I always try to avoid apply if I can, the apply function can solve a large number of problems for you as you perform feature engineering and data wrangling in Python code using Pandas.

Author

  • Matt Eland
    Microsoft MVP in AI, Professional Programming Instructor

    After several decades as a software engineer and engineering manager, Matt now serves as a software engineering instructor and gets to raise up future developers and unleash them upon the world to build awesome things. Matt is a Microsoft MVP in Artificial Intelligence, runs several blogs and channels on data science and software engineering topics, is currently pursuing a master's degree in data analytics, and helps organize the Central Ohio .NET Developer Group while contributing to local and regional conferences. In his copious amounts of spare time, Matt continues to build nerdy things and looks for ways to share them with the larger community.

    View all posts

Leave a Reply

Related Content

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More