Polars apply performance for custom functions

I've enjoyed with Polars significant speed-ups over Pandas, except one case. I'm newbie to Polars, so it could be just my wrong usage. Anyway here is the toy-example: on single column I need to apply custom function in my case it is parse from probablypeople library () but problem is generic.

Plain pandas apply has similar runtime like Polars, but pandas with parallel_apply from () gets speed-up proportional to number of cores.

It looks for me that Polars uses only single core for custom functions,or I miss something?

If I use Polars correctly, maybe there is a possibility to create tool like pandaralell for Polars?

!pip install probablepeople
!pip install pandarallel
import pandas as pd
import probablepeople as pp
import polars as pl
from pandarallel import pandarallel
AMOUNT = 1000_000
#Pandas:
df = pd.DataFrame({'a': ["Mr. Joe Smith"]})
df = df.loc[df.index.repeat(AMOUNT)].reset_index(drop=True)
df['b'] = df['a'].apply(pp.parse)
#Pandarallel:
pandarallel.initialize(progress_bar=True)
df['b_multi'] = df['a'].parallel_apply(pp.parse)
#Polars:
dfp = pl.DataFrame({'a': ["Mr. Joe Smith"]})
dfp = dfp.select(pl.all().repeat_by(AMOUNT).explode())
dfp = dfp.with_columns(pl.col('a').apply(pp.parse).alias('b'))

1 Answer

pandarallel uses multiprocessing.

You could also use multiprocessing with polars.

We're using multiprocessing.pool.Pool.imap with the default settings as an example.

multiprocessing: Understanding logic behind `chunksize`

import multiprocessing
import polars as pl
import probablepeople as pp
from pip._vendor.rich.progress import track
def parallel_apply(function, column): with multiprocessing.get_context("spawn").Pool() as pool: return pl.Series(pool.imap(function, track(column)))
if __name__ == "__main__": df = pl.DataFrame({ "name": ["Mr. Joe Smith", "Mrs. I & II Alice Random"] }) df = df.with_columns(pp = pl.col("name").map_batches(lambda col: parallel_apply(pp.parse, col)) ) print(df)

.map_batches() is used to pass the full column to a function.
The column is passed to parallel_apply along with the Python function to be executed in the Pool e.g. pp.parse
rich.progress.track() (which also comes bundled with pip) is used for a progress bar.

Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
shape: (2, 2)
┌──────────────────────────┬───────────────────────────────────┐
│ name ┆ pp │
│ --- ┆ --- │
│ str ┆ list[list[str]] │
╞══════════════════════════╪═══════════════════════════════════╡
│ Mr. Joe Smith ┆ [["Mr.", "PrefixMarital"], ["Joe… │
│ Mrs. I & II Alice Random ┆ [["Mrs.", "PrefixMarital"], ["I"… │
└──────────────────────────┴───────────────────────────────────┘

Performance

Just as a basic comparison, creating a 1_000_000 row dataframe - I get the following runtimes:

multiprocessing	duration
yes	1m23s
no	5m2s

Velvet Star Monitor

Polars apply performance for custom functions

1 Answer

Performance

Your Answer

Sign up or log in

Post as a guest

Similar Journal

What is the strongest fixed location equipment you can obtain at Level 1?

What's the best strategy to keep the chaos low?

Where do you find slime chunks?

How should I distribute my bonus points with my Paladin at level up?