pandas vs Polars vs DuckDB: A Data Scientist’s Guide to Choosing the Right Tool

Image byΒ author

Originally published on codecut.ai

Introduction

pandas has been the standard tool for working with tabular data in Python for over a decade. But as datasets grow larger and performance requirements increase, two modern alternatives have emerged: Polars, a DataFrame library written in Rust, and DuckDB, an embedded SQL database optimized for analytics.

Each tool excels in different scenarios:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Tool β”‚ Backend β”‚ Execution Model β”‚ Best For β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ pandas β”‚ C/Python β”‚ Eager, single-threaded β”‚ Small datasets, prototyping, ML integration β”‚
β”‚ Polars β”‚ Rust β”‚ Lazy/Eager, multi-threaded β”‚ Large-scale analytics, data pipelines β”‚
β”‚ DuckDB β”‚ C++ β”‚ SQL-first, multi-threaded β”‚ SQL workflows, embedded analytics, file queries β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This guide compares all three tools with practical examples, helping you choose the right one for your workflow.

πŸ’» Get the Code: The complete source code and Jupyter notebook for this tutorial are available on GitHub. Clone it to followΒ along!

Tool Strengths at aΒ Glance

pandas

pandas is the original DataFrame library for Python that excels at interactive data exploration and integrates seamlessly with the ML ecosystem. Key capabilities include:

  • Direct compatibility with scikit-learn, statsmodels, and visualization libraries
  • Rich ecosystem of extensions (pandas-profiling, pandasql, etc.)
  • Mature time series functionality
  • Familiar syntax that most data scientists alreadyΒ know

Polars

Polars is a Rust-powered DataFrame library designed for speed that brings multi-threaded execution and query optimization to Python. Key capabilities include:

  • Speeds up operations by using all available CPU cores byΒ default
  • Builds a query plan first, then executes only what’sΒ needed
  • Streaming mode for processing datasets larger thanΒ RAM
  • Expressive method chaining with a pandas-like API

DuckDB

DuckDB is an embedded SQL database optimized for analytics that brings database-level query optimization to local files. Key capabilities include:

  • Native SQL syntax with full analytical queryΒ support
  • Queries CSV, Parquet, and JSON files directly withoutΒ loading
  • Uses disk storage automatically when data exceeds available memory
  • Zero-configuration embedded database requiring no serverΒ setup

Setup

Install all three libraries:

pip install pandas polars duckdb

Generate sample data for benchmarking:

import pandas as pd
import numpy as np
np.random.seed(42)
n_rows = 5_000_000
data = {
"category": np.random.choice(["Electronics", "Clothing", "Food", "Books"], size=n_rows),
"region": np.random.choice(["North", "South", "East", "West"], size=n_rows),
"amount": np.random.rand(n_rows) * 1000,
"quantity": np.random.randint(1, 100, size=n_rows),
}
df_pandas = pd.DataFrame(data)
df_pandas.to_csv("sales_data.csv", index=False)
print(f"Created sales_data.csv with {n_rows:,} rows")
Created sales_data.csv with 5,000,000 rows

Syntax Comparison

All three tools can perform the same operations with different syntax. Here’s a side-by-side comparison of commonΒ tasks.

Filtering Rows

pandas:
Uses bracket notation with boolean conditions, which requires repeating the DataFrame name for each condition:

import pandas as pd
df_pd = pd.read_csv("sales_data.csv")
result_bracket = df_pd[(df_pd["amount"] > 500) & (df_pd["category"] == "Electronics")]
result_bracket.head()
β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ category β”‚ region β”‚ amount β”‚ quantity β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 7 β”‚ Electronics β”‚ West β”‚ 662.803066 β”‚ 80 β”‚
β”‚ 15 β”‚ Electronics β”‚ North β”‚ 826.004963 β”‚ 25 β”‚
β”‚ 30 β”‚ Electronics β”‚ North β”‚ 766.081832 β”‚ 7 β”‚
β”‚ 31 β”‚ Electronics β”‚ West β”‚ 772.084261 β”‚ 36 β”‚
β”‚ 37 β”‚ Electronics β”‚ East β”‚ 527.967145 β”‚ 35 β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Alternatively, you can use the query() method, which provides cleaner SQL-likeΒ syntax:

result_query = df_pd.query("amount > 500 and category == 'Electronics'")

However, since query() is string-based, there’s no IDE autocomplete. Complex operations like string methods still require brackets:

result_str = df_pd[df_pd["category"].str.startswith("Elec")]
result_str.head()
β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ category β”‚ region β”‚ amount β”‚ quantity β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2 β”‚ Electronics β”‚ North β”‚ 450.941022 β”‚ 93 β”‚
β”‚ 6 β”‚ Electronics β”‚ West β”‚ 475.843957 β”‚ 61 β”‚
β”‚ 7 β”‚ Electronics β”‚ West β”‚ 662.803066 β”‚ 80 β”‚
β”‚ 15 β”‚ Electronics β”‚ North β”‚ 826.004963 β”‚ 25 β”‚
β”‚ 21 β”‚ Electronics β”‚ South β”‚ 292.399383 β”‚ 13 β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Polars:
Unlike pandas, Polars uses one syntax for all filters. The pl.col() expressions are type-safe with IDE autocomplete, and handle both simple comparisons and complex operations like stringΒ methods:

import polars as pl
df_pl = pl.read_csv("sales_data.csv")
result_pl = df_pl.filter(
(pl.col("amount") > 500) & (pl.col("category") == "Electronics")
)
result_pl.head()
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ category β”‚ region β”‚ amount β”‚ quantity β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ str β”‚ str β”‚ f64 β”‚ i64 β”‚
β”‚ β€œElectronics” β”‚ β€œWest” β”‚ 662.803066 β”‚ 80 β”‚
β”‚ β€œElectronics” β”‚ β€œNorth” β”‚ 826.004963 β”‚ 25 β”‚
β”‚ β€œElectronics” β”‚ β€œNorth” β”‚ 766.081832 β”‚ 7 β”‚
β”‚ β€œElectronics” β”‚ β€œWest” β”‚ 772.084261 β”‚ 36 β”‚
β”‚ β€œElectronics” β”‚ β€œEast” β”‚ 527.967145 β”‚ 35 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

DuckDB:
Uses standard SQL with a WHERE clause, which is more readable by those who knowΒ SQL.

import duckdb
result_duckdb = duckdb.sql("""
SELECT * FROM 'sales_data.csv'
WHERE amount > 500 AND category = 'Electronics'
""").df()
result_duckdb.head()
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ category β”‚ region β”‚ amount β”‚ quantity β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 0 β”‚ Electronics β”‚ West β”‚ 662.803066 β”‚ 80 β”‚
β”‚ 1 β”‚ Electronics β”‚ North β”‚ 826.004963 β”‚ 25 β”‚
β”‚ 2 β”‚ Electronics β”‚ North β”‚ 766.081832 β”‚ 7 β”‚
β”‚ 3 β”‚ Electronics β”‚ West β”‚ 772.084261 β”‚ 36 β”‚
β”‚ 4 β”‚ Electronics β”‚ East β”‚ 527.967145 β”‚ 35 β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Selecting Columns

pandas:
Double brackets return a DataFrame with selectedΒ columns.

result_pd = df_pd[["category", "amount"]]
result_pd.head()
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ category β”‚ amount β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 0 β”‚ Food β”‚ 516.653322 β”‚
β”‚ 1 β”‚ Books β”‚ 937.337226 β”‚
β”‚ 2 β”‚ Electronics β”‚ 450.941022 β”‚
β”‚ 3 β”‚ Food β”‚ 674.488081 β”‚
β”‚ 4 β”‚ Food β”‚ 188.847906 β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Polars:
The select() method clearly communicates column selection intent.

result_pl = df_pl.select(["category", "amount"])
result_pl.head()
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ category β”‚ amount β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ str β”‚ f64 β”‚
β”‚ β€œFood” β”‚ 516.653322 β”‚
β”‚ β€œBooks” β”‚ 937.337226 β”‚
β”‚ β€œElectronics” β”‚ 450.941022 β”‚
β”‚ β€œFood” β”‚ 674.488081 β”‚
β”‚ β€œFood” β”‚ 188.847906 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

DuckDB:
SQL’s SELECT clause makes column selection intuitive for SQLΒ users.

result_duckdb = duckdb.sql("""
SELECT category, amount FROM 'sales_data.csv'
""").df()
result_duckdb.head()
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ category β”‚ amount β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 0 β”‚ Food β”‚ 516.653322 β”‚
β”‚ 1 β”‚ Books β”‚ 937.337226 β”‚
β”‚ 2 β”‚ Electronics β”‚ 450.941022 β”‚
β”‚ 3 β”‚ Food β”‚ 674.488081 β”‚
β”‚ 4 β”‚ Food β”‚ 188.847906 β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

GroupBy Aggregation

pandas:
Uses a dictionary to specify aggregations, but returns multi-level column headers that often require flattening before furtherΒ use.

result_pd = df_pd.groupby("category").agg({
"amount": ["sum", "mean"],
"quantity": "sum"
})
result_pd.head()
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ amount β”‚ β”‚ quantity β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚ sum β”‚ mean β”‚ sum β”‚
β”‚ Books β”‚ 6.247506e+08 β”‚ 499.998897 β”‚ 62463285 β”‚
β”‚ Clothing β”‚ 6.253924e+08 β”‚ 500.139837 β”‚ 62505224 β”‚
β”‚ Electronics β”‚ 6.244453e+08 β”‚ 499.938189 β”‚ 62484265 β”‚
β”‚ Food β”‚ 6.254034e+08 β”‚ 499.916417 β”‚ 62577943 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Polars:
Uses explicit alias() calls for each aggregation, producing flat column names directly without post-processing.

result_pl = df_pl.group_by("category").agg([
pl.col("amount").sum().alias("amount_sum"),
pl.col("amount").mean().alias("amount_mean"),
pl.col("quantity").sum().alias("quantity_sum"),
])
result_pl.head()
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€python───────┬────────────┬─────────────┬──────────────┐
β”‚ category β”‚ amount_sum β”‚ amount_mean β”‚ quantity_sum β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ str β”‚ f64 β”‚ f64 β”‚ i64 β”‚
β”‚ β€œClothing” β”‚ 6.2539e8 β”‚ 500.139837 β”‚ 62505224 β”‚
β”‚ β€œBooks” β”‚ 6.2475e8 β”‚ 499.998897 β”‚ 62463285 β”‚
β”‚ β€œElectronics” β”‚ 6.2445e8 β”‚ 499.938189 β”‚ 62484265 β”‚
β”‚ β€œFood” β”‚ 6.2540e8 β”‚ 499.916417 β”‚ 62577943 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

DuckDB:
Standard SQL aggregation with column aliases produces clean, flat output ready for downstream use.

result_duckdb = duckdb.sql("""
SELECT
category,
SUM(amount) as amount_sum,
AVG(amount) as amount_mean,
SUM(quantity) as quantity_sum
FROM 'sales_data.csv'
GROUP BY category
""").df()
result_duckdb.head()
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ category β”‚ amount_sum β”‚ amount_mean β”‚ quantity_sum β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 0 β”‚ Food β”‚ 6.254034e+08 β”‚ 499.916417 β”‚ 62577943.0 β”‚
β”‚ 1 β”‚ Electronics β”‚ 6.244453e+08 β”‚ 499.938189 β”‚ 62484265.0 β”‚
β”‚ 2 β”‚ Clothing β”‚ 6.253924e+08 β”‚ 500.139837 β”‚ 62505224.0 β”‚
β”‚ 3 β”‚ Books β”‚ 6.247506e+08 β”‚ 499.998897 β”‚ 62463285.0 β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Adding Columns

pandas:
The assign() method creates new columns with repeated DataFrame references like df_pd["amount"].

result_pd = df_pd.assign(
amount_with_tax=df_pd["amount"] * 1.1,
high_value=df_pd["amount"] > 500
)
result_pd.head()
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ category β”‚ region β”‚ amount β”‚ quantity β”‚ amount_with_tax β”‚ high_value β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 0 β”‚ Food β”‚ South β”‚ 516.653322 β”‚ 40 β”‚ 568.318654 β”‚ True β”‚
β”‚ 1 β”‚ Books β”‚ East β”‚ 937.337226 β”‚ 45 β”‚ 1031.070948 β”‚ True β”‚
β”‚ 2 β”‚ Electronics β”‚ North β”‚ 450.941022 β”‚ 93 β”‚ 496.035124 β”‚ False β”‚
β”‚ 3 β”‚ Food β”‚ East β”‚ 674.488081 β”‚ 46 β”‚ 741.936889 β”‚ True β”‚
β”‚ 4 β”‚ Food β”‚ East β”‚ 188.847906 β”‚ 98 β”‚ 207.732697 β”‚ False β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Polars:
The with_columns() method uses composable expressions that chain naturally without repeating the DataFrame name.

result_pl = df_pl.with_columns([
(pl.col("amount") * 1.1).alias("amount_with_tax"),
(pl.col("amount") > 500).alias("high_value")
])
result_pl.head()
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ category β”‚ region β”‚ amount β”‚ quantity β”‚ amount_with_tax β”‚ high_value β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ str β”‚ str β”‚ f64 β”‚ i64 β”‚ f64 β”‚ bool β”‚
β”‚ β€œFood” β”‚ β€œSouth” β”‚ 516.653322 β”‚ 40 β”‚ 568.318654 β”‚ true β”‚
β”‚ β€œBooks” β”‚ β€œEast” β”‚ 937.337226 β”‚ 45 β”‚ 1031.070948 β”‚ true β”‚
β”‚ β€œElectronics” β”‚ β€œNorth” β”‚ 450.941022 β”‚ 93 β”‚ 496.035124 β”‚ false β”‚
β”‚ β€œFood” β”‚ β€œEast” β”‚ 674.488081 β”‚ 46 β”‚ 741.936889 β”‚ true β”‚
β”‚ β€œFood” β”‚ β€œEast” β”‚ 188.847906 β”‚ 98 β”‚ 207.732697 β”‚ false β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

DuckDB:
SQL’s SELECT clause defines new columns directly in the query, keeping transformations readable.

result_duckdb = duckdb.sql("""
SELECT *,
amount * 1.1 as amount_with_tax,
amount > 500 as high_value
FROM df_pd
""").df()
result_duckdb.head()
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ category β”‚ region β”‚ amount β”‚ quantity β”‚ amount_with_tax β”‚ high_value β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 0 β”‚ Food β”‚ South β”‚ 516.653322 β”‚ 40 β”‚ 568.318654 β”‚ True β”‚
β”‚ 1 β”‚ Books β”‚ East β”‚ 937.337226 β”‚ 45 β”‚ 1031.070948 β”‚ True β”‚
β”‚ 2 β”‚ Electronics β”‚ North β”‚ 450.941022 β”‚ 93 β”‚ 496.035124 β”‚ False β”‚
β”‚ 3 β”‚ Food β”‚ East β”‚ 674.488081 β”‚ 46 β”‚ 741.936889 β”‚ True β”‚
β”‚ 4 β”‚ Food β”‚ East β”‚ 188.847906 β”‚ 98 β”‚ 207.732697 β”‚ False β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Conditional Logic

pandas:
Each additional condition in np.where() adds another nesting level. With three tiers, the final value is buried two levelsΒ deep:

import numpy as np
# Hard to read: "low" is nested inside two np.where() calls
result_np = df_pd.assign(
value_tier=np.where(
df_pd["amount"] > 700, "high",
np.where(df_pd["amount"] > 300, "medium", "low")
)
)
result_np[["category", "amount", "value_tier"]].head()
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ category β”‚ amount β”‚ value_tier β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 0 β”‚ Food β”‚ 516.653322 β”‚ medium β”‚
β”‚ 1 β”‚ Books β”‚ 937.337226 β”‚ high β”‚
β”‚ 2 β”‚ Electronics β”‚ 450.941022 β”‚ medium β”‚
β”‚ 3 β”‚ Food β”‚ 674.488081 β”‚ medium β”‚
β”‚ 4 β”‚ Food β”‚ 188.847906 β”‚ low β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

For numeric binning, pd.cut() isΒ cleaner:

result_pd = df_pd.assign(
value_tier=pd.cut(
df_pd["amount"],
bins=[-np.inf, 300, 700, np.inf],
labels=["low", "medium", "high"]
)
)
result_pd[["category", "amount", "value_tier"]].head()
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ category β”‚ amount β”‚ value_tier β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 0 β”‚ Food β”‚ 516.653322 β”‚ medium β”‚
β”‚ 1 β”‚ Books β”‚ 937.337226 β”‚ high β”‚
β”‚ 2 β”‚ Electronics β”‚ 450.941022 β”‚ medium β”‚
β”‚ 3 β”‚ Food β”‚ 674.488081 β”‚ medium β”‚
β”‚ 4 β”‚ Food β”‚ 188.847906 β”‚ low β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

However, pd.cut() has drawbacks:

  • Only works for numericΒ ranges
  • Requires thinking in boundaries ([-inf, 300, 700, inf]) instead of conditions (amount >Β 700)
  • Needs numpy for open-ended bins

For non-numeric or mixed conditions, you’re back to np.where():

# "premium" if Electronics AND amount > 500 - pd.cut() can't do this
result = df_pd.assign(
tier=np.where(
(df_pd["category"] == "Electronics") & (df_pd["amount"] > 500),
"premium", "standard"
)
)
result.head()
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ category β”‚ region β”‚ amount β”‚ quantity β”‚ tier β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 0 β”‚ Food β”‚ South β”‚ 516.653322 β”‚ 40 β”‚ standard β”‚
β”‚ 1 β”‚ Books β”‚ East β”‚ 937.337226 β”‚ 45 β”‚ standard β”‚
β”‚ 2 β”‚ Electronics β”‚ North β”‚ 450.941022 β”‚ 93 β”‚ standard β”‚
β”‚ 3 β”‚ Food β”‚ East β”‚ 674.488081 β”‚ 46 β”‚ standard β”‚
β”‚ 4 β”‚ Food β”‚ East β”‚ 188.847906 β”‚ 98 β”‚ standard β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Polars:
The when().then().otherwise() chain solves both pandas problems: no nesting like np.where(), and works for any condition (not just numeric ranges like pd.cut()). The same syntax handles simple binning and complex mixed conditions:

result_pl = df_pl.with_columns(
pl.when(pl.col("amount") > 700).then(pl.lit("high"))
.when(pl.col("amount") > 300).then(pl.lit("medium"))
.otherwise(pl.lit("low"))
.alias("value_tier")
)
result_pl.select(["category", "amount", "value_tier"]).head()
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ category β”‚ amount β”‚ value_tier β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ str β”‚ f64 β”‚ str β”‚
β”‚ β€œFood” β”‚ 516.653322 β”‚ β€œmedium” β”‚
β”‚ β€œBooks” β”‚ 937.337226 β”‚ β€œhigh” β”‚
β”‚ β€œElectronics” β”‚ 450.941022 β”‚ β€œmedium” β”‚
β”‚ β€œFood” β”‚ 674.488081 β”‚ β€œmedium” β”‚
β”‚ β€œFood” β”‚ 188.847906 β”‚ β€œlow” β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

DuckDB:
Standard SQL CASE WHEN syntax is more readable by those who knowΒ SQL.

result_duckdb = duckdb.sql("""
SELECT category, amount,
CASE
WHEN amount > 700 THEN 'high'
WHEN amount > 300 THEN 'medium'
ELSE 'low'
END as value_tier
FROM df_pd
""").df()
result_duckdb.head()
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ category β”‚ amount β”‚ value_tier β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 0 β”‚ Food β”‚ 516.653322 β”‚ medium β”‚
β”‚ 1 β”‚ Books β”‚ 937.337226 β”‚ high β”‚
β”‚ 2 β”‚ Electronics β”‚ 450.941022 β”‚ medium β”‚
β”‚ 3 β”‚ Food β”‚ 674.488081 β”‚ medium β”‚
β”‚ 4 β”‚ Food β”‚ 188.847906 β”‚ low β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Window Functions

pandas:
Uses groupby().transform() which requires repeating the groupby clause for each calculation.

result_pd = df_pd.assign(
category_avg=df_pd.groupby("category")["amount"].transform("mean"),
category_rank=df_pd.groupby("category")["amount"].rank(ascending=False)
)
result_pd[["category", "amount", "category_avg", "category_rank"]].head()
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ category β”‚ amount β”‚ category_avg β”‚ category_rank β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 0 β”‚ Food β”‚ 516.653322 β”‚ 499.916417 β”‚ 604342.0 β”‚
β”‚ 1 β”‚ Books β”‚ 937.337226 β”‚ 499.998897 β”‚ 78423.0 β”‚
β”‚ 2 β”‚ Electronics β”‚ 450.941022 β”‚ 499.938189 β”‚ 685881.0 β”‚
β”‚ 3 β”‚ Food β”‚ 674.488081 β”‚ 499.916417 β”‚ 407088.0 β”‚
β”‚ 4 β”‚ Food β”‚ 188.847906 β”‚ 499.916417 β”‚ 1015211.0 β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Polars:
The over() expression appends the partition to any calculation, avoiding repeated group definitions.

result_pl = df_pl.with_columns([
pl.col("amount").mean().over("category").alias("category_avg"),
pl.col("amount").rank(descending=True).over("category").alias("category_rank")
])
result_pl.select(["category", "amount", "category_avg", "category_rank"]).head()
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ category β”‚ amount β”‚ category_avg β”‚ category_rank β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ str β”‚ f64 β”‚ f64 β”‚ f64 β”‚
β”‚ β€œFood” β”‚ 516.653322 β”‚ 499.916417 β”‚ 604342.0 β”‚
β”‚ β€œBooks” β”‚ 937.337226 β”‚ 499.998897 β”‚ 78423.0 β”‚
β”‚ β€œElectronics” β”‚ 450.941022 β”‚ 499.938189 β”‚ 685881.0 β”‚
β”‚ β€œFood” β”‚ 674.488081 β”‚ 499.916417 β”‚ 407088.0 β”‚
β”‚ β€œFood” β”‚ 188.847906 β”‚ 499.916417 β”‚ 1015211.0 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

DuckDB:
SQL window functions with OVER (PARTITION BYΒ ...) are the industry standard for this type of calculation.

result_duckdb = duckdb.sql("""
SELECT category, amount,
AVG(amount) OVER (PARTITION BY category) as category_avg,
RANK() OVER (PARTITION BY category ORDER BY amount DESC) as category_rank
FROM df_pd
""").df()
result_duckdb.head()
β”Œβ”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ category β”‚ amount β”‚ category_avg β”‚ category_rank β”‚
β”œβ”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 0 β”‚ Clothing β”‚ 513.807166 β”‚ 500.139837 β”‚ 608257 β”‚
β”‚ 1 β”‚ Clothing β”‚ 513.806596 β”‚ 500.139837 β”‚ 608258 β”‚
β”‚ 2 β”‚ Clothing β”‚ 513.806515 β”‚ 500.139837 β”‚ 608259 β”‚
β”‚ 3 β”‚ Clothing β”‚ 513.806063 β”‚ 500.139837 β”‚ 608260 β”‚
β”‚ 4 β”‚ Clothing β”‚ 513.806056 β”‚ 500.139837 β”‚ 608261 β”‚
β””β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Data Loading Performance

pandas reads CSV files on a single CPU core. Polars and DuckDB use multi-threaded execution, distributing the work across all available cores to read different parts of the file simultaneously.

pandas

Single-threaded CSV parsing loads data sequentially.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CPU Core 1 β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Chunk 1 β†’ Chunk 2 β†’ Chunk 3 β†’ ... β†’ End β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ CPU Core 2 [idle] β”‚
β”‚ CPU Core 3 [idle] β”‚
β”‚ CPU Core 4 [idle] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
pandas_time = %timeit -o pd.read_csv("sales_data.csv")
1.05 s Β± 26.9 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)

Polars

Multi-threaded parsing distributes file reading across all available cores.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CPU Core 1 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β”‚
β”‚ CPU Core 2 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β”‚
β”‚ CPU Core 3 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β”‚
β”‚ CPU Core 4 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
polars_time = %timeit -o pl.read_csv("sales_data.csv")
137 ms Β± 34 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)

DuckDB

Similar to Polars, file reading is distributed across all available cores.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CPU Core 1 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β”‚
β”‚ CPU Core 2 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β”‚
β”‚ CPU Core 3 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β”‚
β”‚ CPU Core 4 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
duckdb_time = %timeit -o duckdb.sql("SELECT * FROM 'sales_data.csv'").df()
762 ms Β± 77.8 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)
print(f"Polars is {pandas_time.average / polars_time.average:.1f}Γ— faster than pandas")
print(f"DuckDB is {pandas_time.average / duckdb_time.average:.1f}Γ— faster than pandas")
Polars is 7.7Γ— faster than pandas
DuckDB is 1.4Γ— faster than pandas

While Polars leads with a 7.7Γ— speedup in CSV reading, DuckDB’s 1.4Γ— improvement shows parsing isn’t its focus. DuckDB shines when querying files directly or running complex analytical queries.

Query Optimization

pandas: No Optimization

pandas executes operations eagerly, creating intermediate DataFrames at each step. This wastes memory and prevents optimization.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Step 1: Load ALL rows β†’ 10M rows in memory β”‚
β”‚ Step 2: Filter (amount > 100) β†’ 5M rows in memory β”‚
β”‚ Step 3: GroupBy β†’ New DataFrame β”‚
β”‚ Step 4: Mean β†’ Final result β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Memory: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ (high - stores all intermediates)
def pandas_query():
return (
pd.read_csv("sales_data.csv")
.query('amount > 100')
.groupby('category')['amount']
.mean()
)
pandas_opt_time = %timeit -o pandas_query()
1.46 s Β± 88.9 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)

This approach has three problems:

  • Full CSV load: All rows are read before filtering
  • No predicate pushdown: Rows are filtered after loading the entire file intoΒ memory
  • No projection pushdown: All columns are loaded, even unusedΒ ones

You can manually add usecols to load fewerΒ columns:

def pandas_query_optimized():
return (
pd.read_csv("sales_data.csv", usecols=["category", "amount"])
.query('amount > 100')
.groupby('category')['amount']
.mean()
)
pandas_usecols_time = %timeit -o pandas_query_optimized()
1.06 s Β± 48.2 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)

This is faster, but has two drawbacks:

  • Manual tracking: You must specify columns yourself; change the query, updateΒ usecols
  • No row filtering: All rows still load before the filterΒ applies

Polars and DuckDB handle both automatically by analyzing your query before execution.

Polars: Lazy Evaluation

Polars supports lazy evaluation, which builds a query plan and optimizes it before execution:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Query Plan Built: β”‚
β”‚ scan_csv β†’ filter β†’ group_by β†’ agg β”‚
β”‚ β”‚
β”‚ Optimizations Applied: β”‚
β”‚ β€’ Predicate pushdown (filter during scan) β”‚
β”‚ β€’ Projection pushdown (read only needed columns) β”‚
β”‚ β€’ Multi-threaded execution (parallel across CPU cores) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Memory: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ (low - no intermediate DataFrames)
query_pl = (
pl.scan_csv("sales_data.csv")
.filter(pl.col("amount") > 100)
.group_by("category")
.agg(pl.col("amount").mean().alias("avg_amount"))
)
# View the optimized query plan
print(query_pl.explain())
AGGREGATE[maintain_order: false]
[col("amount").mean().alias("avg_amount")] BY [col("category")]
FROM
Csv SCAN [sales_data.csv] [id: 4687118704]
PROJECT 2/4 COLUMNS
SELECTION: [(col("amount")) > (100.0)]

The query plan shows these optimizations:

  • Predicate pushdown: SELECTION filters during scan, not afterΒ loading
  • Projection pushdown: PROJECT 2/4 COLUMNS reads only what’sΒ needed
  • Operation reordering: Aggregate runs on filtered data, not the fullΒ dataset

Execute the optimized query:

def polars_query():
return (
pl.scan_csv("sales_data.csv")
.filter(pl.col("amount") > 100)
.group_by("category")
.agg(pl.col("amount").mean().alias("avg_amount"))
.collect()
)
polars_opt_time = %timeit -o polars_query()
148 ms Β± 32.3 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)

DuckDB: SQL Optimizer

DuckDB’s SQL optimizer applies similar optimizations automatically:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Query Plan Built: β”‚
β”‚ SQL β†’ Parser β†’ Optimizer β†’ Execution Plan β”‚
β”‚ β”‚
β”‚ Optimizations Applied: β”‚
β”‚ β€’ Predicate pushdown (WHERE during scan) β”‚
β”‚ β€’ Projection pushdown (SELECT only needed columns) β”‚
β”‚ β€’ Vectorized execution (process 1024 rows per batch) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Memory: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ (low - streaming execution)
def duckdb_query():
return duckdb.sql("""
SELECT category, AVG(amount) as avg_amount
FROM 'sales_data.csv'
WHERE amount > 100
GROUP BY category
""").df()
duckdb_opt_time = %timeit -o duckdb_query()
245 ms Β± 12.1 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)

Let’s compare the performance of the optimized queries:

print(f"Polars is {pandas_opt_time.average / polars_opt_time.average:.1f}Γ— faster than pandas")
print(f"DuckDB is {pandas_opt_time.average / duckdb_opt_time.average:.1f}Γ— faster than pandas")
Polars is 9.9Γ— faster than pandas
DuckDB is 6.0Γ— faster than pandas

Polars outperforms DuckDB (9.9Γ— vs 6.0Γ—) in this benchmark because its Rust-based engine handles the filter-then-aggregate pattern efficiently. DuckDB’s strength lies in complex SQL queries with joins and subqueries.

GroupBy Performance

Computing aggregates requires scanning every row, a workload that scales linearly with CPU cores. This makes groupby operations the clearest test of parallel execution.

Let’s load the data for the groupby benchmarks:

# Load data for fair comparison
df_pd = pd.read_csv("sales_data.csv")
df_pl = pl.read_csv("sales_data.csv")

pandas: Single-Threaded

pandas processes groupby operations on a single CPU core, which becomes a bottleneck on large datasets.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CPU Core 1 β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Group A β†’ Group B β†’ Group C β†’ Group D β†’ ... β†’ Aggregate β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ CPU Core 2 [idle] β”‚
β”‚ CPU Core 3 [idle] β”‚
β”‚ CPU Core 4 [idle] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
def pandas_groupby():
return df_pd.groupby("category")["amount"].mean()
pandas_groupby_time = %timeit -o pandas_groupby()
271 ms Β± 135 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)

Polars: Multi-Threaded

Polars splits data across cores, computes partial aggregates in parallel, then merges theΒ results.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CPU Core 1 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β†’ Partial Aggregate β”‚
β”‚ CPU Core 2 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β†’ Partial Aggregate β”‚
β”‚ CPU Core 3 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β†’ Partial Aggregate β”‚
β”‚ CPU Core 4 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β†’ Partial Aggregate β”‚
β”‚ ↓ β”‚
β”‚ Final Merge β†’ Result β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
def polars_groupby():
return df_pl.group_by("category").agg(pl.col("amount").mean())
polars_groupby_time = %timeit -o polars_groupby()
31.1 ms Β± 3.65 ms per loop (mean Β± std. dev. of 7 runs, 10 loops each)

DuckDB: Multi-Threaded

Similar to Polars, DuckDB splits data across cores, computes partial aggregates in parallel, then merges theΒ results.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CPU Core 1 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β†’ Partial Aggregate β”‚
β”‚ CPU Core 2 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β†’ Partial Aggregate β”‚
β”‚ CPU Core 3 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β†’ Partial Aggregate β”‚
β”‚ CPU Core 4 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β†’ Partial Aggregate β”‚
β”‚ ↓ β”‚
β”‚ Final Merge β†’ Result β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
def duckdb_groupby():
return duckdb.sql("""
SELECT category, AVG(amount)
FROM df_pd
GROUP BY category
""").df()
duckdb_groupby_time = %timeit -o duckdb_groupby()
29 ms Β± 3.33 ms per loop (mean Β± std. dev. of 7 runs, 10 loops each)
print(f"Polars is {pandas_groupby_time.average / polars_groupby_time.average:.1f}Γ— faster than pandas")
print(f"DuckDB is {pandas_groupby_time.average / duckdb_groupby_time.average:.1f}Γ— faster than pandas")
Polars is 8.7Γ— faster than pandas
DuckDB is 9.4Γ— faster than pandas

DuckDB and Polars perform similarly (9.4Γ— vs 8.7Γ—), both leveraging parallel execution. DuckDB’s slight edge comes from late materialization and vector-at-a-time pipelined execution, which avoids creating intermediate results that Polars may still materialize for some operations.

Memory Efficiency

pandas: Full MemoryΒ Load

pandas loads the entire dataset intoΒ RAM:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RAM β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”‚ β”‚
β”‚ β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ ALL 10M ROWS β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”‚ β”‚
β”‚ β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ Usage: 707,495 KB (entire dataset in memory) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
df_pd_mem = pd.read_csv("sales_data.csv")
pandas_mem = df_pd_mem.memory_usage(deep=True).sum() / 1e3
print(f"pandas memory usage: {pandas_mem:,.0f} KB")
pandas memory usage: 707,495 KB

For larger-than-RAM datasets, pandas throws an out-of-memory error.

Polars: Streaming Mode

Polars can process data in streaming mode, handling chunks without loading everything:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RAM β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚β–ˆ β”‚ β”‚
β”‚ β”‚ (result only) β”‚ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ Usage: 0.06 KB (streams chunks, keeps only result) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
result_pl_stream = (
pl.scan_csv("sales_data.csv")
.group_by("category")
.agg(pl.col("amount").mean())
.collect(streaming=True)
)
polars_mem = result_pl_stream.estimated_size() / 1e3
print(f"Polars result memory: {polars_mem:.2f} KB")
Polars result memory: 0.06 KB

For larger-than-RAM files, use sink_parquet instead of collect(). It writes results directly to disk as chunks are processed, never holding the full dataset inΒ memory:

(
pl.scan_csv("sales_data.csv")
.filter(pl.col("amount") > 500)
.sink_parquet("filtered_sales.parquet")
)

DuckDB: Automatic Spill-to-Disk

DuckDB automatically writes intermediate results to temporary files when data exceeds available RAM:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RAM Disk (if needed) β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚β–ˆ β”‚ β”‚β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β”‚ β”‚
β”‚ β”‚ (up to 500MB) β”‚ β†’ β”‚ (overflow here) β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ Usage: 0.42 KB (spills to disk when RAM full) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
# Configure memory limit and temp directory
duckdb.sql("SET memory_limit = '500MB'")
duckdb.sql("SET temp_directory = '/tmp/duckdb_temp'")
# DuckDB handles larger-than-RAM automatically
result_duckdb_mem = duckdb.sql("""
SELECT category, AVG(amount) as avg_amount
FROM 'sales_data.csv'
GROUP BY category
""").df()
duckdb_mem = result_duckdb_mem.memory_usage(deep=True).sum() / 1e3
print(f"DuckDB result memory: {duckdb_mem:.2f} KB")
DuckDB result memory: 0.42 KB

DuckDB’s out-of-core processing makes it ideal for embedded analytics where memory isΒ limited.

print(f"pandas: {pandas_mem:,.0f} KB (full dataset)")
print(f"Polars: {polars_mem:.2f} KB (result only)")
print(f"DuckDB: {duckdb_mem:.2f} KB (result only)")
print(f"\nPolars uses {pandas_mem / polars_mem:,.0f}Γ— less memory than pandas")
print(f"DuckDB uses {pandas_mem / duckdb_mem:,.0f}Γ— less memory than pandas")
pandas: 707,495 KB (full dataset)
Polars: 0.06 KB (result only)
DuckDB: 0.42 KB (result only)
Polars uses 11,791,583Γ— less memory than pandas
DuckDB uses 1,684,512Γ— less memory than pandas

The million-fold reduction comes from streaming: Polars and DuckDB process data in chunks and only keep the 4-row result in memory, while pandas must hold all 10 million rows to compute the same aggregation.

Join Operations

Joining tables is one of the most common operations in data analysis. Let’s compare how each tool handles a left join between 1 million orders and 100K customers.

Let’s create two tables for join benchmarking:

# Create orders table (1M rows)
orders_pd = pd.DataFrame({
"order_id": range(1_000_000),
"customer_id": np.random.randint(1, 100_000, size=1_000_000),
"amount": np.random.rand(1_000_000) * 500
})
# Create customers table (100K rows)
customers_pd = pd.DataFrame({
"customer_id": range(100_000),
"region": np.random.choice(["North", "South", "East", "West"], size=100_000)
})
# Convert to Polars
orders_pl = pl.from_pandas(orders_pd)
customers_pl = pl.from_pandas(customers_pd)

pandas: Single-Threaded

pandas processes the join on a single CPUΒ core.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CPU Core 1 β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Row 1 β†’ Row 2 β†’ Row 3 β†’ ... β†’ Row 1M β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ CPU Core 2 [idle] β”‚
β”‚ CPU Core 3 [idle] β”‚
β”‚ CPU Core 4 [idle] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
def pandas_join():
return orders_pd.merge(customers_pd, on="customer_id", how="left")
pandas_join_time = %timeit -o pandas_join()
60.4 ms Β± 6.98 ms per loop (mean Β± std. dev. of 7 runs, 10 loops each)

Polars: Multi-Threaded

Polars distributes the join across all available CPUΒ cores.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CPU Core 1 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β”‚
β”‚ CPU Core 2 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β”‚
β”‚ CPU Core 3 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β”‚
β”‚ CPU Core 4 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
def polars_join():
return orders_pl.join(customers_pl, on="customer_id", how="left")
polars_join_time = %timeit -o polars_join()
11.8 ms Β± 6.42 ms per loop (mean Β± std. dev. of 7 runs, 10 loops each)

DuckDB: Multi-Threaded

Similar to Polars, DuckDB distributes the join across all available CPUΒ cores.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CPU Core 1 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β”‚
β”‚ CPU Core 2 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β”‚
β”‚ CPU Core 3 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β”‚
β”‚ CPU Core 4 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
def duckdb_join():
return duckdb.sql("""
SELECT o.*, c.region
FROM orders_pd o
LEFT JOIN customers_pd c ON o.customer_id = c.customer_id
""").df()
duckdb_join_time = %timeit -o duckdb_join()
55.7 ms Β± 1.14 ms per loop (mean Β± std. dev. of 7 runs, 10 loops each)

Let’s compare the performance of theΒ joins:

print(f"Polars is {pandas_join_time.average / polars_join_time.average:.1f}Γ— faster than pandas")
print(f"DuckDB is {pandas_join_time.average / duckdb_join_time.average:.1f}Γ— faster than pandas")
Polars is 5.1Γ— faster than pandas
DuckDB is 1.1Γ— faster than pandas

Polars delivers a 5.1Γ— speedup while DuckDB shows only 1.1Γ— improvement. Both tools use multi-threading, but Polars’ join algorithm and native DataFrame output avoid the conversion overhead that DuckDB incurs when returning results viaΒ .df().

Interoperability

All three tools work together seamlessly. Use each tool for what it does best in a single pipeline.

pandas DataFrame toΒ DuckDB

Query pandas DataFrames directly withΒ SQL:

df = pd.DataFrame({
"product": ["A", "B", "C"],
"sales": [100, 200, 150]
})
# DuckDB queries pandas DataFrames by variable name
result = duckdb.sql("SELECT * FROM df WHERE sales > 120").df()
print(result)
product  sales
0 B 200
1 C 150

Polars toΒ pandas

Convert Polars DataFrames when ML libraries requireΒ pandas:

df_polars = pl.DataFrame({
"feature1": [1, 2, 3],
"feature2": [4, 5, 6],
"target": [0, 1, 0]
})
# Convert to pandas for scikit-learn
df_pandas = df_polars.to_pandas()
print(type(df_pandas))
<class 'pandas.core.frame.DataFrame'>

DuckDB toΒ Polars

Get query results as Polars DataFrames:

result = duckdb.sql("""
SELECT category, SUM(amount) as total
FROM 'sales_data.csv'
GROUP BY category
""").pl()
print(type(result))
print(result)
<class 'polars.dataframe.frame.DataFrame'>
shape: (4, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ category ┆ total β”‚
β”‚ --- ┆ --- β”‚
β”‚ str ┆ f64 β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════║
β”‚ Electronics ┆ 6.2445e8 β”‚
β”‚ Food ┆ 6.2540e8 β”‚
β”‚ Clothing ┆ 6.2539e8 β”‚
β”‚ Books ┆ 6.2475e8 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Combined PipelineΒ Example

Each tool has a distinct strength: DuckDB optimizes SQL queries, Polars parallelizes transformations, and pandas integrates with ML libraries. Combine them in a single pipeline to leverage allΒ three:

# Step 1: DuckDB for initial SQL query
aggregated = duckdb.sql("""
SELECT category, region,
SUM(amount) as total_amount,
COUNT(*) as order_count
FROM 'sales_data.csv'
GROUP BY category, region
""").pl()
# Step 2: Polars for additional transformations
enriched = (
aggregated
.with_columns([
(pl.col("total_amount") / pl.col("order_count")).alias("avg_order_value"),
pl.col("category").str.to_uppercase().alias("category_upper")
])
.filter(pl.col("order_count") > 100000)
)
# Step 3: Convert to pandas for visualization or ML
final_df = enriched.to_pandas()
print(final_df.head())
category region  total_amount  order_count  avg_order_value category_upper
0 Food East 1.563586e+08 312918 499.679004 FOOD
1 Food North 1.563859e+08 312637 500.215456 FOOD
2 Clothing North 1.560532e+08 311891 500.345286 CLOTHING
3 Clothing East 1.565054e+08 312832 500.285907 CLOTHING
4 Food West 1.560994e+08 312662 499.259318 FOOD
πŸ“– Related: For writing functions that work across pandas, Polars, and PySpark without conversion, see Unified DataFrame Functions.

Decision Matrix

No single tool wins in every scenario. Use these tables to choose the right tool for your workflow.

Performance Summary

Benchmark results from 10 million rows on a singleΒ machine:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Operation β”‚ pandas β”‚ Polars β”‚ DuckDB β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ CSV Read (10M rows) β”‚ 1.05s β”‚ 137ms β”‚ 762ms β”‚
β”‚ GroupBy β”‚ 271ms β”‚ 31ms β”‚ 29ms β”‚
β”‚ Join (1M rows) β”‚ 60ms β”‚ 12ms β”‚ 56ms β”‚
β”‚ Memory Usage β”‚ 707 MB β”‚ 0.06 KB (streaming) β”‚ 0.42 KB (spill-to-disk) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Polars leads in CSV reading (7.7Γ— faster than pandas) and joins (5Γ— faster). DuckDB matches Polars in groupby performance and uses the least memory with automatic spill-to-disk.

Feature Comparison

Each tool makes different trade-offs between speed, memory, and ecosystem integration:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Feature β”‚ pandas β”‚ Polars β”‚ DuckDB β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Multi-threading β”‚ No β”‚ Yes β”‚ Yes β”‚
β”‚ Lazy evaluation β”‚ No β”‚ Yes β”‚ N/A (SQL) β”‚
β”‚ Query optimization β”‚ No β”‚ Yes β”‚ Yes β”‚
β”‚ Larger-than-RAM β”‚ No β”‚ Streaming β”‚ Spill-to-disk β”‚
β”‚ SQL interface β”‚ No β”‚ Limited β”‚ Native β”‚
β”‚ ML integration β”‚ Excellent β”‚ Good β”‚ Limited β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

pandas lacks the performance features that make Polars and DuckDB fast, but remains essential for ML workflows. Choose between Polars and DuckDB based on whether you prefer DataFrame chaining or SQLΒ syntax.

Recommendations

The best tool depends on your data size, workflow preferences, and constraints:

  • Small data (<1M rows): Use pandas for simplicity
  • Large data (1M-100M rows): Use Polars or DuckDB for 5–10Γ—Β speedup
  • SQL-preferred workflow: UseΒ DuckDB
  • DataFrame-preferred workflow: UseΒ Polars
  • Memory-constrained: Use Polars (streaming) or DuckDB (spill-to-disk)
  • ML pipeline integration: Use pandas (convert from Polars/DuckDB asΒ needed)
  • Production data pipelines: Use Polars (DataFrame) or DuckDB (SQL) based on team preference

Final Thoughts

If your code is all written in pandas, you don’t need to rewrite it all. You can migrate where itΒ matters:

  • Profile first: Find which pandas operations areΒ slow
  • Replace with Polars: CSV reads, groupbys, and joins see the biggestΒ gains
  • Add DuckDB: When SQL is cleaner than chained DataFrame operations

Keep pandas for final ML steps. Convert with df.to_pandas() whenΒ needed.

This article was originally published on CodeCut at https://codecut.ai/pandas-vs-polars-vs-duckdb-comparison/


pandas vs Polars vs DuckDB: A Data Scientist’s Guide to Choosing the Right Tool was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top