Python • Pandas

How to Write Parquet Files in Python

Save DataFrames to efficient columnar storage with compression and partitioning. Use these ready-to-run snippets and adopt best practices to keep files small and queryable.

Saving DataFrame to Parquet

The simplest way is via pandas DataFrame.to_parquet, which uses PyArrow by default when available.

import pandas as pd

df = pd.DataFrame({"id": [1, 2], "country": ["US", "CA"]})

# Default: Snappy compression when pyarrow is installed
df.to_parquet("output.parquet", index=False)

# Control engine explicitly
df.to_parquet("output.parquet", index=False, engine="pyarrow")

Tip: Set index=False to avoid persisting the pandas index unless you need it for joins.

Understanding Compression (Snappy, Gzip)

Choose a codec that balances speed and size. Snappy is fast; Gzip yields smaller files but slower writes/reads.

import pandas as pd

df = pd.read_parquet("input.parquet")

# Snappy (default, balanced)
df.to_parquet("out-snappy.parquet", compression="snappy", index=False)

# Gzip (smaller, slower)
df.to_parquet("out-gzip.parquet", compression="gzip", index=False)

# Brotli (good ratio, moderate speed)
df.to_parquet("out-brotli.parquet", compression="brotli", index=False)

Tip: Stay consistent across a dataset; mixing codecs within the same table can surprise downstream readers.

Partitioning Data

Partitioning creates directory-level splits (e.g., country=US/) to speed up reads when filters match those keys. Avoid tiny files—target 100–500MB per file for analytics engines.

import pandas as pd

df = pd.read_parquet("input.parquet")

# Write to a directory with partitions
df.to_parquet(
    "s3://bucket/analytics/users/",
    index=False,
    partition_cols=["country", "year"],
)

Tip: After partitioned writes, check partition balance and consider repartition or coalesce in Spark to reduce small files.