Python • Pandas
Save DataFrames to efficient columnar storage with compression and partitioning. Use these ready-to-run snippets and adopt best practices to keep files small and queryable.
The simplest way is via pandas DataFrame.to_parquet, which uses PyArrow by default when available.
import pandas as pddf = pd.DataFrame({"id": [1, 2], "country": ["US", "CA"]})# Default: Snappy compression when pyarrow is installeddf.to_parquet("output.parquet", index=False)# Control engine explicitlydf.to_parquet("output.parquet", index=False, engine="pyarrow")
Tip: Set index=False to avoid persisting the pandas index unless you need it for joins.
Choose a codec that balances speed and size. Snappy is fast; Gzip yields smaller files but slower writes/reads.
import pandas as pddf = pd.read_parquet("input.parquet")# Snappy (default, balanced)df.to_parquet("out-snappy.parquet", compression="snappy", index=False)# Gzip (smaller, slower)df.to_parquet("out-gzip.parquet", compression="gzip", index=False)# Brotli (good ratio, moderate speed)df.to_parquet("out-brotli.parquet", compression="brotli", index=False)
Tip: Stay consistent across a dataset; mixing codecs within the same table can surprise downstream readers.
Partitioning creates directory-level splits (e.g., country=US/) to speed up reads when filters match those keys. Avoid tiny files—target 100–500MB per file for analytics engines.
import pandas as pddf = pd.read_parquet("input.parquet")# Write to a directory with partitionsdf.to_parquet("s3://bucket/analytics/users/",index=False,partition_cols=["country", "year"],)
Tip: After partitioned writes, check partition balance and consider repartition or coalesce in Spark to reduce small files.