What is Apache Parquet? A Beginner's Guide

· Parquet Explorer
parquetbeginnersdata-engineering

If you work with data — or you are just getting started — you have probably seen files ending in .parquet and wondered what they are. Apache Parquet has quietly become one of the most important file formats in data engineering, analytics, and machine learning. This guide explains what Parquet is, why it exists, and how to start working with it today.

A Brief History

Apache Parquet was born in 2013 out of a collaboration between Twitter and Cloudera, inspired by Google’s Dremel paper. The mission was simple: create an open-source, language-agnostic columnar storage format that could handle the scale and complexity of modern analytical workloads. It graduated to a top-level Apache project and is now the de facto standard for data lakes, analytics engines, and ML pipelines worldwide.

What Does “Columnar” Mean?

Traditional formats like CSV and JSON are row-oriented. Each line contains all the fields for a single record:

name,age,city
Alice,30,Berlin
Bob,25,Tokyo
Carol,42,Lima

When you read this file, you read every field of every row — even if you only need the age column.

Parquet flips this around. It is column-oriented, storing all values for a single column together on disk:

name:  [Alice, Bob, Carol]
age:   [30, 25, 42]
city:  [Berlin, Tokyo, Lima]

This simple architectural change has profound consequences for performance.

Why Columnar Storage Matters

1. Faster Analytical Queries

Most analytical queries only touch a subset of columns. A query like SELECT AVG(age) FROM users only needs the age column. In a columnar format the engine skips every other column entirely, reading far less data from disk or over the network.

2. Better Compression

Values in the same column tend to be similar — they share the same data type, often have limited cardinality, and frequently repeat. Parquet exploits this by applying encoding techniques (dictionary encoding, run-length encoding, delta encoding) within each column, then compresses the result with codecs like Snappy, Zstd, or Gzip. Compression ratios of 5x to 10x compared to raw CSV are common.

3. Rich Type System

CSV has no built-in types — everything is a string. Parquet embeds a full schema with typed columns: integers, floats, strings, booleans, dates, timestamps, decimals, and complex nested types like structs, lists, and maps. No more guessing whether "42" is a number or a string, and no more date-parsing surprises.

How a Parquet File is Organized

Understanding the internal layout helps you appreciate why Parquet is so efficient.

Row Groups

A Parquet file is divided into one or more row groups. Each row group contains a horizontal slice of the data — say, 1 million rows. Row groups let engines parallelize reads and limit memory consumption by processing one group at a time.

Column Chunks

Inside each row group, data is stored per column in a column chunk. A column chunk holds all the values for one column within that row group.

Pages

Each column chunk is further divided into pages (typically 1 MB). Pages are the smallest unit of I/O and encoding. There are data pages, dictionary pages, and index pages.

At the end of every Parquet file sits a footer containing the schema, row group locations, column statistics (min/max values, null counts), and encoding information. A reader fetches this footer first and then decides exactly which row groups and columns to read — a technique called predicate pushdown.

+------------------+
|   Row Group 0    |
|  col0 | col1 |.. |
+------------------+
|   Row Group 1    |
|  col0 | col1 |.. |
+------------------+
|      ...         |
+------------------+
|     Footer       |
+------------------+

When Should You Use Parquet?

Parquet shines in these scenarios:

  • Analytics and BI: Any time you run aggregations, filters, or joins over large datasets.
  • Data lakes: Parquet is the standard storage format for data lakes on S3, GCS, and Azure Blob Storage.
  • Machine learning: Feature stores and training datasets are frequently stored as Parquet because of its compact size and fast column reads.
  • Data exchange: Sharing datasets between teams that use different tools (Python, R, Spark, DuckDB) is painless because the schema travels with the data.

When CSV or JSON Might Be Better

Parquet is not the right choice for everything:

  • Human readability: You cannot open a Parquet file in a text editor. For quick inspection, CSV is simpler — unless you have a tool that makes Parquet just as accessible.
  • Streaming or append-heavy workloads: Parquet files are immutable; appending a single row means rewriting the file. Formats like JSON Lines or Avro may be more convenient for streaming ingestion.
  • Very small datasets: For a 50-row config file, the overhead of Parquet is not worth it.

Working with Parquet Files in Practice

Because Parquet is a binary format, you need the right tools. The good news is that the ecosystem has matured well beyond simple file viewers.

Parquet Explorer is a free, browser-based Parquet platform powered by DuckDB-WASM. It goes far beyond just viewing files — think of it as a Swiss Army knife for Parquet. You can:

  • Query with full SQL: Write DuckDB-compatible SQL queries against your data, complete with window functions, CTEs, and aggregations. Everything runs 100% in your browser — no server, no uploads.
  • Inspect schemas and metadata: Browse column types including nested structures (STRUCT, LIST, MAP) in a tree view, examine row group details, compression codecs, and column-level statistics like min/max values, null counts, and distinct counts.
  • Profile your data: Get per-column statistics, histograms, semantic type detection (the profiler automatically identifies emails, URLs, UUIDs, IP addresses, and phone numbers), and an overall data quality score.
  • Convert between formats: Drag in a CSV, TSV, JSON, or JSONL file and export it as Parquet with your choice of compression (Snappy, Zstd, Gzip). Or go the other direction — export Parquet data to CSV or JSON.
  • Create and edit Parquet files: Define a schema from scratch, enter data manually, or edit existing Parquet files with inline cell editing, adding or removing rows and columns as needed.
  • Export results: Download query results or converted files in CSV, JSON, or Parquet format.

Your data never leaves your machine. Everything runs client-side in your browser tab, available in both English and Spanish.

For terminal workflows, DuckDB CLI lets you run SQL queries against Parquet files from the command line. And Python with pandas or PyArrow remains a solid option if you are already in a notebook environment.

Key Takeaways

FeatureCSVParquet
Storage layoutRow-orientedColumn-oriented
CompressionPoorExcellent (5-10x)
SchemaNone (all strings)Typed, embedded
Read speed (analytics)SlowFast
Human-readableYesNo (needs tooling)
Ecosystem supportUniversalBroad and growing

Getting Started

The fastest way to start working with Parquet is to open parquetexplorer.com and drag in a file. Within seconds you can browse the schema, profile the data, run SQL queries, convert formats, or even create a new Parquet file from scratch. There is nothing to install and no account to create.

Apache Parquet is not just a trend — it is the backbone of modern data infrastructure. Whether you are building pipelines, training models, or sharing a dataset with a colleague, understanding Parquet gives you a real advantage. Now that you know the basics, try opening one of your own files and see the difference for yourself.