Understanding Parquet Scans: How Readers Skip Work and Stay Fast
Parquet accelerates scans by skipping data through metadata driven pushdowns. This article explains how the main mechanisms work in practice.
Join the DZone community and get the full member experience.
Join For FreeParquet is a columnar file format designed for efficient data storage and retrieval. On disk, it is organized around row groups, column chunks, and pages. Along with that, each file also has a footer that describes how everything fits together. A Parquet reader that understands this layout can avoid a lot of work during the scan, such as skipping entire row groups, column chunks, and pages, and decoding only the values that matter.
This article uses a single sample Parquet file to explain exactly what happens during column reads and some common optimization techniques.
How Parquet Is Laid Out
The formal description of the Parquet file format is described here. In short, a Parquet file is divided into row groups. Each row group holds a horizontal slice of rows. Within each row group, each column is stored in its own column chunk. Each column chunk is made of pages, which can be a dictionary page and one or more data pages. At the end of the file, the footer stores the schema and metadata, including the location and statistics for each column chunk.
Here is a useful mental model:
Parquet file
└── Row group 0
│ ├── Column chunk: event_id
│ │ ├── [dictionary page?]
│ │ └── [data pages...]
│ ├── Column chunk: user.id
│ ├── Column chunk: user.country
│ ├── Column chunk: user.device.model
│ ├── Column chunk: user.device.version
│ └── Column chunk: score
└── Row group 1
└── ...
[footer with schema + statistics + offsets]
The footer information enables the Parquet reader to skip and not scan the entire file.
Creating the Sample File
The file used in all examples below can be created using:
import pyarrow as pa
import pyarrow.parquet as pq
import random
schema = pa.schema([
("event_id", pa.int64()),
("user", pa.struct([
("id", pa.int64()),
("country", pa.string()),
("device", pa.struct([
("model", pa.string()),
("version", pa.int64())
]))
])),
("score", pa.int32())
])
countries = ["US", "IN", "CA"]
models = ["iPhone", "Samsung", "Pixel"]
rows = []
num_rows = 10000
for i in range(1, num_rows + 1):
if i % 4 == 3:
rows.append({
"event_id": i,
"user": None,
"score": random.randint(0, 100),
})
else:
rows.append({
"event_id": i,
"user": {
"id": 100 + (i % 10),
"country": random.choice(countries),
"device": {
"model": random.choice(models),
"version": random.randint(1, 20),
},
},
"score": random.randint(0, 100),
})
table = pa.Table.from_pylist(rows, schema=schema)
pq.write_table(
table,
"sample_dict_big.parquet",
compression="SNAPPY",
use_dictionary=True,
row_group_size=4096,
)
Sample Schema and On-Disk Layout
Once the file is created, you can check out the schema of the file using the parquet-cli tool:
(venv) hitarth@hitarth ~ % parquet meta sample_dict_big.parquet
File path: sample_dict_big.parquet
Created by: parquet-cpp-arrow version 22.0.0
Properties:
ARROW:schema:
Schema:
message schema {
optional int64 event_id;
optional group user {
optional int64 id;
optional binary country (STRING);
optional group device {
optional binary model (STRING);
optional int64 version;
}
}
optional int32 score;
}
Row group 0: count: 4096 7.58 B records start: 4 total(compressed): 30.328 kB total(uncompressed):51.578 kB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
event_id INT64 S _ R 4096 5.53 B 0 "1" / "4096"
user.id INT64 S _ R 4096 0.08 B 1024 "100" / "109"
user.country BINARY S _ R 4096 0.22 B 1024 "CA" / "US"
user.device.model BINARY S _ R 4096 0.23 B 1024 "Pixel" / "iPhone"
user.device.version INT64 S _ R 4096 0.53 B 1024 "1" / "20"
score INT32 S _ R 4096 1.00 B 0 "0" / "100"
Row group 1: count: 4096 7.58 B records start: 31060 total(compressed): 30.327 kB total(uncompressed):51.578 kB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
event_id INT64 S _ R 4096 5.53 B 0 "4097" / "8192"
user.id INT64 S _ R 4096 0.08 B 1024 "100" / "109"
user.country BINARY S _ R 4096 0.22 B 1024 "CA" / "US"
user.device.model BINARY S _ R 4096 0.23 B 1024 "Pixel" / "iPhone"
user.device.version INT64 S _ R 4096 0.53 B 1024 "1" / "20"
score INT32 S _ R 4096 1.00 B 0 "0" / "100"
Row group 2: count: 1808 7.81 B records start: 62115 total(compressed): 13.795 kB total(uncompressed):23.179 kB
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
event_id INT64 S _ R 1808 5.44 B 0 "8193" / "10000"
user.id INT64 S _ R 1808 0.13 B 452 "100" / "109"
user.country BINARY S _ R 1808 0.25 B 452 "CA" / "US"
user.device.model BINARY S _ R 1808 0.26 B 452 "Pixel" / "iPhone"
user.device.version INT64 S _ R 1808 0.59 B 452 "1" / "20"
score INT32 S _ R 1808 1.15 B 0 "0" / "100"
Sample rows:
(venv) hitarth@hitarth ~ % parquet head sample_dict_big.parquet
{"event_id": 1, "user": {"id": 101, "country": "US", "device": {"model": "iPhone", "version": 3}}, "score": 17}
{"event_id": 2, "user": {"id": 102, "country": "CA", "device": {"model": "iPhone", "version": 2}}, "score": 10}
{"event_id": 3, "user": null, "score": 66}
{"event_id": 4, "user": {"id": 104, "country": "IN", "device": {"model": "Pixel", "version": 9}}, "score": 57}
{"event_id": 5, "user": {"id": 105, "country": "US", "device": {"model": "iPhone", "version": 1}}, "score": 14}
{"event_id": 6, "user": {"id": 106, "country": "IN", "device": {"model": "iPhone", "version": 18}}, "score": 96}
{"event_id": 7, "user": null, "score": 77}
{"event_id": 8, "user": {"id": 108, "country": "IN", "device": {"model": "Pixel", "version": 10}}, "score": 36}
{"event_id": 9, "user": {"id": 109, "country": "US", "device": {"model": "iPhone", "version": 10}}, "score": 28}
{"event_id": 10, "user": {"id": 100, "country": "US", "device": {"model": "Pixel", "version": 6}}, "score": 13}
Projection Pushdown
Projections pushdown is the simplest optimization. It means that when a query only needs some columns, the reader only fetches and decodes those columns.
Consider a query:
SELECT user.id FROM t;
The reader starts by reading the footer to find the column chunks that correspond to user.id in each row group. It then issues I/O only for those ranges in the file. It decompresses only those column chunks and decodes only those pages. Column chinks for event_id, user.country, user.device.model, user.device.version, and score are not even read from the storage for this query.
Since Parquet stores each column in its own column chunk inside a row group, projection pushdown essentially means that the Parquet reader does not open the column chunks it does not need, which is driven entirely by the footer.
Predicate Pushdown
Predicate pushdown uses filter conditions to avoid reading row ranges that cannot possibly contain matches. The simplest version works at the row group level. The footer stores basic statistics per column and per row group, such as minimum and maximum values and null counts. For a query like below, the reader can examine the statistics in each row group before reading any data pages.
SELECT event_id, score
FROM t
WHERE score > 80;
If the maximum score in a row group is 60, then there is no need to read column chunks from that row group since all rows are guaranteed to fail the filter. For row groups where the maximum is greater than 80, the reader proceeds and reads the necessary column chunks.
This avoids I/O and decoding for entire row ranges. The reader determines this purely from metadata, and there is no need to decode any values to decide whether a row group can be pruned. Some advanced Parquet readers also use page indexes to make further decisions within a column chunk, applying the same idea at a page level.
Dictionary Filtering
Dictionary encoding adds another opportunity for skipping work. In a dictionary encoded column like user.country in our sample file, the values are represented in two layers: a dictionary page with all distinct values and data pages that store integer indices into that dictionary.
For equality filters, the reader can first look only at the dictionary.
SELECT score
FROM t
WHERE user.country = 'ZZ'
If a query asks for user.country = ‘ZZ’ and the dictionary contains only [“US”, “UN”, “CA”], then no data pages for that column chunk can contain a match. The reader can therefore avoid reading and decoding that column chunk entirely for this predicate.
This technique is very effective on low-cardinality columns because the dictionary is small and cheap to check.
Dereference Pushdown
The schema in our example has a nested user struct:
user.id
user.country
user.device.model
user.device.version
Parquet stores each leaf field as a separate column, with definition and repetition levels to reconstruct nesting. A naive reader might load all four users.* columns, even if the query only needs one of them.
Dereference pushdown prevents that. For a query like:
SELECT user.id FROM t;
The Parquet reader resolves the projected paths and marks only user.id as needed. It loads column chunks for user.id, but not for user.country or user.device.*. During decoding, it uses the definition levels of user.id to decide which rows have a user struct at all, without paying the cost of reading other nested leaves.
This keeps nested scans efficient even when structs have many fields.
Vectorized Decoding
Once the reader has determined which row groups, column chunks, and pages to read, it must still decode values. Instead of decoding one value at a time, modern Parquet readers decode batches of values into contiguous memory. A batch decoder unpacks encodings, applies the definition levels, and writes values into columnar buffers with fewer branches and function calls compared to a naive loop.
This optimization doesn’t skip any data; it simply makes the remaining work faster. It becomes important because after the pushdown removes irrelevant data, the decode loop tends to consume most of the CPU time.
Closing Thoughts
Parquet’s structure gives readers enough information to avoid large amounts of unnecessary work, from skipping whole columns to bypassing row groups, pages, and unused nested fields. Once the reader knows what to keep, vectorized decoders handle the remaining pages efficiently.
Many other features exist in the Parquet format, such as page indexes, Bloom filters, column indexes, and I/O scheduling strategies, but they are outside the scope of this article. The goal here is to give a clear view of the core mechanisms that make typical analytical scans fast. Future pieces can explore the more specialized techniques in depth.
Opinions expressed by DZone contributors are their own.
Comments