Understanding Parquet Scans: How Readers Skip Work and Stay Fast

Parquet accelerates scans by skipping data through metadata driven pushdowns. This article explains how the main mechanisms work in practice.

Hitarth Trivedi

Jan. 06, 26 · Analysis

Likes (1)

Comment

Save

2.5K Views

Parquet is a columnar file format designed for efficient data storage and retrieval. On disk, it is organized around row groups, column chunks, and pages. Along with that, each file also has a footer that describes how everything fits together. A Parquet reader that understands this layout can avoid a lot of work during the scan, such as skipping entire row groups, column chunks, and pages, and decoding only the values that matter.

This article uses a single sample Parquet file to explain exactly what happens during column reads and some common optimization techniques.

How Parquet Is Laid Out

The formal description of the Parquet file format is described here. In short, a Parquet file is divided into row groups. Each row group holds a horizontal slice of rows. Within each row group, each column is stored in its own column chunk. Each column chunk is made of pages, which can be a dictionary page and one or more data pages. At the end of the file, the footer stores the schema and metadata, including the location and statistics for each column chunk.

Here is a useful mental model:

    Plain Text
   
 

   Parquet file
└── Row group 0
│ ├── Column chunk: event_id
│ │ ├── [dictionary page?]
│ │ └── [data pages...]
│ ├── Column chunk: user.id
│ ├── Column chunk: user.country
│ ├── Column chunk: user.device.model
│ ├── Column chunk: user.device.version
│ └── Column chunk: score
└── Row group 1
└── ...
[footer with schema + statistics + offsets]
  

The footer information enables the Parquet reader to skip and not scan the entire file.

Creating the Sample File

The file used in all examples below can be created using:

    Python
   
 

   import pyarrow as pa
import pyarrow.parquet as pq
import random

schema = pa.schema([
    ("event_id", pa.int64()),
    ("user", pa.struct([
        ("id", pa.int64()),
        ("country", pa.string()),
        ("device", pa.struct([
            ("model", pa.string()),
            ("version", pa.int64())
        ]))
    ])),
    ("score", pa.int32())
])

countries = ["US", "IN", "CA"]
models = ["iPhone", "Samsung", "Pixel"]

rows = []
num_rows = 10000

for i in range(1, num_rows + 1):
    if i % 4 == 3: 
        rows.append({
            "event_id": i,
            "user": None,
            "score": random.randint(0, 100),
        })
    else:
        rows.append({
            "event_id": i,
            "user": {
                "id": 100 + (i % 10),
                "country": random.choice(countries),
                "device": {
                    "model": random.choice(models),
                    "version": random.randint(1, 20),
                },
            },
            "score": random.randint(0, 100),
        })

table = pa.Table.from_pylist(rows, schema=schema)

pq.write_table(
    table,
    "sample_dict_big.parquet",
    compression="SNAPPY",
    use_dictionary=True,
    row_group_size=4096,
)
  

Sample Schema and On-Disk Layout

Once the file is created, you can check out the schema of the file using the parquet-cli tool:

     Shell
    
 

    (venv) hitarth@hitarth ~ % parquet meta sample_dict_big.parquet
File path:  sample_dict_big.parquet
Created by: parquet-cpp-arrow version 22.0.0
Properties:
  ARROW:schema:
Schema:
message schema {
  optional int64 event_id;
  optional group user {
    optional int64 id;
    optional binary country (STRING);
    optional group device {
      optional binary model (STRING);
      optional int64 version;
    }
  }
  optional int32 score;
}
Row group 0:  count: 4096  7.58 B records  start: 4  total(compressed): 30.328 kB total(uncompressed):51.578 kB 
--------------------------------------------------------------------------------
                     type      encodings count     avg size   nulls   min / max
event_id             INT64     S _ R     4096      5.53 B     0       "1" / "4096"
user.id              INT64     S _ R     4096      0.08 B     1024    "100" / "109"
user.country         BINARY    S _ R     4096      0.22 B     1024    "CA" / "US"
user.device.model    BINARY    S _ R     4096      0.23 B     1024    "Pixel" / "iPhone"
user.device.version  INT64     S _ R     4096      0.53 B     1024    "1" / "20"
score                INT32     S _ R     4096      1.00 B     0       "0" / "100"
Row group 1:  count: 4096  7.58 B records  start: 31060  total(compressed): 30.327 kB total(uncompressed):51.578 kB 
--------------------------------------------------------------------------------
                     type      encodings count     avg size   nulls   min / max
event_id             INT64     S _ R     4096      5.53 B     0       "4097" / "8192"
user.id              INT64     S _ R     4096      0.08 B     1024    "100" / "109"
user.country         BINARY    S _ R     4096      0.22 B     1024    "CA" / "US"
user.device.model    BINARY    S _ R     4096      0.23 B     1024    "Pixel" / "iPhone"
user.device.version  INT64     S _ R     4096      0.53 B     1024    "1" / "20"
score                INT32     S _ R     4096      1.00 B     0       "0" / "100"
Row group 2:  count: 1808  7.81 B records  start: 62115  total(compressed): 13.795 kB total(uncompressed):23.179 kB 
--------------------------------------------------------------------------------
                     type      encodings count     avg size   nulls   min / max
event_id             INT64     S _ R     1808      5.44 B     0       "8193" / "10000"
user.id              INT64     S _ R     1808      0.13 B     452     "100" / "109"
user.country         BINARY    S _ R     1808      0.25 B     452     "CA" / "US"
user.device.model    BINARY    S _ R     1808      0.26 B     452     "Pixel" / "iPhone"
user.device.version  INT64     S _ R     1808      0.59 B     452     "1" / "20"
score                INT32     S _ R     1808      1.15 B     0       "0" / "100"
   

Sample rows:

     Shell
    
 

    (venv) hitarth@hitarth ~ % parquet head sample_dict_big.parquet 
{"event_id": 1, "user": {"id": 101, "country": "US", "device": {"model": "iPhone", "version": 3}}, "score": 17}
{"event_id": 2, "user": {"id": 102, "country": "CA", "device": {"model": "iPhone", "version": 2}}, "score": 10}
{"event_id": 3, "user": null, "score": 66}
{"event_id": 4, "user": {"id": 104, "country": "IN", "device": {"model": "Pixel", "version": 9}}, "score": 57}
{"event_id": 5, "user": {"id": 105, "country": "US", "device": {"model": "iPhone", "version": 1}}, "score": 14}
{"event_id": 6, "user": {"id": 106, "country": "IN", "device": {"model": "iPhone", "version": 18}}, "score": 96}
{"event_id": 7, "user": null, "score": 77}
{"event_id": 8, "user": {"id": 108, "country": "IN", "device": {"model": "Pixel", "version": 10}}, "score": 36}
{"event_id": 9, "user": {"id": 109, "country": "US", "device": {"model": "iPhone", "version": 10}}, "score": 28}
{"event_id": 10, "user": {"id": 100, "country": "US", "device": {"model": "Pixel", "version": 6}}, "score": 13}


   

Projection Pushdown

Projections pushdown is the simplest optimization. It means that when a query only needs some columns, the reader only fetches and decodes those columns.

Consider a query:

     SQL
    
    SELECT user.id  FROM t;

The reader starts by reading the footer to find the column chunks that correspond to user.id in each row group. It then issues I/O only for those ranges in the file. It decompresses only those column chunks and decodes only those pages. Column chinks for event_id, user.country, user.device.model, user.device.version, and score are not even read from the storage for this query.

Since Parquet stores each column in its own column chunk inside a row group, projection pushdown essentially means that the Parquet reader does not open the column chunks it does not need, which is driven entirely by the footer.

Predicate Pushdown

Predicate pushdown uses filter conditions to avoid reading row ranges that cannot possibly contain matches. The simplest version works at the row group level. The footer stores basic statistics per column and per row group, such as minimum and maximum values and null counts. For a query like below, the reader can examine the statistics in each row group before reading any data pages.

    SQL
   
   SELECT event_id, score
FROM t
WHERE score > 80;

If the maximum score in a row group is 60, then there is no need to read column chunks from that row group since all rows are guaranteed to fail the filter. For row groups where the maximum is greater than 80, the reader proceeds and reads the necessary column chunks.

This avoids I/O and decoding for entire row ranges. The reader determines this purely from metadata, and there is no need to decode any values to decide whether a row group can be pruned. Some advanced Parquet readers also use page indexes to make further decisions within a column chunk, applying the same idea at a page level.

Dictionary Filtering

Dictionary encoding adds another opportunity for skipping work. In a dictionary encoded column like user.country in our sample file, the values are represented in two layers: a dictionary page with all distinct values and data pages that store integer indices into that dictionary.

For equality filters, the reader can first look only at the dictionary.

    SQL
   
   SELECT score
FROM t
WHERE user.country = 'ZZ'

If a query asks for user.country = ‘ZZ’ and the dictionary contains only [“US”, “UN”, “CA”], then no data pages for that column chunk can contain a match. The reader can therefore avoid reading and decoding that column chunk entirely for this predicate.

This technique is very effective on low-cardinality columns because the dictionary is small and cheap to check.

Dereference Pushdown

The schema in our example has a nested user struct:

    Plain Text
   
   user.id
user.country
user.device.model
user.device.version

Parquet stores each leaf field as a separate column, with definition and repetition levels to reconstruct nesting. A naive reader might load all four users.* columns, even if the query only needs one of them.

Dereference pushdown prevents that. For a query like:

    SQL
   
   SELECT user.id  FROM t;

The Parquet reader resolves the projected paths and marks only user.id as needed. It loads column chunks for user.id, but not for user.country or user.device.*. During decoding, it uses the definition levels of user.id to decide which rows have a user struct at all, without paying the cost of reading other nested leaves.

This keeps nested scans efficient even when structs have many fields.

Vectorized Decoding

Once the reader has determined which row groups, column chunks, and pages to read, it must still decode values. Instead of decoding one value at a time, modern Parquet readers decode batches of values into contiguous memory. A batch decoder unpacks encodings, applies the definition levels, and writes values into columnar buffers with fewer branches and function calls compared to a naive loop.

This optimization doesn’t skip any data; it simply makes the remaining work faster. It becomes important because after the pushdown removes irrelevant data, the decode loop tends to consume most of the CPU time.

Closing Thoughts

Parquet’s structure gives readers enough information to avoid large amounts of unnecessary work, from skipping whole columns to bypassing row groups, pages, and unused nested fields. Once the reader knows what to keep, vectorized decoders handle the remaining pages efficiently.

Many other features exist in the Parquet format, such as page indexes, Bloom filters, column indexes, and I/O scheduling strategies, but they are outside the scope of this article. The goal here is to give a clear view of the core mechanisms that make typical analytical scans fast. Future pieces can explore the more specialized techniques in depth.

Data structure readers sql

Opinions expressed by DZone contributors are their own.

Related

Trending