Data Engineering: Strategies for Data Retrieval on Multi-Dimensional Data
A comparative look at partitioning, indexing, clustering, and ordering techniques to match retrieval strategies with real-world query needs.
Join the DZone community and get the full member experience.
Join For FreeYou’ve likely heard about the benefits of partitioning data by a single dimension to boost retrieval performance. It’s a common practice in relational databases, NoSQL databases, and, notably, data lakes. For example, a very common dimension to partition data in data lakes is by date or time. However, what if your data querying requirements involve multiple dimensions? Let’s say you wish to query your data by field A and also by field B, or sometimes by field A but other times by field B.
In this post, I’ll go over several common options for such a case.
For convenience, I’ll give examples of how to implement it in the data lake using standard folder names and Parquet to store the data. You should know, however, that the paradigms are also valid for other areas like DBs, NoSQL DBs, memory storage, and so on.
The Default: Micro-Partitions
Micro-partitions are a technique used to sub-partition data within a dataset. Each micro-partition contains metadata for individual fields, providing valuable information for optimizing data consumption performance.
For instance, consider a scenario where data is organized into daily partitions stored in Parquet files.
<dataset_root>/day=20240101/data1.parquet
<dataset_root>/day=20240101/data2.parquet
<dataset_root>/day=20240101/data3.parquet
In this setup, each Parquet file (or even each page within a Parquet file) can be referred to as a micro-partition. Parquet files inherently store metadata per file and per page, which can enhance data consumption performance.
Snowflake also employs micro-partitions by default, but it uses richer metadata and superior indexing capabilities than the simple Parquet files. This enhanced metadata and indexing within Snowflake’s micro-partitions contribute to significant performance gains, making micro-partitions a highly beneficial feature within the platform.
The Obvious Approach: Nested Partitions
Let’s start with nested partitions. In a typical Hive partition structure, it looks like this:
<dataset_root>/<FieldA>=<value>/<FieldB>=<value>/data.parquet
While this works well for consistent queries involving both field A and field B, it falls short when you need the flexibility to query either field separately. For instance:
//Good for this:
Spark.sql("select * from my_data_set where FieldA=11 and FieldB=22 ");
//Not so good for this:
Spark.sql("select * from my_data_set where FieldA=11");
Spark.sql("select * from my_data_set where FieldB=22");
The Opposite Approach: Data Duplication With Separate Partitions
Another approach involves duplicating the data and partitioning it once by field A and once by field B. The directory structure in a hive-like structure might look like this:
<dataset_root>/<FieldA>=<value>/data.parquet
and
<dataset_root>/<FieldB>=<value>/data.parquet
It represents the opposite of the previous option, meaning:
//Good for this:
Spark.sql("select * from my_data_set where FieldA=11");
Spark.sql("select * from my_data_set where FieldB=22");
//Not good for this:
Spark.sql("select * from my_data_set where FieldA=11 and FieldB=22 ");
Also, maintaining data consistency becomes more challenging in this scenario.
Best of Both Worlds? Partitioning by Field A + Externally Indexing by Field B
A widely adopted strategy in databases. The advantage here is that the index serves as a reference to the data, not a copy of it.
In the data lake world, it means partitioning the data by field A, the same as before.
<dataset_root>/<FieldA>=<value>/data.parquet
And in addition, maintaining a slim dataset that references the same data files by field B values.
In data lakes, it’s possible to implement it yourself, though it's usually implemented using an additional data catalog. This is also one of the advantages of using lakehouses (like the Databricks data lakehouse) since you get it out of the box.
It’s ideal for cases where you need to query the data based on specific values for field B.
Spark.sql("select * from my_data_set where FieldB=22");
However, it’s less suitable for queries involving a range of values for field B.
Spark.sql("select * from my_data_set where FieldB>22");
The reason it is not as useful is that the indexed keys are not stored in a continuous manner on the machine, like partitions usually are.
Often Useful: Partitioning by Field A + Clustering or Sorting by Field B
This is an improvement over the micro-partitions approach. Here, we partition the data by field A as you normally do, but ensure that within each partition, the data is clustered by field B.
Here is one example of how to implement it using Spark:
// partition the data partitioned by A and inside partitioned by B
val sortedDF = df.repartitionByRange($"fieldA", $"fieldB")
// than write the data in a partitioned manner
sortedDF.write
.mode(SaveMode.Overwrite)
.partitionBy("fieldA")
.parquet("/dataset_root")
In the example above, data will be written partitioned by field A. But within each partition, the data will also be divided into files (micro-partitioned) by field B.
The theologies used need to support this, of course. In the case of Parquet, it works well since Parquet holds metadata for each field, which includes min and max values. Most technologies (like Apache Spark) take this into account, so they can skip files that do not include the required value for field B.
This is a solid choice for various use cases, while it is not the best approach for queries like this:
Spark.sql("select * from my_data_set where FieldB=22 ");
or
Spark.sql("select * from my_data_set where FieldB>22 ");
Since it means going over all partitions. However, since the data is grouped by field B within the partitions, at least some of the files may be skipped.
This approach is particularly useful when field B contains a wide range of possible values (high cardinality). It can also be a beneficial design when field B’s values are unevenly distributed (skewed).
This is why this paradigm is very common in multiple technologies, for example: clustering in BigQuery, sort key in DynamoDB. clustering within micro-partitions in Snowflake and so on.
The Secret Weapon: Z-Order
A less common but important option is to index or partition by a Z-order. In this case, the data is also sorted, but instead of being partitioned by field A and sorted by fields B, it will be sorted by a key that is a composite of both fields A and B:
<dataset_root>/<A combination of FieldA+fieldB>
This method is actually ideal for all of the query types mentioned so far.
The secret is in the method, which combines the two fields together; it makes sure that keys with similar values are stored in proximity to one another, and this holds true for both fields that make up the partition. So, no matter whether you’re retrieving data based on one field or both, whether you need a precise value or a range of values, this method will surely help. Also, like the previous method, this method is good for high cardinality and skews as well.
The implementation of this is not very common, though, and quite complex. Currently, the most common implementations are by hosted environments like Databricks Lakehouse.
Conclusion
Choosing the right strategy for multi-dimensional data querying depends on your specific use case. Each approach has its strengths and trade-offs. Whether you go for nested partitions, data duplication, external indexing, sorting, or Z-indexing, understanding these strategies equips you to make informed decisions based on your data lake architecture and querying needs.
Published at DZone with permission of Avi Yehuda. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments