Optimizing Trino Performance With Materialized Views in a Data Lake
In this article, learn how Trino materialized views boosted our Iceberg-based data lake, improving real-time query speed, reducing load, and cutting costs.
Join the DZone community and get the full member experience.
Join For FreeIn this article, I share how we improved the performance of our Trino-based data lake by using materialized views. Our service evolved from a dual-storage system built on HBase and Elasticsearch to a simplified, cost-efficient data lake architecture powered by Iceberg, Spark Streaming, and Trino. The transition brought significant advantages but also unexpected performance challenges that we solved through careful use of Trino’s materialized views.
Business Description
Our service receives data from a Kafka source on three different topics and inserts it into HBase and Elasticsearch. HBase was used for get-by-ID operations, while Elasticsearch handled GraphQL-style search queries. HBase is known for excellent insert performance and fast get-by-ID operations, and Elasticsearch provides powerful full-text search capabilities. Over time, however, we realized that we were not using most of Elasticsearch’s advanced search features. Maintaining both systems was costly, and the operational complexity of supporting two clusters — HBase and Elasticsearch — was high. We decided to migrate to a modern data lake architecture to improve scalability and cost efficiency.
To achieve this, we built a Spark Streaming job that consumes data from Kafka and writes it to our Iceberg catalog. For query execution, we adopted Trino as our distributed SQL engine. Trino was also used for joining data from multiple sources, but that part is outside the scope of this article.
Performance Challenges
We expected some performance degradation compared to HBase and Elasticsearch, but the loss was more significant than anticipated. Queries that previously took less than a second began taking more than 10 seconds. After investigating query logs, we found that around 85% of all queries were searching for data from the last one to two hours. Handling such high query volume under real-time conditions proved challenging for Trino. After trying all possible partitioning optimizations, we turned to one of Trino’s powerful features: materialized views.
New Technical Architecture
Trino materialized views are physical representations of query results at the time of refresh. They store precomputed data, allowing future queries to execute faster than if they had to re-run the underlying query. Below is an example syntax for creating a materialized view:
CREATE [ OR REPLACE ] MATERIALIZED VIEW
[ IF NOT EXISTS ] view_name
[ GRACE PERIOD interval ]
[ COMMENT string ]
[ WITH properties ]
AS query
In our implementation, we created a new view each time new data was inserted. Trino also provides the command “REFRESH MATERIALIZED VIEW,” which repopulates or refreshes the stored data. Using the refresh command gave us a 3–4× performance improvement, but each refresh took about 20–30 seconds. Because we needed near-real-time results, we used the “CREATE OR REPLACE MATERIALIZED VIEW” command every 20 seconds, querying data from the past two minutes. This approach gave us relevant data with an operation time of just 1–2 seconds.
Performance Improvements
After implementing materialized views, we observed a 2–3× improvement in real-time query performance compared to querying full Iceberg tables. This was because Trino didn’t need to scan the entire table, only the view containing recent data. We also monitored the Trino coordinator and found that CPU spikes decreased by 30–35% during peak load periods. This confirmed that materialized views significantly reduced query complexity and load on the coordinator without introducing new bottlenecks.
Possible Problems
Overall, we didn’t encounter major issues with creating or using views. The creation process was straightforward and consistently provided strong performance gains. For use cases that require the highest query speed and can tolerate slightly longer refresh times, using “REFRESH MATERIALIZED VIEW” is a great option. We did not notice significant pressure on the Trino coordinator, but we recommend monitoring the coordinator's CPU usage closely during refresh operations.
Future Improvements and Observations
After observing such strong performance gains from materialized views, we decided to extend this approach to our daily ETL jobs. We created specialized Trino views managed by Airflow. Each job first executed a “CREATE OR REPLACE MATERIALIZED VIEW” command followed by a refresh. This allowed the ETL processes to run 2–3× faster while ensuring the freshness of data. In the future, we plan to integrate incremental refresh strategies and add automatic monitoring with Prometheus and Airflow DAGs to trigger refresh operations dynamically.
Conclusion
Using Trino’s materialized views, we built a cost-efficient and scalable analytical platform capable of near-real-time queries over Iceberg tables. This solution improved query performance, reduced infrastructure costs, and simplified our overall data-lake architecture. It also demonstrated how materialized views can serve as a lightweight caching mechanism for time-sensitive analytics workloads in distributed data systems.
Opinions expressed by DZone contributors are their own.
Comments