High-Performance Real-Time Processing With Snowflake
High-Performance Real-Time Processing With Snowflake
Until recently, the standard solution to capture, analyze, and store data in near real-time involved using the Hadoop toolset. See a simpler solution here.
Join the DZone community and get the full member experience.Join For Free
Until recently, the standard solution to capture, analyze, and store data in near real-time involved using the Hadoop toolset. This article explains how Snowflake has simplified the solution, and now supports high-performance real-time data capture, transformation, and processing using nothing more complex than industry-standard SQL.
The requirements include the ability to capture, transform, and analyze data at a potentially massive velocity in near real-time. This involves capturing data from millions of electronic sensors and transforming and storing the results for real-time analysis on dashboards. The solution must minimize latency — the delay between a real-world event and its impact upon dashboards, to a few minutes.
Typical applications include:
- Monitoring Machine Sensors: Using embedded sensors in industrial machines or vehicles. For example, Progressive Insurance uses real-time speed data to help analyze customer behavior and deliver appropriate discounts. Similar technology is used by FedEx, which uses SenseAware to provide near real-time parcel tracking.
- Fraud Detection: To assess the risk of credit card fraud before authorizing or declining the transaction. This can be based upon a simple report of a lost or stolen card, or more likely, an analysis of aggregate spending behavior, aligned with machine learning techniques.
What's the Problem?
The primary challenge for systems architects is the potentially massive throughput required, which could exceed a million transactions per second. NoSQL databases can handle the data velocity but have the disadvantages associated with a lack of SQL access, no transaction support, and eventual consistency.
Finally, they don't support flexible join operations, and analytic query options are limited or non-existent. This means you can quickly retrieve a key-value pair for an event, but analyzing the data is a severe challenge.
However, it doesn't stop there.
The Components Needed
The diagram above illustrates the main architectural components needed to solve this problem. This includes:
- High-Velocity Data Capture: The ability to capture high-velocity message streams from multiple data sources in the range of hundreds of megabytes per second.
- Message Queuing: We can expect short-term spikes in data volume, which implies a message handling solution to avoid the need to scale up the entire solution for the worst possible case.
- Guaranteed message delivery: Which implies a scale-out, fault-tolerant, highly available solution that gracefully handles individual node failure, and guarantees message delivery.
- Architectural Separation: To decouple the source systems from the messaging, transformation and data storage components. Ideally, the solution should allow independent scaling of each component in the stack.
- Data integration: The transformation process will almost certainly need to combine real-time transaction streams with existing reference data from databases and other data sources. The solution must, therefore, provide excellent data source connectivity and seamless integration with other sources.
- Guaranteed once only processing: The transformation process needs to be resilient to failure and re-start operations, effectively guaranteeing every message will be processed once only.
- Massively Scalable: While the data capture component will help smooth out massive spikes in velocity, the solution must transparently scale to deal with both regular and unexpected workloads.
Storage and Analytics
- Unlimited Data Storage: The data storage solution must be capable of accepting, processing, and storing millions of transactions, ideally in a single repository. This implies an almost unlimited data storage capacity, combining both Data Lake and analytics capability on a single platform.
- Dashboard Connectivity: The solution must provide support for open connectivity standards including JDBC and ODBC to support Business Intelligence and dashboards. There's little value in capturing the data if it cannot be analyzed.
The Traditional Solution
The diagram above illustrates a common architecture referred to as Lambda Architecture, which includes a Speed Layer to process data in real-time with a Batch Layer to produce an accurate historical record. In essence, this splits the problem into two distinct components, and the results are combined at query time in the Serving Layer to deliver results.
"Keeping code written in two different systems perfectly in sync was really, really hard". - Jay Kreps on Lambda (LinkedIn)
While Lambda Architecture has many advantages, including decoupling and separation of responsibility, it also has the following disadvantages:
- Logic Duplication: Much of the logic to transform the data is duplicated in both the Speed and Batch layers. This adds to the system complexity and creates challenges for maintenance as code needs to be maintained in two places – often using two completely different technologies. As Jay Kreps who invented the Lambda Architecture while at LinkedIn testifies, keeping code written in two different systems was really hard.
- Batch Processing Effort: The batch processing layer assumes all input data is re-processed every time. This has the advantage of guaranteeing accuracy, as code changes are applied to the data every time but place a huge batch processing burden on the system.
- Serving Layer Complexity: As data is independently processed by the Batch and Speed layers, the Serving Layer must execute queries against two data sources and combine real-time and historical results into a single query. This adds additional complexity to the solution and may rule out direct access from some dashboard tools or need additional development effort.
- NoSQL Data Storage: While batch processing typically uses Hadoop/HDFS for data storage, the Speed Layer needs fast random access to data, and typically uses a NoSQL database, for example, HBase. This comes with huge disadvantages including no industry standard SQL interface, a lack of join operations, and no support for ad-hoc analytic queries.
When the only transformation tool available was Map-Reduce with NoSQL for data storage, Lambda Architecture was a sensible solution, and it has been successfully deployed at scale at Twitter and LinkedIn. However, there are more advanced (and simple) alternatives available.
The Snowflake Solution
The diagram above illustrates an alternative simple solution with a single real-time data flow from source to dashboard. The critical component that makes this possible is the Snowflake data warehouse which now includes a native Kafka connector in addition to Streams and Tasks to capture, transform, and analyze data in near real-time.
The components in the above solution are:
- Apache Kafka: For fault-tolerant message queuing and broadcast system.
- Snowflake Streams and Tasks: To receive the data, perform change data capture and transform and store data ready for analysis and presentation.
- Snowflake Multi-cluster Architecture: To seamlessly handle thousands of online concurrent users to analyze results.
- Tableau: For analytic presentation and dashboards.
The advantages of this architecture include:
- Absolute Simplicity: As a pipeline to capture, implement change data capture, and storage can be completed with just a handful of SQL statements.
- SQL Transformations: With all data transformation logic in the Snowflake transformation component (using industry-standard SQL), there's no code duplication or multiple technologies to cause maintenance issues.
- Real-Time Accuracy: As the database solution provides full relational support and ACID compliance, there is no issue around eventual consistency from NoSQL solutions.
The diagram below illustrates one of the unique benefits of Snowflake which delivers unlimited cloud-based compute resources as Virtual Warehouses over a shared storage system. This means, it's no longer necessary to provide a separate speed and batch processing layer, as queries can be continually streamed into the warehouse using Snowpipe, while being transformed on one virtual warehouse, and results analyzed on yet another.
Snowflake is uniquely positioned to provide a single platform for all your data warehouse storage, processing, and analysis needs. This includes:
- Near Real-Time Streaming: Using Snowpipe and the native Kafka connector to capture and ingest streaming data without contention.
- Semi-structured Processing: Which uses a simple SQL interface to provide a real-time schema-on-read view over JSON, AVRO, Parquet and XML. Unlike other databases that store semi-structured data as simple text, Snowflake parses the data and seamlessly stores it in a columnar data structure for fast retrieval.
- A Data Lake: Which can combine both semi-structured JSON and structured CSV formats. This helps abolish the separate data silos of Data Lake and Warehouse.
- A Data Warehouse: With full ACID compliance, and an industry-standard SQL interface over an unlimited data store.
- Integration to Transformation and Analytic tools: Including native connectors for Spark, Talend, Informatica, Looker, PowerBI and Tableau.
Disclaimer: The opinions expressed in my articles are my own, and will not necessarily reflect those of my employer (past or present).
Published at DZone with permission of John Ryan , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.