DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Accelerate Innovation by Shifting Left FinOps: Part 4
  • Small-Scale Logs Processing and Analysis
  • NoSQL for Relational Minds
  • Unveiling the Clever Way: Converting XML to Relational Data

Trending

  • My Favorite Interview Question
  • Tired of Spring Overhead? Try Dropwizard for Your Next Java Microservice
  • Endpoint Security Controls: Designing a Secure Endpoint Architecture, Part 2
  • Using Python Libraries in Java
  1. DZone
  2. Data Engineering
  3. Databases
  4. High-Performance Real-Time Processing With Snowflake

High-Performance Real-Time Processing With Snowflake

By 
John Ryan user avatar
John Ryan
·
Jul. 06, 20 · Presentation
Likes (3)
Comment
Save
Tweet
Share
6.8K Views

Join the DZone community and get the full member experience.

Join For Free

Until recently, the standard solution to capture, analyze, and store data in near real-time involved using the Hadoop toolset. This article explains how Snowflake has simplified the solution, and now supports high-performance real-time data capture, transformation, and processing using nothing more complex than industry-standard SQL.

The Requirement

The requirements include the ability to capture, transform, and analyze data at a potentially massive velocity in near real-time. This involves capturing data from millions of electronic sensors and transforming and storing the results for real-time analysis on dashboards. The solution must minimize latency — the delay between a real-world event and its impact upon dashboards, to a few minutes.

Typical applications include:

  • Monitoring Machine Sensors: Using embedded sensors in industrial machines or vehicles. For example, Progressive Insurance uses real-time speed data to help analyze customer behavior and deliver appropriate discounts. Similar technology is used by FedEx, which uses SenseAware to provide near real-time parcel tracking.
  • Fraud Detection: To assess the risk of credit card fraud before authorizing or declining the transaction. This can be based upon a simple report of a lost or stolen card, or more likely, an analysis of aggregate spending behavior, aligned with machine learning techniques.

What's the Problem?

The primary challenge for systems architects is the potentially massive throughput required, which could exceed a million transactions per second. NoSQL databases can handle the data velocity but have the disadvantages associated with a lack of SQL access, no transaction support, and eventual consistency.

Finally, they don't support flexible join operations, and analytic query options are limited or non-existent. This means you can quickly retrieve a key-value pair for an event, but analyzing the data is a severe challenge.

 However, it doesn't stop there.

Data Processing Pipeline

The Components Needed

The diagram above illustrates the main architectural components needed to solve this problem. This includes:

Data Capture

  • High-Velocity Data Capture: The ability to capture high-velocity message streams from multiple data sources in the range of hundreds of megabytes per second.
  • Message Queuing: We can expect short-term spikes in data volume, which implies a message handling solution to avoid the need to scale up the entire solution for the worst possible case.
  • Guaranteed message delivery: Which implies a scale-out, fault-tolerant, highly available solution that gracefully handles individual node failure, and guarantees message delivery.
  • Architectural Separation: To decouple the source systems from the messaging, transformation and data storage components. Ideally, the solution should allow independent scaling of each component in the stack.

Transformation

  • Data integration: The transformation process will almost certainly need to combine real-time transaction streams with existing reference data from databases and other data sources. The solution must, therefore, provide excellent data source connectivity and seamless integration with other sources.
  • Guaranteed once only processing: The transformation process needs to be resilient to failure and re-start operations, effectively guaranteeing every message will be processed once only.
  • Massively Scalable: While the data capture component will help smooth out massive spikes in velocity, the solution must transparently scale to deal with both regular and unexpected workloads.

Storage and Analytics

  • Unlimited Data Storage: The data storage solution must be capable of accepting, processing, and storing millions of transactions, ideally in a single repository. This implies an almost unlimited data storage capacity, combining both Data Lake and analytics capability on a single platform.
  • Dashboard Connectivity: The solution must provide support for open connectivity standards including JDBC and ODBC to support Business Intelligence and dashboards. There's little value in capturing the data if it cannot be analyzed.

Classic Lambda Architecture

The Traditional Solution

The diagram above illustrates a common architecture referred to as Lambda Architecture, which includes a Speed Layer to process data in real-time with a Batch Layer to produce an accurate historical record. In essence, this splits the problem into two distinct components, and the results are combined at query time in the Serving Layer to deliver results.

"Keeping code written in two different systems perfectly in sync was really, really hard". - Jay Kreps on Lambda (LinkedIn)

While Lambda Architecture has many advantages, including decoupling and separation of responsibility, it also has the following disadvantages:

  • Logic Duplication: Much of the logic to transform the data is duplicated in both the Speed and Batch layers. This adds to the system complexity and creates challenges for maintenance as code needs to be maintained in two places – often using two completely different technologies.  As Jay Kreps who invented the Lambda Architecture while at LinkedIn testifies, keeping code written in two different systems was really hard.
  • Batch Processing Effort: The batch processing layer assumes all input data is re-processed every time. This has the advantage of guaranteeing accuracy, as code changes are applied to the data every time but place a huge batch processing burden on the system.
  • Serving Layer Complexity: As data is independently processed by the Batch and Speed layers, the Serving Layer must execute queries against two data sources and combine real-time and historical results into a single query. This adds additional complexity to the solution and may rule out direct access from some dashboard tools or need additional development effort.
  • NoSQL Data Storage: While batch processing typically uses Hadoop/HDFS for data storage, the Speed Layer needs fast random access to data, and typically uses a NoSQL database, for example, HBase. This comes with huge disadvantages including no industry standard SQL interface, a lack of join operations, and no support for ad-hoc analytic queries.

When the only transformation tool available was Map-Reduce with NoSQL for data storage, Lambda Architecture was a sensible solution, and it has been successfully deployed at scale at Twitter and LinkedIn. However, there are more advanced (and simple) alternatives available.

Real Time Analytics with Snowflake

The Snowflake Solution

The diagram above illustrates an alternative simple solution with a single real-time data flow from source to dashboard. The critical component that makes this possible is the Snowflake data warehouse which now includes a native Kafka connector in addition to Streams and Tasks to capture, transform, and analyze data in near real-time.

The components in the above solution are:

  • Apache Kafka: For fault-tolerant message queuing and broadcast system.
  • Snowflake Streams and Tasks: To receive the data, perform change data capture and transform and store data ready for analysis and presentation.
  • Snowflake Multi-cluster Architecture: To seamlessly handle thousands of online concurrent users to analyze results.
  • Tableau: For analytic presentation and dashboards.

The advantages of this architecture include:

  • Absolute Simplicity: As a pipeline to capture, implement change data capture, and storage can be completed with just a handful of SQL statements.
  • SQL Transformations: With all data transformation logic in the Snowflake transformation component (using industry-standard SQL), there's no code duplication or multiple technologies to cause maintenance issues.
  • Real-Time Accuracy: As the database solution provides full relational support and ACID compliance, there is no issue around eventual consistency from NoSQL solutions.

The diagram below illustrates one of the unique benefits of Snowflake which delivers unlimited cloud-based compute resources as Virtual Warehouses over a shared storage system. This means, it's no longer necessary to provide a separate speed and batch processing layer, as queries can be continually streamed into the warehouse using Snowpipe, while being transformed on one virtual warehouse, and results analyzed on yet another.

Snowflake Workload Separation

Conclusion

Snowflake is uniquely positioned to provide a single platform for all your data warehouse storage, processing, and analysis needs. This includes:

  • Near Real-Time Streaming:  Using Snowpipe and the native Kafka connector to capture and ingest streaming data without contention.
  • Semi-structured Processing:  Which uses a simple SQL interface to provide a real-time schema-on-read view over JSON, AVRO, Parquet and XML.  Unlike other databases that store semi-structured data as simple text, Snowflake parses the data and seamlessly stores it in a columnar data structure for fast retrieval.
  • A Data Lake:  Which can combine both semi-structured JSON and structured CSV formats.  This helps abolish the separate data silos of Data Lake and Warehouse.
  • A Data Warehouse:  With full ACID compliance, and an industry-standard SQL interface over an unlimited data store.
  • Integration to Transformation and Analytic tools:  Including native connectors for Spark, Talend, Informatica, Looker, PowerBI and Tableau.

Disclaimer: The opinions expressed in my articles are my own, and will not necessarily reflect those of my employer (past or present).

Data (computing) Processing Database Relational database Batch processing The Industry Standard Data storage Lambda architecture

Published at DZone with permission of John Ryan, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Accelerate Innovation by Shifting Left FinOps: Part 4
  • Small-Scale Logs Processing and Analysis
  • NoSQL for Relational Minds
  • Unveiling the Clever Way: Converting XML to Relational Data

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!