Big Data Building Blocks: Selecting Architectures and Open-Source Frameworks

DZone 's Guide to

Big Data Building Blocks: Selecting Architectures and Open-Source Frameworks

A developer dives into the world of big data, stream processing, and sentiment analysis. Read on to take the plunge with her!

· Big Data Zone ·
Free Resource

This article is featured in the new DZone Guide to Big Data: Volume, Variety, and Velocity. Get your free copy for insightful articles, industry stats, and more!

From bare metal servers to the cloud, data at scale is everywhere. What we don't talk about as much, however, is that it too often ends up as tedious series of tasks — from complex real-time processing, event processing, data analyzing, data streaming, and building machine learning models, all the way down to simpler activities like cleaning and preparing data. Developers, DevOps teams, and data scientists across industries and organizations are searching for ways to automate these processes, but they still face another issue: finding the best solutions to big data's biggest challenges, like high availability, horizontal scaling, and fault tolerance. Many open-source tools, like Apache Kafka and ApacheSpark, claim to solve these challenges, but that leaves teams asking questions like:

  • How do we choose the best tool for our organization and products?
  • Is there a generic architecture for data at scale?
  • Can we use a basic architecture to start?

In this article, we'll use examples of common business scenarios, walking through our requirements and database decision process to help you see how to evaluate various architectures and open source tools to find the best solution for your scenario.

The Scenario: Twitter Sentiment Analysis

Many customers use social media to talk about products and services. Twitter is no different. Opinion-filled tweets can go viral and dramatically impact your product's (and company's) reputation. So, in our sample scenario, let’s imagine we’re a regional retail company. We would like to track and analyze Twitter posts in real-time so that we can take action when necessary, showing our appreciation for positive feedback and mitigating customer dissatisfaction quickly.

  • The problem: We have many products and services, and across all business units, our customers generate 100,000 tweets each hour, seven days a week.
  • The business goal: An easy, automated way to understand social sentiment so that we catch issues before they escalate.
  • The solution requirements: Real-time social media monitoring, horizontal scaling to accommodate future growth, and a real-time summary and alert system that pings customer success/operations teams (based on certain criteria).

What We'll Use

  • Real-time subscriber that will subscribe and pull data from the Twitter API.
  • Message queue.
  • Text parser that will adjust tweets into a format that our sentiment analysis engine can consume, as well as add elements such as likes, retweets, etc.
  • Sentiment analysis engine that will evaluate the tweet "feeling" and return its sentiment (positive/negative).
  • Score engine that will receive data from the sentiment analysis engine and the parser, run an algorithm that defines tweet severity and evaluate whether or not to fire an alert to our teams.
  • Database that will store our data, which the service layer will use to transfer to a dashboard in real-time.

Generic Architecture Options

A Kappa architecture consists of a message queue, a real-time processing layer, and a service layer. It is designed for both parallel processing and asynchronous or synchronous pipelines.Lambda architecture

A Lambda architecture is similar to the Kappa, with one extra layer, the batch layer, which combines output from all ingested data according to the product needs. In our example, the service layer is in charge of creating our dashboard view, where we combine output from the real-time processing and batch processing input to create a view that consists of insights from both.

Kappa architectureOur Solution Choice: Kappa Architecture

The Lambda architecture can be somewhat complex due to the need to maintain both batch and real-time processing codebases. Since we would like to start with a simplified solution, we'll use the Kappa architecture option.

Our goals:

  • Minimum latency: Evaluate new tweets as quickly as possible (i.e. a few seconds).
  • High throughput: Digest many tweets in parallel.
  • Horizontal scalability: Accommodate increase in loads.
  • Integrations: Allow different components in our system to integrate with other systems, such as alerting systems, web services, and so on. This will help us evaluate future changes in the architecture to support new features.

Data architecture

Our Decision Process

For real-time subscribers, many commercial products exist, both in the cloud (e.g. Azure Logic Apps) and on-premise. The same goes for the alert system. One option is Graphite, which works well with Grafana. Although these two are popular, we should use Prometheus instead. Prometheus works better for us since it provides an out-of-the-box full alert management system (which Graphite and Grafana lack), as well as data collection and basic visualization.

As for our message queue, there are multiple products (e.g. RabbitMQ, Apache ActiveMQ, and Apache Kafka). Many frameworks, like Apache Spark and Apache Storm, have built-in integration with Kafka. We selected Kafka for its scalability, high throughput, easy integration, and community support. On top of that, Kafka Stream clients are supported in various programming languages, which reduces the learning curve of a dedicated programming language.

Our text parser, sentiment analysis engine, and score engine can be built with Apache Storm or Apache Spark Streaming. Apache Storm focuses on stream processing, performs task-parallel computations, and can be used with any programming languages. Apache Spark includes a built-in machine learning library MLlib, which can help us with the sentiment engine. But Spark Streaming processes data in micro-batches, which might limit the latency capabilities. Given this, and since MLlib doesn't support all machine learning algorithms, we chose to work with Apache Storm. With Storm as the streaming layer, we can process each tweet as one task, which ends up being faster than working with data-parallel-driven Spark. This meets our scalability, low latency, and high throughput goals.

Finally, we need to evaluate our database options. Since we want to see a summary of our analyses in real-time on a dashboard, we'll use an indexed, in-memory DB like Redis for faster queries and minimum latency.

Data architecture

It's important to note that our architecture setup allows changes. As our system develops and we add statistical features, we can add a batch layer, turning the Kappa architecture into a Lambda one. The architecture also supports adding multiple inputs from various social networks, like LinkedIn. If there's a good chance that you'll add a batch layer in the future, consider using Apache Spark instead of Apache Storm. This way, you'll maintain one framework for both stream and batch processing.

Big Data Landscape at a Glance: Overview and Consideration

Apache Kafka provides a unified, high-throughput, low-latency platform for handling real-time data feeds, and most clouds support a managed Kafka. Kafka also provides Kafka Stream for streaming applications, as well as Kafka Connect, with which we can create a database connector that reads data from Kafka and writes it to a desired database, like PostgreSQL, MongoDB, and more. Kafka delivers an in-order, persistent, and scalable messaging system, supports microservices, and has a thriving open-source community — making it an extremely popular message queue framework.

Redis is an open-source in-memory data structure project that implements a distributed, in-memory key-value database with optional durability. Redis supports different kinds of abstract data structures, such as strings, lists, maps, sets, sorted sets, HyperLogLogs, bitmaps, streams, and spatial indexes. In the StackOverflow 2018 Developer Survey, Redis ranks as developers' "most loved database," but it's important to note that Redis doesn't support join operation or query language out-of-the-box. You'll also need to learn Lua for creating your own stored procedure; the learning process is longer and perhaps harder.

Apache Spark is an open-source distributed general-purpose cluster computing framework used most often for machine learning, stream processing, data integration, and interactive analytics. Spark Streaming is built on top of Apache Spark. It's a scalable fault-tolerant stream processing system that natively supports both batch and streaming workloads, with out-of-the box support for graphs and machine learning libraries. Spark is relatively simple to grasp, is extremely fast, and has vast community support.

Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. Usually compared with Spark Streaming, Storm is focused on stream processing and performs task-parallel computations. Storm is often used with Kafka, where Storm processes the data and Kafka is used for event stream or message bus purposes. Storm has huge support in the cloud and can be used with any programming language since the communication is over JSON-based protocol. Current adapters exist in Python, Ruby, JavaScript, Perl, and more.

Compared to Apache Samza, yet another stream processing framework, Storm shines in high-speed event processing (which is good for synchronous systems), while Samza shines in processing many gigabytes. Both are well-suited to work with massive amounts of real-time data.

What's Next

There are endless topics, technologies, scenarios, and requirements, so it's impossible to cover them all. This article gives you a foundation to evaluate your needs, understand the broader database architecture and technology landscape, and select from popular open-source tools and services.

However, it's important to remember that, as with all technology, new products and updates are released at a rapid pace. When you build your solution, think about how it'll handle future improvements, and continually evaluate if and how your solution meets your needs.

In summary: continue learning, always.

And I'm a lifelong learner who will be learning right along with you. If you'd like to see what I'm discovering, I write about all things big data on my personal blog.

If you'd like to build your own sentiment analysis tool, check out this Azure Stream Analytics tutorial.

This article is featured in the new DZone Guide to Big Data: Volume, Variety, and Velocity. Get your free copy for insightful articles, industry stats, and more!

apache storm, big data, kappa architecture, sentiment analysis, stream processing

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}