Let's take the classic problem of real time SENTIMENT ANALYSIS for a laptop brand.
For doing Opinion analysis, process flow should follow as shown in the diagram below:
Collect data from different sources such as Twitter, Facebook, ecommerce sites.
On basis of some keyword such as “High Throughput” , We have to filter data.
Generate sentiment of each message coming through various sources.
Have a Storage mechanism for storing the processed data.
Now the question is if I can I solve it using a Big Data System. Please see the process diagram below using Hadoop:
If we are running a Hive Query, Pig Script, or MapReduce it will take hours to process the results since they have to read data from HDFS (reading from disk) and then do the processing, so ideally they can't do processing of data in real time (they follow a Data at rest principle).
So the broad answer to the question of whether Hadoop can handle real time processing is NO since Hadoop is meant for Batch Processing and it can take hours to generate the result.
To summarize, real time problems can't be solved with Hadoop since it takes a batch-oriented approach.
There are many such use cases where we require data processing in real time, such as:
Processing Customer Behaviour
So now how do we solve this particular problem? By using some real time streaming mechanisms (everything is done in memory, following a Data in Motion principle).
The classical process flow for this real time processing is shown below:
But if we follow this approach, the following queries needs to be answered.
1. Data Streaming: Data needs to be sent as Streams in Data Pipeline.
2. Fault Tolerance: If any processes go down, what will be the failover mechanism?
3. Scaling: Can we scale our Cluster easily to increase processing capacity if data size increases?
4. Guaranteed Message Processing: Will there be a gurantee for processing messages?
5. Programming Language Agnostic: Whether it will be programming independent.
There are some real time streaming mechanisms such as Apache Storm which helps in solving this problem.
Now let's try to answer the above questions to see if Apache Storm has an answer.
Data is sent as streams in the form of a tuple.
Storm being a distributed platform allows you to add more nodes to your Storm cluster runtime and increase the throughput of the application
In Storm, units of works are executed by workers in a cluster. When a worker dies, Storm will restart that worker, and if the node on which the worker is running dies, Storm will restart that worker on some other node in the cluster.
Guaranteed Message Processing
Storm provides strong guarantees that each tuple passed on to it for processing will be processed atleast once. If that tuple fails for processing, Storm will restart those failed tuples.
Programming Language Agnostic
It can be coded in any programming language. Even though Storm platform runs on JVM, the applications written over it can be written in any programming language that can read and write to standard I/O.
Hope this article helps in clarifying the usage of real time streaming mechanisms such as Apache Storm for Big Data Processing.