The Search for a Better BIG Data Analytics Pipeline
Join the DZone community and get the full member experience.
Join For Free
"Big Data Analytics"
has recently been one of the hottest buzzwords. It is a combination of
"Big Data" and "Deep Analysis." The former is a phenomenon of Web2.0
where a lot of transaction and user activity data have been collected,
which can be mined for extracting useful information. The latter is
about using advanced mathematical/statistical techniques to build models
from the data. In reality, I've found these two areas are quite
different and disjointed, and people working in each area have pretty
different backgrounds.
From my personal experience, most of the people working big data come from a computer science and distributed parallel processing systems background, but not from the statistical or mathematical discipline.
In this model, data is created from the OLTP (On Line Transaction Processing) system, flowing into the BIG Data Analytics system, which produced various outputs; including data mart/cubes for OLAP (On Line Analytic Processing), reports for the consumption of business executives, and predictive models that feedback decision support for OLTP.
The big data processing part (colored in orange) is usually done using Hadoop/PIG/Hive technology with classical ETL logic implementation. By leveraging the Map/Reduce model that Hadoop provides, we can linearly scale up the processing by adding more machines into the Hadoop cluster. Drawing cloud computing resources (e.g. Amazon EMR) is a very common approach to performing this kind of tasks.
The deep analysis part (colored in green) is usually done in R, SPSS, or SAS using a much smaller amount of carefully sampled data that fits into a single machine's capacity (usually less than couple hundred thousands data records). The deep analysis part usually involves data visualization, data preparation, model learning (e.g. Linear regression and regularization, K-nearest-neighbour/Support vector machine/Bayesian network/Neural network, Decision Tree and Ensemble methods), model evaluation. For those who are interested, please read up my earlier posts on these topics.
In this architecture, "Flume" is used to move data from OLTP system to Hadoop File System HDFS. A workflow scheduler (typically a cron-tab entry calling a script) will periodically run to process the data using Map/Reduce. The data has two portions: a) Raw transaction data from HDFS b) Previous model hosted in some NOSQL server. Finally the "reducer" will update the previous model which will be available to the OLTP system.
For most the big data analytic projects that I got involved in, the above architecture works pretty well. I believe projects requiring a real-time feedback loop may see some limitation in this architecture. Real-time big data analytics is an interesting topic that I am planning to discuss in future posts.
Big Data Camp
People working in this camp typically come from a Hadoop, PIG/Hive background. They usually have implemented some domain-specific logic to process a large amount of raw data. Often the logic is relatively straight-forward based on domain-specific business rules.From my personal experience, most of the people working big data come from a computer science and distributed parallel processing systems background, but not from the statistical or mathematical discipline.
Deep Analysis Camp
On the other hand, people working in this camp usually come from statistical and mathematical background, where the first thing being taught is how to use sampling to understand a large population's characteristic. Notice the magic of "sampling" is that the accuracy of estimating the large population depends on the size of sample but not the actual size of the population. In their world, there is never a need to process all the data in the population in the first place. Therefore, Big Data Analytics is unnecessary under this philosophy.Typical Data Processing Pipeline
Learning from my previous projects, I observe most data processing pipelines fall into the following pattern.In this model, data is created from the OLTP (On Line Transaction Processing) system, flowing into the BIG Data Analytics system, which produced various outputs; including data mart/cubes for OLAP (On Line Analytic Processing), reports for the consumption of business executives, and predictive models that feedback decision support for OLTP.
Big Data + Deep Analysis
The BIG data analytics box is usually done in a batch fashion (e.g. once a day), usually we see that big data processing and deep data analysis happen at different stages of this batch process.The big data processing part (colored in orange) is usually done using Hadoop/PIG/Hive technology with classical ETL logic implementation. By leveraging the Map/Reduce model that Hadoop provides, we can linearly scale up the processing by adding more machines into the Hadoop cluster. Drawing cloud computing resources (e.g. Amazon EMR) is a very common approach to performing this kind of tasks.
The deep analysis part (colored in green) is usually done in R, SPSS, or SAS using a much smaller amount of carefully sampled data that fits into a single machine's capacity (usually less than couple hundred thousands data records). The deep analysis part usually involves data visualization, data preparation, model learning (e.g. Linear regression and regularization, K-nearest-neighbour/Support vector machine/Bayesian network/Neural network, Decision Tree and Ensemble methods), model evaluation. For those who are interested, please read up my earlier posts on these topics.
Implementation Architecture
There are many possible ways to implement the data pipeline described above. Here is one common implementation that works well in many projects.In this architecture, "Flume" is used to move data from OLTP system to Hadoop File System HDFS. A workflow scheduler (typically a cron-tab entry calling a script) will periodically run to process the data using Map/Reduce. The data has two portions: a) Raw transaction data from HDFS b) Previous model hosted in some NOSQL server. Finally the "reducer" will update the previous model which will be available to the OLTP system.
For most the big data analytic projects that I got involved in, the above architecture works pretty well. I believe projects requiring a real-time feedback loop may see some limitation in this architecture. Real-time big data analytics is an interesting topic that I am planning to discuss in future posts.
Big data
Analytics
Pipeline (software)
Data processing
File system
Published at DZone with permission of Ricky Ho, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
Which Is Better for IoT: Azure RTOS or FreeRTOS?
-
What to Pay Attention to as Automation Upends the Developer Experience
-
How Agile Works at Tesla [Video]
-
Cypress Tutorial: A Comprehensive Guide With Examples and Best Practices
Comments