Compete on Data Analytics Using Spring Cloud Data Flow
Becoming data-driven is one of the most desireable acheivements of many organizations. Check out this solution using Spring Cloud Data Flow.
Join the DZone community and get the full member experience.Join For Free
Being data-driven is one of the most essential prerequisites for any organization to achieve the desired digital transformation. In this regard, firms have started treating their data as assets and adjusting their strategies to emphasize them.
The bars were set high a decade before but the transformation process is slower than expected. Many proofs exist on the failures of modern enterprises to go data-driven. One such study is the findings of NewVantage Partners’ 2019 Big Data and AI Executive Survey. The survey comprised of several C-level technology and business executives representing large corporations such as American Express, Ford Motor, General Electric, General Motors, and Johnson & Johnson.
Though there are tremendous investments in big data and AI initiatives, the negativity constitutes more than 50% in all the above cases.
The Three V’s
Why are enterprises failing to be data-driven? It is because of the Three V’s effect — Volume, Variety, & Velocity:
Volume: It is becoming difficult to handle the volume of ever-growing data produced and consumed within and outside the organization. Every now and then organizations upgrade the storage and processing power to support the expected data volume.
Variety: The variety of data produced by an enormous amount of sensors and applications are of mixed type — structured, semi-structured or unstructured. Many traditional and proprietary data platforms and tools are finding it difficult in onboarding these different types.
Velocity: All these sensors and applications generate data at the speed of light. Extracting, transforming and loading them in real-time and delivering the analytics at the same speed seem to be a herculean task for many ETL tools.
One Such SMB
One of our customers, who is a leader in Through-Channel Marketing Automation, segmented in the Small- and Medium-Sized enterprise category, wanted to become a real-time data-driven company to compete on data analytics. They did not want to stay behind using traditional ETL tools in analyzing old-time data.
The enterprise’s product had well-defined integrations by leveraging Pentaho (Kettle) ETL features, which pretty much supported their business for a decade. Later in the new data era, it started to receive different sets of expectations from the business.
- Analytics on Real-Time Data: The end users of the business wanted to make decisions based on real-time data rather than near-real-time/old data.
- Not Business Developers Friendly: Business developers heavily depended on the ETL developers for every customization in the data pipeline — maybe it was complicated or less complicated.
- Lack of Automation: The number of cycles that had to be run in delivering the data pipelines to production was huge as they involved end-to-end manual steps.
- Multiple SaaS Channels: The supported SaaS sources were Google and Salesforce by Pentaho, but the business demanded to have most of the social platforms as their data sources. Pentaho’s latest version 8.0 supports Realtime Streaming data but that is not targeted to receive stream from all the social platforms off the shelf. Instead it requires a couple of steps to be configured here and there by ETL developers. Ref — stream-data-from-twitter-api-with-oauth-using-kettle
- Support for Multiple Authentication protocols: Each API demands a different set of authentication mechanisms. Though most of the platforms support either OAuth/OAuth 2, there were needs to have basic authentication as well.
- Data Fields Selection & Transformation: The response of any streaming would be totally different from how it is stored for further processing. Apart from storing or doing any transformation, nothing more could be expected from ETL integration tool. But the below challenges were to be addressed. The selection of the fields from the response needs to be easily achieved and minimal transformation like date/time format, currency format change based on locale was also expected.
- Business Developer Friendly: Business developers should be able to create the pipeline without much effort for working with any of the social platforms. A simple drag and drop of components should allow creating any data pipeline. This tailor made data pipeline creator is not something supported out-of-the box by any of the data integration tools in the market.
- Time to Market: If a business developer wants to change a step in the pipeline - either by adding a filter or by removing a verification step, the process should like that be deployed into the respective environment for testing and delivered to production in no time.
We conducted day-long workshops and design thinking sessions to understand the pain points of all the stakeholders — Product Managers, Business Developers and Data Engineers.
The data points — functional and non-functional requirements — were carefully collected and we have come up with a recommended solution of creating data pipelines dynamically by leveraging Spring Cloud Data Flow and it met the expectations!
The Functional View
- Data Sources: The incoming real-time streams from various social platforms.
- API Connector: Connectors of supporting various API (SOAP, REST & Graph) and Authentication protocols (OAuth, OpenID, Basic)
- Metadata Picker: Lists of all the attributes from the respective stream which can be chosen for processing further
- Data Formatter: Customisation of the attributes by Formatting & applying transformation logic is done here.
- DB Tables: The sink tables (GCP - Bigtable) where the data analytics queries get fired.
The Systems View
Authentication and Authorization
- The allowed users are from the GSuite (the enterprise’ Identity Provider)
- With the help of Cloud IAP, the access to the application is controlled.
GCR with App or Task Images
- Container Repository of pre-built images of applications/ tasks to be developed as Source, Processor and Sink. These images are templates, used for creating the components in any stream. The images could be of source, processor or sink type.
- These images would be accepting the configurable properties like endpoint URI, accessToken, Consumer Key and Consumer Secret etc...and pass them to the underlying applications/tasks so that they can consume/process/receive the data based on the type.
Data Pipeline Creator UI
- To configure the data sources with respective authentication configurations
- To render the metadata for selection/ filtering
- To specify the data formatting per data source
Data Pipeline Creator Service
- A containerized microservice exposing REST APIs used in the Data PipeLine Creator UI
- All the configurations specific to a data pipeline - from the source to the destination are preserved in the metadata database to which this service has access to.
- This service abstracts out the Spring Cloud Data Flow services for the given needs as shown below
Spring Cloud Data Flow Server
- The server responsible for creating a data pipeline (for the given configuration) per data source configuration and deploying in the Skipper Server as data streams
- It is configured with Kafka streams to support real time data processing.
The recommended solution has been implemented; the enterprise has done the end-to-end automation of continuously deploying the real time data services and pipelines into production without much help from ETL developers but by the business developers. The mission accomplished!
There are many reasons why any enterprise fails in achieving the goal to become data-driven. Irrespective of the number of excuses and failures, the amount of data continues to rise exponentially. According to the independent research firm IDC, the growth in connected IoT devices is expected to generate 79.4 zetabytes of data in 2025.
Big data Spring Cloud Analytics Flow (web browser) Spring Framework Pipeline (software) dev
Opinions expressed by DZone contributors are their own.