Compete on Data Analytics Using Spring Cloud Data Flow
Compete on Data Analytics Using Spring Cloud Data Flow
Becoming data-driven is one of the most desireable acheivements of many organizations. Check out this solution using Spring Cloud Data Flow.
Join the DZone community and get the full member experience.Join For Free
The Three V’s
Volume: It is becoming difficult to handle the volume of ever-growing data produced and consumed within and outside the organization. Every now and then organizations upgrade the storage and processing power to support the expected data volume.
Variety: The variety of data produced by an enormous amount of sensors and applications are of mixed type — structured, semi-structured or unstructured. Many traditional and proprietary data platforms and tools are finding it difficult in onboarding these different types.
Velocity: All these sensors and applications generate data at the speed of light. Extracting, transforming and loading them in real-time and delivering the analytics at the same speed seem to be a herculean task for many ETL tools.
One Such SMB
- Analytics on Real-Time Data: The end users of the business wanted to make decisions based on real-time data rather than near-real-time/old data.
- Not Business Developers Friendly: Business developers heavily depended on the ETL developers for every customization in the data pipeline — maybe it was complicated or less complicated.
- Lack of Automation: The number of cycles that had to be run in delivering the data pipelines to production was huge as they involved end-to-end manual steps.
- Multiple SaaS Channels: The supported SaaS sources were Google and Salesforce by Pentaho, but the business demanded to have most of the social platforms as their data sources. Pentaho’s latest version 8.0 supports Realtime Streaming data but that is not targeted to receive stream from all the social platforms off the shelf. Instead it requires a couple of steps to be configured here and there by ETL developers. Ref — stream-data-from-twitter-api-with-oauth-using-kettle
- Support for Multiple Authentication protocols: Each API demands a different set of authentication mechanisms. Though most of the platforms support either OAuth/OAuth 2, there were needs to have basic authentication as well.
- Data Fields Selection & Transformation: The response of any streaming would be totally different from how it is stored for further processing. Apart from storing or doing any transformation, nothing more could be expected from ETL integration tool. But the below challenges were to be addressed. The selection of the fields from the response needs to be easily achieved and minimal transformation like date/time format, currency format change based on locale was also expected.
- Business Developer Friendly: Business developers should be able to create the pipeline without much effort for working with any of the social platforms. A simple drag and drop of components should allow creating any data pipeline. This tailor made data pipeline creator is not something supported out-of-the box by any of the data integration tools in the market.
- Time to Market: If a business developer wants to change a step in the pipeline - either by adding a filter or by removing a verification step, the process should like that be deployed into the respective environment for testing and delivered to production in no time.
The Functional View
- Data Sources: The incoming real-time streams from various social platforms.
- API Connector: Connectors of supporting various API (SOAP, REST & Graph) and Authentication protocols (OAuth, OpenID, Basic)
- Metadata Picker: Lists of all the attributes from the respective stream which can be chosen for processing further
- Data Formatter: Customisation of the attributes by Formatting & applying transformation logic is done here.
- DB Tables: The sink tables (GCP - Bigtable) where the data analytics queries get fired.
The Systems View
- The allowed users are from the GSuite (the enterprise’ Identity Provider)
- With the help of Cloud IAP, the access to the application is controlled.
- Container Repository of pre-built images of applications/ tasks to be developed as Source, Processor and Sink. These images are templates, used for creating the components in any stream. The images could be of source, processor or sink type.
- These images would be accepting the configurable properties like endpoint URI, accessToken, Consumer Key and Consumer Secret etc...and pass them to the underlying applications/tasks so that they can consume/process/receive the data based on the type.
- To configure the data sources with respective authentication configurations
- To render the metadata for selection/ filtering
- To specify the data formatting per data source
- A containerized microservice exposing REST APIs used in the Data PipeLine Creator UI
- All the configurations specific to a data pipeline - from the source to the destination are preserved in the metadata database to which this service has access to.
- This service abstracts out the Spring Cloud Data Flow services for the given needs as shown below
- The server responsible for creating a data pipeline (for the given configuration) per data source configuration and deploying in the Skipper Server as data streams
- It is configured with Kafka streams to support real time data processing.
Opinions expressed by DZone contributors are their own.