Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Data Science for the Modern Data Architecture

DZone's Guide to

Data Science for the Modern Data Architecture

While everyone wants to predict the future, truly leveraging data science for predictive analytics remains the domain of a select few.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Our customers increasingly leverage data science and machine learning to solve complex predictive analytics problems. A few examples of these problems are churn prediction, predictive maintenance, image classification, and entity matching.

While everyone wants to predict the future, truly leveraging data science for predictive analytics remains the domain of a select few. To expand the reach of data science, the modern data architecture (MDA) needs to address the following four requirements:

  1. Enable apps to consume predictions and become smarter
  2. Bring predictive analytics to the IOT edge
  3. Become easier, more accurate, and faster to deploy and manage
  4. Fully support data science life cycle

The below diagram represents where data science fits in the MDA.

Data-Smart Applications

The end-users consume data, analytics, and the results of data science analytics via data-centric applications (or apps). A vast majority of these applications today don't leverage data science, machine learning, or predictive analytics. A new generation of enterprise and consumer-facing apps are being built to take advantage of data science/predictive analytics and provide context driven insights to nudge end-users to next set of actions. These apps are called data-smart applications.

Writing data-smart apps is hard. The app developer needs to write not only the traditional app logic but also the logic to invoke predictive analytics. These data-smart apps also face a set of common problems such as entity disambiguation, data quality analysis, and anomaly detection. Since today's data platforms don't provide these functionalities, the app developers are responsible for solving these problems.

We have seen this issue before, and frameworks such as JavaEE & Spring Framework evolved to addresses common application concerns. Now we need the next generation application framework to make writing Data Smart Applications easier. We are starting to see this evolution. Salesforce Einstein is helping applications in Salesforce Cloud become smarter, but similar functionality is yet to be available in open source.

Smarter Edge

The Internet of Things is rapidly expanding and the market size estimates are huge. IDC estimates global IT spending on IoT-related items will reach $1.29 trillion by 2020. Edge intelligence has the potential to deliver insights and predictions where it is needed most, at a faster speed, without requiring a persistent network connection. What is needed is to deliver predictions at the edge, but predictive models need not be created at the edge. Today, model training at the edge is painfully slow and we can create better models faster in the data center. What is needed is to deliver these models to the edge where they can provide predictions even while being disconnected from the data center. Often, the models degrade with time and drift, and to address these issues, the edge needs to be able to report back on model performance and ask for new models when the performance falls below certain threshold.

Faster, More Accurate, and Easier Management

Businesses are collecting ever bigger datasets, running more compute intensive deep learning and machine learning algorithms across a bigger compute cluster. This requires a mature and sophisticated big data and big compute platform. The platform needs to leverage hardware advances and transparently make them available to big data analytics and data-smart apps. Hardware advances such a GPU, FPGA, RDMA etc. should be made transparently available to compute framework with the right level of resource sharing and isolation semantic. YARN already support GPU with node-labels but this functionality is going to evolve to provide finer-grained control.

A lot of data science workloads leverage Python libraries and R packages. Managing these dependencies in a distributed cluster is a non-trivial issue. We have made advances with Package management in SparkR and virtual environment support with PySpark, but much more is needed. Upcoming Hadoop 3 will provide Docker Support and that will allow developer packaged environment to run as a YARN job and will be easier to manage.

Tuning, debugging, and tracing a distributed system remains hard. As data science on big data goes mainstream, we need to make distributed systems easier to manage, debug, trace, and tune.

Complete Data Science Platform

Data science is a team sport. Data scientists collaborate, explore corporate datasets, wrestle with data, and deploy machine learning while keeping up with the onslaught of new machine learning techniques and libraries. A complete data science platform needs to support the full data science life cycle. It needs to provide data scientists the choice of their favorite Notebook from Jupyter and Zeppelin to RStudio and allow them a wide choice of data science languages and frameworks to use. The platform should make collaboration easier and help data scientist be more aligned with modern Software Engineering practices such as code review, continuous integration, and delivery.

Model deployment and management is a critical part of completing the data science loop and the framework needs to support model deployment, versioning, A/B testing, champion/challenger, and provide standard ways to promote and use the models.

Deep learning (DL) is top of mind for many and selecting the right DL framework for the right problems for DL, remains an art form. The platform needs to provide guidance and choice of right DL frameworks to use and provide better integration with hardware resources to improve training time and performance.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,data science ,architecture ,predictive analytics

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}