DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Real-Time Data Architecture Frameworks
  • AI, ML, and Data Science: Shaping the Future of Automation
  • Snowflake vs. Databricks: How to Choose the Right Data Platform
  • MLOps: How to Build a Toolkit to Boost AI Project Performance

Trending

  • Endpoint Security Controls: Designing a Secure Endpoint Architecture, Part 2
  • AWS to Azure Migration: A Cloudy Journey of Challenges and Triumphs
  • How to Ensure Cross-Time Zone Data Integrity and Consistency in Global Data Pipelines
  • Understanding IEEE 802.11(Wi-Fi) Encryption and Authentication: Write Your Own Custom Packet Sniffer
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Data Science for the Modern Data Architecture

Data Science for the Modern Data Architecture

While everyone wants to predict the future, truly leveraging data science for predictive analytics remains the domain of a select few.

By 
Vinay Shukla user avatar
Vinay Shukla
·
Sep. 22, 17 · Opinion
Likes (6)
Comment
Save
Tweet
Share
10.6K Views

Join the DZone community and get the full member experience.

Join For Free

Our customers increasingly leverage data science and machine learning to solve complex predictive analytics problems. A few examples of these problems are churn prediction, predictive maintenance, image classification, and entity matching.

While everyone wants to predict the future, truly leveraging data science for predictive analytics remains the domain of a select few. To expand the reach of data science, the modern data architecture (MDA) needs to address the following four requirements:

  1. Enable apps to consume predictions and become smarter
  2. Bring predictive analytics to the IOT edge
  3. Become easier, more accurate, and faster to deploy and manage
  4. Fully support data science life cycle

The below diagram represents where data science fits in the MDA.

Data-Smart Applications

The end-users consume data, analytics, and the results of data science analytics via data-centric applications (or apps). A vast majority of these applications today don't leverage data science, machine learning, or predictive analytics. A new generation of enterprise and consumer-facing apps are being built to take advantage of data science/predictive analytics and provide context driven insights to nudge end-users to next set of actions. These apps are called data-smart applications.

Writing data-smart apps is hard. The app developer needs to write not only the traditional app logic but also the logic to invoke predictive analytics. These data-smart apps also face a set of common problems such as entity disambiguation, data quality analysis, and anomaly detection. Since today's data platforms don't provide these functionalities, the app developers are responsible for solving these problems.

We have seen this issue before, and frameworks such as JavaEE & Spring Framework evolved to addresses common application concerns. Now we need the next generation application framework to make writing Data Smart Applications easier. We are starting to see this evolution. Salesforce Einstein is helping applications in Salesforce Cloud become smarter, but similar functionality is yet to be available in open source.

Smarter Edge

The Internet of Things is rapidly expanding and the market size estimates are huge. IDC estimates global IT spending on IoT-related items will reach $1.29 trillion by 2020. Edge intelligence has the potential to deliver insights and predictions where it is needed most, at a faster speed, without requiring a persistent network connection. What is needed is to deliver predictions at the edge, but predictive models need not be created at the edge. Today, model training at the edge is painfully slow and we can create better models faster in the data center. What is needed is to deliver these models to the edge where they can provide predictions even while being disconnected from the data center. Often, the models degrade with time and drift, and to address these issues, the edge needs to be able to report back on model performance and ask for new models when the performance falls below certain threshold.

Faster, More Accurate, and Easier Management

Businesses are collecting ever bigger datasets, running more compute intensive deep learning and machine learning algorithms across a bigger compute cluster. This requires a mature and sophisticated big data and big compute platform. The platform needs to leverage hardware advances and transparently make them available to big data analytics and data-smart apps. Hardware advances such a GPU, FPGA, RDMA etc. should be made transparently available to compute framework with the right level of resource sharing and isolation semantic. YARN already support GPU with node-labels but this functionality is going to evolve to provide finer-grained control.

A lot of data science workloads leverage Python libraries and R packages. Managing these dependencies in a distributed cluster is a non-trivial issue. We have made advances with Package management in SparkR and virtual environment support with PySpark, but much more is needed. Upcoming Hadoop 3 will provide Docker Support and that will allow developer packaged environment to run as a YARN job and will be easier to manage.

Tuning, debugging, and tracing a distributed system remains hard. As data science on big data goes mainstream, we need to make distributed systems easier to manage, debug, trace, and tune.

Complete Data Science Platform

Data science is a team sport. Data scientists collaborate, explore corporate datasets, wrestle with data, and deploy machine learning while keeping up with the onslaught of new machine learning techniques and libraries. A complete data science platform needs to support the full data science life cycle. It needs to provide data scientists the choice of their favorite Notebook from Jupyter and Zeppelin to RStudio and allow them a wide choice of data science languages and frameworks to use. The platform should make collaboration easier and help data scientist be more aligned with modern Software Engineering practices such as code review, continuous integration, and delivery.

Model deployment and management is a critical part of completing the data science loop and the framework needs to support model deployment, versioning, A/B testing, champion/challenger, and provide standard ways to promote and use the models.

Deep learning (DL) is top of mind for many and selecting the right DL framework for the right problems for DL, remains an art form. The platform needs to provide guidance and choice of right DL frameworks to use and provide better integration with hardware resources to improve training time and performance.

Data science Big data Machine learning Data architecture Architecture

Published at DZone with permission of Vinay Shukla. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Real-Time Data Architecture Frameworks
  • AI, ML, and Data Science: Shaping the Future of Automation
  • Snowflake vs. Databricks: How to Choose the Right Data Platform
  • MLOps: How to Build a Toolkit to Boost AI Project Performance

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!