Over a million developers have joined DZone.

Evolving from Descriptive to Predictive Analytics (Part 3): Fast Start Data Management

DZone's Guide to

Evolving from Descriptive to Predictive Analytics (Part 3): Fast Start Data Management

Learn about the tooling needed for efficient and scalable machine learning solutions to evolve from descriptive to predictive analytics.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

The post that follows is the third in an ongoing series about a shift in focus from descriptive to predictive analytics. We hope you'll check out Part 1 and Part 2.

At this point in our journey, we had leadership support and the skills needed to transform from descriptive to prescriptive analytics. The next step was identifying the tools our team would need to succeed.

Machine learning is a subset of data science in which machines learn from data. If you only have in place a data management strategy for feeding descriptive analytics, longterm success will very likely mean that strategy needs to evolve. A majority of solutions for descriptive analytics use purpose-built data marts that provide analytics for a specific function of the business. With machine learning, you'll quickly find you need to combine data you've never combined before. In particular, you might need to bring together disparate internal, external, structured, and unstructured data, which can present a major challenge. Don't let that stop you from moving forward. As Mark Twain once said, "The secret of getting ahead is getting started." We recommend moving forward in parallel paths: Evolve your hybrid data management strategy and tools while building machine learning solutions on the data foundation you have today.

In later articles, we'll surface the details of building a data management strategy for machine learning, but for now, we'll focus on what you need to create machine learning solutions as quickly as possible. Our team didn't wait for a broader scalable solution to be in place. We needed to deliver business results as quickly as possible to provide early returns on our data science investment and generate excitement around the possibilities of machine learning.

During our initial research into machine learning, we spoke to others who had already made the transition to predictive analytics. A common theme was that the data management and governance challenges consumed 50-75% of most projects. We had a team of data scientists looking to apply their skills and a business eagerly awaiting results, so we needed a way to reduce the work of accessing, understanding, and preparing the data.

Data Storage

We needed to accomplish three basic functions with our data: move, store, and govern. We already had a robust relational Db2 datamart to support our descriptive analytics, and having a world-class database solution such as Db2 gave us a great starting point, but there was certainly work to do since we needed to greatly increase the scope and currency of our data. We quickly realized we should co-locate our data with our tools. Later, we'll discuss 'data gravity' that should pull the tools into its ecosystem, but in the short term, we did the opposite. We moved our datamart to co-locate it with the environment that contained our data science and visualization tools. This immediately improved our data access performance and reliability.

Data Movement

  • The ability to monitor and manage multiple jobs from a single graphical interface.
  • A highly scalable parallel framework.
  • Native integration into a broader suite of information governance tools.

Our datamart was a decade-long product of many data engineers contributing their preferred data load solutions along the way. We moved the data using Db2 command line scripts, Cognos Data Manager and InfoServer DataStage, and I can't overemphasize the importance of having a robust data strategy and solution since data will need to move quickly and reliably to support your machine learning initiatives.

We also needed to standardize on an enterprise ETL platform. Our primary needs beyond robust ETL functionality were:

Taken together, that made DataStage the perfect solution. We didn't take the time to rebuild everything up front but instead continue to migrate all ETL jobs into DataStage while creating new solutions directly in DataStage. This reliability and efficiency have turned out to be essential to our machine learning efforts.

Data Governance

It took us a bit longer to get the next realization: the need to manage data quality. You can't expect high-performing ML models without data quality and completeness. We've also found that even limited success with ML invites others to engage in your projects. This meant we needed an easy way for new data scientists to understand previously cleansed data ready for ML as well as the structure and quality of new data.

To address these new challenges, we implemented another part of the Information Server suite of products, Information Analyzer, which has helped us to quickly understand the structure, frequency, and quality of our data. We can then focus our efforts on understanding the content and cleansing the data to ensure that it's fit for use within our ML framework.

Storing, moving, and governing data is a foundational aspect of machine learning. Using Db2, DataStage and Information Analyzer as our hybrid data management and unified governance tools gave us an efficient and stable foundation for our short-term machine learning efforts.

In the next part of the blog series, we'll discuss another crucial tool that's crucial to your early efforts, a data science tool.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

big data ,descriptive analytics ,predictive analytics ,data management ,machine learning

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}