Building a Machine Learning Engine from Big Data
Building a Machine Learning Engine from Big Data
Let’s take a look at the underlying processes behind machine learning to understand more about what we need to make it work.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Machine learning (ML) is still growing as a field in big data and has of late made some significant advances. In fact, practical applications have become quite commonplace and most of us have benefited from it already. Email providers use it to determine whether an email is spam or legitimate. Credit card companies use it to flag potentially fraudulent charges. Hospitals are using it to improve outcomes for patients. If you’re a Netflix customer, you may have even been suggested a movie using machine learning :-) It’s obviously a hot topic and is making it’s way across newer industries as capabilities grow. A simple ‘machine learning’ search on Amazon returns over 13,000 books.
But with this rapid growth of machine learning applications, what does an organization actually need to create a well-oiled machine learning engine? Let’s take a look at the underlying processes behind machine learning to understand more about what we need to make it work.
Applied Mathematics and Data Science
There are many aspects enabling a cohesive machine learning process. Jean Georges Perrin wrote a great article on some of the mathematical aspects of what makes machine learning possible. Linear regression is a relatively basic technique that most of us have at least touched on in our past Algebra courses. The mathematics start to get more complex involving advanced statistical techniques like decision trees, neural networks, clustering, and bayesian networks to name a few. Each have their benefits and limitations, it takes an understanding of the underlying process and how it works to know when to use each one. If you do not already have a background in these techniques, you may want to consult a data scientist. They love this stuff!
Considerations for Data Ingestion
Once these techniques are determined, the analytic system still needs an influx of data (see illustration below). The data should be fresh, cleansed, and put into a format the analytic system can handle. Ideally, we even want the system to be able to take in information from multiple sources. We may have data coming in once a day and other streaming continuously. Unfortunately, this part of the process has traditionally required a majority of the effort. In a typical 80/20 breakdown, this would be 80% of the time.
Data Quality and Preparation
All of the source systems need to be identified and connected. Data needs to be extracted, quality checked, and cleansed. Have we seen this data before? Did we get what we expected on a file level and at a field level? Will transforming the data make it more usable? How do we join multiple data sets together? Do we need to tokenize or mask any sensitive data before we process it further? All of these questions and more need to be addressed in the data preparation stage. Once this is defined, we can implement the actual learning algorithms we discussed earlier to generate insights.
Managing These Complex Data Processes
Zaloni’s Bedrock is a unified platform that manages the entire ingestion, data preparation, and machine learning process. Bedrock sits on top of Hadoop and extracts the benefits as well as simplifies implementations. Setting and updating connections is an easy process, and data can be extracted continuously or in batch mode. Bedrock provides various tools for performing transformations, data quality checks, and cleansing. As data source changes format, it’s easy to update accordingly. Once the data is prepared, all of the analytics will be run on Hadoop which allows for great performance and scalability as needed.
Having the entire process defined within Bedrock helps to minimize the current standard 80% time in multiple ways. The automation of ingestion and workflows allows for reuse and constant updates to the data. By having an interactive visual interface to transform the data, needed changes are easy to find and implement. As the analytical processing gets more complex and these processor hungry algorithms try to make sense of larger data sets, systems can bog down. By implementing in Hadoop, we can scale indefinitely - your performance today will be equivalent two years from now. An option is to implement on a cloud hosted solution which can have some major cost benefits as well. Additionally, having the entire process in one platform allows an easy feedback loop to fix any issues.
Published at DZone with permission of Aashish Majethia . See the original article here.
Opinions expressed by DZone contributors are their own.