Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Using Machine Learning to Mine Complex Datasets

DZone's Guide to

Using Machine Learning to Mine Complex Datasets

New research uses machine learning to help scientists derive reliable information rapidly from large and complex datasets.

· AI Zone ·
Free Resource

Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.

It's widely recognized that the rising power of AI is largely driven by the increase in data available to train the algorithms. In the Big Data Era, the quantity of data is seldom the issue, but being able to successfully analyze it is often much harder.

New research from the Department of Energy's Lawrence Berkeley National Laboratory (Berkeley Lab) and UC Berkeley uses machine learning to enable scientists to derive insights from incredibly complex datasets in record time.

"Take a human cell, for example. There are 10 170 possible molecular interactions in a single cell. That creates considerable computing challenges in searching for relationships," the authors explain. "Our method enables the identification of interactions of high order at the same computational cost as main effects — even when those interactions are local with weak marginal effects."

Unique Requirements

The team highlights the unique requirements of machine learning projects in science compared to those in other sectors. Whereas in some sectors, not being able to understand how the algorithm came to its conclusion is acceptable, in science, this isn't the case.

A detailed understanding of how and why something happens allows scientists to model the process and test whether it can be improved. As such, explainability is crucial for machine learning when used in scientific projects.

This is especially difficult in complex systems where there is generally a huge number of variables to keep in mind — and indeed, variables that behave in nonlinear ways. This makes building a model that shows cause and effect very difficult.

"Unfortunately, in biology, you come across interactions of order 30, 40, 60 all the time," the authors explain. "It's completely intractable with traditional approaches to statistical learning."

Random Forests

The team used random forests to translate the internal state of the algorithm to a more human-readable interpretation. They believe their approach will allow researchers to safely search for complex interactions without incurring huge computational costs of identification.

"There is no difference in the computational cost of detecting an interaction of order 30 versus an interaction of order two," they say. "And that's a sea change."

The algorithm was put through its paces on a couple of genomics problems: one involving the role of gene enhancers in fruit flies and the other alternative splicing in a human-derived cell line. In both experiments, the algorithm was able to confirm previous findings while also discovering some higher-order interactions for the team to follow up on in subsequent work.

The team is now testing the algorithm on a number of other problems, in various other domains, but are confident that their work represents a fundamental shift in how science can be performed.

"We do prediction, but we introduce stability on top of prediction in iRF to more reliably learn the underlying structure in the predictors," they say. "This enables us to learn how to engineer systems for goal-oriented optimization and more accurately targeted simulations and follow-up experiments."

TrueSight is an AIOps platform, powered by machine learning and analytics, that elevates IT operations to address multi-cloud complexity and the speed of digital transformation.

Topics:
machine learning ,ai ,data analytics ,random forest ,algorithm

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}