MACH Algorithm — A Breakthrough in Distributed Deep Learning
A new deep learning algorithm can train up to 10x faster, with up to 4x less memory. Find out how.
Join the DZone community and get the full member experience.Join For Free
Researcher Anshumali Shrivastava, from Rice University, has announced that they have made a breakthrough in distributed deep learning. In collaboration with Amazon University, he had shown that they can reduce the time by 7x to 10x time with 2x to 4x less memory needed to train computers in product research and other "extreme classification problems". The research is explained further in this video:
In the tests, conducted on a set of 70 million queries and more than 49 million products of research data provided by Amazon, Shrivastava, Medini, and colleagues demonstrated their approach of using "merged-average classifiers via hashing" (MACHs), which required a fraction of the training resources of some modern commercial systems.
This study can be used in extreme classification problems known to have many possible outcomes or parameters.
According to the researchers, the depth learning models for extreme classification are so large that they typically need to be formed using a supercomputer, a linked set of graphics processing units where parameters are distributed and run in parallel, often for several days. However, MACH takes a very different approach. Shrivastava describes it through a thought experiment that randomly divides 100 million products into three classes in two worlds, where these classes take the form of buckets, in which you can find iPhones as well as t-shirts or fishing rods.
You might also be interested in: Beginner’s Guide: Image Recognition and Deep Learning
Shrivastava said, "Now I feed a search to the classifier in world one, and it says bucket three, and I feed it to the classifier in world two, and it says bucket one," he said. "What is this person thinking about? The most probable class is something that is common between these two buckets. If you look at the possible intersection of the buckets there are three in world one times three in world two, or nine possibilities," he said. "So I have reduced my search space to one over nine, and I have only paid the cost of creating six classes."
In this Amazon experiment, the training of the model took less time and memory than some of the best-reported training times on models with comparable parameters, including Google's low-density expert mix (MOU) model.
But the most notable fact is that the MACH algorithm does not require any communication between parallel processors, which makes the algorithm extremely scalable, as communication between processors can become the bottleneck of an algorithm.
Medini, Shrivastava’s colleague, said, "In principle, you could train each of the 32 on one GPU, which is something you could never do with a nonindependent approach.".
All of these characteristics make the MACH algorithm very efficient with voice translation and answering general questions.
Opinions expressed by DZone contributors are their own.