DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Retrieval-Augmented Generation (RAG): Enhancing AI-Language Models With Real-World Knowledge
  • Architecting High-Performance Supercomputers for Tomorrow's Challenges
  • Blue Skies Ahead: An AI Case Study on LLM Use for a Graph Theory Related Application
  • From Zero to Production: Best Practices for Scaling LLMs in the Enterprise

Trending

  • My LLM Journey as a Software Engineer Exploring a New Domain
  • Kubeflow: Driving Scalable and Intelligent Machine Learning Systems
  • *You* Can Shape Trend Reports: Join DZone's Software Supply Chain Security Research
  • Automatic Code Transformation With OpenRewrite
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Understanding LLMs: Mixture of Experts

Understanding LLMs: Mixture of Experts

Mixture of Experts is a LLM architecture that powers your most popular LLMs. Continue reading to learn about how they work and why they work so well.

By 
Roger Oriol user avatar
Roger Oriol
·
Apr. 24, 24 · Review
Likes (1)
Comment
Save
Tweet
Share
927 Views

Join the DZone community and get the full member experience.

Join For Free

Unlike the Transformers architecture, Mixture of Experts is not a new idea. Still, it is the latest hot topic in Large Language Model architecture. This architecture has been rumored to power OpenAI's GPT-4 (and maybe GPT3.5-turbo) and is the backbone of Mistral's Mixtral 8x7B, Grok-1, and Databricks' DBRX, which rival or even surpass GPT 3.5 with a relatively smaller size. Follow along to learn more about how this kind of architecture works and why it leads to such great results for LLMs.

Architecture

A Mixture of Experts is a model with a sparse layer and a router. The experts reside in the sparse layer and there are models unconnected between them. Each expert specializes in a specific task. The router is a gating mechanism that learns and decides which expert is best equipped to deal with the input. The simplicity of this concept allows this architecture to work with any type of model. In this article, we will focus on Transformers where the experts are feed-forward networks, but they might as well be RNNs, SVMs, or even Linear Regression models. Another possibility is Hierarchical experts, which use multiple routers at different levels.

Mixture of Experts Architecture

The big advantage of this kind of architecture is conditional computation. Every single inference doesn’t need to use all the model’s weights. The gating mechanism is trained to choose the top k experts and route the input only to those. This choice also has a degree of random noise, which prevents overloading the most popular experts and ensures that other experts are also trained on all kinds of data.

History

The first sentence of this article states that Mixture of Experts is not a recent idea. In fact, it was first proposed in 1991 in the paper Adaptive Mixture of Local Experts. In this article, the authors proposed that when the model had to perform different tasks, it was beneficial to have different experts with decoupled weights that weren’t affected by other experts fitting their weights to their own tasks.

Even though the idea is old, the Mixture of Experts architecture benefits a lot from today’s computing power and horizontal scaling. MoE models can easily be distributed between multiple devices. Since not all weights of the model activate on each inference, each expert can be located on a different device, which frees up the devices with other experts to handle other tasks in parallel.

How Many Experts Should a Model Have?

When we train a Mixture of Experts model, we expect each expert to learn and be proficient with specific tasks. Experts do seem to specialize in handling specific inputs. For example, for a language model experts tend to divide their expertise in handling nouns, verbs, punctuation, numbers, and counting, etc. However, they don’t specialize in other tasks that we would consider obvious to divide. When we train a MoE model in a multilingual corpus, different experts don’t learn different languages, they all seem to try to learn all of them.

A crucial decision when designing a Mixture of Experts model is the number of experts it will have. Normally, more experts mean more efficiency, since a smaller part of the whole model will need to be used for each inference. However, there are some caveats. The advantages of adding another expert diminish the more experts we have; 4 to 16 experts seem to be a sweet spot. Also, even though it doesn’t use all weights for every inference, reducing computing time, it still must always hold all the weights in VRAM. Looking at some popular models, DBRX has 16 experts (4 activate at any inference), while Mixtral and Grok have 8 (2 activate).

Fine-Tuning MoE

A particular problem with Mixture of Experts is that they are hard to fine-tune. MoEs are very prone to overfitting. After fine-tuning, they are bad at reasoning tasks, but still good at knowledge tasks. A way to mitigate this is to reduce the number of experts, as fewer experts lead to better fine-tuning. Also, a recent study has shed some hope for MoE fine-tuning. It had great success at finetuning a Flan MoE, suggesting that Moe's might benefit from instruction fine-tuning.

Scaling MoE

On the other hand, Mixture of Experts are great for high-throughput scenarios, as opposed to dense models. MoEs can be scaled with many techniques.

A paper by Google named GShard explored solving device underutilization to successfully scale a MoE horizontally across many devices. They replicated all non-MoE layers between all devices, but MoE layers had a different expert for each device. They also introduced the concept of expert capacity, which is the maximum number of tokens an expert can take before it is considered overflowed, after when the next expert in line would take over.

Another paper, named Switch Transformers, looked at techniques to reduce communication costs between devices and reduce training instabilities. To optimize parallelism, they proposed to use a single expert approach and reduce the capacity factor to almost all tokens being equally divided between the experts (with some small wiggle room for overchoosing a specific expert). Switch Transformers also proposed to only use bfloat16 precision for expert layers and use full precision for other layers. This stabilizes training, as other layers like the router need better precision due to an exponentiating function, while still reducing communication costs between experts.

Optimizing MoE

Mixture of Expert models can also be optimized through different means. Distillation of a sparse model into a dense model keeps 30% of sparsity gains while being much smaller in total model size. Another technique is the Aggregation of MoE, which merges the weights of all experts into one, which still performs very well on all tasks. Also, QMoE is a quantization technique that can store 1.6 trillion parameters in less than 160GB (0.8 bits per parameter!).

Conclusion

In conclusion, given that there’s a need today for models that perform a multitude of different tasks for a group of millions of people (think ChatGPT or similar products), MoE’s excellence in high-throughput, distributed scenarios shines. Being training and inference efficient will also mean lower costs and faster innovation. Of course, not everything is great, but there are some drawbacks. Being hard to fine-tune is a problem, as needing a lot of VRAM to operate. What is certain is that in the future we will keep seeing better techniques to optimize sparse models and it will lead to better LLMs.

Language model large language model AI

Published at DZone with permission of Roger Oriol. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Retrieval-Augmented Generation (RAG): Enhancing AI-Language Models With Real-World Knowledge
  • Architecting High-Performance Supercomputers for Tomorrow's Challenges
  • Blue Skies Ahead: An AI Case Study on LLM Use for a Graph Theory Related Application
  • From Zero to Production: Best Practices for Scaling LLMs in the Enterprise

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!