専門家の混合(MoE)に関するビジュアルガイド
原題: A Visual Guide to Mixture of Experts (MoE)What is mixture of experts? - IBMMixture of Experts Explained - Hugging Face[2503.07137] A Comprehensive Survey of Mixture-of-Experts ...What is mixture of experts? - IBMWhat is Mixture of Experts (MoE)? - GeeksforGeeksMixture of experts - WikipediaWhat is Mixture of Experts (MoE)? - GeeksforGeeksMixture of experts (MoE): A big data perspective - ScienceDirect
分析結果
- カテゴリ
- AI
- 重要度
- 60
- トレンドスコア
- 24
- 要約
- 専門家の混合(MoE)は、特定のタスクに対して異なる専門家モデルを活用する手法です。このガイドでは、MoEが大規模言語モデルにおいてどのように機能するかを解説し、専門家の選択や組み合わせのメカニズムを明らかにします。MoEは、計算資源を効率的に使用し、モデルのパフォーマンスを向上させるための重要なアプローチです。
- キーワード
A Visual Guide to Mixture of Experts (MoE) Exploring Language Models Subscribe Sign in A Visual Guide to Mixture of Experts (MoE) Demystifying the role of MoE in Large Language Models Maarten Grootendorst Oct 07, 2024 426 21 43 Share Translations - Korean - French - Chinese | Also check out the YouTube version with lots of animations! When looking at the latest releases of Large Language Models (LLMs), you will often see “ MoE ” in the title. What does this “ MoE ” represent and why are so many LLMs using it? In this visual guide, we will take our time to explore this important component, Mixture of Experts (MoE) through more than 50 visualizations ! In this visual guide, we will go through the two main components of MoE, namely Experts and the Router , as applied in typical LLM-based architectures. To see a Table of Contents (ToC), click on the stack of lines on the left-hand side. Thanks for reading Exploring Language Models ! Subscribe to receive new posts on Gen AI and the book: Hands-On Large Language Models Subscribe To see more visualizations related to LLMs and to support this newsletter, check out the book I wrote on Large Language Models! Official website of the book. You can order the book on Amazon . All code is uploaded to GitHub . P.S. If you read the book, a quick review would mean the world—it really helps us authors! What is Mixture of Experts? Mixture of Experts (MoE) is a technique that uses many different sub-models (or “experts”) to improve the quality of LLMs. Two main components define a MoE: Experts - Each FFNN layer now has a set of “experts” of which a subset can be chosen. These “experts” are typically FFNNs themselves. Router or gate network - Determines which tokens are sent to which experts. In each layer of an LLM with an MoE, we find (somewhat specialized) experts: Know that an “expert” is not specialized in a specific domain like “Psychology” or “Biology”. At most, it learns syntactic information on a word level instead: More specifically, their expertise is in handling specific tokens in specific contexts. The router (gate network) selects the expert(s) best suited for a given input: Each expert is not an entire LLM but a submodel part of an LLM’s architecture. The Experts To explore what experts represent and how they work, let us first examine what MoE is supposed to replace; the dense layers . Dense Layers Mixture of Experts (MoE) all start from a relatively basic functionality of LLMs, namely the Feedforward Neural Network (FFNN). Remember that a standard decoder-only Transformer architecture has the FFNN applied after layer normalization: An FFNN allows the model to use the contextual information created by the attention mechanism, transforming it further to capture more complex relationships in the data. The FFNN, however, does grow quickly in size. To learn these complex relationships, it typically expands on the input it receives: Sparse Layers The FFNN in a traditional Transformer is called a dense model since all parameters (its weights and biases) are activated. Nothing is left behind and everything is used to calculate the output. If we take a closer look at the dense model, notice how the input activates all parameters to some degree: In contrast, sparse models only activate a portion of their total parameters and are closely related to Mixture of Experts. To illustrate, we can chop up our dense model into pieces (so-called experts), retrain it, and only activate a subset of experts at a given time: The underlying idea is that each expert learns different information during training. Then, when running inference, only specific experts are used as they are most relevant for a given task. When asked a question, we can select the expert best suited for a given task: What does an Expert Learn? As we have seen before, experts learn more fine-grained information than entire domains. 1 As such, calling them “experts” has sometimes been seen as misleading. Expert specialization of an encoder model in the ST-MoE paper. Experts in decoder models, however, do not seem to have the same type of specialization. That does not mean though that all experts are equal. A great example can be found in the Mixtral 8x7B paper where each token is colored with the first expert choice. This visual also demonstrates that experts tend to focus on syntax rather than a specific domain. Thus, although decoder experts do not seem to have a specialism they do seem to be used consistently for certain types of tokens. The Architecture of Experts Although it’s nice to visualize experts as a hidden layer of a dense model cut in pieces, they are often whole FFNNs themselves: Since most LLMs have several decoder blocks, a given text will pass through multiple experts before the text is generated: The chosen experts likely differ between tokens which results in different “paths” being taken: If we update our visualization of the decoder block, it would now contain more FFNNs (one for each expert) instead: The decoder block now has multiple FFNNs (each an “expert”) that it can use during inference. The Routing Mechanism Now that we have a set of experts, how does the model know which experts to use? Just before the experts, a router (also called a gate network ) is added which is trained to choose which expert to choose for a given token. The Router The router (or gate network ) is also an FFNN and is used to choose the expert based on a particular input. It outputs probabilities which it uses to select the best matching expert: The expert layer returns the output of the selected expert multiplied by the gate value (selection probabilities). The router together with the experts (of which only a few are selected) makes up the MoE Layer : A given MoE layer comes in two sizes, either a sparse or a dense mixture of experts. Both use a router to select experts but a Sparse MoE only selects a few whereas a Dense MoE selects them all but potentially in different distributions. For instance, given a set of tokens, a MoE will distribute the tokens across all experts whereas a Sparse MoE will only select a few experts. In the current state of LLMs, when you see a “MoE” it will typically be a Sparse MoE as it allows you to use a subset of experts. This is computationally cheaper which is an important trait for LLMs. Selection of Experts The gating network is arguably the most important component of any MoE as it not only decides which experts to choose during inference but also training . In its most basic form, we multiply the input ( x ) by the router weight matrix ( W) : Then, we apply a SoftMax on the output to create a probability distribution G ( x ) per expert: The router uses this probability distribution to choose the best matching expert for a given input. Finally, we multiply the output of each router with each selected expert and sum the results. Let’s put everything together and explore how the input flows through the router and experts: The Complexity of Routing However, this simple function often results in the router choosing the same expert since certain experts might learn faster than others: Not only will there be an uneven distribution of experts chosen, but some experts will hardly be trained at all. This results in issues during both training and inference. Instead, we want equal importance among experts during training and inference, which we call load balancing . In a way, it’s to prevent overfitting on the same experts. Load Balancing To balance the importance of experts, we will need to look at the router as it is the main component to decide which experts to choose at a given time. KeepTopK One method of load balancing the router is through a straightforward extension called KeepTopK 2 . By introducing trainable (gaussian) noise, we can prevent the same experts from always being picked: Then, all but the top k experts that you want activating (for example 2) will have their weights set to -∞ : By setting these weights to -∞ , the output of the SoftMax on these weights will result in a probability of 0 : The KeepTopK strategy is one that many LLMs still use despite many promising alternatives. Note that KeepTopK can also be used without the additional noise. Token Choice The KeepTopK strategy routes each token to a few selected experts. This method is called Token Choice 3 and allows for a given token to be sent to one expert ( top-1 routing ): or to more than one expert (top-k routing): A major benefit is that it allows the experts’ respective contributions to be weighed and integrated. Auxiliary Loss To get a more even distribution of experts during training, the auxiliary loss (also called load balancing loss ) was added to the network’s regular loss. It adds a constraint that forces experts to have equal importance. The first component of this auxiliary loss is to sum the router values for each expert over the entire batch: This gives us the importance scores per expert which represents how likely a given expert will be chosen regardless of the input. We can use this to calculate the coefficient variation ( CV ), which tells us how different the importance scores are between experts. For instance, if there are a lot of differences in importance scores, the CV will be high: In contrast, if all experts have similar importance scores, the CV will be low (which is what we aim for): Using this CV score, we can update the auxiliary loss during training such that it aims to lower the CV score as much as possible ( thereby giving equal importance to each expert ): Finally, the auxiliary loss is added as a separate loss to optimize during training. Expert Capacity Imbalance is not just found in the experts that were chosen but also in the distributions of tokens that are sent to the expert. For instance, if input tokens are disproportionally sent to one expert over another then that might also result in undertraining: Here, it is not just about which experts are used but how much they are use