Architecture Deep Dive

Mixture of Experts

How to build a model with 140 billion parameters that costs the same to run as a 14 billion parameter model. The architecture behind Mixtral, DeepSeek, and GPT-4.

🧠 Architecture deep-dive 🎬 5 animations 📊 Dense math 🔧 LLM application 🔗 Connects to GRPO series

The Core Idea

The most important insight behind MoE fits in one sentence: not every input needs every parameter.

A standard neural network applies all its weights to every input. A 70B dense model runs 70 billion multiply-adds for every single token. A MoE model instead partitions its FFN parameters into E independent "expert" subnetworks and, for each token, activates only a small subset — typically 2 out of 8. Same inference FLOP as a ~17B dense model, but 8× more parameters (and therefore 8× more stored knowledge).

💡 The fundamental trade-off

Dense model: 70B parameters × 100% activated = 70B FLOPs per token. Every expert for every token — guaranteed quality, high compute cost.

MoE model: 140B total parameters × 12.5% activated (2 of 16 experts) = ~17B effective FLOPs per token. 2× more knowledge, same compute. The catch: which expert to activate for which token must be learned.

The Human Analogy

Think of a hospital. A patient arrives with a heart condition. The hospital doesn't consult every specialist simultaneously — the cardiologist, the radiologist, and the surgical team are activated. The dermatologist and the pediatric oncologist are not. The total institutional knowledge is vast, but each case activates only the relevant experts.

For language: a token about quantum mechanics routes to "physics" experts; a Python code token routes to "programming" experts. The model learns this routing — not by explicit labeling but entirely through gradient descent on the task objective.

Parameters vs FLOPs: Two Axes of Scale

Dense scaling conflates two things that MoE separates:

📚 Parameters = Knowledge capacity

More parameters = more facts memorized, more patterns stored, more languages and domains handled. Scales well with dataset size. MoE lets you scale this cheaply — adding experts is relatively cheap in memory-per-parameter.

⚡ FLOPs = Computation per token

More active FLOPs = richer per-token processing. You always need enough FLOPs to reason well. MoE keeps active FLOPs fixed while expanding knowledge capacity. This is the key insight: the two axes scale differently.

History

MoE is not a 2023 invention. The core idea is over 30 years old. What changed is scale, hardware, and the realization that sparsity is essential.

1991

Original MoE — Jacobs, Jordan, Nowlan, Hinton

The founding paper. Small feedforward networks for simple classification tasks. Key idea: multiple "expert" networks + a "gating" network that decides which expert to use. Competitive mixture via softmax gating. All experts active (soft MoE, no sparsity yet).

1994

Hierarchical MoE — Jordan & Jacobs

Tree-structured expert decomposition. Different levels of hierarchy handle coarse vs fine aspects of the input. Precursor to modern fine-grained expert hierarchies.

2017

Sparsely-Gated MoE — Shazeer et al. (Google Brain)

The breakthrough. "Outrageously Large Neural Networks." Introduced sparse top-K routing: only the top-2 experts by gating score are activated. Trained a 137B MoE LSTM on translation — better than all previous models at a fraction of the compute. Introduced the auxiliary load-balancing loss. This is the template all modern MoE follows.

2021

Switch Transformer — Fedus, Zoph, Dean (Google)

Applied MoE to T5-style transformers at scale. Key insight: Top-1 routing (only 1 expert per token) is simpler, faster, and surprisingly effective. Introduced the capacity factor and expert dropout. 1.6T parameter model, 7× cheaper than T5-XXL at equal quality.

2021

GLaM — Du et al. (Google)

1.2 trillion total parameters, 143B activated per token (64 experts, 2 activated). Matched GPT-3 quality on average with 1/3 the energy cost. Demonstrated that MoE scales to trillion-parameter regimes.

Dec 2023

Mixtral 8x7B — Mistral AI

The model that brought MoE into mainstream open-source. 8 experts per layer, 2 activated. 46.7B total parameters, ~12.9B active per token. Outperforms Llama 2 70B on most benchmarks at less than 1/5 the inference cost. Fully open weights. Mixtral 8x22B followed in 2024.

Jan 2024

DeepSeek-MoE — DeepSeek AI

Two innovations: (1) Fine-grained experts — many small experts instead of few large ones, giving the router more flexibility. (2) Shared experts — a subset always activated, capturing knowledge that all tokens need. Matched dense MoE baselines at 40% less compute.

2024–25

Grok-1, DBRX, DeepSeek-V2/V3/R1

Grok-1 (xAI): 314B total, 86B active. DBRX (Databricks): 132B total, 36B active. DeepSeek-V3/R1: 671B total, 37B active — the state-of-the-art as of 2025. MoE now the standard architecture for frontier models.

Architecture — Dense FFN vs MoE Layer

MoE only changes one thing in the standard transformer: the feed-forward network (FFN) layer. Everything else — attention, layer norms, residual connections, tokenization — stays identical.

Standard Transformer FFN (what you have now)

In a standard transformer, every FFN layer has two weight matrices: W₁ ∈ ℝ^d×d_ff and W₂ ∈ ℝ^d_ff×d where d_ff = 4d typically.

\text{FFN}(x) = \text{GELU}(x W_1)\, W_2, \qquad W_1 \in \mathbb{R}^{d \times d_{ff}},\; W_2 \in \mathbb{R}^{d_{ff} \times d}

Standard dense FFN. All d×d_ff parameters activated for every token. Parameters per layer: 2 × d × d_ff

MoE Layer (the replacement)

Replace the single FFN with E identical-architecture FFNs (the "experts"), plus a small router network.

h = \sum_{i \in \mathcal{T}_K(x)} g_i(x) \cdot E_i(x)

MoE output. T_K(x) = indices of top-K experts for input x. g_i = gating weight (how much to weight expert i's output).

g_i(x) = \frac{\exp(s_i)}{\sum_{j \in \mathcal{T}_K} \exp(s_j)}, \quad s = x W_g \in \mathbb{R}^E, \quad \mathcal{T}_K = \operatorname{TopK}(s, K)

Gating weights. W_g ∈ ℝ^{d×E} is the router (a tiny linear layer). s are per-expert scores. TopK selects the K highest. Weights normalized via softmax over selected experts only.

The Key Numbers

MoE vs Dense: Parameter and FLOPs accounting

Total params	E × (2d × d_ff)	E times more than dense. A 7B dense FFN becomes 56B with E=8. But only 2 experts activate per token.
Active params/token	K × (2d × d_ff)	K/E fraction of parameters activated. K=2, E=8 → 25% active. Same FLOPs as a 2× smaller dense model.
Router params	d × E	Tiny. For d=4096, E=8: 32,768 params vs billions in the experts. Routing cost is negligible.
Memory (inference)	All E experts	Must load all experts into memory even though only K activate. Memory cost = full model. Compute cost = K/E fraction.

Figure 1 — Dense FFN vs MoE Layer: Architecture & Token Flow

Left: a dense FFN — all weights activate for every token. Right: a MoE layer with E=8 experts — the router scores each expert, selects the top-2, and the token is processed only by those two. The other 6 experts do zero computation for this token. The outputs of the two active experts are weighted and summed before passing to the next layer.

The Router — How Experts Are Selected

The router is the brain of a MoE model. It is a tiny linear layer (d → E parameters) trained end-to-end with the rest of the model. At inference, it produces one score per expert and selects the top K.

Step-by-Step Routing for One Token

Compute router scores

For input token embedding x ∈ ℝ^d, compute s = x · W_g ∈ ℝ^E. One score per expert. W_g is a (d × E) weight matrix — tiny compared to the experts themselves.

Select top-K experts

Take the K indices with the highest scores: 𝒯_K = TopK(s, K). For K=2, E=8: select the 2 highest-scoring experts. The other 6 are completely ignored for this token — no computation, no gradient flow.

Compute gating weights

Apply softmax over only the selected K scores: g_i = exp(s_i) / Σ_j∈𝒯 exp(s_j). These are the combination weights — a higher weight means the expert's output contributes more to the final result.

Process through K experts

Send x to each selected expert E_i. Each expert independently computes its FFN output: E_i(x) = GELU(xW_1,i)W_2,i. These K forward passes are the main compute cost of the layer.

Weighted combination

Sum the K expert outputs weighted by gating scores: h = Σ_i∈𝒯 g_i · E_i(x). This h replaces what a standard FFN would have produced. Add it to the residual stream as normal.

Figure 2 — Routing Mechanism: Scores, Top-K Selection, Weighted Combination

Interactive routing visualization. The bar chart shows router scores for each of the 8 experts for a given input token. The top-2 (highlighted) are selected. Their scores are normalized via softmax to produce gating weights. The token flows to only those 2 experts, and their outputs are combined proportionally. Click "New Token" to see a different routing decision.

Expert Choice Routing (alternative)

Standard top-K routing lets tokens choose experts. Expert Choice routing (Zhou et al., 2022) inverts this: each expert picks its top-C tokens from the batch. This guarantees perfect load balance by design — every expert processes exactly C tokens. The downside: a token might not be selected by any expert, requiring a pass-through with just the residual.

💡 Token choice vs Expert choice

Token choice (standard): Each token picks its top-K experts. Popular experts become overloaded. Load imbalance is the main training challenge. Requires auxiliary loss to fix.

Expert choice: Each expert picks its top-C tokens. Perfect balance guaranteed. Some tokens skipped. More complex gradient flow. Used in Gemini and some recent models.

Load Balancing — The Central Training Challenge

Without explicit regularization, MoE training collapses: a few popular experts attract all the tokens, get the most gradient signal, improve the most, attract even more tokens. This positive feedback loop — the Matthew effect — leads to expert collapse within the first few thousand steps.

Figure 3 — Expert Utilization: Without vs With Load Balancing Loss

Token counts per expert across a training batch. Without auxiliary loss (left): within 500 steps, experts 2 and 5 capture 70%+ of all tokens. The other experts are essentially dead — they receive little gradient and fail to specialize. With auxiliary loss (right): all 8 experts receive roughly equal token counts. Each expert trains on a representative sample and learns a distinct specialization.

The Auxiliary Load Balancing Loss

The Switch Transformer introduced the standard fix. Let f_i = fraction of tokens routed to expert i in the batch, and P_i = mean router probability assigned to expert i. The auxiliary loss penalizes uneven distributions:

\mathcal{L}_{aux} = \alpha \cdot E \sum_{i=1}^{E} f_i \cdot P_i

Switch Transformer auxiliary loss. α ≈ 0.01. E experts. f_i = fraction of tokens sent to expert i. P_i = fraction of router probability on expert i. Minimized when all f_i = P_i = 1/E.

Two signals penalized simultaneously: f_i is based on hard routing decisions (non-differentiable), P_i is differentiable through the softmax. The product connects a gradient path to the otherwise hard routing decision.

Expert Capacity

A second mechanism: each expert has a fixed capacity — the maximum tokens it can process per batch.

\text{capacity} = \left\lfloor\frac{\text{tokens\_per\_batch}}{E} \times C\right\rfloor

C = capacity factor, typically 1.0–1.25. C=1 means each expert handles exactly batch_size/E tokens on average. C=1.25 allows 25% overflow buffer.

If more tokens are routed to an expert than its capacity allows, the excess tokens are dropped — they pass through via the residual connection without any expert processing. This adds a small error but prevents any single expert from becoming a bottleneck. Monitoring the "overflow rate" (fraction of dropped tokens) is a key training health metric.

⚠️ Overflow rate as a training signal

Healthy MoE training: overflow < 1–2% of tokens. Overflow > 5% means the auxiliary loss weight α is too low — increase it. Overflow = 0 with a very high α means the auxiliary loss is dominating the task objective — the router becomes "random" and experts can't specialize. Tune α carefully.

DeepSeek's Device-Level Balancing

Standard auxiliary loss balances experts globally. But if 8 experts run on 8 different GPUs, what matters for efficiency is that each GPU has roughly equal work. DeepSeek-V3 adds a device-level auxiliary loss that balances token counts per device, not just per expert. This prevents communication bottlenecks in distributed training even when global expert balance looks good.

Key Models — The MoE Taxonomy

Model	Year	Total Params	Active/Token	Experts	K	Key Innovation
Switch Transformer	2021	1.6T	~7B	2048	1	Top-1 routing, capacity factor, T5-based
GLaM	2021	1.2T	143B	64	2	Decoder-only at scale; 3× cheaper than GPT-3
ST-MoE	2022	269B	32B	32	1	Stable MoE training; encoder+decoder
Mixtral 8x7B	2023	46.7B	12.9B	8	2	Open weights, outperforms LLaMA-2 70B
Mixtral 8x22B	2024	141B	39B	8	2	Stronger open MoE, instruction-tuned variants
Grok-1	2024	314B	86B	8	2	Open weights, xAI; MoE with standard architecture
DBRX	2024	132B	36B	16	4	16 experts, 4 active; more routing flexibility
DeepSeek-MoE	2024	145B	22B	160	6	Fine-grained experts + shared experts
DeepSeek-V3	2024	671B	37B	256	8	State-of-art; Multi-head Latent Attention + MoE
OLMoE-1B-7B	2024	6.9B	1B	64	8	Open-source, fully transparent training

Three Architectural Families

Standard MoE (Switch, Mixtral)

E experts, K activated. K=1 (Switch) or K=2 (Mixtral). Large experts with full d_ff. Simple to implement, good baseline. Prone to load imbalance when K=1.

Fine-Grained MoE (DeepSeek)

Many small experts (E=64–256) with reduced d_ff. Router has more flexibility — can combine many smaller specialists. Better coverage of input diversity. DeepSeek-MoE adds shared experts on top.

Hybrid MoE (DeepSeek-V2/V3)

Shared experts (always active) + routed experts (top-K). Shared experts capture universal knowledge; routed experts capture specialized patterns. Also combines with Multi-head Latent Attention for attention efficiency.

DeepSeek-MoE: Fine-Grained + Shared Experts

DeepSeek's key innovations deserve detail. Instead of 8 large experts, they use 64 small experts with d_ff = d_ff_dense / 8. Plus 2 "shared" experts that always activate regardless of routing. The objective:

h = \underbrace{\sum_{i=1}^{N_s} E_i^{(s)}(x)}_{\text{shared experts (always active)}} + \underbrace{\sum_{i \in \mathcal{T}_K} g_i \cdot E_i^{(r)}(x)}_{\text{top-K routed experts}}

DeepSeek-MoE output. N_s shared experts always fire. K of N_r routed experts selected per token. Shared experts capture universal patterns; routed experts specialize.

Why fine-grained helps: With 8 large experts, each token gets 2 experts = 25% of FFN capacity. With 64 small experts (each 1/8 the size), selecting K=6 gives 6/64 = 9.4% — but more flexibly combined from a richer set of specialists. The router can pick the precise combination of micro-specialties needed.

Expert Specialization — What Do Experts Actually Learn?

Does routing create genuine specialization, or is it arbitrary? Research consistently shows: yes, experts develop meaningful semantic specializations — even though no labels or explicit objectives guide this.

What Specialization Has Been Observed

Domain specialization

Different experts handle different knowledge domains. Analysis of Mixtral routing shows distinct clusters: science/mathematics tokens, code/programming tokens, multilingual tokens, and general language tokens each tend toward different experts.

Syntactic specialization

Some experts handle punctuation, structural tokens, and formatting. Others handle content words. Common function words (the, of, and) often route to a shared "syntax" expert.

Position specialization

Early tokens in a sequence and late tokens can route to different experts. Some experts specialize in context-setting (beginning of sentence); others in consequence/conclusion tokens.

Frequency specialization

High-frequency tokens (common words) often route to a small set of generalist experts. Low-frequency tokens (specialized terms) route to more diverse, narrower experts. This mirrors how information is distributed in language.

Figure 4 — Expert Specialization in a text Generation Model

Simulated expert routing patterns for a LLM with 8 experts. Different token types route to different experts: output operation tokens (query, read_doc, citation) → Experts 1–3; numeric parameters (h=20, r=3, d=1.5) → Expert 5; entity references (query_0, source_0, result_0) → Expert 4; structural tokens (; , ( )) → Expert 7; boolean/assembly operations → Expert 6. Expert 8 acts as a generalist fallback. Click "Stream Tokens" to watch the routing live.

🧬 How specialization emerges through training

Initially all experts are identical (random init). The router is also random. As training proceeds, expert 1 might handle a slightly better version of "output operations" by random chance. The router learns to route those tokens there. Expert 1 receives more gradient from output examples, becoming even better at them. Other experts, freed from output tokens, specialize elsewhere. Positive feedback creates stable specialization within thousands of steps — entirely without supervision.

Training Challenges

1. Instability and the Router Cold Start

Early in training, the router is undertrained. It routes randomly, which means expert outputs have high variance. Two mitigations used in practice:

⚡ Router z-loss (ST-MoE)

Adds a penalty on the magnitude of router logits: L_z = (1/B)Σ_x (log Σ_e exp(s_e(x)))². Prevents logits from becoming very large or small, keeping gradients stable through the router early in training.

🧊 Expert dropout

During training, randomly drop entire expert outputs (set to zero before adding to residual). Forces the model to not over-rely on any single expert. Analogous to standard dropout but at the expert level. Rates: 0.1–0.4.

2. Expert Collapse

Even with auxiliary loss, expert collapse can happen. Signs: some experts have near-zero utilization after 10k+ steps. Causes: learning rate too high (router commits too early), auxiliary loss coefficient too low, or experts initialized identically (router has no initial preference to exploit).

Fix: Initialize expert weights with small noise (not identically). Use different random seeds per expert for the first few layers. This breaks symmetry and gives the router something to differentiate on.

3. Gradient Imbalance

Experts that receive more tokens get more gradient signal and train faster. This compounds the load imbalance problem — experts that happen to get more tokens early continue to attract them. The auxiliary loss mitigates but doesn't eliminate this. At very large batch sizes, even small imbalances in f_i compound significantly.

4. Communication Overhead in Distributed Training

In expert parallelism, different experts live on different GPUs. Routing a token to an expert on a different GPU requires all-to-all communication — every GPU sends some tokens to every other GPU. At scale, this communication becomes a significant overhead.

\text{communication volume} = 2 \times \frac{K}{E} \times |\text{batch}| \times d \times \text{bytes\_per\_elem}

All-to-all volume per MoE layer. K/E fraction of tokens cross device boundaries. With K=2, E=8 on 8 GPUs: 25% of tokens sent between devices twice per forward pass (send + receive).

DeepSeek addresses this with expert grouping: ensure experts are grouped so that popular token types route to experts on the same GPU (group intra-GPU experts). Also: limit K to prevent communication from dominating.

Full Training Recipe

moe_training_recipe.py

from transformers import MixtralConfig, MixtralForCausalLM
import torch

# ── MoE Configuration ──────────────────────────────────────────────────────
config = MixtralConfig(
    num_local_experts=8,              # E: total experts per layer
    num_experts_per_tok=2,            # K: experts activated per token
    router_aux_loss_coef=0.02,        # alpha: auxiliary loss weight
    hidden_size=4096,                 # d: model dimension
    intermediate_size=14336,          # d_ff: expert FFN hidden dim
    num_hidden_layers=32,
    num_attention_heads=32,
)

# ── Loss Computation ────────────────────────────────────────────────────────
def compute_moe_loss(model_output):
    """Total loss = task loss + auxiliary balancing loss."""
    task_loss = model_output.loss

    # Collect auxiliary losses from all MoE layers
    aux_losses = []
    for layer in model.model.layers:
        if hasattr(layer, 'block_sparse_moe'):
            router_logits = layer.block_sparse_moe.router_logits  # (seq_len, E)

            # f_i: fraction of tokens routed to expert i
            routing_weights = torch.softmax(router_logits, dim=-1)
            _, selected = torch.topk(router_logits, k=config.num_experts_per_tok)
            tokens_per_expert = torch.zeros(config.num_local_experts)
            for expert_idx in range(config.num_local_experts):
                tokens_per_expert[expert_idx] = (selected == expert_idx).sum().float()
            f_i = tokens_per_expert / tokens_per_expert.sum()         # normalize

            # P_i: mean router probability for expert i
            P_i = routing_weights.mean(dim=0)                         # (E,)

            # Auxiliary loss: E × Σ f_i × P_i (minimized at uniform distribution)
            aux_loss = config.num_local_experts * (f_i * P_i).sum()
            aux_losses.append(aux_loss)

    total_aux = config.router_aux_loss_coef * sum(aux_losses)
    return task_loss + total_aux

# ── Monitoring ──────────────────────────────────────────────────────────────
def log_routing_stats(router_logits, step):
    """Track expert utilization — critical for diagnosing MoE health."""
    _, selected = torch.topk(router_logits, k=2)
    counts = torch.zeros(8)
    for e in range(8):
        counts[e] = (selected == e).float().mean()

    print(f"Step {step} | Expert utilization: {counts.tolist()}")
    print(f"  Max/min ratio: {counts.max()/counts.min():.2f}x "
          f"(healthy: <3x, concerning: >10x)")

    # Overflow rate: tokens dropped due to capacity limits
    overflow = getattr(model, '_last_overflow_rate', 0)
    print(f"  Overflow rate: {overflow:.1%} (healthy: <2%, problem: >5%)")

Inference — The Memory–Compute Trade-off

MoE models have a fundamental inference challenge: all expert weights must live in memory even though only K/E are used per token. You pay full memory cost for the parameter advantage, but only partial compute cost.

Memory vs Compute

Model	Memory (FP16)	Active FLOPs/token	KV cache	Throughput vs dense-equivalent
Dense 7B	~14 GB	7B	small	baseline
Mixtral 8x7B	~87 GB (all experts)	12.9B	same	~3-4× higher throughput at same latency
Dense 70B	~140 GB	70B	large	baseline
DeepSeek-V3 (671B)	~1.3 TB (all experts)	37B	compressed (MLA)	~5-8× higher throughput than dense-70B

💻 Practical deployment implications

Single GPU: Mixtral 8x7B requires ~87GB — needs 2×A100 (80GB) or offloading. For a 4-bit quantized model, ~23GB — fits on a single A100. Quantization hits MoE especially hard because expert weights are accessed sparsely, making weight-caching less effective.

Speculative decoding: Works well for MoE because the routing decisions are often consistent across tokens, allowing the small draft model to predict routing in advance.

Expert offloading: For very large MoE models, keep only the K active experts in GPU memory, offloading the rest to CPU. Adds latency on expert swaps but makes 671B models runnable on smaller hardware. DeepSeek uses this in inference services.

Batch Size and Throughput

MoE models benefit disproportionately from larger batch sizes. When batch = 1 (chat), each expert processes at most a handful of tokens — very inefficient. When batch = 512 (serving), each expert processes many tokens in parallel. The K/E throughput advantage is fully realized only at large batch sizes. For interactive applications, MoE is less efficient than its FLOP numbers suggest.

MoE and Post-Training RL

This is where the RL guides and MoE connect. The same PPO/GRPO algorithms apply to MoE models, but there are specific complications.

The Routing Instability Problem in GRPO

Recall from the Beyond PPO guide: GRPO's probability ratio is computed at the token level:

r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{old}(a_t \mid s_t)}

For a dense model, this ratio is stable across PPO epochs — the same weights produce the same logits for the same input. For a MoE model, this ratio has an additional source of variance: routing decisions can change between the numerator and denominator evaluations.

When π_θ and π_old route a token to different experts, the numerator and denominator compute through different parameter subsets. The probability ratio can be large even when the intended "change in probability" is small. This is why GSPO was invented.

s_i(\theta) = \exp\!\left(\frac{1}{|y_i|}\sum_t \log\frac{\pi_\theta(a_t|s_t)}{\pi_{old}(a_t|s_t)}\right)

GSPO's sequence-level ratio — semantic mean across all tokens. Single noisy routing decision at one token is diluted by the average. MoE models specifically benefit from GSPO over GRPO.

🔗 The GSPO–MoE connection (from the Beyond PPO guide)

GSPO was explicitly motivated by MoE instability. DeepSeek uses it internally for their MoE models. The sequence-level ratio dilutes the noise from individual token routing mismatches. For a 671B MoE model where small routing changes cause large individual token ratios, GSPO is not optional — it's what makes GRPO-style RL stable on MoE architectures.

Expert Load Balance During RL

A second MoE-specific challenge during RL training: the policy update can shift the input distribution enough that the routing balance, carefully tuned during pre-training, breaks. Experts that were balanced during pre-training may become unbalanced after 1000 RL steps because the policy now generates different token distributions.

Fix: Keep the auxiliary load-balancing loss active during RL training. Use a lower coefficient than pre-training (α = 0.001–0.005 vs 0.01–0.02 during pre-training). Occasionally monitor expert utilization during RL and reset the router if collapse occurs.

Knowledge Retention and Expert Forgetting

MoE experts are less likely to suffer from catastrophic forgetting during RL compared to dense models. Because each expert specializes in a subset of the input space, a gradient update on "LLM-specific" tokens mainly updates the relevant domain experts — not the language, math, or code experts. This is a practical benefit: MoE models tend to retain broader capabilities better during task-specific RL fine-tuning.

MoE for text Generation

A MoE architecture is a natural fit for a multimodal LLM. response sequences have distinctly different token types that benefit from different processing strategies.

Why MoE Suits text Specifically

✅ Diverse token types

response sequences mix: operation tokens (query, read_doc, citation), numeric parameters (h=20, r=3.5), entity IDs (query_0, source_4), structural tokens (; , ( )), and coordinate/output values. These require fundamentally different processing — ideal for specialization.

✅ Sparse knowledge activation

A language model for mechanical parts doesn't need its "organic shape" or "textile" knowledge for every token. MoE allows those experts to exist (for generalization) without paying their compute cost on every forward pass.

Suggested Expert Configuration for a text MoE

Expert Group	Specializes in	Example tokens	Why separate
E1–E2 Output ops	Shape construction operations	query, read_doc, revolve, loft, sweep	Different spatial reasoning from parameters
E3–E4 Finishing ops	Detail/tolerance operations	citation, check_answer, draft, shell	Different from primary construction; tolerance-aware
E5 Numerics	Semantic parameter values	h=20, r=3.5, d=1, w=40	Dense numeric reasoning; different from operation routing
E6 References	Entity identifiers	query_0, source_4, face_2, result_0	Requires tracking entity state across sequence
E7 Boolean/Assembly	Multi-body operations	boolean_subtract, union, mirror, pattern	Different topology reasoning from single-body ops
E8 Structure	Syntax and sequence	; , ( ) = TAB EOL	Pure syntax; different from semantic ops

MoE + GRPO/DAPO for text

Combining MoE with the RL training pipeline from the previous guides:

Pre-train with MoE architecture

Train on large text dataset + general text. Let experts specialize naturally. Monitor expert utilization — output tokens should route consistently to E1–E2 within ~10k steps.

SFT on text task

Fine-tune on (image → response_text) pairs. Routing patterns stabilize further. The auxiliary loss keeps experts balanced even on the narrower text distribution.

DAPO (not GRPO) for MoE RL

Use DAPO with Clip-Higher (ε_high=0.28) — especially important for MoE where rare operation tokens (loft, sweep) might route to experts that haven't been updated in a while, producing large ratios. Dynamic sampling filters all-correct/all-wrong batches.

Consider GSPO for long sequences

For sequences >200 tokens (complex assemblies), use GSPO's sequence-level ratio to dampen MoE routing noise. The semantic mean across 200 token-routing decisions is much more stable than any individual ratio.

Monitor expert balance throughout RL

Add expert utilization logging to your RL training loop. If output experts (E1–E2) start capturing >50% of all tokens during GRPO, reduce α temporarily to let the router rebalance.

🔧 Practical recommendation

For your LLM: start with a dense SFT baseline, get RL working with DAPO/GRPO, then migrate to MoE. Don't start with MoE — the additional debugging complexity (routing, load balance, overflow) on top of RL training is overwhelming. Once RL is stable, MoE gives you 2–3× parameter capacity at the same inference cost, which translates directly to better answer accuracy on complex multi-body assemblies.

REF

References

Jacobs, R.A., Jordan, M.I., Nowlan, S.J., & Hinton, G.E. (1991). Adaptive Mixtures of Local Experts. Neural Computation. doi:10.1162/neco.1991.3.1.79
Shazeer, N. et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR 2017. arXiv:1701.06538
Fedus, W., Zoph, B., & Shazeer, N. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. JMLR 2022. arXiv:2101.03961
Du, N. et al. (2021). GLaM: Efficient Scaling of Language Models with Mixture-of-Experts. ICML 2022. arXiv:2112.06905
Zoph, B. et al. (2022). ST-MoE: Designing Stable and Transferable Sparse Expert Models. arXiv:2202.08906
Zhou, Y. et al. (2022). Mixture-of-Experts with Expert Choice Routing. NeurIPS 2022. arXiv:2202.09368
Jiang, A.Q. et al. (2024). Mixtral of Experts. arXiv:2401.04088
Dai, D. et al. (2024). DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. arXiv:2401.06066
DeepSeek-AI (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437
Muennighoff, N. et al. (2024). OLMoE: Open Mixture-of-Experts Language Models. arXiv:2409.02060
Zheng, C. et al. (2025). GSPO: Group Sequence Policy Optimization. arXiv:2507.18071 (MoE instability motivation)