How Mixture of Experts Changed GPU Clusters
Mixture-of-Experts changed GPU clusters in a simple way: models got much bigger on paper, but much smaller in compute per token. That made training cheaper and inference lighter, but it also made moving tokens between GPUs a first-order bottleneck. The practical result is a different cluster shape: more GPUs sharing fast NVLink domains, serving stacks optimized for expert routing, and more frontier models fitting inside a single 8-GPU server than their headline parameter counts suggest.
Dense to Mixture of Experts
A dense model runs every parameter (one of the billions of numerical weights that encode what the model knows) on every token (the word or subword unit the model reads and writes). A 70-billion-parameter dense model like Llama 3.1 70B uses all 70 billion for every token.
With Mixture-of-Experts (MoE) instead of one big network where every parameter fires on every token, an MoE model splits its parameters into many smaller pieces called experts, typically a few hundred of them. For each token, a small dispatcher inside the model called the router picks just a few experts to do the work, while the rest sit idle for that token. DeepSeek-V4-Pro has 384 experts and routes each token to 6 of them, so only 49 billion of its 1.6 trillion parameters are active per token. Most MoE models also include one or two shared experts that fire on every token regardless of routing, in addition to the routed picks. MoE models are usually described by both numbers: 1.6T total, 49B active.
Closed labs went MoE first: GPT-4 was reportedly MoE in 2023, GPT-5 is MoE, and Gemini and Grok are MoE as well.
Open-weight stayed dense through mid-2024, with GPT-3 (175B), Llama 2 (70B), and Llama 3.1 (405B) firing every parameter on every token and Mixtral 8x7B (December 2023) the only notable exception, until DeepSeek-V3 in late 2024 (671 billion total, 37 billion active) became the first open MoE at closed-frontier quality, [1]DeepSeek-AI, "DeepSeek-V3 Technical Report" (December 2024)https://arxiv.org/abs/2412.19437 followed by Llama 4, Kimi K2.5, Qwen 3.5, GLM-5.1, DeepSeek-V4, and Gemma 4 through 2025 and 2026. [6]Meta, "Introducing Llama 4" (April 2025)https://ai.meta.com/blog/llama-4-multimodal-intelligence/ [2]Moonshot AI, "Kimi K2.5: Visual Agentic Intelligence" (February 2026)https://arxiv.org/abs/2602.02276 [5]Qwen Team, "Qwen3.5-397B-A17B" (2026)https://huggingface.co/Qwen/Qwen3.5-397B-A17B [3]Z.AI, "GLM-5.1: Open-Weight 754B Agentic Model" (April 2026)https://huggingface.co/zai-org/GLM-5.1 [11]DeepSeek-AI, "DeepSeek-V4-Pro" (2026)https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro [7]Google DeepMind, "gemma-4-26B-A4B-it" (2026)https://huggingface.co/google/gemma-4-26B-A4B-it
| Model | Total params | Active params | Experts | Active per token |
|---|---|---|---|---|
| DeepSeek-V4-Pro [11]DeepSeek-AI, "DeepSeek-V4-Pro" (2026)https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro Q2 2026 | 1.6T | 49B | 384 | 6 + 1 shared |
| Kimi K2.5 [2]Moonshot AI, "Kimi K2.5: Visual Agentic Intelligence" (February 2026)https://arxiv.org/abs/2602.02276 Q1 2026 | 1T | 32B | 384 | 8 + 1 shared |
| GLM-5.1 [3]Z.AI, "GLM-5.1: Open-Weight 754B Agentic Model" (April 2026)https://huggingface.co/zai-org/GLM-5.1 Q2 2026 | 754B | 40B | 256 | 8 + 1 shared |
| Qwen 3.5 [5]Qwen Team, "Qwen3.5-397B-A17B" (2026)https://huggingface.co/Qwen/Qwen3.5-397B-A17B Q1 2026 | 397B | 17B | 512 | 10 + 1 shared |
| DeepSeek-V4-Flash [12]DeepSeek-AI, "DeepSeek-V4-Flash" (2026)https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash Q2 2026 | 284B | 13B | 256 | 6 + 1 shared |
| Gemma 4 [7]Google DeepMind, "gemma-4-26B-A4B-it" (2026)https://huggingface.co/google/gemma-4-26B-A4B-it Q2 2026 | 26B | 3.8B | 128 | 8 + 1 shared |
Firing fewer parameters per token cuts compute at both training and inference. DeepSeek-V3 was trained for ~$5.6 million on 2,048 H800 GPUs (the export-controlled H100 variant for the China market), a small fraction of what a dense model of comparable quality would cost, because a dense equivalent would have to fire every parameter on every training token. [1]DeepSeek-AI, "DeepSeek-V3 Technical Report" (December 2024)https://arxiv.org/abs/2412.19437
MoE's memory-heavy, compute-light profile also makes local inference plausible on devices like the Mac Studio, which carries up to 512 GB of unified memory at ~800 GB/s. A 4-bit-quantized DeepSeek-V3-class model (each parameter stored in 4 bits, ~369 GB of weights total) fits in that envelope, and once it fits, per-token compute is set by the ~37B active parameters, not the 671B total. The bottleneck becomes memory bandwidth more than raw FLOPs.
Expert parallelism and all-to-all traffic
Frontier models do not fit on one GPU, dense or MoE. The established way to split a dense model was tensor parallelism, often combined with other parallelism strategies (data, pipeline, context) at larger scales: the large matrix in each block is carved into N slices across N GPUs, every GPU computes its slice of the matrix multiplication for every token. Every GPU participates in every token.
MoE does not replace that stack, but it changes the feed-forward layer (the per-token compute that runs after attention in each block) enough that expert parallelism becomes attractive. Instead of slicing each matrix across GPUs, you assign whole experts to specific GPUs and route tokens to whichever GPUs hold the experts the router picked. Attention, shared experts, and the other dense parts of the model still run as usual.
The communication cost is significant. In the forward pass of several popular MoE models, inter-device communication accounts for 47% of total execution time on average. [4]Shulai Zhang et al., "COMET: Fine-grained Computation-communication Overlapping for Mixture-of-Experts" (MLSys 2025)https://openreview.net/forum?id=fGgQS5VW09 NVIDIA reports that in DeepSeek-V3 training, EP communication can account for over 50% of total training time without optimization. [8]NVIDIA Technical Blog, "Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel" (2026)https://developer.nvidia.com/blog/optimizing-communication-for-mixture-of-experts-training-with-hybrid-expert-parallel/
Hardware Implications
Rack-scale is the next step
Since communication between GPUs is the bottleneck for expert parallelism, the first goal is to keep the model inside a single NVLink domain, the fastest interconnect available. Today that means one B200 or B300 server with 8 GPUs sharing 1.5-2 TB of HBM at 1.8 TB/s per-GPU NVLink bandwidth. Within that envelope, serving is simple. Past it, traffic has to cross a slower network.
Models will keep getting bigger, the frontier is already at 1T total parameters, and a bigger generation is always close behind. The natural next step is a wider NVLink domain. That is exactly why NVIDIA is pushing rack-scale NVLink: GB200 NVL72 (72 GPUs in one NVLink domain, 13.8 TB HBM, 130 TB/s aggregate), [9]NVIDIA Technical Blog, "Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems" (2026)https://developer.nvidia.com/blog/scaling-large-moe-models-with-wide-expert-parallelism-on-nvl72-rack-scale-systems/ GB300 NVL72, and Vera Rubin NVL144 in H2 2026.
The current problem with rack-scale, besides the premium for a more complicated system, is that NVL72 draws roughly 120 kW per rack and requires liquid cooling, while most colocation inventory tops out at 40-60 kW air-cooled. Retrofitting data centers for rack-scale density is a 2-3 year capex project.
Scale-out networking will keep improving
For everything that cannot run on a rack-scale system, the pressure moves to scale-out networking. Multi-node MoE deployments rely on high-bandwidth networking between servers: InfiniBand is the default reference point, while NVIDIA's own stack also targets Spectrum-X Ethernet for the same job. [8]NVIDIA Technical Blog, "Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel" (2026)https://developer.nvidia.com/blog/optimizing-communication-for-mixture-of-experts-training-with-hybrid-expert-parallel/ The bandwidth gap between intra-node NVLink and inter-node networking is where performance leaks once expert traffic has to leave the box.
Memory capacity keeps rising
This is simple, the more memory a single accelerator has the less you need to stack them together. HBM per GPU keeps growing generation to generation for the same scaling-law reason: B200 at 192 GB, B300 at 288 GB, Rubin continuing the ramp.
References
- DeepSeek-AI, "DeepSeek-V3 Technical Report" (December 2024)
- Moonshot AI, "Kimi K2.5: Visual Agentic Intelligence" (February 2026)
- Z.AI, "GLM-5.1: Open-Weight 754B Agentic Model" (April 2026)
- Shulai Zhang et al., "COMET: Fine-grained Computation-communication Overlapping for Mixture-of-Experts" (MLSys 2025)
- Qwen Team, "Qwen3.5-397B-A17B" (2026)
- Meta, "Introducing Llama 4" (April 2025)
- Google DeepMind, "gemma-4-26B-A4B-it" (2026)
- NVIDIA Technical Blog, "Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel" (2026)
- NVIDIA Technical Blog, "Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems" (2026)
- DeepSeek-AI, "DeepSeek-V4-Pro" (2026)
- DeepSeek-AI, "DeepSeek-V4-Flash" (2026)
Frequently Asked Questions
What is a Mixture of Experts (MoE) model?
An MoE model splits its parameters into many smaller sub-networks called experts, typically a few hundred per layer. A small router inside each MoE layer picks a few experts per token while the rest sit idle for that token. DeepSeek-V4-Pro has 384 experts per layer and routes each token to 6 of them, so only 49 billion of its 1.6 trillion parameters are active per token. MoE models are usually described by both numbers: 1.6T total, 49B active.
What are the benefits of MoE over dense models?
Cheaper training and lighter inference. DeepSeek-V3 was trained for ~$5.6 million on 2,048 H800 GPUs, a small fraction of what a dense model of comparable quality would cost, because a dense equivalent has to fire every parameter on every training token. At inference, the expensive feed-forward work scales with the active parameter count rather than the total. MoE also unlocks local deployment on unified-memory devices like the Mac Studio (up to 512 GB), where memory is generous but compute is modest.
What are the tradeoffs of MoE?
Memory footprint stays the same as a dense model of equivalent total parameters, because all experts must be resident in HBM (the router can pick any of them on any token). Splitting experts across GPUs creates an all-to-all traffic pattern when tokens are routed to experts on other GPUs, which can account for ~47% of total forward-pass execution time in popular MoE models. That is why fast interconnect like NVLink and rack-scale fabrics matter so much for MoE serving.
Why do MoE models need expert parallelism instead of tensor parallelism?
Tensor parallelism splits one large matrix across GPUs, so every GPU participates in every token. Expert parallelism assigns whole experts to specific GPUs, so only the GPUs holding the chosen experts do work for a given token. EP preserves MoE sparsity; TP would force every GPU to participate in every token and shrink per-expert batches below what tensor cores can use efficiently. In practice frameworks use a hybrid: tensor or data parallelism for attention (which is dense), expert parallelism for the MoE feed-forward layers.
Which frontier models are MoE?
Every frontier model with a confirmed spec is MoE. Closed: GPT-4 was reportedly MoE in 2023, GPT-5 is MoE, and Gemini and Grok are sparse MoE as well. Anthropic has not disclosed Claude's architecture. Open-weight stayed dense through mid-2024 (GPT-3, Llama 2, Llama 3.1) with Mixtral 8x7B in December 2023 as the only notable exception, until DeepSeek-V3 in late 2024 became the first open MoE at closed-frontier quality, followed by Llama 4, Kimi K2.5, Qwen 3.5, GLM-5.1, DeepSeek-V4, and Gemma 4 through 2025 and 2026.
Coverage creates a minimum value ("floor") for what your GPUs are worth at a future date. If they sell below the floor, the policy pays you the difference.
Learn how it works →