MiMo UltraSpeed: 1,000+ Tokens per Second on a Single 8-GPU Node
Xiaomi's MiMo-V2.5-Pro-UltraSpeed is a 1-trillion parameter Mixture of Experts (MoE) model that generates text at over 1,000 tokens per second, peaking at 1,200, on a single server with eight commodity GPUs. For comparison, Claude 4 Opus averages around 32 tokens/second and GPT-5.5 averages 38–51 tokens/second on their respective proprietary infrastructure.
Architecture
The 1 trillion parameter scale is made computationally tractable through MoE architecture. Rather than activating all parameters for every token, a gating network routes each prompt to the most relevant subset of expert networks. For any given inference, only a fraction of the total parameters are active.
The speed is achieved through a philosophy Xiaomi calls "Extreme Model-System Codesign": deep collaboration between model design and systems engineering teams, addressing latency at every layer simultaneously.
Three techniques behind the speed
MXFP4 quantization with QAT
The primary bottleneck for large models is memory bandwidth: moving weights from GPU memory to compute cores for each token generation creates contention that limits throughput.
MiMo UltraSpeed uses MXFP4 quantization, converting 16-bit or 32-bit weights to 4-bit integers. This reduces the memory footprint by 4x or more, meaning less data moves per token and decoding accelerates proportionally.
Aggressive 4-bit quantization typically degrades model quality. To prevent this, the model was fine-tuned using Quantization-Aware Training (QAT): training that simulates the effects of 4-bit precision so the model adapts to perform effectively at lower precision. The routing layers of the MoE architecture are kept at higher precision to preserve the accuracy of expert selection.
DFlash block speculative decoding
Standard autoregressive decoding generates one token per forward pass. Conventional speculative decoding uses a small draft model to guess a few tokens ahead, which the large model verifies in one pass.
DFlash (Block Diffusion for Flash Speculative Decoding) predicts an entire block of 8 tokens in a single parallel forward pass rather than guessing token by token. In testing on coding tasks, the large model accepted an average of 6.3 out of every 8 proposed tokens, a high enough acceptance rate to maintain sustained forward momentum.
Persistent kernel with warp specialization
At 1,000+ tokens/second, GPU execution overhead becomes a bottleneck. A standard GPU launches a kernel, completes the operation, clears memory, and waits for the next instruction. These microsecond pauses accumulate into significant throughput loss at this scale.
The TileRT team developed a persistent kernel that loads once into GPU memory and stays resident, eliminating the overhead of constant kernel launch and teardown. Within this kernel, warp specialization assigns permanent roles to different GPU hardware partitions:
- One set of cores handles data movement
- One set runs mathematical computations
- One set manages communication
These three roles run in continuous, true parallelism with no sequential dependency between them.
Test results
LeetCode hard problem
Given the full problem description for LeetCode #65 (Valid Number), the model generated a complete solution with logic explanation. Peak observed speed: 3,451 tokens/second. Total generation time: 3 seconds.
Three.js game generation
Prompt: "build a Subway Surfer style 3d game using three.js"
In 50 seconds, the model generated a self-contained HTML file with a functional 3D endless runner including lane-switching mechanics. Two iterative follow-up prompts added coins, obstacles, and a local high-score system. The result was a playable prototype created from text prompts in a few minutes.
Assessment
MiMo UltraSpeed's speed claims are consistent with its architectural choices. MXFP4 reduces memory bandwidth requirements, DFlash reduces the number of forward passes needed per token generated, and the persistent kernel eliminates GPU idle time. The combination on a single 8-GPU node producing 1,000+ tokens/second is a meaningful engineering result.
The quality caveat is real. In complex generation tasks, outputs contained broken functionality and incomplete logic, suggesting the model's reasoning depth does not yet match frontier models at 32–51 tokens/second. The tradeoff is explicit: current MiMo UltraSpeed prioritizes throughput, and quality at that throughput level is still behind the leaders.
The hardware accessibility is the more significant claim. Demonstrating this speed on commodity 8-GPU hardware rather than proprietary TPU clusters or bespoke supercomputer infrastructure changes the deployment calculus for organizations without hyperscale cloud access.
Technical details are available in the research paper and the model is accessible through Xiaomi's playground.