MiMo UltraSpeed: 1,000+ Tokens per Second on a Single 8-GPU Node

Stanley Ulili

Updated on June 15, 2026

Architecture
Three techniques behind the speed
Test results
Assessment

Xiaomi's MiMo-V2.5-Pro-UltraSpeed is a 1-trillion parameter Mixture of Experts (MoE) model that generates text at over 1,000 tokens per second, peaking at 1,200, on a single server with eight commodity GPUs. For comparison, Claude 4 Opus averages around 32 tokens/second and GPT-5.5 averages 38–51 tokens/second on their respective proprietary infrastructure.

Architecture

Table comparing average output speed of top AI models highlighting the dramatic difference of MiMo UltraSpeed

The 1 trillion parameter scale is made computationally tractable through MoE architecture. Rather than activating all parameters for every token, a gating network routes each prompt to the most relevant subset of expert networks. For any given inference, only a fraction of the total parameters are active.

The speed is achieved through a philosophy Xiaomi calls "Extreme Model-System Codesign": deep collaboration between model design and systems engineering teams, addressing latency at every layer simultaneously.

Snippet from the research paper highlighting the term "Extreme Model-System Codesign"

Three techniques behind the speed

MXFP4 quantization with QAT

The primary bottleneck for large models is memory bandwidth: moving weights from GPU memory to compute cores for each token generation creates contention that limits throughput.

MiMo UltraSpeed uses MXFP4 quantization, converting 16-bit or 32-bit weights to 4-bit integers. This reduces the memory footprint by 4x or more, meaning less data moves per token and decoding accelerates proportionally.

Aggressive 4-bit quantization typically degrades model quality. To prevent this, the model was fine-tuned using Quantization-Aware Training (QAT): training that simulates the effects of 4-bit precision so the model adapts to perform effectively at lower precision. The routing layers of the MoE architecture are kept at higher precision to preserve the accuracy of expert selection.

DFlash block speculative decoding

Standard autoregressive decoding generates one token per forward pass. Conventional speculative decoding uses a small draft model to guess a few tokens ahead, which the large model verifies in one pass.

DFlash (Block Diffusion for Flash Speculative Decoding) predicts an entire block of 8 tokens in a single parallel forward pass rather than guessing token by token. In testing on coding tasks, the large model accepted an average of 6.3 out of every 8 proposed tokens, a high enough acceptance rate to maintain sustained forward momentum.

Table from the research paper showing acceptance length for DFlash in different scenarios especially coding

Persistent kernel with warp specialization

At 1,000+ tokens/second, GPU execution overhead becomes a bottleneck. A standard GPU launches a kernel, completes the operation, clears memory, and waits for the next instruction. These microsecond pauses accumulate into significant throughput loss at this scale.

The TileRT team developed a persistent kernel that loads once into GPU memory and stays resident, eliminating the overhead of constant kernel launch and teardown. Within this kernel, warp specialization assigns permanent roles to different GPU hardware partitions:

One set of cores handles data movement
One set runs mathematical computations
One set manages communication

These three roles run in continuous, true parallelism with no sequential dependency between them.

Diagram illustrating Warp Specialization with different hardware parts assigned permanent parallel roles

Test results

LeetCode hard problem

Given the full problem description for LeetCode #65 (Valid Number), the model generated a complete solution with logic explanation. Peak observed speed: 3,451 tokens/second. Total generation time: 3 seconds.

Performance graph from the LeetCode test showing a peak output speed of 3451 tokens/s

Three.js game generation

Prompt: "build a Subway Surfer style 3d game using three.js"

In 50 seconds, the model generated a self-contained HTML file with a functional 3D endless runner including lane-switching mechanics. Two iterative follow-up prompts added coins, obstacles, and a local high-score system. The result was a playable prototype created from text prompts in a few minutes.

Gameplay footage of the 3D endless runner game created by the MiMo UltraSpeed model

Assessment

MiMo UltraSpeed's speed claims are consistent with its architectural choices. MXFP4 reduces memory bandwidth requirements, DFlash reduces the number of forward passes needed per token generated, and the persistent kernel eliminates GPU idle time. The combination on a single 8-GPU node producing 1,000+ tokens/second is a meaningful engineering result.

The quality caveat is real. In complex generation tasks, outputs contained broken functionality and incomplete logic, suggesting the model's reasoning depth does not yet match frontier models at 32–51 tokens/second. The tradeoff is explicit: current MiMo UltraSpeed prioritizes throughput, and quality at that throughput level is still behind the leaders.

The hardware accessibility is the more significant claim. Demonstrating this speed on commodity 8-GPU hardware rather than proprietary TPU clusters or bespoke supercomputer infrastructure changes the deployment calculus for organizations without hyperscale cloud access.

Technical details are available in the research paper and the model is accessible through Xiaomi's playground.

Got an article suggestion? Let us know

GLM-5.2: A Complete Overview of ZAI's Open-Weight Model

GLM-5.2 from ZAI is the top-ranked open-weight model on general intelligence benchmarks, outscores GPT-5.5 on real-world tasks, and leads the web design leaderboard ahead of Fable 5. Here's what it can do and what it costs

→