oMLX: Apple Silicon-Optimized LLM Inference with Two-Tier KV Caching
oMLX is a macOS-native LLM inference server built on Apple's open-source MLX framework. Unlike cross-platform tools built to support NVIDIA and AMD GPUs across multiple operating systems, oMLX is designed exclusively for Apple Silicon and exploits the specific capabilities of the unified memory architecture that those tools cannot.
The hardware advantages oMLX exploits
Unified memory and zero-copy arrays
On a traditional PC, the CPU and GPU have separate memory pools. Model weights must be copied between system RAM and VRAM over the PCIe bus during inference, which creates a significant bandwidth bottleneck.
Apple Silicon integrates CPU, GPU, and Neural Engine onto a single chip (SoC) with a shared physical memory pool. The MLX framework uses this with zero-copy arrays: when the GPU computes on model weights, the CPU can read the result directly from the same memory address without copying any data. This eliminates the PCIe bottleneck entirely.
oMLX is built on MLX and inherits these properties.
Lazy computation
MLX uses lazy computation: rather than executing mathematical operations immediately, it builds a computation graph of the planned operations and executes them only when the final output is needed. This allows the framework to analyze the full sequence, fuse operations, and optimize data flow before any computation runs.
Two-tier KV caching
The KV (Key-Value) cache stores the conversational context an LLM needs to generate coherent responses. Every token in the prompt and response history is added to this cache as the conversation grows.
On systems with limited RAM, a large KV cache consumes a substantial portion of available memory, often over 30%, slowing both the model and the rest of the system. oMLX addresses this with a two-tier approach.
Tier 1 (hot cache): Active conversational context is kept in high-speed unified memory for maximum performance.
Tier 2 (cold cache): Older context and large static data like system prompts and tool definitions are frozen and moved to the SSD.
When the model needs to reference older context, oMLX loads it from the SSD in milliseconds. This extends effective context capacity significantly beyond what unified memory alone would allow, and preserves memory for active computations.
Setup
oMLX is distributed as a standard DMG file. After installation, the initial configuration window asks for:
- Base directory: where oMLX stores configuration files
- Model directory: where LLM files are stored
- Port: the server port (default
1337) - API key: auto-generated for securing the local server
Clicking Start Server launches the background process. Clicking Open Admin Panel & Chat opens the dashboard in a browser.
Models can be downloaded directly from Hugging Face in the Models tab by providing a model ID. The dashboard shows real-time serving stats including total profiled tokens, cached tokens, cache efficiency (percentage served from cache versus recomputed), and token generation speed in tokens per second.
Performance comparison: oMLX vs. LM Studio
Both tools were used to run the same task: having an AI coding agent (Codex CLI) build a full-stack movie search and watchlist application with a React frontend and Express.js backend, using the same Qwen3.6-35B-A3B-4bit model.
Token generation speed: oMLX averaged around 47 tokens per second. LM Studio averaged around 16 tokens per second, approximately one-third the speed.
Completion time: The task completed in approximately 20 minutes with oMLX and approximately 35 minutes with LM Studio.
System usability: With LM Studio, the MacBook Pro M2 became difficult to use for other tasks during inference. Video playback on a second monitor was impractical. With oMLX, the system remained responsive throughout: browsing, video playback, and general use continued normally while the agent ran in the background.
Context limit handling: During the oMLX session, the agent reached the model's context limit twice, producing 400 errors. Because oMLX persists the KV cache to SSD, re-running the prompt caused it to recognize the existing project state, reload the frozen cache instantly, and resume from exactly where it stopped. No progress was lost.
LM Studio did not hit the context limit during its session, indicating more conservative context management. This stability comes at the cost of the significant performance and usability differences described above.
Final thoughts
oMLX's performance advantage over LM Studio is substantial for the specific hardware it targets. The three-times higher token generation speed and the maintained system responsiveness during inference are direct consequences of its Apple Silicon-specific design: zero-copy unified memory access, lazy computation, and two-tier KV caching.
The two-tier caching system is the most practically significant feature for long-running agent sessions. The ability to recover from context limit events without losing work changes the character of what kinds of tasks are feasible in a single session.
The tool is appropriate for Mac users with Apple Silicon who run LLMs locally as a regular part of their workflow. For users on other hardware or operating systems, LM Studio or similar cross-platform tools remain the relevant options.
Documentation and downloads are available at omlx.dev.