# oMLX: Apple Silicon-Optimized LLM Inference with Two-Tier KV Caching

[oMLX](https://omlx.dev/) is a macOS-native LLM inference server built on Apple's open-source [MLX framework](https://github.com/ml-explore/mlx). Unlike cross-platform tools built to support NVIDIA and AMD GPUs across multiple operating systems, oMLX is designed exclusively for Apple Silicon and exploits the specific capabilities of the unified memory architecture that those tools cannot.

![Overview of Apple's MLX project webpage showing its purpose as an array framework for Apple Silicon](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/ac06a424-6076-401a-3126-a095c40ad500/lg1x =1280x720)

## The hardware advantages oMLX exploits

### Unified memory and zero-copy arrays

On a traditional PC, the CPU and GPU have separate memory pools. Model weights must be copied between system RAM and VRAM over the PCIe bus during inference, which creates a significant bandwidth bottleneck.

![Diagram showing the traditional PC architecture with separate CPU and GPU memory pools and data copying over PCIe](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/f60bcda7-d9c1-42c3-4b59-e13c0ab1b300/orig =1280x720)

Apple Silicon integrates CPU, GPU, and Neural Engine onto a single chip (SoC) with a shared physical memory pool. The MLX framework uses this with zero-copy arrays: when the GPU computes on model weights, the CPU can read the result directly from the same memory address without copying any data. This eliminates the PCIe bottleneck entirely.

![Diagram comparing traditional architecture with Apple Silicon SoC showing zero-copy access where both CPU and GPU engines access a unified memory pool directly](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/10eb94d3-fe14-4c27-3d3c-d8068ceb4600/public =1280x720)

oMLX is built on MLX and inherits these properties.

### Lazy computation

MLX uses lazy computation: rather than executing mathematical operations immediately, it builds a computation graph of the planned operations and executes them only when the final output is needed. This allows the framework to analyze the full sequence, fuse operations, and optimize data flow before any computation runs.

### Two-tier KV caching

The KV (Key-Value) cache stores the conversational context an LLM needs to generate coherent responses. Every token in the prompt and response history is added to this cache as the conversation grows.

![Diagram of a modern GPU's RAM usage showing the KV cache consuming more than 30% of available memory alongside model parameters](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/52625f37-0e88-47c4-1630-cf1cc7cd1c00/lg2x =1280x720)

On systems with limited RAM, a large KV cache consumes a substantial portion of available memory, often over 30%, slowing both the model and the rest of the system. oMLX addresses this with a two-tier approach.

**Tier 1 (hot cache):** Active conversational context is kept in high-speed unified memory for maximum performance.

**Tier 2 (cold cache):** Older context and large static data like system prompts and tool definitions are frozen and moved to the SSD.

![Diagram explaining the two-tier KV cache system showing active context in unified memory connected to cold storage on SSD](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/d6ee04ad-4bf9-461c-364b-bd819aaf2100/public =1280x720)

When the model needs to reference older context, oMLX loads it from the SSD in milliseconds. This extends effective context capacity significantly beyond what unified memory alone would allow, and preserves memory for active computations.

## Setup

oMLX is distributed as a standard DMG file. After installation, the initial configuration window asks for:

![Clean "Welcome to oMLX" initial configuration window](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/7f234eda-b07e-4eca-c64b-c7e697975500/public =1280x720)

- **Base directory:** where oMLX stores configuration files
- **Model directory:** where LLM files are stored
- **Port:** the server port (default `1337`)
- **API key:** auto-generated for securing the local server

Clicking **Start Server** launches the background process. Clicking **Open Admin Panel & Chat** opens the dashboard in a browser.

Models can be downloaded directly from Hugging Face in the Models tab by providing a model ID. The dashboard shows real-time serving stats including total profiled tokens, cached tokens, cache efficiency (percentage served from cache versus recomputed), and token generation speed in tokens per second.

## Performance comparison: oMLX vs. LM Studio

Both tools were used to run the same task: having an AI coding agent (Codex CLI) build a full-stack movie search and watchlist application with a React frontend and Express.js backend, using the same Qwen3.6-35B-A3B-4bit model.

**Token generation speed:** oMLX averaged around 47 tokens per second. LM Studio averaged around 16 tokens per second, approximately one-third the speed.

**Completion time:** The task completed in approximately 20 minutes with oMLX and approximately 35 minutes with LM Studio.

**System usability:** With LM Studio, the MacBook Pro M2 became difficult to use for other tasks during inference. Video playback on a second monitor was impractical. With oMLX, the system remained responsive throughout: browsing, video playback, and general use continued normally while the agent ran in the background.

**Context limit handling:** During the oMLX session, the agent reached the model's context limit twice, producing 400 errors. Because oMLX persists the KV cache to SSD, re-running the prompt caused it to recognize the existing project state, reload the frozen cache instantly, and resume from exactly where it stopped. No progress was lost.

LM Studio did not hit the context limit during its session, indicating more conservative context management. This stability comes at the cost of the significant performance and usability differences described above.

## Final thoughts

oMLX's performance advantage over LM Studio is substantial for the specific hardware it targets. The three-times higher token generation speed and the maintained system responsiveness during inference are direct consequences of its Apple Silicon-specific design: zero-copy unified memory access, lazy computation, and two-tier KV caching.

The two-tier caching system is the most practically significant feature for long-running agent sessions. The ability to recover from context limit events without losing work changes the character of what kinds of tasks are feasible in a single session.

The tool is appropriate for Mac users with Apple Silicon who run LLMs locally as a regular part of their workflow. For users on other hardware or operating systems, LM Studio or similar cross-platform tools remain the relevant options.

Documentation and downloads are available at [omlx.dev](https://omlx.dev/).