Cactus: Low-Latency AI Inference for Mobile with Zero-Copy Memory Mapping and NPU Acceleration

Stanley Ulili

Updated on May 18, 2026

The memory problem on edge devices
Zero-copy memory mapping
NPU-first architecture
Hybrid Router
Real-time transcription benchmark
Final thoughts

Cactus is an AI inference engine designed for mobile and edge devices. Its architecture addresses two constraints specific to these environments: aggressive OS memory management that terminates apps with high RAM usage, and dedicated Neural Processing Unit (NPU) hardware that most inference engines do not fully utilize.

The engine is available via SDKs for Swift, Kotlin, React Native, Flutter, Python, and Rust.

The memory problem on edge devices

Loading a large AI model on mobile traditionally involves copying the entire model file from storage into RAM. This causes a sudden memory spike that OS memory managers on iOS and Android may treat as a sign of an out-of-control process, terminating the app.

Smartphone overheating and displaying a dead battery icon illustrating problems from running heavy AI models on-device

Even after quantization (reducing model weight precision to shrink size), many models still require hundreds of megabytes of RAM during inference, more than mobile operating systems are willing to allocate to a background process.

Chart from Yole Intelligence showing memory requirements by AI model size during inference on mobile and edge devices

Zero-copy memory mapping

Rather than copying the model into RAM, Cactus maps the model's weights directly from storage. Tensors are pulled into the active compute cycle only when needed for a specific calculation, not preloaded. This keeps RAM usage stable and low throughout inference.

Cactus reports up to 10x lower RAM usage than other engines using this approach. The stable memory footprint allows the app to run in the background without triggering the OS memory manager, which enables use cases like continuous ambient listening that would be impractical with a conventional loading strategy.

The `.cact` model format

To support zero-copy memory mapping, Cactus v1 introduced a proprietary .cact model format. The format is structured specifically to allow memory-mapped access to individual tensors. It replaces the GGUF format that Cactus used previously.

FAQ section from the Cactus website explaining the move from GGUF to the proprietary .cact format

NPU-first architecture

Modern mobile SoCs from Apple, Qualcomm (Snapdragon), and MediaTek (Exynos) include a dedicated NPU designed for the matrix operations that neural networks require. These units are substantially more power-efficient for AI workloads than a GPU or CPU.

Block diagram showing the distinct components of a modern SoC including CPU, GPU, and the specialized NPU

Most inference engines default to GPU execution. Cactus communicates with the NPU directly using custom Cactus Kernels that bypass standard software translation layers. This reduces overhead and increases energy efficiency on supported chips. The Cactus model dashboard provides a curated list of models pre-optimized for specific NPUs (Apple Neural Engine, Snapdragon NPU, and others) ready for deployment.

Hybrid Router

On-device models have a reasoning ceiling: they handle simple to moderately complex tasks well but fail on requests that require multi-step reasoning, external data, or high accuracy on ambiguous input. The Hybrid Router addresses this by routing requests based on assessed complexity.

Simple requests (such as "Set the thermostat to 72 degrees") are routed to the on-device model on the NPU. The response is near-instantaneous, no data leaves the device, and there is no cloud cost. Complex requests (such as "Process a refund for order #4821 and notify the customer") are assessed as high-complexity and automatically routed to a cloud frontier model like Gemini. The application code does not change; the router manages the failover in the background.

This approach keeps costs low by defaulting to local processing while using cloud capacity only when the local model is likely to produce a poor result.

Real-time transcription benchmark

A demo application using the Swift Cactus SDK on an iPhone 12 Pro (several generations old at the time of writing) ran the Parakeet-CTC-1.1b model in local mode and compared latency against cloud mode using Gemini-2.5-flash.

Local mode: average latency of 220–230ms from audio capture to transcribed text on screen. The device remains responsive throughout.

Cloud mode: average latency of 1,400–1,500ms for a three-second audio batch. The additional latency reflects the network round-trip to the data center and back, not just inference time.

The local latency result on a device this age demonstrates that the zero-copy and NPU optimizations remain effective on hardware several generations removed from current flagship models.

Final thoughts

Cactus is most relevant for mobile applications that need persistent or low-latency AI inference without the reliability and cost problems of routing everything to the cloud. The zero-copy memory mapping solves the specific problem of OS termination due to RAM spikes, and the NPU-first kernels extract performance from hardware that most engines leave underutilized.

The Hybrid Router is a practical feature for production applications: it avoids hard-coding a choice between on-device and cloud, instead making that decision dynamically per request. For developers who have tried and abandoned on-device inference because of memory crashes or poor performance on older hardware, the architectural changes in Cactus are worth re-evaluating.

Documentation and SDK installation guides are at cactus.run.

Got an article suggestion? Let us know

Understand Anything: AI-Generated Knowledge Graphs for Large Codebases

Understand Anything is an open-source Claude Code plugin that analyzes a codebase with static analysis and multi-agent LLM processing to produce an interactive knowledge graph. The dashboard includes an AI-generated guided tour, per-node summaries, semantic tags, and a dependency path finder between any two nodes.

→