Liquid AI's LFM2-VL: running a vision-language model in the browser

Stanley Ulili

Updated on March 23, 2026

Why on-device inference matters
Architecture
Running the webcam captioning demo
Final thoughts

Liquid AI's LFM2-VL-1.6B is a 1.6 billion parameter Vision-Language Model (VLM) that runs entirely within a standard web browser. It uses WebGPU for GPU-accelerated inference and the ONNX Runtime as the execution engine, with no cloud API, no server, and no installation required. After the model files are downloaded and cached on the first visit, it operates fully offline.

Why on-device inference matters

Cloud-based AI services require sending data to a third-party server for processing. For use cases involving personal photos, private documents, or a live video feed, this creates privacy exposure. It also introduces network latency that makes real-time applications impractical, adds ongoing cost through API usage, and breaks entirely when internet connectivity is unavailable.

On-device inference addresses all of these: data stays on the local machine, latency is eliminated, there are no API costs, and the model continues working offline. The tradeoff has historically been capability, as on-device models were significantly less capable than their cloud-hosted counterparts. The LFM2-VL is designed to minimize that gap.

Infographic illustrating Liquid AI's on-device model, showing how WebGPU and the ONNX Runtime enable local processing for data security, offline capability, and cloud independence

Architecture

Hybrid design

Most large language models are based on the Transformer architecture, which becomes computationally expensive as input sequence length grows. The LFM2 uses a hybrid approach that avoids this scaling problem.

Schematic of the LFM2 hybrid architecture showing Gated Short Convolution Blocks and GQA Blocks working together

The architecture combines two components. Gated short-range convolution blocks act as efficient local filters that handle the bulk of computation in a memory-friendly way. Grouped Query Attention (GQA) is a more efficient variant of the standard multi-head attention mechanism that handles longer-range dependencies without the same memory overhead. Together, these allow the model to maintain a 32,000-token context window without the exponential slowdowns that affect pure Transformer models on constrained hardware.

Efficiency by design

Many small models are created by compressing or pruning a larger model, which reduces size at the cost of capability. Liquid AI's approach is different.

The LFM2 is architected from first principles to be efficient on edge hardware. Memory layout, attention mechanisms, and parameter allocation are all optimized for the constraints of consumer devices rather than for data center GPUs. This is why the 1.6B parameter model benchmarks above models twice its size on several tasks.

Linear Input-Varying (LIV) systems

The "Liquid" in Liquid AI refers to the Linear Input-Varying (LIV) architecture. In standard models, weights are fixed after training. In LFM models, operator weights can change dynamically based on the current input, functioning as adaptive filters that compress information by focusing on the most locally relevant details. This is what makes the model efficient at handling long sequences, such as a continuous video stream, without accumulating memory or compute costs proportional to sequence length.

WebGPU and ONNX Runtime

Two open web technologies make browser-based inference possible. WebGPU is a low-level API that provides direct access to the device GPU from the browser, enabling the kind of general-purpose GPU computation needed for neural network inference. ONNX Runtime is a cross-platform inference engine that executes models stored in the Open Neural Network Exchange (ONNX) format across CPU and GPU backends, including WebGPU. The LFM2-VL is distributed in ONNX format and executed via the ONNX Runtime's WebGPU backend.

Running the webcam captioning demo

Liquid AI provides a live demo on Hugging Face Spaces that runs the LFM2.5-VL-1.6B model against a real-time webcam feed. The only requirement is a browser with WebGPU support, such as a recent version of Chrome or Edge.

Model loading and quantization

The demo offers three model variants with different precision and file size tradeoffs:

Vision Q4, Decoder Q4 (~1.8 GB): smallest and fastest, lower precision
Vision FP16, Decoder Q4 (~2.3 GB): balanced option
Vision FP16, Decoder FP16 (~3.5 GB): highest precision, largest download

Demo interface showing the quantization selection dropdown with Q4 and FP16 options and their corresponding file sizes

Selecting a quantization level and clicking Load starts the download. The model is cached in the browser after the first download and loads instantly on subsequent visits. Once loaded, clicking Start prompts for webcam permission and begins real-time captioning.

Capabilities observed in testing

With the FP16 variant loaded and a live webcam feed active, the model produces continuous natural language descriptions updated in real time.

Basic subject description is accurate and specific. Pointing the camera at a person produces captions like "A man with a shaved head and a beard is wearing a dark hoodie and looking directly at the camera," identifying physical attributes and framing.

Object recognition updates immediately when new objects enter the frame. A smartphone is identified as "A man is holding a smartphone with a black case," and a microphone is correctly identified as a microphone.

The OCR capability is particularly notable. When a RØDE microphone is held toward the camera, the model identifies the device and reads the text printed on it, producing a caption that includes "...with the word 'RODE' on it."

Gesture recognition also works reliably. A peace sign produces "A man is making a peace sign with his hand" and a thumbs-up produces the corresponding description, with the caption updating within the same refresh cycle as the gesture change.

Offline verification

Disabling Wi-Fi while the webcam captioning is running produces no interruption. Captions continue generating at the same rate and accuracy with no network connection. The inference is handled entirely by the local GPU through WebGPU, with the model weights served from the browser cache.

Final thoughts

The LFM2-VL-1.6B is a practical demonstration that capable vision-language inference is achievable in a browser on consumer hardware. The combination of the hybrid LIV architecture, efficiency-first design, WebGPU acceleration, and ONNX Runtime execution produces a model that covers object recognition, OCR, and gesture understanding in real time without a cloud dependency.

The main constraint is the initial download, which ranges from 1.8 GB to 3.5 GB depending on the quantization level selected. After that, the model is fully local. For developers building privacy-sensitive applications, accessibility tools, or anything that needs to function without reliable internet access, the approach Liquid AI has taken here is worth close attention.

The model weights and demo are available on Hugging Face, and the technical details of the LFM architecture are documented on Liquid AI's research page.

Got an article suggestion? Let us know

Context Mode: Reducing AI Context Bloat with an MCP Server

Context Mode is an open-source MCP server that intercepts tool calls from AI coding agents, indexes large outputs locally using FTS5, and returns concise summaries to the context window instead of raw data, reducing token consumption by up to 98%

→