Qwen 3.5 Small Models: Multimodal AI on Your Laptop and Phone, Offline

Stanley Ulili

Updated on March 9, 2026

Intelligence density and native multimodal architecture
Benchmark performance
Running Qwen 3.5 offline on a laptop for coding
Running Qwen 3.5 on an iPhone, offline
A note on the Qwen team
Final thoughts

Alibaba's Qwen 3.5 small model series challenges a long-standing assumption in AI development: that sophisticated multimodal capabilities require enormous parameter counts. The 0.8B and 2B models in this series handle text, images, and code through a native multimodal architecture, rather than bolting vision onto a language model after the fact, and are small enough to run entirely offline on consumer hardware.

Intelligence density and native multimodal architecture

Most small models are primarily text-based. Vision or coding capabilities, when present, are typically added on top of a language model through separate modules or adapters. The Qwen 3.5 small series takes a different approach: multimodal processing is built into the architecture from the start, so the same model weights handle text, images, and code without switching between subsystems.

Diagram illustrating the difference between a single-modal AI model processing one data type and a multimodal model processing text, images, and charts together

This unified design is what allows the 0.8B model to have functional vision and coding abilities despite its size. The Qwen team describes the goal as maximizing "intelligence density" per parameter rather than scaling parameter count.

Both models support a 262,000-token context window, large enough to process entire PDF documents, long codebases, or extended multi-turn conversations within a single session.

Benchmark performance

On the MMLU benchmark, which evaluates general knowledge and reasoning, the 2B model scores 66.5. For reference, Llama 2 at 7B parameters scored 45.3 on the same test, making the efficiency gap between these two generations concrete.

Bar chart titled "Visual Reasoning Intelligence (MMMU Pro evaluation), <15B Total Parameters," showing Qwen 3.5 0.8B and 4B outperforming larger competitors including models from Google and Mistral

The more notable results come from vision benchmarks. On OCRBench, which tests a model's ability to recognize and understand text within images, the 2B model scores 85.4 and the 0.8B model scores 79.1. These scores place them ahead of models several times their size on this specific task.

Running Qwen 3.5 offline on a laptop for coding

The combination of LM Studio and Cline (a VS Code extension) provides a straightforward path to running these models locally and connecting them to a code editor. LM Studio handles model management and serves a local API; Cline connects to that API and acts as a coding agent inside VS Code.

Setting up LM Studio

In LM Studio's search interface, search for Qwen 3.5 0.8B and download a GGUF version of the model. GGUF is a quantized format optimized for local inference. Repeat the process for Qwen 3.5 2B.

After downloading, open the Context and Offload settings for the model and increase the context length slider from its default (often 4096) to a higher value. This gives the model enough working memory to handle larger code generation tasks.

The Context and Offload settings panel in LM Studio with the Context Length slider being adjusted to a higher value

To start the local server, navigate to the local server tab, select the model from the dropdown, and click Start Server. LM Studio will load the model into memory and expose a local API endpoint, typically at http://127.0.0.1:1234.

Connecting Cline to the local server

In VS Code, open the Cline settings panel and set the API Provider to LM Studio. Enable the custom base URL option and paste in the address from LM Studio. Select the matching model name from the dropdown.

Cline settings panel in VS Code showing the API Provider set to LM Studio and the custom base URL field configured

With Wi-Fi disabled to confirm no network dependency, both models are ready to use.

The 0.8B model: coding results

Given the prompt "Build a complete, professional company website for a cafe called 'Power Brewers'. Use basic HTML, CSS, and Javascript without any external libraries," the 0.8B model completes the task in approximately one minute on an M2 MacBook Pro.

VS Code window with generated code alongside the resulting Power Brewers website in a browser, showing a functional but visually simple gray design

The output is a working multi-section website in a single index.html file. The design is minimal, with some contrast issues (dark text on dark backgrounds), and the model included Unsplash image URLs that don't load offline. Still, the model produced a complete, structured result from a single prompt with no internet access, which is the more significant point at this parameter count.

The 2B model: coding results

The same prompt given to the 2B model produces a noticeably different process. Before writing any code, the model outlines a plan covering the site structure, features, and implementation steps. The task takes around three minutes to complete.

The output is more polished: a coffee-themed brown and white color palette, cleaner layout, and an attempted shopping cart sidebar with item listings. The "Add to Cart" buttons were absent from the final output, but the structural intent was clear. During testing, the 2B model was more prone to getting stuck in generation loops and occasionally needed the task restarted.

Website generated by the 2B model showing a cleaner brown-and-white themed design with a shopping cart sidebar

Running Qwen 3.5 on an iPhone, offline

Apple's open-source MLX Swift framework enables hardware-accelerated inference on Apple Silicon by letting the CPU and GPU share memory directly through the unified memory architecture. A custom SwiftUI app using MLX Swift makes it possible to run Qwen 3.5 on-device. The source code is available at github.com/andrisgauracs/qwen-chat-ios.

Screenshot of the MLX Swift documentation explaining that it is a Swift API for machine learning on Apple Silicon

All tests below were conducted with the iPhone in airplane mode after the models were downloaded.

The 0.8B model on iPhone

Text inference runs at over 22 tokens per second, which feels effectively instant for conversational use.

On a reasoning test ("The car wash is only 100 meters away from my house. Should I walk or drive?"), the model correctly answers "Drive" with reasonable justification. On vision tasks, results are more mixed. Given an image of a ripe banana with brown spots, it correctly identifies the fruit and notes it is overripe, but also appends the phrase "dog banana" and warns it may not be safe to eat. Shown a Corgi, it hallucinates a second dog in the image and misidentifies the breed as a Golden Retriever. On an OCR test with Latvian text, it misidentifies the language as Slovenian.

QwenChat app on iPhone showing an image of a banana with the prompt "What is this and what is the condition of it?" alongside the model's response

The 2B model on iPhone

The 2B model improves meaningfully on the vision and OCR tasks. On the banana image, it correctly describes the condition as "fully ripe and ready to eat," interpreting the brown spots accurately. On the Corgi, it still fails to identify the breed, guessing Pomeranian instead. On the Latvian OCR test, it correctly identifies the language and attempts a partial transcription and translation, a significant capability jump over the 0.8B result.

iPhone app showing the 2B model's response correctly identifying the language in an image as Latvian

A note on the Qwen team

Shortly after the Qwen 3.5 release, reports emerged that key members of the Qwen team at Alibaba, including senior leadership and engineers, were departing to start a new AI company.

VentureBeat headline asking "Did Alibaba just kneecap its powerful Qwen AI team? Key figures depart in wake of latest open source release"

Whether this affects the pace of future Qwen releases remains to be seen. Alibaba has not announced any changes to the project's roadmap.

Final thoughts

The Qwen 3.5 small models make a genuine case that on-device multimodal AI is no longer a novelty. The 0.8B model is fast enough for real-time use on a phone and capable enough to generate functional code on a laptop, all without a network connection. The 2B model adds meaningfully better reasoning and OCR accuracy, particularly visible in the language identification and banana ripeness tests, at the cost of some speed.

Neither model is without hallucinations or rough edges, and the 2B model's tendency toward generation loops during coding tasks is a practical annoyance. But the overall capability-to-size ratio is notable, and the 262,000-token context window gives both models room to work with real-world inputs that most small models would struggle to fit in memory.

For developers building offline or privacy-sensitive applications, both models are worth evaluating. The iOS source code at github.com/andrisgauracs/qwen-chat-ios provides a starting point for on-device deployment on Apple hardware.

Got an article suggestion? Let us know

Liquid AI's LFM2-VL: running a vision-language model in the browser

Liquid AI's LFM2-VL-1.6B is a vision-language model that runs entirely in the browser using WebGPU and the ONNX Runtime, with no server, no GPU requirement, and full offline capability after the initial model download."

→