Qwen 3.5 Small Models: Multimodal AI on Your Laptop and Phone, Offline
Alibaba's Qwen 3.5 small model series challenges a long-standing assumption in AI development: that sophisticated multimodal capabilities require enormous parameter counts. The 0.8B and 2B models in this series handle text, images, and code through a native multimodal architecture, rather than bolting vision onto a language model after the fact, and are small enough to run entirely offline on consumer hardware.
Intelligence density and native multimodal architecture
Most small models are primarily text-based. Vision or coding capabilities, when present, are typically added on top of a language model through separate modules or adapters. The Qwen 3.5 small series takes a different approach: multimodal processing is built into the architecture from the start, so the same model weights handle text, images, and code without switching between subsystems.
This unified design is what allows the 0.8B model to have functional vision and coding abilities despite its size. The Qwen team describes the goal as maximizing "intelligence density" per parameter rather than scaling parameter count.
Both models support a 262,000-token context window, large enough to process entire PDF documents, long codebases, or extended multi-turn conversations within a single session.
Benchmark performance
On the MMLU benchmark, which evaluates general knowledge and reasoning, the 2B model scores 66.5. For reference, Llama 2 at 7B parameters scored 45.3 on the same test, making the efficiency gap between these two generations concrete.
The more notable results come from vision benchmarks. On OCRBench, which tests a model's ability to recognize and understand text within images, the 2B model scores 85.4 and the 0.8B model scores 79.1. These scores place them ahead of models several times their size on this specific task.
Running Qwen 3.5 offline on a laptop for coding
The combination of LM Studio and Cline (a VS Code extension) provides a straightforward path to running these models locally and connecting them to a code editor. LM Studio handles model management and serves a local API; Cline connects to that API and acts as a coding agent inside VS Code.
Setting up LM Studio
In LM Studio's search interface, search for Qwen 3.5 0.8B and download a GGUF version of the model. GGUF is a quantized format optimized for local inference. Repeat the process for Qwen 3.5 2B.
After downloading, open the Context and Offload settings for the model and increase the context length slider from its default (often 4096) to a higher value. This gives the model enough working memory to handle larger code generation tasks.
To start the local server, navigate to the local server tab, select the model from the dropdown, and click Start Server. LM Studio will load the model into memory and expose a local API endpoint, typically at http://127.0.0.1:1234.
Connecting Cline to the local server
In VS Code, open the Cline settings panel and set the API Provider to LM Studio. Enable the custom base URL option and paste in the address from LM Studio. Select the matching model name from the dropdown.
With Wi-Fi disabled to confirm no network dependency, both models are ready to use.
The 0.8B model: coding results
Given the prompt "Build a complete, professional company website for a cafe called 'Power Brewers'. Use basic HTML, CSS, and Javascript without any external libraries," the 0.8B model completes the task in approximately one minute on an M2 MacBook Pro.
The output is a working multi-section website in a single index.html file. The design is minimal, with some contrast issues (dark text on dark backgrounds), and the model included Unsplash image URLs that don't load offline. Still, the model produced a complete, structured result from a single prompt with no internet access, which is the more significant point at this parameter count.
The 2B model: coding results
The same prompt given to the 2B model produces a noticeably different process. Before writing any code, the model outlines a plan covering the site structure, features, and implementation steps. The task takes around three minutes to complete.
The output is more polished: a coffee-themed brown and white color palette, cleaner layout, and an attempted shopping cart sidebar with item listings. The "Add to Cart" buttons were absent from the final output, but the structural intent was clear. During testing, the 2B model was more prone to getting stuck in generation loops and occasionally needed the task restarted.
Running Qwen 3.5 on an iPhone, offline
Apple's open-source MLX Swift framework enables hardware-accelerated inference on Apple Silicon by letting the CPU and GPU share memory directly through the unified memory architecture. A custom SwiftUI app using MLX Swift makes it possible to run Qwen 3.5 on-device. The source code is available at github.com/andrisgauracs/qwen-chat-ios.
All tests below were conducted with the iPhone in airplane mode after the models were downloaded.
The 0.8B model on iPhone
Text inference runs at over 22 tokens per second, which feels effectively instant for conversational use.
On a reasoning test ("The car wash is only 100 meters away from my house. Should I walk or drive?"), the model correctly answers "Drive" with reasonable justification. On vision tasks, results are more mixed. Given an image of a ripe banana with brown spots, it correctly identifies the fruit and notes it is overripe, but also appends the phrase "dog banana" and warns it may not be safe to eat. Shown a Corgi, it hallucinates a second dog in the image and misidentifies the breed as a Golden Retriever. On an OCR test with Latvian text, it misidentifies the language as Slovenian.
The 2B model on iPhone
The 2B model improves meaningfully on the vision and OCR tasks. On the banana image, it correctly describes the condition as "fully ripe and ready to eat," interpreting the brown spots accurately. On the Corgi, it still fails to identify the breed, guessing Pomeranian instead. On the Latvian OCR test, it correctly identifies the language and attempts a partial transcription and translation, a significant capability jump over the 0.8B result.
A note on the Qwen team
Shortly after the Qwen 3.5 release, reports emerged that key members of the Qwen team at Alibaba, including senior leadership and engineers, were departing to start a new AI company.
Whether this affects the pace of future Qwen releases remains to be seen. Alibaba has not announced any changes to the project's roadmap.
Final thoughts
The Qwen 3.5 small models make a genuine case that on-device multimodal AI is no longer a novelty. The 0.8B model is fast enough for real-time use on a phone and capable enough to generate functional code on a laptop, all without a network connection. The 2B model adds meaningfully better reasoning and OCR accuracy, particularly visible in the language identification and banana ripeness tests, at the cost of some speed.
Neither model is without hallucinations or rough edges, and the 2B model's tendency toward generation loops during coding tasks is a practical annoyance. But the overall capability-to-size ratio is notable, and the 262,000-token context window gives both models room to work with real-world inputs that most small models would struggle to fit in memory.
For developers building offline or privacy-sensitive applications, both models are worth evaluating. The iOS source code at github.com/andrisgauracs/qwen-chat-ios provides a starting point for on-device deployment on Apple hardware.