Gemma 4 12B: Encoder-Free Multimodal Architecture with Linear Projection
Google's Gemma 4 12B eliminates the separate vision and audio encoders found in traditional multimodal models. Instead of using large encoder networks to translate images and audio into text-compatible embeddings, it uses lightweight linear projections that reformat raw data directly into the LLM's internal vector format (the hidden dimension). All visual and audio reasoning happens inside the unified transformer backbone.
Traditional encoder-based architecture
Most multimodal models add modality-specific encoders on top of an LLM backbone.
The typical pipeline:
- A Vision Encoder (often 500M+ parameters) processes an image through dozens of layers to produce embeddings
- A connector or adapter layer adjusts these embeddings to match the LLM's input format
- An Audio Encoder follows a parallel process for sound
- Only after this preprocessing does the LLM receive the data
LLMs process everything as numerical vectors (embeddings) derived from tokens.
The encoder approach works but carries costs: multiple large models must be loaded simultaneously (consuming significant VRAM), encoding is sequential and adds latency, and because encoders are often frozen pre-trained models, fine-tuning is limited to the LLM backbone.
Gemma 4's encoder-free approach
Rather than using a separate neural network to interpret an image, Gemma 4 uses a thin mathematical layer to reformat the raw data into the LLM's existing internal format.
Image processing via linear projection
- Patching. The image is divided into 48×48 pixel patches. Each patch contains 2,304 pixel values.
- Linear projection. Each patch's 2,304-value vector is multiplied by a pre-trained weight matrix in a single matrix multiplication. This transforms it into a vector sized to match the LLM's hidden dimension.
- No visual analysis at this stage. The embedder does not identify edges, objects, or textures. It is purely a format converter.
All visual reasoning (object recognition, spatial relationships, text in images) is performed by the LLM's transformer layers once the data is in the correct format.
The hidden dimension is the standardized vector size used internally by the LLM. Every input, whether a text token, code, or a pixel patch, must be converted to fit this fixed format before the transformer layers process it.
Audio processing
Audio follows a similar pattern:
- Slicing. The audio waveform is cut into 40-millisecond segments.
- Sampling. Each segment is represented by 640 floating-point amplitude values.
- Direct projection. The 640-value vector is projected into the hidden dimension format.
Audio is a time-series stream structurally similar to a text sequence, so the transformer architecture processes it naturally once it is in token-like format.
Parameter count comparison
The Gemma 4 vision embedder uses approximately 35 million parameters. Traditional vision encoders typically use 500 million or more. This reduction means the full model fits within 16GB of unified RAM, whereas encoder-based models of comparable reasoning ability typically require much more.
The model achieves benchmark scores close to Gemma 4 26B on several evaluations despite having half the parameter count, attributed to the efficiency of unified end-to-end training without frozen encoder components.
Running locally on Apple Silicon with oMLX
For Apple Silicon (M1–M4) users, oMLX provides an optimized runtime. Setup steps:
- Install oMLX following instructions in the repository
- Download a quantized Gemma 4 12B model (8-bit versions are available on Hugging Face)
- Load the model through the oMLX web interface, which displays VRAM consumption
- Use the Chat interface to send text prompts and upload images
The interface is fully local with no cloud dependency.
Performance on image tasks
Running on an M2 MacBook Pro offline:
Airport departures board. Given an image of a flight information display, the model identifies the board layout, performs OCR on the flight details, and returns structured output listing flights, times, statuses (Boarding, Flight closing), and gate numbers accurately.
Vikings scene. Given a slightly blurry scene from a TV show, the model identifies the setting as historical/fantasy, describes the central figure (a woman in a leadership role), the surrounding warriors' attire and weapons (axes, spears, wooden shields), the setting (sandy beach with dry grass), and the overall atmosphere.
Both responses are generated in near-real time on device.
Final thoughts
The key architectural claim is that separate encoder networks are not necessary for multimodal understanding if the underlying LLM is powerful enough. Linear projection into the hidden dimension is sufficient to make raw pixel and audio data addressable by the transformer layers, and the transformer can then perform the reasoning that encoders previously handled.
The practical consequences are meaningful: the model is small enough to run on a 16GB laptop, fine-tuning trains all parameters end-to-end rather than requiring frozen encoder components, and inference latency is lower because there is no sequential encode-then-reason pipeline.
Whether this architecture generalizes well to more demanding vision tasks (dense scene understanding, high-resolution document analysis) is an open question, but the benchmark results for a 12B model suggest the approach is viable at this scale.
Documentation and model downloads are at ai.google.dev/gemma.