Google Gemma 4: Per-Layer Embeddings, Multimodality, and On-Device Performance
Google's Gemma 4 is a family of open-weight models released under the Apache 2.0 license. The edge-focused models in the family use a new architecture called Per-Layer Embeddings (PLE) that allows them to perform well above what their active parameter count would suggest. The smallest models run on devices with 1.5 GB of RAM and ship with native multimodality, a 128K token context window, and support for over 140 languages.
Architecture
Per-Layer Embeddings
In a standard transformer, each input token is converted to a single embedding at the first layer and that representation is carried through the entire network unchanged in structure. Gemma 4's PLE architecture gives each layer its own set of embeddings, allowing the model to introduce and refine token representations as they move through the network rather than relying on a single initial vector to carry all semantic weight.
This approach increases intelligence density: the model can handle complex reasoning with a smaller active parameter footprint.
Effective vs. active parameters
The smaller Gemma 4 models are named E2B and E4B, where the "E" stands for "Effective."
- Gemma 4 E2B: 2.3 billion active parameters, performing at the depth of a 5.1 billion effective parameter model
- Gemma 4 E4B: 4.5 billion active parameters, performing at the depth of an 8 billion effective parameter model
During inference only the active parameters are used, which keeps memory and compute requirements low while the PLE architecture maintains reasoning quality above what the active count alone would suggest.
Licensing
Gemma 4 is released under the Apache 2.0 license, which permits commercial use without restrictions. This distinguishes it from models that are open-weight but commercially restricted.
Other capabilities
Native multimodality: Text, vision, and audio inputs are handled within a single unified architecture rather than through separate adapters bolted onto a language model.
Thinking mode: An internal chain-of-thought reasoning process allows the model to verify its own logic before producing a final answer. This reduces logical fallacies and infinite reasoning loops that are common in smaller models.
Context and language support: The edge models ship with a 128K token context window. Pre-training covered more than 140 languages, which contributes to strong multilingual OCR and language identification performance.
Running Gemma 4 locally with LM Studio and Cline
LM Studio hosts the model and serves a local API. Cline, a VS Code extension, connects to that API as the coding agent.
In LM Studio, search for gemma 4 and download the quantized versions of gemma-4-e2b-it and gemma-4-e4b-it. Under the Local Server tab, select the model, set the context length to 131072 (128K), and click Start Server.
In Cline's settings, set the API provider to LM Studio and point the base URL at the local server address, typically http://127.0.0.1:1234. Disabling Wi-Fi confirms that all inference runs on-device.
Coding test: full-stack website generation
Both models were given the same prompt: build a complete HTML, CSS, and JavaScript website for a fictional cafe called "Power Brewers," including a functional shopping cart.
E2B (2.3B active parameters)
Completion time was approximately 1.5 minutes. The output had significant issues. The model appended its internal task list to the bottom of both index.html and style.css, requiring manual cleanup. It referenced a script.js file for the shopping cart but the file was empty. The design was minimal.
E4B (4.5B active parameters)
Completion time was approximately 3.5 minutes. The design remained plain but the JavaScript output was correct. script.js implemented a working shopping cart: items could be added from the menu, the total updated on each addition, and checkout triggered a confirmation alert.
The jump from E2B to E4B produces a meaningful difference in code generation reliability. The E2B model's failure to produce any working JavaScript makes it unsuitable for complex multi-file generation tasks.
On-device tests: iPhone via AI Edge Gallery
Google's AI Edge Gallery app runs Gemma 4 on-device using the LiteRT-LM inference framework, optimized for mobile GPUs.
Logical reasoning
Asked "The car wash is 50 meters away. Should I walk or drive?", the model produced a detailed multi-point analysis covering energy, convenience, and weather before recommending driving based on the assumption the user might be tired. The reasoning was internally coherent but the conclusion was incorrect for such a short distance. A partial pass.
Image recognition
Asked to identify the breed of a Corgi from a photo, the model correctly identified it as a dog but guessed Border Collie rather than Corgi. A common class of error for vision models where category-level recognition is strong but breed-level specificity is weaker.
OCR and translation
Given an image containing Latvian text, the model correctly identified the language, transcribed the text with high accuracy, and provided a correct English translation. Minor grammatical oddities appeared in the transcription but the overall result was strong.
Multilingual chat and knowledge cutoff
The model understood and responded to Latvian questions, though with unnatural grammar. When asked for its knowledge cutoff date it reported January 2025.
Final thoughts
The E4B model is the more capable choice for coding tasks. The E2B model falls short on complex multi-file generation but may be adequate for simpler prompts where its smaller footprint and faster inference are priorities.
The multimodal and multilingual capabilities are the clearest strengths: native OCR, language identification, and translation across 140 languages in a model that runs on a phone without network access is a meaningful capability. The vision model's breed-level identification errors are a current limitation.
The Apache 2.0 license removes commercial barriers that affect many other small open-weight models, which is the most practically significant aspect of the release for developers building applications.
Model weights are available on Hugging Face and through Google's AI Edge Gallery.