Back to AI guides

IBM Granite models: From architecture to browser-based AI

Stanley Ulili
Updated on November 10, 2025

The AI landscape has long been dominated by massive, resource-intensive models locked away in powerful data centers. While these colossal models are impressive, a more practical revolution is taking place: the rise of small, efficient language models that can run on everyday hardware. These compact models represent a shift away from cloud dependency, making AI more accessible, private, and practical for a wider range of applications.

IBM's Granite 4.0 series exemplifies this trend. Rather than simply shrinking existing architectures, IBM engineered these models from the ground up for efficiency and performance. This makes them particularly well-suited for enterprise applications and local deployment scenarios where privacy, latency, and cost matter.

Why smaller models matter

The relentless pursuit of scale in AI development has produced incredible breakthroughs, but it has also created significant challenges. Giant models are astronomically expensive to train and operate, consume massive amounts of energy, and often prove too slow for real-time applications. Their reliance on centralized cloud infrastructure raises valid concerns about data privacy and accessibility.

The engineering focus is shifting from raw size to efficiency, creating models that are smarter rather than just bigger. This movement aims to democratize AI by enabling it to run on edge devices, personal computers, and even web browsers without requiring a constant internet connection. The benefits are substantial:

  • Enhanced privacy - When models run locally, data never leaves your device
  • Reduced costs - Eliminating expensive API calls and cloud GPU instances drastically lowers operational expenses
  • Lower latency - Local processing removes network lag, enabling instantaneous responses for applications like code completion
  • Broader accessibility - Anyone with a modern computer can experiment with these models

Companies like Microsoft with BitNet and NVIDIA with Nemotron have made significant strides in this area. IBM's Granite 4.0 family represents another major advancement in efficient AI.

The Granite 4.0 architecture

IBM's Granite 4.0 series combines intelligent architectural design with practical engineering to achieve both high performance and efficiency. The models aren't simply scaled-down versions of larger architectures, they incorporate novel approaches to handling context and computation.

Hybrid transformer and Mamba layers

Most modern language models rely exclusively on the transformer architecture, which excels at understanding complex relationships between words through self-attention mechanisms. However, the computational cost of this attention grows quadratically with input length, making it expensive and memory-intensive for long documents or conversations.

A diagram illustrating the hybrid architecture of the Granite model, showing the alternation between Mamba-2 blocks and Attention blocks.

Granite addresses this limitation by interleaving traditional transformer layers with Mamba layers. Mamba, based on State Space Models (SSMs), processes information sequentially and handles very long contexts with linear rather than quadratic complexity. This hybrid approach provides distinct advantages:

  • Transformer layers deliver deep contextual understanding and reasoning capabilities
  • Mamba layers offer remarkable efficiency in processing long sequences, reducing memory usage and speeding up inference

This combination allows Granite models to manage massive context windows spanning hundreds of thousands of tokens while remaining lightweight and fast. They can process entire documents, lengthy codebases, or extensive chat histories without losing track of earlier information, a common limitation of smaller, purely transformer-based models.

Model family variants

Granite 4.0 isn't a single model but a family of variants, each tailored for different use cases and hardware constraints. This allows developers to choose the optimal balance between performance and resource consumption.

A table listing the various Granite 4.0 models, their architecture type (Hybrid, Dense), model size, and intended use cases.

The family includes several key models:

  • Granite-4.0-Small (32B total parameters, 9B active) - Designed for cost-effective performance on enterprise tasks like multi-tool agents and customer support automation. Uses a Mixture of Experts (MoE) architecture where only a fraction of parameters are activated for any given input
  • Granite-4.0-Tiny (7B total parameters, 1B active) - A hybrid MoE model optimized for low-latency and edge applications where speed is critical
  • Granite-4.0-Micro (3B) - A dense hybrid model serving as a powerful building block for workflows like function calling and local applications
  • Granite-4.0-Nano (350M & 1B) - The smallest models in the family, ideal for on-device deployment and latency-sensitive use cases where computational resources are limited

Enterprise-grade security and trust

Beyond performance, IBM has emphasized making Granite models trustworthy and secure, addressing critical requirements for enterprise adoption.

The Granite family became the first open-source model family to receive accreditation under ISO/IEC 42001:2023, certifying that its development aligns with internationally recognized best practices for responsible AI management. The models were trained on carefully curated, ethically acquired, and enterprise-cleared data to ensure trustworthiness.

All Granite model checkpoints are cryptographically signed before release to prevent tampering and verify authenticity. This focus on security and responsible development makes Granite particularly attractive for applications in regulated industries like finance, healthcare, and government.

Running Granite in the browser

IBM provides a "Granite-4.0 Tool Studio" demo on Hugging Face that runs entirely in the browser using WebGPU and Transformers.js. This demonstrates the practical capabilities of the Nano models for client-side AI applications.

The initial interface of the Granite-4.0 Tool Studio, showing various example prompts for the user to try.

Tool calling functionality

One of the most powerful features demonstrated is tool calling, which allows the AI model to understand user intent and trigger predefined JavaScript functions to perform actions or retrieve real-time information.

The "TOOL WORKSPACE" panel in the demo, displaying the JavaScript implementation and JSON schema for tools like `speak`, `get_location`, and `get_time`.
1a98a702b/frame0407.jpg)

The demo's "Show Tools" panel displays available functions, each with a JavaScript implementation and a JSON schema describing what the function does and what parameters it accepts. The model uses these schemas to decide which tool to call.

When you ask "what time is it?", the model recognizes the intent, identifies the get_time tool as the appropriate function, and executes it. The function returns the current date and time in a nicely formatted component. This ability to interact with external functions transforms the AI model from a static text generator into a dynamic agent.

Data formatting capabilities

The model can also understand and structure data effectively. When you paste a large block of unstructured CSV-like customer data and request "please format this csv data to json," the Granite Nano model begins streaming a perfectly structured JSON array almost instantly.

The Granite-4.0 Tool Studio processing a large block of text data and correctly formatting it into a structured JSON output.

Each line of input data converts into a distinct JSON object with correctly identified keys (Index, CustomerID, FirstName, LastName, etc.). This demonstrates the model's powerful reasoning and formatting capabilities at a very small scale, running entirely client-side. Tasks that typically require server-side scripts or paid API calls can now be handled for free, offline, in the browser.

Building an offline AI code assistant

An AI code assistant that provides real-time code suggestions demonstrates how to harness local AI for practical development tools. This type of application can run completely offline after the initial model download, providing code suggestions as you type.

The basic concept involves a text area where users type JavaScript code. After a pause in typing, a local Granite AI model suggests the next few lines of code. Users can accept suggestions by pressing Tab.

The complete source code for this project is available on GitHub at github.com/andrisgauracs/AI-Code-Assistant. You can clone the repository to explore the full implementation, or watch the video below for a quick overview of how the application works.

HTML structure

The application requires a simple HTML structure with a loading indicator and text area:

index.html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>AI Code Assistant</title>
    <link rel="stylesheet" href="styles.css" />
</head>
<body>
    <div class="container">
        <h1>AI Code Assistant</h1>

        <div id="loading-indicator" style="display: none;">
            <div class="spinner"></div>
            <p>Loading model...</p>
        </div>

        <p>
            Start typing JavaScript code below. The AI will suggest completions after you stop typing for 1 second. Press <b>Tab</b> to accept suggestions or <b>Esc</b> to dismiss them.
        </p>

        <div id="editor">
            <textarea id="long-text-input" placeholder="Start typing JavaScript code..." autocomplete="off" spellcheck="false"></textarea>
        </div>
    </div>

    <script type="module" src="main.js"></script>
</body>
</html>

The textarea element with ID long-text-input is where users type and where the AI provides suggestions.

Model initialization with Transformers.js

The core logic imports Transformers.js components and loads the Granite model:

main.js
// Import necessary modules from Transformers.js
import { AutoTokenizer, AutoModelForCausalLM, env } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers@2.17.1';

// Import helper functions for the UI
import { createSuggestionElement, showSuggestion, hideSuggestion } from './src/suggestionUI.js';

// Define the model ID for the Granite 1B parameter model
const MODEL_ID = "onnx-community/granite-4.0-1B-onnx-web";

// Configure environment for local execution
env.backends.onnx.wasm.numThreads = navigator.hardwareConcurrency ?? 4;
env.useBrowserCache = true; // Cache the model in the browser
env.allowRemoteModels = true; // Allow downloading from Hugging Face Hub

// Initialize global variables
let model = null;
let tokenizer = null;
let typingTimeout = null;
let isGenerating = false;

// Initialize the model and tokenizer
async function loadModel() {
    try {
        console.log("Loading tokenizer...");
        tokenizer = await AutoTokenizer.from_pretrained(MODEL_ID);

        console.log("Loading model...");
        model = await AutoModelForCausalLM.from_pretrained(MODEL_ID, {
            dtype: 'q4', // Use 4-bit quantization for efficiency
            device: 'webgpu', // Use WebGPU for hardware acceleration
        });

        console.log("Model loaded successfully.");
        return true;
    } catch (error) {
        console.error("Error loading model:", error);
        return false;
    }
}

The AutoTokenizer and AutoModelForCausalLM classes automatically handle loading the correct configurations for a given model ID. Setting env.useBrowserCache to true is essential for offline functionality. The first time the app runs, it downloads the model from Hugging Face and stores it in the browser's cache. Subsequent visits load the model directly from cache, even without an internet connection.

The dtype: 'q4' option enables quantization, a technique that reduces the precision of the model's weights from 32-bit floats to 4-bit integers. This dramatically shrinks the model's size and speeds up computation with minimal impact on performance, making it feasible to run in a browser. The device: 'webgpu' option tells Transformers.js to use the GPU for faster processing when available.

Generating code suggestions

The code generation function takes the user's current code as input, feeds it to the model, and returns a suggested completion:

A screenshot of the detailed prompt string in the VS Code editor, instructing the AI on how to behave as a code completion assistant.

main.js
// Function to generate the code suggestion
async function generateCodeSuggestion(input) {
    if (isGenerating || !model || !tokenizer) return null;
    isGenerating = true;

    try {
        // Create a detailed prompt for the model
        const messages = [{
            role: "user",
            content: `You are a code completion AI assistant.
INPUT: \`\`\`javascript\n<code>${input}</code>\`\`\`
Complete this JavaScript code with the most likely next lines of code in the <code> section.
Pay attention to closing tags, opening tags, indentation, syntax and code quality.
IMPORTANT: Only output the next lines of code to replace the <code> section.
IMPORTANT: Do not output the INPUT part. Only the <code> section.
IMPORTANT: Do not include comments in the code.`
        }];

        // Apply chat template to format the input properly
        const chatInput = tokenizer.apply_chat_template(messages, {
            add_generation_prompt: true,
            return_dict: true,
        });

        // Generate a response using the model
        const { input_ids } = chatInput;
        const sequences = await model.generate({
            ...chatInput,
            max_new_tokens: 128,
            do_sample: true,
            temperature: 1,
            return_dict_in_generate: true,
        });

        // Decode the generated text
        const response = tokenizer.batch_decode(
            sequences.sequences.slice(null, input_ids.dims[1], null), 
            { skip_special_tokens: true }
        )[0];

        // Helper functions to clean the output
        const rawSuggestion = extractCodeFromResponse(response);
        return cleanSuggestion(rawSuggestion, input);

    } catch (error) {
        console.error("Error generating suggestion:", error);
        return null;
    } finally {
        isGenerating = false;
    }
}

Prompt engineering is critical here. The messages array contains a carefully crafted prompt that gives the model a role ("You are a code completion AI assistant") and specific instructions on how to format its output. This "system prompt" is essential for getting reliable and clean code completions.

The model.generate() function runs the AI inference. Parameters like max_new_tokens limit the length of the completion, while temperature controls output creativity. A higher temperature (like 1.0) allows for more varied suggestions.

The model outputs a sequence of token IDs. The tokenizer.batch_decode() method converts these IDs back into human-readable text. Custom helper functions (extractCodeFromResponse and cleanSuggestion) then parse this text, remove unwanted conversational parts, and ensure it doesn't simply repeat the input.

Handling user input

Event listeners tie everything together, detecting when the user has stopped typing and triggering the suggestion function:

The AI Code Assistant app displaying a code suggestion for a JavaScript function.

main.js
// Main application initialization
async function initializeApp() {
    const longTextInput = document.getElementById('long-text-input');
    const loadingIndicator = document.getElementById('loading-indicator');
    const suggestionElement = createSuggestionElement();

    // Show loading indicator and load the model
    loadingIndicator.style.display = 'flex';
    const success = await loadModel();
    loadingIndicator.style.display = 'none';

    if (!success) {
        longTextInput.placeholder = "Failed to load AI model.";
        return;
    }

    let currentSuggestion = null;

    // Handle typing in the textarea with throttling
    longTextInput.addEventListener('input', () => {
        if (typingTimeout) clearTimeout(typingTimeout);
        hideSuggestion(suggestionElement);
        currentSuggestion = null;

        typingTimeout = setTimeout(async () => {
            const input = longTextInput.value.trim();
            if (input.length > 0) {
                const suggestion = await generateCodeSuggestion(input);
                if (suggestion && suggestion !== input) {
                    currentSuggestion = suggestion;
                    showSuggestion(suggestion, longTextInput, suggestionElement);
                }
            }
        }, 1000); // Wait for 1 second of inactivity
    });

    // Handle Tab key to accept suggestion
    longTextInput.addEventListener('keydown', (e) => {
        if (e.key === 'Tab' && currentSuggestion) {
            e.preventDefault(); // Prevent default tab behavior
            longTextInput.value += currentSuggestion;
            hideSuggestion(suggestionElement);
            currentSuggestion = null;
        } else if (e.key === 'Escape') {
            hideSuggestion(suggestionElement);
            currentSuggestion = null;
        }
    });
}

// Start the application
initializeApp();

Throttling with setTimeout prevents running the AI model on every keystroke, which would be inefficient. Instead, a 1-second delay ensures generateCodeSuggestion() only runs after the user has paused typing for a full second. If the user types again within that second, the timeout is cleared and reset.

An input event listener detects typing, while a keydown listener captures Tab and Escape keys. When Tab is pressed with an active suggestion, the suggested code is appended to the text area's current value.

The future of local AI

The IBM Granite 4.0 model family demonstrates that massive, cloud-based models aren't always necessary for building useful and intelligent applications. By combining innovative architectures like Mamba and transformer layers, employing techniques like quantization and Mixture of Experts, and prioritizing enterprise-grade security, IBM has delivered tools that are both powerful and accessible.

Running these models locally opens up significant possibilities. Applications can be faster, more private, and more cost-effective than ever before. This represents a meaningful step toward democratizing AI, moving it from a centralized resource to a personal tool that can enhance productivity and creativity in everyday tasks. The era of small, efficient, and local AI is here, and models like Granite are leading the charge.

Got an article suggestion? Let us know
Licensed under CC-BY-NC-SA

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.