Running a Local LLM on a Raspberry Pi 1: Cross-Compilation, Quantization, and ARMv6 Constraints

Stanley Ulili

Updated on May 17, 2026

Model selection: Falcon H1-tiny
Quantization: Q4KS is the target format
Cross-compilation with dockcross
Setup walkthrough
Running inference
Summary

The original Raspberry Pi (700MHz single-core ARMv6, 512MB RAM) is undersized for most AI workloads by several orders of magnitude. This walkthrough documents the specific combination of model, quantization format, build flags, and tooling that makes local LLM inference possible on this hardware, including the ARMv6-specific constraints that rule out most modern approaches.

Model selection: Falcon H1-tiny

Most "small" open-source models (Llama 3 8B, Mistral 7B) require multiple gigabytes of RAM. The search for something that fits in 512MB leads to Falcon H1-tiny from the Technology Innovation Institute.

Falcon H1-tiny has 90 million parameters. For comparison, a 7B model is roughly 78 times larger. It uses a Hybrid Transformers + Mamba architecture: standard attention layers for contextual understanding and State Space Model (Mamba) layers for efficient long-sequence processing. This combination allows the model to maintain usable linguistic capability at unusually small scale.

Quantization: Q4KS is the target format

At full 16-bit float precision, even a 90M parameter model exceeds 512MB. Quantization reduces weight precision from 16-bit floats to lower-precision integers, shrinking memory usage proportionally.

Table of GGUF quantization formats showing Legacy formats, K-quants, and I-quants

The Raspberry Pi 1's ARMv6 processor rules out modern I-quant formats (importance quantization), which rely on bit-manipulation instructions that ARMv6 does not implement. The only viable options are legacy K-quant formats.

Q4KS is the correct choice: a 4-bit legacy K-quant that uses a two-level affine approach compatible with ARMv6. It provides the best quality-to-size ratio for this hardware.

2-bit quantization (Q2K) fits in memory but degrades the model's outputs to gibberish. 8-bit (Q80) produces better quality but requires more RAM and runs at the same speed.

Cross-compilation with dockcross

Why not compile on the Pi

Compiling llama.cpp directly on the Pi would take approximately 18 hours and very likely exhaust the 512MB RAM before completing.

Terminal on the Raspberry Pi showing the cmake build process starting slowly with an 18-hour estimate

The professional solution is cross-compilation: building the ARMv6 binary on a faster machine using a toolchain that targets ARMv6.

dockcross provides Docker containers pre-configured with cross-compilation toolchains for various architectures. The linux-armv6 image is used here.

ARMv6 constraints in the build flags

The Raspberry Pi 1 lacks NEON instructions (the ARM SIMD extension that accelerates matrix multiplication). This is critical: virtually all modern AI libraries assume NEON is available.

Specification comparison table for Raspberry Pi models with ARMv6 instruction set circled for Pi 1

The cmake flags disable all incompatible optimizations: -DGGML_NATIVE=OFF, -DGGML_NEON=OFF, and -DGGML_OPENMP=OFF. Without these, the binary either crashes or produces incorrect results on ARMv6.

Setup walkthrough

Prepare the Pi

Flash Raspberry Pi OS (Legacy, 32-bit) Lite using Raspberry Pi Imager. The Lite variant omits the desktop environment, preserving RAM. In the imager's advanced settings, enable SSH and configure Wi-Fi credentials before writing the card. This avoids needing the Pi's local terminal.

Raspberry Pi Imager application window ready for OS selection

Cross-compile llama.cpp

On the development machine:

Copied!

git clone https://github.com/ggerganov/llama.cpp

Copied!

cd llama.cpp

Copied!

mkdir build-pi && cd build-pi

Copied!

docker run --rm dockcross/linux-armv6 > ./dockcross

Copied!

chmod +x ./dockcross

Copied!

./dockcross cmake .. -DBUILD_SHARED_LIBS=OFF -DGGML_NATIVE=OFF -DGGML_NEON=OFF -DGGML_OPENMP=OFF -DLLAMA_BUILD_EXAMPLES=ON

Copied!

./dockcross cmake --build .

-DBUILD_SHARED_LIBS=OFF produces a statically linked binary that transfers cleanly without dependency issues. The completed binary is at bin/llama-completion.

Transfer files to the Pi

Copied!

ssh your_username@<pi_ip>

On the Pi, create directories:

Copied!

mkdir -p ~/llama-bin ~/models

From the development machine, copy the binary and model files:

Copied!

scp bin/llama-completion your_username@<pi_ip>:~/llama-bin/

Copied!

scp /path/to/Falcon-H1-Tiny-90M-Instruct-Q4_K_S.gguf your_username@<pi_ip>:~/models/

Download GGUF files for Q2_K, Q4_K_S, and Q8_0 from the Falcon H1-tiny model page on Hugging Face before transferring.

Running inference

The --no-mmap flag is required on 32-bit systems with limited RAM. Memory mapping on a 32-bit address space with 512MB physical RAM fails unpredictably. This flag forces the model to load into heap memory, which is slower but reliable.

2-bit model (Q2_K): incoherent output

Copied!

~/llama-bin/llama-completion \
  -m ~/models/Falcon-H1-Tiny-90M-Instruct-Q2_K.gguf \
  -p "Hello! How are you?" \
  -n 32 \
  --threads 1 \
  --ctx-size 128 \
  --no-mmap

Speed: approximately 0.35 tokens per second. Output: gibberish. The 2-bit compression has degraded the model's language representation past the point of usefulness.

4-bit model (Q4KS): coherent output

Copied!

~/llama-bin/llama-completion \
  -m ~/models/Falcon-H1-Tiny-90M-Instruct-Q4_K_S.gguf \
  -p "Hello! How are you?" \
  -n 32 \
  --threads 1 \
  --ctx-size 128 \
  --no-mmap

Terminal on a monitor with the Raspberry Pi in the foreground successfully generating a coherent response with the 4-bit model

Output: "Hello! I'm just a digital assistant, so I don't have feelings, but I'm here to help..." Coherent, contextually appropriate, and grammatically correct. This is the working configuration.

8-bit model (Q8_0): higher quality, mixed accuracy

At Q8_0, factual accuracy improves for well-known topics. When asked for the capital of Belgium it responds correctly ("Brussels"). When asked for the capital of Albania it responds incorrectly ("Kotor," which is a city in Montenegro; the correct answer is Tirana). The 90M parameter model retains frequently-seen facts and hallucinates on less common ones.

Summary

The working configuration is: Raspberry Pi 1 Model B, Raspberry Pi OS Legacy 32-bit Lite, llama.cpp compiled via dockcross with NEON and native optimizations disabled, Falcon H1-tiny at Q4KS quantization, --no-mmap flag. The result is coherent output at approximately 0.35 tokens per second.

The constraints that shaped every decision: 512MB RAM limits model size and rules out memory mapping; ARMv6 lacks NEON and rules out I-quant formats and modern compiler optimizations; the single-core 700MHz CPU makes on-device compilation impractical.

This is not a practical deployment. The speed and reliability of the model's knowledge base are both inadequate for real applications. What it demonstrates is that the lower bound of useful LLM inference has dropped far enough that hardware from 2012 can produce coherent language output, which is a meaningful data point about how far model compression techniques have progressed.