# Running a Local LLM on a Raspberry Pi 1: Cross-Compilation, Quantization, and ARMv6 Constraints

The original Raspberry Pi (700MHz single-core ARMv6, 512MB RAM) is undersized for most AI workloads by several orders of magnitude. **This walkthrough documents the specific combination of model, quantization format, build flags, and tooling that makes local LLM inference possible on this hardware**, including the ARMv6-specific constraints that rule out most modern approaches.


<iframe width="100%" height="315" src="https://www.youtube.com/embed/GtTzO5ZOQr4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>



## Model selection: Falcon H1-tiny

Most "small" open-source models (Llama 3 8B, Mistral 7B) require multiple gigabytes of RAM. The search for something that fits in 512MB leads to Falcon H1-tiny from the Technology Innovation Institute.

![Falcon H1-tiny model card on Hugging Face showing the model's name and logo](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/d0c6ad59-9f17-48a7-a175-d73e35612b00/lg2x =1280x720)

Falcon H1-tiny has 90 million parameters. For comparison, a 7B model is roughly 78 times larger. It uses a **Hybrid Transformers + Mamba architecture**: standard attention layers for contextual understanding and State Space Model (Mamba) layers for efficient long-sequence processing. This combination allows the model to maintain usable linguistic capability at unusually small scale.

## Quantization: Q4_K_S is the target format

At full 16-bit float precision, even a 90M parameter model exceeds 512MB. Quantization reduces weight precision from 16-bit floats to lower-precision integers, shrinking memory usage proportionally.

![Table of GGUF quantization formats showing Legacy formats, K-quants, and I-quants](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/a570ccb0-668c-464c-77d0-2e59536cbc00/public =1280x720)

The Raspberry Pi 1's ARMv6 processor rules out modern I-quant formats (importance quantization), which rely on bit-manipulation instructions that ARMv6 does not implement. The only viable options are legacy K-quant formats.

**Q4_K_S** is the correct choice: a 4-bit legacy K-quant that uses a two-level affine approach compatible with ARMv6. It provides the best quality-to-size ratio for this hardware.

2-bit quantization (Q2_K) fits in memory but degrades the model's outputs to gibberish. 8-bit (Q8_0) produces better quality but requires more RAM and runs at the same speed.

## Cross-compilation with dockcross

### Why not compile on the Pi

Compiling llama.cpp directly on the Pi would take approximately 18 hours and very likely exhaust the 512MB RAM before completing.

![Terminal on the Raspberry Pi showing the cmake build process starting slowly with an 18-hour estimate](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/d73f4a33-2dcc-4d39-dff2-45ad63f4d700/lg2x =1280x720)

The professional solution is cross-compilation: building the ARMv6 binary on a faster machine using a toolchain that targets ARMv6.

`dockcross` provides Docker containers pre-configured with cross-compilation toolchains for various architectures. The `linux-armv6` image is used here.

### ARMv6 constraints in the build flags

The Raspberry Pi 1 lacks NEON instructions (the ARM SIMD extension that accelerates matrix multiplication). This is critical: virtually all modern AI libraries assume NEON is available.

![Specification comparison table for Raspberry Pi models with ARMv6 instruction set circled for Pi 1](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/9b3e6a0d-eccd-4e20-a4b8-28de78923c00/orig =1280x720)

The cmake flags disable all incompatible optimizations: `-DGGML_NATIVE=OFF`, `-DGGML_NEON=OFF`, and `-DGGML_OPENMP=OFF`. Without these, the binary either crashes or produces incorrect results on ARMv6.

## Setup walkthrough

### Prepare the Pi

Flash **Raspberry Pi OS (Legacy, 32-bit) Lite** using Raspberry Pi Imager. The Lite variant omits the desktop environment, preserving RAM. In the imager's advanced settings, enable SSH and configure Wi-Fi credentials before writing the card. This avoids needing the Pi's local terminal.

![Raspberry Pi Imager application window ready for OS selection](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/24de2d59-670d-47fa-b53e-0cc7b03b0900/lg2x =1280x720)

### Cross-compile llama.cpp

On the development machine:

```command
git clone https://github.com/ggerganov/llama.cpp
```

```command
cd llama.cpp
```

```command
mkdir build-pi && cd build-pi
```

```command
docker run --rm dockcross/linux-armv6 > ./dockcross
```

```command
chmod +x ./dockcross
```

```command
./dockcross cmake .. -DBUILD_SHARED_LIBS=OFF -DGGML_NATIVE=OFF -DGGML_NEON=OFF -DGGML_OPENMP=OFF -DLLAMA_BUILD_EXAMPLES=ON
```

```command
./dockcross cmake --build .
```

`-DBUILD_SHARED_LIBS=OFF` produces a statically linked binary that transfers cleanly without dependency issues. The completed binary is at `bin/llama-completion`.

### Transfer files to the Pi

```command
ssh your_username@<pi_ip>
```

On the Pi, create directories:

```command
mkdir -p ~/llama-bin ~/models
```

From the development machine, copy the binary and model files:

```command
scp bin/llama-completion your_username@<pi_ip>:~/llama-bin/
```

```command
scp /path/to/Falcon-H1-Tiny-90M-Instruct-Q4_K_S.gguf your_username@<pi_ip>:~/models/
```

Download GGUF files for `Q2_K`, `Q4_K_S`, and `Q8_0` from the Falcon H1-tiny model page on Hugging Face before transferring.

## Running inference

The `--no-mmap` flag is required on 32-bit systems with limited RAM. Memory mapping on a 32-bit address space with 512MB physical RAM fails unpredictably. This flag forces the model to load into heap memory, which is slower but reliable.

### 2-bit model (Q2_K): incoherent output

```command
~/llama-bin/llama-completion \
  -m ~/models/Falcon-H1-Tiny-90M-Instruct-Q2_K.gguf \
  -p "Hello! How are you?" \
  -n 32 \
  --threads 1 \
  --ctx-size 128 \
  --no-mmap
```

Speed: approximately 0.35 tokens per second. Output: gibberish. The 2-bit compression has degraded the model's language representation past the point of usefulness.

### 4-bit model (Q4_K_S): coherent output

```command
~/llama-bin/llama-completion \
  -m ~/models/Falcon-H1-Tiny-90M-Instruct-Q4_K_S.gguf \
  -p "Hello! How are you?" \
  -n 32 \
  --threads 1 \
  --ctx-size 128 \
  --no-mmap
```

![Terminal on a monitor with the Raspberry Pi in the foreground successfully generating a coherent response with the 4-bit model](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/992f4708-4793-4880-0ed4-fe776bab9a00/public =1280x720)

Output: "Hello! I'm just a digital assistant, so I don't have feelings, but I'm here to help..." Coherent, contextually appropriate, and grammatically correct. This is the working configuration.

### 8-bit model (Q8_0): higher quality, mixed accuracy

At Q8_0, factual accuracy improves for well-known topics. When asked for the capital of Belgium it responds correctly ("Brussels"). When asked for the capital of Albania it responds incorrectly ("Kotor," which is a city in Montenegro; the correct answer is Tirana). The 90M parameter model retains frequently-seen facts and hallucinates on less common ones.

## Summary

The working configuration is: Raspberry Pi 1 Model B, Raspberry Pi OS Legacy 32-bit Lite, `llama.cpp` compiled via `dockcross` with NEON and native optimizations disabled, Falcon H1-tiny at Q4_K_S quantization, `--no-mmap` flag. The result is coherent output at approximately 0.35 tokens per second.

The **constraints that shaped every decision: 512MB RAM limits model size and rules out memory mapping; ARMv6 lacks NEON and rules out I-quant formats and modern compiler optimizations**; the single-core 700MHz CPU makes on-device compilation impractical.

This is not a practical deployment. The speed and reliability of the model's knowledge base are both inadequate for real applications. What it demonstrates is that the lower bound of useful LLM inference has dropped far enough that hardware from 2012 can produce coherent language output, which is a meaningful data point about how far model compression techniques have progressed.![frame_1_45.jpg](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/128dd2a5-7b50-41fc-8c20-247ca8744000/orig =1280x720)