llama-swap: On-Demand Model Switching for Local LLM Servers" description:

Stanley Ulili

Updated on May 18, 2026

The problem it solves
Configuration
Starting the server
Making requests
VRAM management with TTL
Web UI
When to use llama-swap vs. alternatives
Final thoughts

llama-swap is a lightweight Go binary that manages multiple local LLM server processes behind a single API endpoint. When a request arrives specifying a model, llama-swap checks whether that model is running, starts it if not (stopping another model if VRAM is needed), and proxies the request. Client applications point to one stable URL and never need to be reconfigured when switching models.

The problem it solves

Using llama-server from llama.cpp directly gives precise control over flags like --n-gpu-layers, --ctx-size, and --flash-attn. The tradeoff is that switching models requires killing the running process, constructing a new command for the next model, waiting for it to load (sometimes several minutes), and updating every client's base URL to the new port.

Animation depicting the kill → restart → repeat process lifecycle showing the constant stopping and starting of different server instances

llama-swap removes this cycle. All clients connect to one URL. Model startup, teardown, and VRAM management happen automatically in the background.

Configuration

llama-swap is configured with a single YAML file. Each entry in the models section defines a backend process, the command to start it, and optional tuning parameters.

config.yaml

Copied!

healthCheckTimeout: 180
logLevel: info

models:
  qwen-coder:
    name: "Qwen2.5 Coder 7B"
    cmd:
      - /opt/homebrew/bin/llama-server
      - --model
      - /path/to/models/Qwen2.5-7B-Instruct-Q5_K_M.gguf
      - --port
      - "{PORT}"
      - --host
      - "127.0.0.1"
      - --ctx-size
      - "8192"
      - --n-gpu-layers
      - "99"
      - --flash-attn
    ttl: 120
    aliases:
      - qwen2.5-coder
      - qwen-coder-7b

  smallm2:
    name: "SmolLM2 1.7B"
    cmd:
      - /opt/homebrew/bin/llama-server
      - --model
      - /path/to/models/SmolLM-1.7B-Instruct-Q4_K_M.gguf
      - --port
      - "{PORT}"
      - --host
      - "127.0.0.1"
      - --ctx-size
      - "4096"
      - --n-gpu-layers
      - "99"
      - --flash-attn
    ttl: 60

{PORT} is a placeholder that llama-swap replaces with a dynamically assigned port when starting the process. The ttl value (in seconds) controls how long an idle model stays loaded before llama-swap terminates it. aliases lists additional names that clients can use to reference the same model.

Starting the server

Copied!

./llama-swap --config config.yaml

llama-swap starts on port 8080 by default. No models are loaded yet. They start on first request.

Making requests

All requests go to the same endpoint. The model field in the request body determines which backend llama-swap activates.

Copied!

curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-coder",
    "messages": [{"role": "user", "content": "Write a Python function to calculate a factorial."}]
  }'

On the first request for qwen-coder, llama-swap starts the backend process and waits for the health check to pass before proxying. Subsequent requests to the same model are immediate because the process is already running.

To use a different model, send to the same URL with a different model value:

Copied!

curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "smallm2",
    "messages": [{"role": "user", "content": "What are three fun facts about llamas?"}]
  }'

If VRAM does not fit both models simultaneously, llama-swap terminates qwen-coder before starting smallm2. The client receives a normal response either way.

VRAM management with TTL

Diagram showing a long list of models on disk but limited VRAM that can only hold one or two at once

Consumer GPUs typically have 8GB to 24GB of VRAM, enough for one or two models at once. The ttl value starts a countdown after each request completes. If no further requests arrive for that model before the countdown ends, llama-swap terminates the process and frees the VRAM. If a request arrives before the countdown expires, the timer resets.

Web UI

llama-swap includes a monitoring interface at http://127.0.0.1:8080.

Models tab: shows all configured models and their current state (unloaded, loading, ready). Manual unload is available here regardless of TTL.

Models dashboard displaying the status of configured models with one shown as Ready and another as Unloaded

Activity tab: shows a log of requests with model ID, prompt processing speed, generation speed, and duration.

llama-swap Activity dashboard showing a table of recent requests with model ID and performance metrics

Logs tab: provides a unified view of both llama-swap proxy logs and the upstream logs from the backend model server.

When to use llama-swap vs. alternatives

Ollama manages model downloads, storage, and server lifecycle through a simple CLI. It is the better choice when ease of use matters more than fine-grained control. llama-swap does not download or manage model files; it only manages the processes that serve them.

LM Studio provides a GUI for interactive chatting and model browsing. llama-swap is headless and suited for providing a backend API to other applications: IDE plugins, Python scripts, automated agents, or containers running in a homelab.

llama-swap is most appropriate when you are already using llama-server directly and want to eliminate the manual process of switching between models while retaining full control over the flags used to run each one.

Final thoughts

llama-swap solves a specific problem that emerges at a particular point in the local LLM development workflow: when you have outgrown the simplicity of Ollama but the manual overhead of managing multiple llama-server instances has become genuinely disruptive. Its YAML configuration gives you the same control over flags that running llama-server directly provides, while the proxy layer removes all the port management and client reconfiguration.

Source code and releases are at github.com/mostlygeek/llama-swap.

Got an article suggestion? Let us know

Running a Local LLM on a Raspberry Pi 1: Cross-Compilation, Quantization, and ARMv6 Constraints

This guide walks through running Falcon H1-tiny (90M parameters, Q4_K_S quantization) on a Raspberry Pi 1 Model B (700MHz ARMv6, 512MB RAM) using llama.cpp compiled via dockcross cross-compilation. At Q4_K_S quantization the model produces coherent output at ~0.35 tokens per second.

→