# llama-swap: On-Demand Model Switching for Local LLM Servers" description:


[llama-swap](https://github.com/mostlygeek/llama-swap) is a lightweight Go binary that manages multiple local LLM server processes behind a single API endpoint. When a request arrives specifying a model, llama-swap checks whether that model is running, starts it if not (stopping another model if VRAM is needed), and proxies the request. Client applications point to one stable URL and never need to be reconfigured when switching models.

<iframe width="100%" height="315" src="https://www.youtube.com/embed/GtTzO5ZOQr4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>


## The problem it solves

Using `llama-server` from llama.cpp directly gives precise control over flags like `--n-gpu-layers`, `--ctx-size`, and `--flash-attn`. The tradeoff is that switching models requires killing the running process, constructing a new command for the next model, waiting for it to load (sometimes several minutes), and updating every client's base URL to the new port.

![Animation depicting the kill → restart → repeat process lifecycle showing the constant stopping and starting of different server instances](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/70baf7b6-58fd-4015-c1e8-6a4c61a7db00/md1x =1280x720)

llama-swap removes this cycle. All clients connect to one URL. Model startup, teardown, and VRAM management happen automatically in the background.

## Configuration

llama-swap is configured with a single YAML file. Each entry in the `models` section defines a backend process, the command to start it, and optional tuning parameters.

![config.yaml file in a code editor showing configuration for two distinct models](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/b0221cb8-2f8d-49bd-d3a3-293c10695d00/orig =1280x720)

```yaml
[label config.yaml]
healthCheckTimeout: 180
logLevel: info

models:
  qwen-coder:
    name: "Qwen2.5 Coder 7B"
    cmd:
      - /opt/homebrew/bin/llama-server
      - --model
      - /path/to/models/Qwen2.5-7B-Instruct-Q5_K_M.gguf
      - --port
      - "{PORT}"
      - --host
      - "127.0.0.1"
      - --ctx-size
      - "8192"
      - --n-gpu-layers
      - "99"
      - --flash-attn
    ttl: 120
    aliases:
      - qwen2.5-coder
      - qwen-coder-7b

  smallm2:
    name: "SmolLM2 1.7B"
    cmd:
      - /opt/homebrew/bin/llama-server
      - --model
      - /path/to/models/SmolLM-1.7B-Instruct-Q4_K_M.gguf
      - --port
      - "{PORT}"
      - --host
      - "127.0.0.1"
      - --ctx-size
      - "4096"
      - --n-gpu-layers
      - "99"
      - --flash-attn
    ttl: 60
```

`{PORT}` is a placeholder that llama-swap replaces with a dynamically assigned port when starting the process. The `ttl` value (in seconds) controls how long an idle model stays loaded before llama-swap terminates it. `aliases` lists additional names that clients can use to reference the same model.

## Starting the server

```command
./llama-swap --config config.yaml
```

llama-swap starts on port `8080` by default. No models are loaded yet. They start on first request.

## Making requests

All requests go to the same endpoint. The `model` field in the request body determines which backend llama-swap activates.

```command
curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen-coder",
    "messages": [{"role": "user", "content": "Write a Python function to calculate a factorial."}]
  }'
```

On the first request for `qwen-coder`, llama-swap starts the backend process and waits for the health check to pass before proxying. Subsequent requests to the same model are immediate because the process is already running.

To use a different model, send to the same URL with a different `model` value:

```command
curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "smallm2",
    "messages": [{"role": "user", "content": "What are three fun facts about llamas?"}]
  }'
```

If VRAM does not fit both models simultaneously, llama-swap terminates `qwen-coder` before starting `smallm2`. The client receives a normal response either way.

## VRAM management with TTL

![Diagram showing a long list of models on disk but limited VRAM that can only hold one or two at once](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/f0eca13b-ffd2-4913-9493-b1aafa628600/md2x =1280x720)

Consumer GPUs typically have 8GB to 24GB of VRAM, enough for one or two models at once. The `ttl` value starts a countdown after each request completes. If no further requests arrive for that model before the countdown ends, llama-swap terminates the process and frees the VRAM. If a request arrives before the countdown expires, the timer resets.

## Web UI

llama-swap includes a monitoring interface at `http://127.0.0.1:8080`.

**Models tab:** shows all configured models and their current state (unloaded, loading, ready). Manual unload is available here regardless of TTL.

![Models dashboard displaying the status of configured models with one shown as Ready and another as Unloaded](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/1987f236-0348-462d-43bc-647bfc48a900/md2x =1280x720)

**Activity tab:** shows a log of requests with model ID, prompt processing speed, generation speed, and duration.

![llama-swap Activity dashboard showing a table of recent requests with model ID and performance metrics](https://imagedelivery.net/xZXo0QFi-1_4Zimer-T0XQ/0f7d7ced-8277-4258-da10-620fc2c43e00/orig =1280x720)

**Logs tab:** provides a unified view of both llama-swap proxy logs and the upstream logs from the backend model server.

## When to use llama-swap vs. alternatives

**Ollama** manages model downloads, storage, and server lifecycle through a simple CLI. It is the better choice when ease of use matters more than fine-grained control. llama-swap does not download or manage model files; it only manages the processes that serve them.

**LM Studio** provides a GUI for interactive chatting and model browsing. llama-swap is headless and suited for providing a backend API to other applications: IDE plugins, Python scripts, automated agents, or containers running in a homelab.

llama-swap is most appropriate when you are already using `llama-server` directly and want to eliminate the manual process of switching between models while retaining full control over the flags used to run each one.

## Final thoughts

llama-swap solves a specific problem that emerges at a particular point in the local LLM development workflow: when you have outgrown the simplicity of Ollama but the manual overhead of managing multiple `llama-server` instances has become genuinely disruptive. Its YAML configuration gives you the same control over flags that running `llama-server` directly provides, while the proxy layer removes all the port management and client reconfiguration.

Source code and releases are at [github.com/mostlygeek/llama-swap](https://github.com/mostlygeek/llama-swap).