34f9fd73408a2637a238ab4b58a7976005b09f91
llama-swap only waited a maximum of 5 seconds for an upstream HTTP server to be available. If it took longer than that it will error out the request. Now it will wait up to the configured healthCheckTimeout or the upstream process unexpectedly exits.
llama-swap
llama.cpp's server can't swap models on demand. So let's swap the server on demand instead!
llama-swap is a proxy server that sits in front of llama-server. When a request for /v1/chat/completions comes in it will extract the model requested and change the underlying llama-server automatically.
- ✅ easy to deploy: single binary with no dependencies
- ✅ full control over llama-server's startup settings
- ✅ ❤️ for users who are rely on llama.cpp for LLM inference
config.yaml
llama-swap's configuration purposefully simple.
# Seconds to wait for llama.cpp to load and be ready to serve requests
# Default (and minimum) is 15 seconds
healthCheckTimeout: 60
# define valid model values and the upstream server start
models:
"llama":
cmd: llama-server --port 8999 -m Llama-3.2-1B-Instruct-Q4_K_M.gguf
# where to reach the server started by cmd
proxy: http://127.0.0.1:8999
# aliases model names to use this configuration for
aliases:
- "gpt-4o-mini"
- "gpt-3.5-turbo"
# wait for this path to return an HTTP 200 before serving requests
# defaults to /health to match llama.cpp
#
# use "none" to skip endpoint checking. This may cause requests to fail
# until the server is ready
checkEndpoint: /custom-endpoint
"qwen":
# environment variables to pass to the command
env:
- "CUDA_VISIBLE_DEVICES=0"
# multiline for readability
cmd: >
llama-server --port 8999
--model path/to/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf
proxy: http://127.0.0.1:8999
Installation
- Create a configuration file, see config.example.yaml
- Download a release appropriate for your OS and architecture.
- Note: Windows currently untested.
- Run the binary with
llama-swap --config path/to/config.yaml
Monitoring Logs
The /logs endpoint is available to monitor what llama-swap is doing. It will send the last 10KB of logs. Useful for monitoring the output of llama-server. It also supports streaming of logs.
Usage:
# basic, sends up to the last 10KB of logs
curl http://host/logs'
# add `stream` to stream new logs as they come in
curl -Ns 'http://host/logs?stream'
# add `skip` to skip history (only useful if used with stream)
curl -Ns 'http://host/logs?stream&skip'
# will output nothing :)
curl -Ns 'http://host/logs?skip'
Systemd Unit Files
Use this unit file to start llama-swap on boot. This is only tested on Ubuntu.
/etc/systemd/system/llama-swap.service
[Unit]
Description=llama-swap
After=network.target
[Service]
User=nobody
# set this to match your environment
ExecStart=/path/to/llama-swap --config /path/to/llama-swap.config.yml
Restart=on-failure
RestartSec=3
StartLimitBurst=3
StartLimitInterval=30
[Install]
WantedBy=multi-user.target
Building from Source
- Install golang for your system
- run
make clean all - binaries will be built into
build/directory
Description
Languages
Go
79.6%
TypeScript
13.6%
Shell
2.9%
CSS
1.3%
Makefile
1.1%
Other
1.5%
