152 lines
5.0 KiB
Markdown
152 lines
5.0 KiB
Markdown
# llama-swap
|
|
|
|

|
|
|
|
# Introduction
|
|
llama-swap is an OpenAI API compatible server that gives you complete control over how you use your hardware. It automatically swaps to the configuration of your choice for serving a model. Since [llama.cpp's server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) can't swap models, let's swap the server instead!
|
|
|
|
Features:
|
|
|
|
- ✅ Easy to deploy: single binary with no dependencies
|
|
- ✅ Easy to config: single yaml file
|
|
- ✅ On-demand model switching
|
|
- ✅ Full control over server settings per model
|
|
- ✅ OpenAI API support (`v1/completions`, `v1/chat/completions`, `v1/embeddings` and `v1/rerank`)
|
|
- ✅ Multiple GPU support
|
|
- ✅ Run multiple models at once with `profiles`
|
|
- ✅ Remote log monitoring at `/log`
|
|
- ✅ Automatic unloading of models from GPUs after timeout
|
|
- ✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabblyAPI, etc)
|
|
- ✅ Direct access to upstream HTTP server via `/upstream/:model_id` ([demo](https://github.com/mostlygeek/llama-swap/pull/31))
|
|
|
|
## Releases
|
|
|
|
Builds for Linux and OSX are available on the [Releases](https://github.com/mostlygeek/llama-swap/releases) page.
|
|
|
|
### Building from source
|
|
|
|
1. Install golang for your system
|
|
1. `git clone git@github.com:mostlygeek/llama-swap.git`
|
|
1. `make clean all`
|
|
1. Binaries will be in `build/` subdirectory
|
|
|
|
## config.yaml
|
|
|
|
llama-swap's configuration is purposefully simple.
|
|
|
|
```yaml
|
|
# Seconds to wait for llama.cpp to load and be ready to serve requests
|
|
# Default (and minimum) is 15 seconds
|
|
healthCheckTimeout: 60gi
|
|
|
|
# define valid model values and the upstream server start
|
|
models:
|
|
"llama":
|
|
cmd: llama-server --port 8999 -m Llama-3.2-1B-Instruct-Q4_K_M.gguf
|
|
|
|
# where to reach the server started by cmd, make sure the ports match
|
|
proxy: http://127.0.0.1:8999
|
|
|
|
# aliases names to use this model for
|
|
aliases:
|
|
- "gpt-4o-mini"
|
|
- "gpt-3.5-turbo"
|
|
|
|
# check this path for an HTTP 200 OK before serving requests
|
|
# default: /health to match llama.cpp
|
|
# use "none" to skip endpoint checking, but may cause HTTP errors
|
|
# until the model is ready
|
|
checkEndpoint: /custom-endpoint
|
|
|
|
# automatically unload the model after this many seconds
|
|
# ttl values must be a value greater than 0
|
|
# default: 0 = never unload model
|
|
ttl: 60
|
|
|
|
"qwen":
|
|
# environment variables to pass to the command
|
|
env:
|
|
- "CUDA_VISIBLE_DEVICES=0"
|
|
|
|
# multiline for readability
|
|
cmd: >
|
|
llama-server --port 8999
|
|
--model path/to/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf
|
|
proxy: http://127.0.0.1:8999
|
|
|
|
# unlisted models do not show up in /v1/models or /upstream lists
|
|
# but they can still be requested as normal
|
|
"qwen-unlisted":
|
|
cmd: llama-server --port 9999 -m Llama-3.2-1B-Instruct-Q4_K_M.gguf -ngl 0
|
|
unlisted: true
|
|
|
|
# profiles make it easy to managing multi model (and gpu) configurations.
|
|
#
|
|
# Tips:
|
|
# - each model must be listening on a unique address and port
|
|
# - the model name is in this format: "profile_name:model", like "coding:qwen"
|
|
# - the profile will load and unload all models in the profile at the same time
|
|
profiles:
|
|
coding:
|
|
- "qwen"
|
|
- "llama"
|
|
```
|
|
|
|
**Advanced examples**
|
|
|
|
- [config.example.yaml](config.example.yaml) includes example for supporting `v1/embeddings` and `v1/rerank` endpoints
|
|
- [Speculative Decoding](examples/speculative-decoding/README.md) - using a small draft model can increase inference speeds from 20% to 40%. This example includes a configurations Qwen2.5-Coder-32B (2.5x increase) and Llama-3.1-70B (1.4x increase) in the best cases.
|
|
- [Optimizing Code Generation](examples/benchmark-snakegame/README.md) - find the optimal settings for your machine. This example demonstrates defining multiple configurations and testing which one is fastest.
|
|
|
|
## Installation
|
|
|
|
1. Create a configuration file, see [config.example.yaml](config.example.yaml)
|
|
1. Download a [release](https://github.com/mostlygeek/llama-swap/releases) appropriate for your OS and architecture.
|
|
* _Note: Windows currently untested._
|
|
1. Run the binary with `llama-swap --config path/to/config.yaml`
|
|
|
|
## Monitoring Logs
|
|
|
|
Open the `http://<host>/logs` with your browser to get a web interface with streaming logs.
|
|
|
|
Of course, CLI access is also supported:
|
|
|
|
```
|
|
# sends up to the last 10KB of logs
|
|
curl http://host/logs'
|
|
|
|
# streams logs
|
|
curl -Ns 'http://host/logs/stream'
|
|
|
|
# stream and filter logs with linux pipes
|
|
curl -Ns http://host/logs/stream | grep 'eval time'
|
|
|
|
# skips history and just streams new log entries
|
|
curl -Ns 'http://host/logs/stream?no-history'
|
|
```
|
|
|
|
## Systemd Unit Files
|
|
|
|
Use this unit file to start llama-swap on boot. This is only tested on Ubuntu.
|
|
|
|
`/etc/systemd/system/llama-swap.service`
|
|
```
|
|
[Unit]
|
|
Description=llama-swap
|
|
After=network.target
|
|
|
|
[Service]
|
|
User=nobody
|
|
|
|
# set this to match your environment
|
|
ExecStart=/path/to/llama-swap --config /path/to/llama-swap.config.yml
|
|
|
|
Restart=on-failure
|
|
RestartSec=3
|
|
StartLimitBurst=3
|
|
StartLimitInterval=30
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
```
|