Go to file

Benson Wong 533162ce6a add support for automatically unloading a model (#10 ) (#14 )

* Make starting upstream process on-demand (#10)
* Add automatic unload of model after TTL is reached
* add `ttl` configuration parameter to models in seconds, default is 0 (never unload)

2024-11-19 16:32:51 -08:00

.github/workflows

release works?

2024-10-04 12:54:14 -07:00

misc/simple-responder

add support for automatically unloading a model (#10 ) (#14 )

2024-11-19 16:32:51 -08:00

models

first commit

2024-10-03 20:20:01 -07:00

proxy

add support for automatically unloading a model (#10 ) (#14 )

2024-11-19 16:32:51 -08:00

.gitignore

add .vscode to .gitignore

2024-10-05 19:37:00 -07:00

.goreleaser.yaml

add goreleaser config to limit GOOS and GOARCH builds

2024-10-04 21:46:55 -07:00

config.example.yaml

add support for automatically unloading a model (#10 ) (#14 )

2024-11-19 16:32:51 -08:00

go.mod

use gin for http server

2024-11-18 15:30:16 -08:00

go.sum

use gin for http server

2024-11-18 15:30:16 -08:00

header.jpeg

add header image

2024-10-22 10:30:30 -07:00

LICENSE.md

2024-10-04 09:31:08 -07:00

llama-swap.go

use gin instead of standard http lib in main

2024-11-18 15:58:28 -08:00

Makefile

add support for automatically unloading a model (#10 ) (#14 )

2024-11-19 16:32:51 -08:00

README.md

add support for automatically unloading a model (#10 ) (#14 )

2024-11-19 16:32:51 -08:00

README.md

llama-swap

llama.cpp's server can't swap models on demand. So let's swap the server on demand instead!

llama-swap is a proxy server that sits in front of llama-server. When a request for /v1/chat/completions comes in it will extract the model requested and change the underlying llama-server automatically.

✅ easy to deploy: single binary with no dependencies
✅ full control over llama-server's startup settings
✅ ❤️ for users who are rely on llama.cpp for LLM inference

config.yaml

llama-swap's configuration purposefully simple.

# Seconds to wait for llama.cpp to load and be ready to serve requests
# Default (and minimum) is 15 seconds
healthCheckTimeout: 60

# define valid model values and the upstream server start
models:
  "llama":
    cmd: llama-server --port 8999 -m Llama-3.2-1B-Instruct-Q4_K_M.gguf

    # where to reach the server started by cmd
    proxy: http://127.0.0.1:8999

    # aliases model names to use this configuration for
    aliases:
    - "gpt-4o-mini"
    - "gpt-3.5-turbo"

    # wait for this path to return an HTTP 200 before serving requests
    # defaults to /health to match llama.cpp
    #
    # use "none" to skip endpoint checking. This may cause requests to fail
    # until the server is ready
    checkEndpoint: /custom-endpoint

    # automatically unload the model after 10 seconds
    # ttl values must be a value greater than 0
    # default: 0 = never unload model
    ttl: 5

  "qwen":
    # environment variables to pass to the command
    env:
      - "CUDA_VISIBLE_DEVICES=0"

    # multiline for readability
    cmd: >
      llama-server --port 8999
      --model path/to/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf

    proxy: http://127.0.0.1:8999

Installation

Create a configuration file, see config.example.yaml
Download a release appropriate for your OS and architecture.
- Note: Windows currently untested.
Run the binary with llama-swap --config path/to/config.yaml

Monitoring Logs

The /logs endpoint is available to monitor what llama-swap is doing. It will send the last 10KB of logs. Useful for monitoring the output of llama-server. It also supports streaming of logs.

Usage:

# sends up to the last 10KB of logs
curl http://host/logs'

# streams logs using chunk encoding
curl -Ns 'http://host/logs/stream'

# skips history and just streams new log entries
curl -Ns 'http://host/logs/stream?no-history'

# streams logs using Server Sent Events
curl -Ns 'http://host/logs/streamSSE'

Systemd Unit Files

Use this unit file to start llama-swap on boot. This is only tested on Ubuntu.

/etc/systemd/system/llama-swap.service

[Unit]
Description=llama-swap
After=network.target

[Service]
User=nobody

# set this to match your environment
ExecStart=/path/to/llama-swap --config /path/to/llama-swap.config.yml

Restart=on-failure
RestartSec=3
StartLimitBurst=3
StartLimitInterval=30

[Install]
WantedBy=multi-user.target

Building from Source

Install golang for your system
run make clean all
binaries will be built into build/ directory

Languages

Go 79.6%

TypeScript 13.6%

Shell 2.9%

CSS 1.3%

Makefile 1.1%

Other 1.5%