Go to file

Benson Wong 34f9fd7340 Improve timeout and exit handling of child processes. fix #3 and #5

llama-swap only waited a maximum of 5 seconds for an upstream
HTTP server to be available. If it took longer than that it will error
out the request. Now it will wait up to the configured healthCheckTimeout
or the upstream process unexpectedly exits.

2024-11-01 14:32:39 -07:00

.github/workflows

release works?

2024-10-04 12:54:14 -07:00

misc/simple-responder

renaming to llama-swap

2024-10-04 20:21:11 -07:00

models

first commit

2024-10-03 20:20:01 -07:00

proxy

Improve timeout and exit handling of child processes. fix #3 and #5

2024-11-01 14:32:39 -07:00

.gitignore

add .vscode to .gitignore

2024-10-05 19:37:00 -07:00

.goreleaser.yaml

add goreleaser config to limit GOOS and GOARCH builds

2024-10-04 21:46:55 -07:00

config.example.yaml

Improve timeout and exit handling of child processes. fix #3 and #5

2024-11-01 14:32:39 -07:00

go.mod

Support multiline cmds in YAML configuration

2024-10-19 20:06:59 -07:00

go.sum

Support multiline cmds in YAML configuration

2024-10-19 20:06:59 -07:00

header.jpeg

add header image

2024-10-22 10:30:30 -07:00

LICENSE.md

2024-10-04 09:31:08 -07:00

llama-swap.go

Add custom check endpoint

2024-10-11 21:59:21 -07:00

Makefile

Refactor log implementation

2024-10-31 12:16:54 -07:00

README.md

Add /logs endpoint to monitor upstream processes

2024-10-30 21:02:30 -07:00

README.md

llama-swap

llama.cpp's server can't swap models on demand. So let's swap the server on demand instead!

llama-swap is a proxy server that sits in front of llama-server. When a request for /v1/chat/completions comes in it will extract the model requested and change the underlying llama-server automatically.

✅ easy to deploy: single binary with no dependencies
✅ full control over llama-server's startup settings
✅ ❤️ for users who are rely on llama.cpp for LLM inference

config.yaml

llama-swap's configuration purposefully simple.

# Seconds to wait for llama.cpp to load and be ready to serve requests
# Default (and minimum) is 15 seconds
healthCheckTimeout: 60

# define valid model values and the upstream server start
models:
  "llama":
    cmd: llama-server --port 8999 -m Llama-3.2-1B-Instruct-Q4_K_M.gguf

    # where to reach the server started by cmd
    proxy: http://127.0.0.1:8999

    # aliases model names to use this configuration for
    aliases:
    - "gpt-4o-mini"
    - "gpt-3.5-turbo"

    # wait for this path to return an HTTP 200 before serving requests
    # defaults to /health to match llama.cpp
    #
    # use "none" to skip endpoint checking. This may cause requests to fail
    # until the server is ready
    checkEndpoint: /custom-endpoint

  "qwen":
    # environment variables to pass to the command
    env:
      - "CUDA_VISIBLE_DEVICES=0"

    # multiline for readability
    cmd: >
      llama-server --port 8999
      --model path/to/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf

    proxy: http://127.0.0.1:8999

Installation

Create a configuration file, see config.example.yaml
Download a release appropriate for your OS and architecture.
- Note: Windows currently untested.
Run the binary with llama-swap --config path/to/config.yaml

Monitoring Logs

The /logs endpoint is available to monitor what llama-swap is doing. It will send the last 10KB of logs. Useful for monitoring the output of llama-server. It also supports streaming of logs.

Usage:

# basic, sends up to the last 10KB of logs
curl http://host/logs'

# add `stream` to stream new logs as they come in
curl -Ns 'http://host/logs?stream'

# add `skip` to skip history (only useful if used with stream)
curl -Ns 'http://host/logs?stream&skip'

# will output nothing :)
curl -Ns 'http://host/logs?skip'

Systemd Unit Files

Use this unit file to start llama-swap on boot. This is only tested on Ubuntu.

/etc/systemd/system/llama-swap.service

[Unit]
Description=llama-swap
After=network.target

[Service]
User=nobody

# set this to match your environment
ExecStart=/path/to/llama-swap --config /path/to/llama-swap.config.yml

Restart=on-failure
RestartSec=3
StartLimitBurst=3
StartLimitInterval=30

[Install]
WantedBy=multi-user.target

Building from Source

Install golang for your system
run make clean all
binaries will be built into build/ directory

Languages

Go 79.6%

TypeScript 13.6%

Shell 2.9%

CSS 1.3%

Makefile 1.1%

Other 1.5%