Benson Wong 34f9fd7340 Improve timeout and exit handling of child processes. fix #3 and #5
llama-swap only waited a maximum of 5 seconds for an upstream
HTTP server to be available. If it took longer than that it will error
out the request. Now it will wait up to the configured healthCheckTimeout
or the upstream process unexpectedly exits.
2024-11-01 14:32:39 -07:00
2024-10-04 12:54:14 -07:00
2024-10-03 20:20:01 -07:00
2024-10-05 19:37:00 -07:00
2024-10-22 10:30:30 -07:00
.
2024-10-04 09:31:08 -07:00
2024-10-11 21:59:21 -07:00
2024-10-31 12:16:54 -07:00

llama-swap

llama-swap header image

llama.cpp's server can't swap models on demand. So let's swap the server on demand instead!

llama-swap is a proxy server that sits in front of llama-server. When a request for /v1/chat/completions comes in it will extract the model requested and change the underlying llama-server automatically.

  • easy to deploy: single binary with no dependencies
  • full control over llama-server's startup settings
  • ❤️ for users who are rely on llama.cpp for LLM inference

config.yaml

llama-swap's configuration purposefully simple.

# Seconds to wait for llama.cpp to load and be ready to serve requests
# Default (and minimum) is 15 seconds
healthCheckTimeout: 60

# define valid model values and the upstream server start
models:
  "llama":
    cmd: llama-server --port 8999 -m Llama-3.2-1B-Instruct-Q4_K_M.gguf

    # where to reach the server started by cmd
    proxy: http://127.0.0.1:8999

    # aliases model names to use this configuration for
    aliases:
    - "gpt-4o-mini"
    - "gpt-3.5-turbo"

    # wait for this path to return an HTTP 200 before serving requests
    # defaults to /health to match llama.cpp
    #
    # use "none" to skip endpoint checking. This may cause requests to fail
    # until the server is ready
    checkEndpoint: /custom-endpoint

  "qwen":
    # environment variables to pass to the command
    env:
      - "CUDA_VISIBLE_DEVICES=0"

    # multiline for readability
    cmd: >
      llama-server --port 8999
      --model path/to/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf

    proxy: http://127.0.0.1:8999

Installation

  1. Create a configuration file, see config.example.yaml
  2. Download a release appropriate for your OS and architecture.
    • Note: Windows currently untested.
  3. Run the binary with llama-swap --config path/to/config.yaml

Monitoring Logs

The /logs endpoint is available to monitor what llama-swap is doing. It will send the last 10KB of logs. Useful for monitoring the output of llama-server. It also supports streaming of logs.

Usage:

# basic, sends up to the last 10KB of logs
curl http://host/logs'

# add `stream` to stream new logs as they come in
curl -Ns 'http://host/logs?stream'

# add `skip` to skip history (only useful if used with stream)
curl -Ns 'http://host/logs?stream&skip'

# will output nothing :)
curl -Ns 'http://host/logs?skip'

Systemd Unit Files

Use this unit file to start llama-swap on boot. This is only tested on Ubuntu.

/etc/systemd/system/llama-swap.service

[Unit]
Description=llama-swap
After=network.target

[Service]
User=nobody

# set this to match your environment
ExecStart=/path/to/llama-swap --config /path/to/llama-swap.config.yml

Restart=on-failure
RestartSec=3
StartLimitBurst=3
StartLimitInterval=30

[Install]
WantedBy=multi-user.target

Building from Source

  1. Install golang for your system
  2. run make clean all
  3. binaries will be built into build/ directory
Description
No description provided
Readme 2.3 MiB
Languages
Go 79.6%
TypeScript 13.6%
Shell 2.9%
CSS 1.3%
Makefile 1.1%
Other 1.5%