Clean up and Documentation (#347) [skip ci]

* cmd,misc: move misc binaries to cmd/ * docs: add docs and move examples/ there * misc: remove unused misc/assets dir * docs: add configuration.md * update README with better structure Updates: #334
2025-10-19 14:53:13 -07:00
parent 6516532568
commit 9fc0431531
18 changed files with 529 additions and 156 deletions
--- a/README.md
+++ b/README.md
@@ -5,74 +5,165 @@

 # llama-swap

-llama-swap is a light weight, transparent proxy server that provides automatic model swapping to llama.cpp's server.
+Run multiple LLM models on your machine and hot-swap between them as needed. llama-swap works with any OpenAI API-compatible server, giving you the flexibility to switch models without restarting your applications.

-Written in golang, it is very easy to install (single binary with no dependencies) and configure (single yaml file). To get started, download a pre-built binary, a provided docker images or Homebrew.
+Built in Go for performance and simplicity, llama-swap has zero dependencies and is incredibly easy to set up. Get started in minutes - just install and configure.

 ## Features:

- ✅ Easy to deploy: single binary with no dependencies
- ✅ Easy to config: single yaml file
+- ✅ Easy to deploy and configure: one binary, one configuration file. no external dependencies
 - ✅ On-demand model switching
+- ✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, etc)
+  - future proof, upgrade your inference servers at any time.
 - ✅ OpenAI API supported endpoints:
  - `v1/completions`
  - `v1/chat/completions`
  - `v1/embeddings`
  - `v1/audio/speech` ([#36](https://github.com/mostlygeek/llama-swap/issues/36))
  - `v1/audio/transcriptions` ([docs](https://github.com/mostlygeek/llama-swap/issues/41#issuecomment-2722637867))
- ✅ llama-server (llama.cpp) supported endpoints:
+- ✅ llama-server (llama.cpp) supported endpoints
  - `v1/rerank`, `v1/reranking`, `/rerank`
  - `/infill` - for code infilling
  - `/completion` - for completion endpoint
- ✅ llama-swap custom API endpoints
+- ✅ llama-swap API
  - `/ui` - web UI
-  - `/log` - remote log monitoring
-  - `/upstream/:model_id` - direct access to upstream HTTP server ([demo](https://github.com/mostlygeek/llama-swap/pull/31))
-  - `/unload` - manually unload running models ([#58](https://github.com/mostlygeek/llama-swap/issues/58))
+  - `/upstream/:model_id` - direct access to upstream server ([demo](https://github.com/mostlygeek/llama-swap/pull/31))
+  - `/models/unload` - manually unload running models ([#58](https://github.com/mostlygeek/llama-swap/issues/58))
  - `/running` - list currently running models ([#61](https://github.com/mostlygeek/llama-swap/issues/61))
+  - `/log` - remote log monitoring
  - `/health` - just returns "OK"
- ✅ Run multiple models at once with `Groups` ([#107](https://github.com/mostlygeek/llama-swap/issues/107))
- ✅ Automatic unloading of models after timeout by setting a `ttl`
- ✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, etc)
- ✅ Reliable Docker and Podman support using `cmd` and `cmdStop` together
- ✅ Full control over server settings per model
- ✅ Preload models on startup with `hooks` ([#235](https://github.com/mostlygeek/llama-swap/pull/235))
+- ✅ Customizable
+  - Run multiple models at once with `Groups` ([#107](https://github.com/mostlygeek/llama-swap/issues/107))
+  - Automatic unloading of models after timeout by setting a `ttl`
+  - Reliable Docker and Podman support using `cmd` and `cmdStop` together
+  - Preload models on startup with `hooks` ([#235](https://github.com/mostlygeek/llama-swap/pull/235))
+
+### Web UI
+
+llama-swap includes a real time web interface for monitoring logs and controlling models:
+
+<img width="1360" height="963" alt="image" src="https://github.com/user-attachments/assets/adef4a8e-de0b-49db-885a-8f6dedae6799" />
+
+The Activity Page shows recent requests:
+
+<img width="1360" height="963" alt="image" src="https://github.com/user-attachments/assets/5f3edee6-d03a-4ae5-ae06-b20ac1f135bd" />
+
+## Installation
+
+llama-swap can be installed in multiple ways
+
+1. Docker
+2. Homebrew (OSX and Linux)
+3. WinGet
+4. From release binaries
+5. From source
+
+### Docker Install ([download images](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap))
+
+Nightly container images with llama-swap and llama-server are built for multiple platforms (cuda, vulkan, intel, etc).
+
+```shell
+$ docker pull ghcr.io/mostlygeek/llama-swap:cuda
+
+# run with a custom configuration and models directory
+$ docker run -it --rm --runtime nvidia -p 9292:8080 \
+ -v /path/to/models:/models \
+ -v /path/to/custom/config.yaml:/app/config.yaml \
+ ghcr.io/mostlygeek/llama-swap:cuda
+```
+
+<details>
+<summary>
+more examples
+</summary>
+
+```shell
+# pull latest images per platform
+docker pull ghcr.io/mostlygeek/llama-swap:cpu
+docker pull ghcr.io/mostlygeek/llama-swap:cuda
+docker pull ghcr.io/mostlygeek/llama-swap:vulkan
+docker pull ghcr.io/mostlygeek/llama-swap:intel
+docker pull ghcr.io/mostlygeek/llama-swap:musa
+
+# tagged llama-swap, platform and llama-server version images
+docker pull ghcr.io/mostlygeek/llama-swap:v166-cuda-b6795
+
+```
+
+</details>
+
+### Homebrew Install (macOS/Linux)
+
+```shell
+brew tap mostlygeek/llama-swap
+brew install llama-swap
+llama-swap --config path/to/config.yaml --listen localhost:8080
+```
+
+### WinGet Install (Windows)
+
+> [!NOTE]
+> WinGet is maintained by community contributor [Dvd-Znf](https://github.com/Dvd-Znf) ([#327](https://github.com/mostlygeek/llama-swap/issues/327)). It is not an official part of llama-swap.
+
+```shell
+# install
+C:\> winget install llama-swap
+
+# upgrade
+C:\> winget upgrade llama-swap
+```
+
+### Pre-built Binaries
+
+Binaries are available on the [release](https://github.com/mostlygeek/llama-swap/releases) page for Linux, Mac, Windows and FreeBSD.
+
+### Building from source
+
+1. Building requires Go and Node.js (for UI).
+1. `git clone https://github.com/mostlygeek/llama-swap.git`
+1. `make clean all`
+1. look in the `build/` subdirectory for the llama-swap binary
+
+## Configuration
+
+```yaml
+# minimum viable config.yaml
+
+models:
+  model1:
+    cmd: llama-server --port ${PORT} --model /path/to/model.gguf
+```
+
+That's all you need to get started:
+
+1. `models` - holds all model configurations
+2. `model1` - the ID used in API calls
+3. `cmd` - the command to run to start the server.
+4. `${PORT}` - an automatically assigned port number
+
+Almost all configuration settings are optional and can be added one step at a time:
+
+- Advanced features
+  - `groups` to run multiple models at once
+  - `hooks` to run things on startup
+  - `macros` reusable snippets
+- Model customization
+  - `ttl` to automatically unload models
+  - `aliases` to use familiar model names (e.g., "gpt-4o-mini")
+  - `env` to pass custom environment variables to inference servers
+  - `cmdStop` gracefully stop Docker/Podman containers
+  - `useModelName` to override model names sent to upstream servers
+  - `${PORT}` automatic port variables for dynamic port assignment
+  - `filters` rewrite parts of requests before sending to the upstream server
+
+See the [configuration documentation](docs/configuration.md) for all options.

 ## How does llama-swap work?

-When a request is made to an OpenAI compatible endpoint, llama-swap will extract the `model` value and load the appropriate server configuration to serve it. If the wrong upstream server is running, it will be replaced with the correct one. This is where the "swap" part comes in. The upstream server is automatically swapped to the correct one to serve the request.
+When a request is made to an OpenAI compatible endpoint, llama-swap will extract the `model` value and load the appropriate server configuration to serve it. If the wrong upstream server is running, it will be replaced with the correct one. This is where the "swap" part comes in. The upstream server is automatically swapped to handle the request correctly.

 In the most basic configuration llama-swap handles one model at a time. For more advanced use cases, the `groups` feature allows multiple models to be loaded at the same time. You have complete control over how your system resources are used.

-## config.yaml
-
-llama-swap is managed entirely through a yaml configuration file.
-
-It can be very minimal to start:
-
-```yaml
-models:
-  "qwen2.5":
-    cmd: |
-      /path/to/llama-server
-      -hf bartowski/Qwen2.5-0.5B-Instruct-GGUF:Q4_K_M
-      --port ${PORT}
-```
-
-However, there are many more capabilities that llama-swap supports:
-
- `groups` to run multiple models at once
- `ttl` to automatically unload models
- `macros` for reusable snippets
- `aliases` to use familiar model names (e.g., "gpt-4o-mini")
- `env` to pass custom environment variables to inference servers
- `cmdStop` for to gracefully stop Docker/Podman containers
- `useModelName` to override model names sent to upstream servers
- `healthCheckTimeout` to control model startup wait times
- `${PORT}` automatic port variables for dynamic port assignment
-
-See the [configuration documentation](https://github.com/mostlygeek/llama-swap/wiki/Configuration) in the wiki all options and examples.
-
 ## Reverse Proxy Configuration (nginx)

 If you deploy llama-swap behind nginx, disable response buffering for streaming endpoints. By default, nginx buffers responses which breaks Server‑Sent Events (SSE) and streaming chat completion. ([#236](https://github.com/mostlygeek/llama-swap/issues/236))
@@ -97,111 +188,7 @@ location /v1/chat/completions {

 As a safeguard, llama-swap also sets `X-Accel-Buffering: no` on SSE responses. However, explicitly disabling `proxy_buffering` at your reverse proxy is still recommended for reliable streaming behavior.

-## Web UI
-
-llama-swap includes a real time web interface for monitoring logs and models:
-
-<img width="1360" height="963" alt="image" src="https://github.com/user-attachments/assets/adef4a8e-de0b-49db-885a-8f6dedae6799" />
-
-The Activity Page shows recent requests:
-
-<img width="1360" height="963" alt="image" src="https://github.com/user-attachments/assets/5f3edee6-d03a-4ae5-ae06-b20ac1f135bd" />
-
-## Installation
-
-llama-swap can be installed in multiple ways
-
-1. Docker
-2. Homebrew (OSX and Linux)
-3. From release binaries
-4. From source
-
-### Docker Install ([download images](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap))
-
-Docker images with llama-swap and llama-server are built nightly.
-
-```shell
-# use CPU inference comes with the example config above
-$ docker run -it --rm -p 9292:8080 ghcr.io/mostlygeek/llama-swap:cpu
-
-# qwen2.5 0.5B
-$ curl -s http://localhost:9292/v1/chat/completions \
-    -H "Content-Type: application/json" \
-    -H "Authorization: Bearer no-key" \
-    -d '{"model":"qwen2.5","messages": [{"role": "user","content": "tell me a joke"}]}' | \
-    jq -r '.choices[0].message.content'
-
-# SmolLM2 135M
-$ curl -s http://localhost:9292/v1/chat/completions \
-    -H "Content-Type: application/json" \
-    -H "Authorization: Bearer no-key" \
-    -d '{"model":"smollm2","messages": [{"role": "user","content": "tell me a joke"}]}' | \
-    jq -r '.choices[0].message.content'
-```
-
-<details>
-<summary>Docker images are built nightly with llama-server for cuda, intel, vulcan and musa.</summary>
-
-They include:
-
- `ghcr.io/mostlygeek/llama-swap:cpu`
- `ghcr.io/mostlygeek/llama-swap:cuda`
- `ghcr.io/mostlygeek/llama-swap:intel`
- `ghcr.io/mostlygeek/llama-swap:vulkan`
- ROCm disabled until fixed in llama.cpp container
-
-Specific versions are also available and are tagged with the llama-swap, architecture and llama.cpp versions. For example: `ghcr.io/mostlygeek/llama-swap:v89-cuda-b4716`
-
-Beyond the demo you will likely want to run the containers with your downloaded models and custom configuration.
-
-```shell
-$ docker run -it --rm --runtime nvidia -p 9292:8080 \
-  -v /path/to/models:/models \
-  -v /path/to/custom/config.yaml:/app/config.yaml \
-  ghcr.io/mostlygeek/llama-swap:cuda
-```
-
-</details>
-
-### Homebrew Install (macOS/Linux)
-
-The latest release of `llama-swap` can be installed via [Homebrew](https://brew.sh).
-
-```shell
-# Set up tap and install formula
-brew tap mostlygeek/llama-swap
-brew install llama-swap
-# Run llama-swap
-llama-swap --config path/to/config.yaml --listen localhost:8080
-```
-
-This will install the `llama-swap` binary and make it available in your path. See the [configuration documentation](https://github.com/mostlygeek/llama-swap/wiki/Configuration)
-
-### Pre-built Binaries ([download](https://github.com/mostlygeek/llama-swap/releases))
-
-Binaries are available for Linux, Mac, Windows and FreeBSD. These are automatically published and are likely a few hours ahead of the docker releases. The binary install works with any OpenAI compatible server, not just llama-server.
-
-1. Download a [release](https://github.com/mostlygeek/llama-swap/releases) appropriate for your OS and architecture.
-1. Create a configuration file, see the [configuration documentation](https://github.com/mostlygeek/llama-swap/wiki/Configuration).
-1. Run the binary with `llama-swap --config path/to/config.yaml --listen localhost:8080`.
-   Available flags:
-   - `--config`: Path to the configuration file (default: `config.yaml`).
-   - `--listen`: Address and port to listen on (default: `:8080`).
-   - `--version`: Show version information and exit.
-   - `--watch-config`: Automatically reload the configuration file when it changes. This will wait for in-flight requests to complete then stop all running models (default: `false`).
-
-### Building from source
-
-1. Build requires golang and nodejs for the user interface.
-1. `git clone https://github.com/mostlygeek/llama-swap.git`
-1. `make clean all`
-1. Binaries will be in `build/` subdirectory
-
-## Monitoring Logs
-
-Open the `http://<host>:<port>/` with your browser to get a web interface with streaming logs.
-
-CLI access is also supported:
+## Monitoring Logs on the CLI

 ```shell
 # sends up to the last 10KB of logs
@@ -227,11 +214,11 @@ curl -Ns 'http://host/logs/stream?no-history'

 Any OpenAI compatible server would work. llama-swap was originally designed for llama-server and it is the best supported.

-For Python based inference servers like vllm or tabbyAPI it is recommended to run them via podman or docker. This provides clean environment isolation as well as responding correctly to `SIGTERM` signals to shutdown.
+For Python based inference servers like vllm or tabbyAPI it is recommended to run them via podman or docker. This provides clean environment isolation as well as responding correctly to `SIGTERM` signals for proper shutdown.

 ## Star History

 > [!NOTE]
-> ⭐️ Star this project to help others discover it! 
+> ⭐️ Star this project to help others discover it!

 [![Star History Chart](https://api.star-history.com/svg?repos=mostlygeek/llama-swap&type=Date)](https://www.star-history.com/#mostlygeek/llama-swap&Date)