Implement Multi-Process Handling (#7)

Refactor code to support starting of multiple back end llama.cpp servers. This functionality is exposed as `profiles` to create a simple configuration format. Changes: * refactor proxy tests to get ready for multi-process support * update proxy/ProxyManager to support multiple processes (#7) * Add support for Groups in configuration * improve handling of Model alias configs * implement multi-model swapping * improve code clarity for swapModel * improve docs, rename groups to profiles in config
2024-11-23 19:45:13 -08:00
parent 533162ce6a
commit 73ad85ea69
10 changed files with 361 additions and 124 deletions
--- a/README.md
+++ b/README.md
@@ -2,17 +2,22 @@

 ![llama-swap header image](header.jpeg)

-[llama.cpp's server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) can't swap models on demand. So let's swap the server on demand instead!
+llama-swap is a golang server that automatically swaps the llama.cpp server on demand. Since [llama.cpp's server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server) can't swap models, let's swap the server instead!

-llama-swap is a proxy server that sits in front of llama-server. When a request for `/v1/chat/completions` comes in it will extract the `model` requested and change the underlying llama-server automatically.
+Features:

- ✅ easy to deploy: single binary with no dependencies
- ✅ full control over llama-server's startup settings
- ✅ ❤️ for users who are rely on llama.cpp for LLM inference
+- ✅ Easy to deploy: single binary with no dependencies
+- ✅ Single yaml configuration file
+- ✅ Automatically switching between models
+- ✅ Full control over llama.cpp server settings per model
+- ✅ OpenAI API support (`v1/completions` and `v1/chat/completions`)
+- ✅ Multiple GPU support
+- ✅ Run multiple models at once with `profiles`
+- ✅ Remote log monitoring at `/log`

 ## config.yaml

-llama-swap's configuration purposefully simple.
+llama-swap's configuration is purposefully simple.

 ```yaml
 # Seconds to wait for llama.cpp to load and be ready to serve requests
@@ -24,25 +29,24 @@ models:
  "llama":
    cmd: llama-server --port 8999 -m Llama-3.2-1B-Instruct-Q4_K_M.gguf

-    # where to reach the server started by cmd
+    # where to reach the server started by cmd, make sure the ports match
    proxy: http://127.0.0.1:8999

-    # aliases model names to use this configuration for
+    # aliases names to use this model for
    aliases:
    - "gpt-4o-mini"
    - "gpt-3.5-turbo"

-    # wait for this path to return an HTTP 200 before serving requests
-    # defaults to /health to match llama.cpp
-    #
-    # use "none" to skip endpoint checking. This may cause requests to fail
-    # until the server is ready
+    # check this path for an HTTP 200 OK before serving requests
+    # default: /health to match llama.cpp
+    # use "none" to skip endpoint checking, but may cause HTTP errors
+    # until the model is ready
    checkEndpoint: /custom-endpoint

-    # automatically unload the model after 10 seconds
+    # automatically unload the model after this many seconds
    # ttl values must be a value greater than 0
    # default: 0 = never unload model
-    ttl: 5
+    ttl: 60

  "qwen":
    # environment variables to pass to the command
@@ -53,8 +57,18 @@ models:
    cmd: >
      llama-server --port 8999
      --model path/to/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf
-
    proxy: http://127.0.0.1:8999
+
+# profiles make it easy to managing multi model (and gpu) configurations.
+#
+# Tips:
+#  - each model must be listening on a unique address and port
+#  - the model name is in this format: "profile_name/model", like "coding/qwen"
+#  - the profile will load and unload all models in the profile at the same time
+profiles:
+  coding:
+    - "qwen"
+    - "llama"
 ```

 ## Installation