llama-swap

Author	SHA1	Message	Date
Benson Wong	b8f888f864	Logging Improvements (#88 ) This change revamps the internal logging architecture to be more flexible and descriptive. Previously all logs from both llama-swap and upstream services were mixed together. This makes it harder to troubleshoot and identify problems. This PR adds these new endpoints: - `/logs/stream/proxy` - just llama-swap's logs - `/logs/stream/upstream` - stdout output from the upstream server	2025-04-04 21:01:33 -07:00
Benson Wong	b7f8cb5094	Limit Access-Control-Allow-Origin to OPTIONS preflight requests #85	2025-04-04 14:44:35 -07:00
Benson Wong	a23da6eb57	Sanitize CORS headers (#85 ) Add sanitation step for `Access-Control-Allow-Headers` when echoing back user supplied headers	2025-04-01 08:43:53 -07:00
Benson Wong	84e2c07a7e	Refactor wildcard out of CORS headers (#81 ) Changes to CORS functionality: - `Access-Control-Allow-Origin: *` is set for all requests - for pre-flight OPTIONS requests - specify methods: `Access-Control-Allow-Methods: GET, POST, PUT, PATCH, DELETE, OPTIONS` - if the client sent `Access-Control-Request-Headers` then echo back the same value in `Access-Control-Allow-Headers`. If no `Access-Control-Request-Headers` were sent, then send back a default set - set `Access-Control-Max-Age: 86400` to that may improve performance - Add CORS tests to the proxy-manager	2025-03-25 15:24:43 -07:00
Benson Wong	680af28bcc	Allow very permissive CORS headers (#77 )	2025-03-20 15:50:21 -07:00
Benson Wong	5c97299e7b	Add support for sending a custom model name to upstream (#69 ) (#71 ) * add test for splitRequestedModel() * Add `useModelName` parameter to model configuration * add docs to README	2025-03-14 21:07:52 -07:00
Benson Wong	3201a68a04	Add /v1/audio/transcriptions support (#41 ) * add support for /v1/audio/transcriptions	2025-03-13 13:49:39 -07:00
Florin-Gabriel Dumitru	3ac94ad20e	Adds an endpoint '/running' (#61 ) * Adds an endpoint '/running' that returns either an empty JSON object if no model has been loaded so far, or the last model loaded (model key) and it's current state (state key). Possible state values are: stopped, starting, ready and stopping. * Improves the `/running` endpoint by allowing multiple entries under the `running` key within the JSON response. Refactors the `/running` method name (listRunningProcessesHandler). Removes the unlisted filter implementation. * Adds tests for: - no model loaded - one model loaded - multiple models loaded * Adds simple comments. * Simplified code structure as per 250313 comments on PR #65. --------- Co-authored-by: FGDumitru\|B <xelotx@gmail.com>	2025-03-13 13:42:59 -07:00
Benson Wong	b3d331da0d	Properly strip profile name slug from models fixes (#62 ) The profile slug in a model name, `profile:model`, is specific to llama-swap. This strips `profile:` out of the model name request so upstreams that expect just `model` work and do not require knowing about the profile slug.	2025-03-09 12:41:52 -07:00
Benson Wong	082d5d0fc5	Add /unload endpoint (#58 ) to unload all currently running models	2025-03-03 10:33:36 -08:00
Benson Wong	09bdd86b54	Improve shutdown behaviour (#47 ) (#49 ) Introduce `Process.Shutdown()` and `ProxyManager.Shutdown()`. These two function required a lot of internal process state management refactoring. A key benefit is that `Process.start()` is now interruptable. When `Shutdown()` is called it will break the long health check loop. State management within Process is also improved. Added `starting`, `stopping` and `shutdown` states. Additionally, introduced a simple finite state machine to manage transitions.	2025-02-05 17:19:59 -08:00
Benson Wong	2c3e3e27f7	Support OPTIONS requests (#42 ) Add middleware that responds with permissive OPTIONS headers for all request paths.	2025-01-31 10:09:07 -08:00
Benson Wong	abdc2bfdb3	Fix panic when requesting non-members of profiles A panic occurs when a request for an invalid profile:model pair is made. The edge case is that the profile exists and the model exists but they're not configured as a pair. This adds an additional check to make sure the profile:model pair is valid before attempting to swap the model.	2025-01-16 12:06:38 -08:00
Benson Wong	3a1e9f81f1	support TTS /v1/audio/speech (#36 )	2025-01-12 16:27:01 -08:00
Benson Wong	ae3ef9bc39	Refactor UI (#33 ) - add html to / instead of 404 - add client side regex to /logs	2024-12-23 19:48:59 -08:00
Benson Wong	da5d9e8a6a	fix HTTP logging so true path is printed	2024-12-20 11:25:01 -08:00
Benson Wong	84b667ca7a	improve logging and error reporting for troubleshooting	2024-12-20 10:46:56 -08:00
Benson Wong	9c8860471e	support v1/rerank endpoint	2024-12-17 21:22:25 -08:00
Benson Wong	9b4e3f307e	rename proxy handler	2024-12-17 17:25:10 -08:00
Benson Wong	6fe37c3abf	support /v1/embeddings (#4 )	2024-12-17 17:25:10 -08:00
Benson Wong	891f6a5b5a	Add /upstream endpoint (#30 ) * remove catch-all route to upstream proxy (it was broken anyways) * add /upstream/:model_id to swap and route to upstream path * add /upstream HTML endpoint and unlisted option * add /upstream endpoint to show a list of available models * add `unlisted` configuration option to omit a model from /v1/models and /upstream lists * add favicon.ico	2024-12-17 14:37:44 -08:00
Benson Wong	18c134624d	Add Access-Control-Allow-Origin CORS header to /v1/models endpoint - match behavior of llama.cpp where the Origin in request is used - add test for listModelsHandler	2024-12-03 15:53:59 -08:00
Benson Wong	04b4760e7e	change profile split character to : (colon) (#21 ) - change from `/` to `:` for multiple models loaded as part of a profile - breaking change now, but allows for more compatibility with other inference engines that may have model references like `coding:Qwen/Qwen-2.5-Coder-32B`	2024-12-01 09:10:50 -08:00
Benson Wong	cf82b3c633	Improve Concurrency and Parallel Request Handling (#19 ) Rewrite the swap behaviour so that in-flight requests block process swapping until they are completed. Additionally: - add tests for parallel requests with proxy.ProxyManager and proxy.Process - improve Process startup behaviour and simplified the code - stopping of processes are sent SIGTERM and have 5 seconds to terminate, before they are killed	2024-11-30 15:24:42 -08:00
Benson Wong	73ad85ea69	Implement Multi-Process Handling (#7 ) Refactor code to support starting of multiple back end llama.cpp servers. This functionality is exposed as `profiles` to create a simple configuration format. Changes: * refactor proxy tests to get ready for multi-process support * update proxy/ProxyManager to support multiple processes (#7) * Add support for Groups in configuration * improve handling of Model alias configs * implement multi-model swapping * improve code clarity for swapModel * improve docs, rename groups to profiles in config	2024-11-23 19:45:13 -08:00
Benson Wong	533162ce6a	add support for automatically unloading a model (#10 ) (#14 ) * Make starting upstream process on-demand (#10) * Add automatic unload of model after TTL is reached * add `ttl` configuration parameter to models in seconds, default is 0 (never unload)	2024-11-19 16:32:51 -08:00
Benson Wong	ba39ed4c18	Add support for legacy v1/completions API (#12 )	2024-11-19 09:57:39 -08:00
Benson Wong	7eec51f3f2	Dechunk HTTP requests by default (#11 ) ProxyManager already has all the Request body's data. There is no never a need to use chunked transfer encoding to the upstream process.	2024-11-19 09:40:44 -08:00
Benson Wong	c9233d2c9a	use gin instead of standard http lib in main	2024-11-18 15:58:28 -08:00
Benson Wong	401aa88949	move log handlers to separate file	2024-11-18 15:33:06 -08:00
Benson Wong	e9e88fd229	rename proxy.go to proxymanager.go	2024-11-18 15:30:34 -08:00

31 Commits