Commit Graph

39 Commits

Author SHA1 Message Date
Ryan Steed
86e9b93c37 proxy,ui: add version endpoint and display version info in UI (#395)
- Add /api/version endpoint to ProxyManager that returns build date, commit hash, and version
- Implement SetVersion method to configure version info in ProxyManager
- Add version info fetching to APIProvider and display in ConnectionStatus component
- Include version info in UI context and update dependencies
- Add tests for version endpoint functionality
2025-11-17 10:43:47 -08:00
Ryan Steed
554d29e87d feat: enhance model listing to include aliases (#400)
introduce includeAliasesInList as a new configuration setting (default false) that includes aliases in v1/models

Fixes #399
2025-11-15 14:35:26 -08:00
Benson Wong
e250e71e59 Include metrics from upstream chat requests (#361)
* proxy: refactor metrics recording

- remove metrics_middleware.go as this wrapper is no longer needed. This
  also eliminiates double body parsing for the modelID
- move metrics parsing to be part of MetricsMonitor
- refactor how metrics are recording in ProxyManager
- add MetricsMonitor tests
- improve mem efficiency of processStreamingResponse
- add benchmarks for MetricsMonitor.addMetrics
- proxy: refactor MetricsMonitor to be more safe handling errors
2025-10-25 17:38:18 -07:00
David Wen Riccardi-Zhu
d58a8b85bf Refactor to use httputil.ReverseProxy (#342)
* Refactor to use httputil.ReverseProxy

Refactor manual HTTP proxying logic in Process.ProxyRequest to use the standard
library's httputil.ReverseProxy.

* Refactor TestProcess_ForceStopWithKill test

Update to handle behavior with httputil.ReverseProxy.

* Fix gin interface conversion panic
2025-10-13 16:47:04 -07:00
Benson Wong
caf9e98b1e Fix race conditions in proxy.Process (#349)
- Fix data races found in proxy.Process by go's race detector. 
- Add data race detection to the CI tests. 

Fixes #348
2025-10-13 16:42:49 -07:00
Benson Wong
70930e4e91 proxy: add support for user defined metadata in model configs (#333)
Changes: 

- add Metadata key to ModelConfig
- include metadata in /v1/models under meta.llamaswap key
- add recursive macro substitution into Metadata
- change macros at global and model level to be any scalar type

Note: 

This is the first mostly AI generated change to llama-swap. See #333 for notes about the workflow and approach to AI going forward.
2025-10-04 19:56:41 -07:00
Benson Wong
216c40b951 proxy/config: create config package and migrate configuration (#329)
* proxy/config: create config package and migrate configuration

The configuration is become more complex as llama-swap adds more
advanced features. This commit moves config to its own package so it can
be developed independently of the proxy package.

Additionally, enforcing a public API for a configuration will allow
downstream usage to be more decoupled.
2025-09-28 16:50:06 -07:00
Benson Wong
1a84926505 proxy: add unload of single model (#318)
This adds a new API endpoint, /api/models/unload/*model, that unloads a single model. In the UI when a model is in a ReadyState it will have a new button to unload it. 

Fixes #312
2025-09-24 20:53:48 -07:00
Artur Podsiadły
558801db1a Fix nginx proxy buffering for streaming endpoints (#295)
* Fix nginx proxy buffering for streaming endpoints

- Add X-Accel-Buffering: no header to SSE endpoints (/api/events, /logs/stream)
- Add X-Accel-Buffering: no header to proxied text/event-stream responses
- Add nginx reverse proxy configuration section to README
- Add tests for X-Accel-Buffering header on streaming endpoints

Fixes #236

* Fix goroutine cleanup in streaming endpoints test

Add context cancellation to TestProxyManager_StreamingEndpointsReturnNoBufferingHeader
to ensure the goroutine is properly cleaned up when the test completes.
2025-09-09 16:07:46 -07:00
Yandrik
977f1856bb add /completion endpoint (#275)
* feat: add /completion endpoint
* chore: reformat using gofmt
2025-08-28 21:41:02 -07:00
Benson Wong
52b329f7bc Fix #277 race condition in ProcessGroup.ProxyRequest when swap=true 2025-08-28 21:38:40 -07:00
Benson Wong
5dc6b3e6d9 Add barebones but working implementation of model preload (#209, #235)
Add barebones but working implementation of model preload

* add config test for Preload hook
* improve TestProxyManager_StartupHooks
* docs for new hook configuration
* add a .dev to .gitignore
2025-08-14 10:27:28 -07:00
Benson Wong
10569ed546 Fix model alias usage in upstream path (#230)
Model alias values are not properly resolved and work in upstream/ path.

Related to #229.
2025-08-07 20:16:56 -07:00
Ben Greene
5c63e0066c return models sorted by id in /v1/models (#222) 2025-08-06 10:04:52 -07:00
Benson Wong
0f583163f7 add /health (#211) 2025-07-30 10:37:10 -07:00
Benson Wong
01d4838fb3 Fix token metrics parsing (#199)
Fix #198

- use llama-server's `timings` info if available in response body
- send "-1" for token/sec when not able to accurately calculate
  performance
- optimize streaming body search for metrics information
2025-07-22 23:10:14 -07:00
g2mt
87dce5f8f6 Add metrics logging for chat completion requests (#195)
- Add token and performance metrics  for v1/chat/completions 
- Add Activity Page in UI
- Add /api/metrics endpoint

Contributed by @g2mt
2025-07-21 22:19:55 -07:00
Benson Wong
c867a6c9a2 Add name and description to v1/models list (#179)
* Add support for name and description in v1/models list
* add configuration example for name and description
2025-06-30 23:02:44 -07:00
Benson Wong
4236cec03a Add Filters to Model Configuration (#174)
llama-swap can strip specific keys in JSON requests. This is useful for removing the ability for clients to set sampling parameters like temperature, top_k, top_p, etc.
2025-06-23 10:52:29 -07:00
Benson Wong
e3a0b013c1 add content length test for #131 2025-05-14 19:50:01 -07:00
Benson Wong
7f37bcc6eb Improve testing around using SIGKILL (#127)
* Add test for SIGKILL of process
* silent TestProxyManager_RunningEndpoint debug output
* Ref #125
2025-05-13 21:21:52 -07:00
Benson Wong
519c3a4d22 Change /unload to not wait for inflight requests (#125)
Sometimes upstreams can accept HTTP but never respond causing requests
to build up waiting for a response. This can block Process.Stop() as
that waits for inflight requests to finish. This change refactors the
code to not wait when attempting to shutdown the process.
2025-05-13 11:39:19 -07:00
Sam
bc652709a5 Add config hot-reload (#106)
introduce --watch-config command line option to reload ProxyManager when configuration changes.
2025-05-11 17:37:00 -07:00
Benson Wong
21d7973d11 Improve content-length handling (#115)
ref: See #114

* Improve content-length handling
- Content length was not always being sent
- Add tests for content-length
2025-05-05 10:46:26 -07:00
Benson Wong
448ccae959 Introduce Groups Feature (#107)
Groups allows more control over swapping behaviour when a model is requested. The new groups feature provides three ways to control swapping: within the group, swapping out other groups or keep the models in the group loaded persistently (never swapped out). 

Closes #96, #99 and #106.
2025-05-02 22:35:38 -07:00
Benson Wong
5fad24c16f Make checkHealthTimeout Interruptable during startup (#102)
interrupt and exit Process.start() early if the upstream process exits prematurely or unexpectedly.
2025-04-24 14:39:33 -07:00
Benson Wong
192b2ae621 Remove no longer needed test 2025-04-04 14:46:01 -07:00
Benson Wong
84e2c07a7e Refactor wildcard out of CORS headers (#81)
Changes to CORS functionality: 

- `Access-Control-Allow-Origin: *` is set for all requests 
- for pre-flight OPTIONS requests
  - specify methods: `Access-Control-Allow-Methods: GET, POST, PUT, PATCH, DELETE, OPTIONS`
  - if the client sent `Access-Control-Request-Headers` then echo back the same value in `Access-Control-Allow-Headers`. If no `Access-Control-Request-Headers` were sent, then send back a default set
  - set `Access-Control-Max-Age: 86400` to that may improve performance 
- Add CORS tests to the proxy-manager
2025-03-25 15:24:43 -07:00
Benson Wong
5c97299e7b Add support for sending a custom model name to upstream (#69) (#71)
* add test for splitRequestedModel()
* Add `useModelName` parameter to model configuration
* add docs to README
2025-03-14 21:07:52 -07:00
Benson Wong
3201a68a04 Add /v1/audio/transcriptions support (#41)
* add support for /v1/audio/transcriptions
2025-03-13 13:49:39 -07:00
Florin-Gabriel Dumitru
3ac94ad20e Adds an endpoint '/running' (#61)
* Adds an endpoint '/running' that returns either an empty JSON object if no model has been loaded so far, or the last model loaded (model key) and it's current state (state key). Possible state values are: stopped, starting, ready and stopping.

* Improves the `/running` endpoint by allowing multiple entries under the `running` key within the JSON response.
Refactors the `/running` method name (listRunningProcessesHandler).
Removes the unlisted filter implementation.

* Adds tests for:
- no model loaded
- one model loaded
- multiple models loaded

* Adds simple comments.

* Simplified code structure as per 250313 comments on PR #65.

---------

Co-authored-by: FGDumitru|B <xelotx@gmail.com>
2025-03-13 13:42:59 -07:00
Benson Wong
b3d331da0d Properly strip profile name slug from models fixes (#62)
The profile slug in a model name, `profile:model`, is specific to
llama-swap. This strips `profile:` out of the model name request so
upstreams that expect just `model` work and do not require knowing about
the profile slug.
2025-03-09 12:41:52 -07:00
Benson Wong
082d5d0fc5 Add /unload endpoint (#58) to unload all currently running models 2025-03-03 10:33:36 -08:00
Benson Wong
09bdd86b54 Improve shutdown behaviour (#47) (#49)
Introduce `Process.Shutdown()` and `ProxyManager.Shutdown()`. These two function required a lot of internal process state management refactoring. A key benefit is that `Process.start()` is now interruptable. When `Shutdown()` is called it will break the long health check loop. 

State management within Process is also improved. Added `starting`, `stopping` and `shutdown` states. Additionally, introduced a simple finite state machine to manage transitions.
2025-02-05 17:19:59 -08:00
Benson Wong
abdc2bfdb3 Fix panic when requesting non-members of profiles
A panic occurs when a request for an invalid profile:model pair is made.
The edge case is that the profile exists and the model exists but they're
not configured as a pair.

This adds an additional check to make sure the profile:model pair is
valid before attempting to swap the model.
2025-01-16 12:06:38 -08:00
Benson Wong
18c134624d Add Access-Control-Allow-Origin CORS header to /v1/models endpoint
- match behavior of llama.cpp where the Origin in request is used
- add test for listModelsHandler
2024-12-03 15:53:59 -08:00
Benson Wong
04b4760e7e change profile split character to : (colon) (#21)
- change from `/` to `:` for multiple models loaded as part of a profile
- breaking change now, but allows for more compatibility with other inference engines that may have model references like `coding:Qwen/Qwen-2.5-Coder-32B`
2024-12-01 09:10:50 -08:00
Benson Wong
cf82b3c633 Improve Concurrency and Parallel Request Handling (#19)
Rewrite the swap behaviour so that in-flight requests block process swapping until they are completed. 

Additionally: 

- add tests for parallel requests with proxy.ProxyManager and proxy.Process
- improve Process startup behaviour and simplified the code 
- stopping of processes are sent SIGTERM and have 5 seconds to terminate, before they are killed
2024-11-30 15:24:42 -08:00
Benson Wong
73ad85ea69 Implement Multi-Process Handling (#7)
Refactor code to support starting of multiple back end llama.cpp servers. This functionality is exposed as `profiles` to create a simple configuration format. 

Changes: 

* refactor proxy tests to get ready for multi-process support
* update proxy/ProxyManager to support multiple processes (#7)
* Add support for Groups in configuration
* improve handling of Model alias configs
* implement multi-model swapping
* improve code clarity for swapModel
* improve docs, rename groups to profiles in config
2024-11-23 19:45:13 -08:00