llama-swap

Author	SHA1	Message	Date
Benson Wong	70930e4e91	proxy: add support for user defined metadata in model configs (#333 ) Changes: - add Metadata key to ModelConfig - include metadata in /v1/models under meta.llamaswap key - add recursive macro substitution into Metadata - change macros at global and model level to be any scalar type Note: This is the first mostly AI generated change to llama-swap. See #333 for notes about the workflow and approach to AI going forward.	2025-10-04 19:56:41 -07:00
Benson Wong	1f6179110c	proxy/config: add model level macros (#330 ) * proxy/config: add model level macros Add macros to model configuration. Model macros override macros that are defined at the global configuration level. They follow the same naming and value rules as the global macros. * proxy/config: fix bug with macro reserved name checking The PORT reserved name was not properly checked * proxy/config: add tests around model.filters.stripParams - add check that model.filters.stripParams has no invalid macros - renamed strip_params to stripParams for camel case consistency - add legacy code compatibility so model.filters.strip_params continues to work * proxy/config: add duplicate removal to model.filters.stripParams * clean up some doc nits	2025-09-28 23:32:52 -07:00
Benson Wong	a533aec736	small tweak to example config	2025-09-01 21:26:58 -07:00
Brett Profitt	97b17fc47d	Add ${MODEL_ID} macro (#226 ) The automatic ${MODEL_ID} macro includes the name of the model and can be used in Cmd and CmdStop.	2025-09-01 21:21:37 -07:00
Benson Wong	c55d0cc842	Add docs for model.concurrencyLimit #263 [skip ci]	2025-08-22 16:08:37 -07:00
Benson Wong	305e5a0031	improve example config [skip ci]	2025-08-17 09:19:04 -07:00
Benson Wong	5dc6b3e6d9	Add barebones but working implementation of model preload (#209 , #235 ) Add barebones but working implementation of model preload * add config test for Preload hook * improve TestProxyManager_StartupHooks * docs for new hook configuration * add a .dev to .gitignore	2025-08-14 10:27:28 -07:00
g2mt	87dce5f8f6	Add metrics logging for chat completion requests (#195 ) - Add token and performance metrics for v1/chat/completions - Add Activity Page in UI - Add /api/metrics endpoint Contributed by @g2mt	2025-07-21 22:19:55 -07:00
Benson Wong	c867a6c9a2	Add name and description to v1/models list (#179 ) * Add support for name and description in v1/models list * add configuration example for name and description	2025-06-30 23:02:44 -07:00
Benson Wong	4236cec03a	Add Filters to Model Configuration (#174 ) llama-swap can strip specific keys in JSON requests. This is useful for removing the ability for clients to set sampling parameters like temperature, top_k, top_p, etc.	2025-06-23 10:52:29 -07:00
Benson Wong	9a3c656738	New UI (#157 , #164 ) - Add a react UI to replace the plain HTML one. - Serve as a foundation for better GUI interactions	2025-06-16 16:45:19 -07:00
Benson Wong	f9ee7156dc	update configuration examples for multiline yaml commands #133	2025-05-16 11:45:39 -07:00
Benson Wong	cb876c143b	update example config	2025-05-12 10:20:18 -07:00
Benson Wong	fb7c808082	add timing for Process start, stop, total request time (#91 )	2025-04-14 14:34:59 -07:00
Benson Wong	b8f888f864	Logging Improvements (#88 ) This change revamps the internal logging architecture to be more flexible and descriptive. Previously all logs from both llama-swap and upstream services were mixed together. This makes it harder to troubleshoot and identify problems. This PR adds these new endpoints: - `/logs/stream/proxy` - just llama-swap's logs - `/logs/stream/upstream` - stdout output from the upstream server	2025-04-04 21:01:33 -07:00
Benson Wong	314d2f2212	remove cmd_stop configuration and functionality from PR #40 (#44 ) * remove cmd_stop functionality from #40	2025-01-31 12:42:44 -08:00
Benson Wong	baeb0c4e7f	Add cmd_stop configuration to better support docker (#35 ) Add `cmd_stop` to model configuration to run a command instead of sending a SIGTERM to shutdown a process before swapping.	2025-01-30 16:59:57 -08:00
Benson Wong	84b667ca7a	improve logging and error reporting for troubleshooting	2024-12-20 10:46:56 -08:00
Benson Wong	9c8860471e	support v1/rerank endpoint	2024-12-17 21:22:25 -08:00
Benson Wong	6fe37c3abf	support /v1/embeddings (#4 )	2024-12-17 17:25:10 -08:00
Benson Wong	891f6a5b5a	Add /upstream endpoint (#30 ) * remove catch-all route to upstream proxy (it was broken anyways) * add /upstream/:model_id to swap and route to upstream path * add /upstream HTML endpoint and unlisted option * add /upstream endpoint to show a list of available models * add `unlisted` configuration option to omit a model from /v1/models and /upstream lists * add favicon.ico	2024-12-17 14:37:44 -08:00
Benson Wong	73ad85ea69	Implement Multi-Process Handling (#7 ) Refactor code to support starting of multiple back end llama.cpp servers. This functionality is exposed as `profiles` to create a simple configuration format. Changes: * refactor proxy tests to get ready for multi-process support * update proxy/ProxyManager to support multiple processes (#7) * Add support for Groups in configuration * improve handling of Model alias configs * implement multi-model swapping * improve code clarity for swapModel * improve docs, rename groups to profiles in config	2024-11-23 19:45:13 -08:00
Benson Wong	533162ce6a	add support for automatically unloading a model (#10 ) (#14 ) * Make starting upstream process on-demand (#10) * Add automatic unload of model after TTL is reached * add `ttl` configuration parameter to models in seconds, default is 0 (never unload)	2024-11-19 16:32:51 -08:00
Benson Wong	7eec51f3f2	Dechunk HTTP requests by default (#11 ) ProxyManager already has all the Request body's data. There is no never a need to use chunked transfer encoding to the upstream process.	2024-11-19 09:40:44 -08:00
Benson Wong	34f9fd7340	Improve timeout and exit handling of child processes. fix #3 and #5 llama-swap only waited a maximum of 5 seconds for an upstream HTTP server to be available. If it took longer than that it will error out the request. Now it will wait up to the configured healthCheckTimeout or the upstream process unexpectedly exits.	2024-11-01 14:32:39 -07:00
Benson Wong	8448efa7fc	revise health check logic to not error on 5 second timeout	2024-11-01 09:42:37 -07:00
Benson Wong	be82d1a6a0	Support multiline cmds in YAML configuration Add support for multiline `cmd` configurations allowing for nicer looking configuration YAML files.	2024-10-19 20:06:59 -07:00
Benson Wong	8eb5b7b6c4	Add custom check endpoint Replace previously hardcoded value for `/health` to check when the server became ready to serve traffic. With this the server can support any server that provides an an OpenAI compatible inference endpoint.	2024-10-11 21:59:21 -07:00
Benson Wong	d682589fb1	support environment variables	2024-10-04 11:55:27 -07:00
Benson Wong	bfdba43bd8	improve error handling	2024-10-04 10:55:02 -07:00
Benson Wong	b63b81b121	first commit	2024-10-03 20:20:01 -07:00

31 Commits