Commit Graph

7 Commits

Author SHA1 Message Date
Benson Wong
cf82b3c633 Improve Concurrency and Parallel Request Handling (#19)
Rewrite the swap behaviour so that in-flight requests block process swapping until they are completed. 

Additionally: 

- add tests for parallel requests with proxy.ProxyManager and proxy.Process
- improve Process startup behaviour and simplified the code 
- stopping of processes are sent SIGTERM and have 5 seconds to terminate, before they are killed
2024-11-30 15:24:42 -08:00
Benson Wong
73ad85ea69 Implement Multi-Process Handling (#7)
Refactor code to support starting of multiple back end llama.cpp servers. This functionality is exposed as `profiles` to create a simple configuration format. 

Changes: 

* refactor proxy tests to get ready for multi-process support
* update proxy/ProxyManager to support multiple processes (#7)
* Add support for Groups in configuration
* improve handling of Model alias configs
* implement multi-model swapping
* improve code clarity for swapModel
* improve docs, rename groups to profiles in config
2024-11-23 19:45:13 -08:00
Benson Wong
533162ce6a add support for automatically unloading a model (#10) (#14)
* Make starting upstream process on-demand (#10)
* Add automatic unload of model after TTL is reached
* add `ttl` configuration parameter to models in seconds, default is 0 (never unload)
2024-11-19 16:32:51 -08:00
Benson Wong
7eec51f3f2 Dechunk HTTP requests by default (#11)
ProxyManager already has all the Request body's data. There is no never
a need to use chunked transfer encoding to the upstream process.
2024-11-19 09:40:44 -08:00
Benson Wong
5021e0f299 remove the process handler override 2024-11-18 21:26:39 -08:00
Benson Wong
e5c909ddf7 add tests for proxy.Process 2024-11-17 20:49:14 -08:00
Benson Wong
36a31f450f add proxy.Process to manage upstream proxy logic 2024-11-17 16:41:15 -08:00