* remove catch-all route to upstream proxy (it was broken anyways)
* add /upstream/:model_id to swap and route to upstream path
* add /upstream HTML endpoint and unlisted option
* add /upstream endpoint to show a list of available models
* add `unlisted` configuration option to omit a model from /v1/models and /upstream lists
* add favicon.ico
Stop Process TTL goroutine when process is not ready (#28)
- fix issue where the goroutine will continue even though the child
process is no longer running and the Process' state is not Ready
- fix issue where some logs were going to stdout instead of p.logMonitor
causing them to not show up in the /logs
- add units to unloading model message
- fixes#25 where requests that last longer than the TTL will cause the
process to be unloaded before the next request.
- new behavior, TTL waits until all requests are complete before
checking timeout
- change from `/` to `:` for multiple models loaded as part of a profile
- breaking change now, but allows for more compatibility with other inference engines that may have model references like `coding:Qwen/Qwen-2.5-Coder-32B`
Switch from using a naive strings.Fields() to shlex.Split() for parsing the model startup command into a string[]. This makes parsing much more reliable around newlines, quotes, etc.
Rewrite the swap behaviour so that in-flight requests block process swapping until they are completed.
Additionally:
- add tests for parallel requests with proxy.ProxyManager and proxy.Process
- improve Process startup behaviour and simplified the code
- stopping of processes are sent SIGTERM and have 5 seconds to terminate, before they are killed
Refactor code to support starting of multiple back end llama.cpp servers. This functionality is exposed as `profiles` to create a simple configuration format.
Changes:
* refactor proxy tests to get ready for multi-process support
* update proxy/ProxyManager to support multiple processes (#7)
* Add support for Groups in configuration
* improve handling of Model alias configs
* implement multi-model swapping
* improve code clarity for swapModel
* improve docs, rename groups to profiles in config
* Make starting upstream process on-demand (#10)
* Add automatic unload of model after TTL is reached
* add `ttl` configuration parameter to models in seconds, default is 0 (never unload)