llama-swap

Author	SHA1	Message	Date
Benson Wong	2e45f5692a	Update README.md Improve README documentation.	2025-01-01 12:51:24 -08:00
Benson Wong	c97b80bdfe	Update README.md	2025-01-01 12:25:45 -08:00
Benson Wong	ae3ef9bc39	Refactor UI (#33 ) - add html to / instead of 404 - add client side regex to /logs	2024-12-23 19:48:59 -08:00
Benson Wong	db6715bec3	update golang.org/x/net -> v0.33.0 for dependabot	2024-12-20 11:28:32 -08:00
Benson Wong	da5d9e8a6a	fix HTTP logging so true path is printed	2024-12-20 11:25:01 -08:00
Benson Wong	84b667ca7a	improve logging and error reporting for troubleshooting	2024-12-20 10:46:56 -08:00
Benson Wong	29657106fc	add more OpenAI API supported in README	2024-12-20 10:08:20 -08:00
Benson Wong	9c8860471e	support v1/rerank endpoint	2024-12-17 21:22:25 -08:00
Benson Wong	9b4e3f307e	rename proxy handler	2024-12-17 17:25:10 -08:00
Benson Wong	6fe37c3abf	support /v1/embeddings (#4 )	2024-12-17 17:25:10 -08:00
Benson Wong	7f45493a37	Update README.md	2024-12-17 14:45:41 -08:00
Benson Wong	891f6a5b5a	Add /upstream endpoint (#30 ) * remove catch-all route to upstream proxy (it was broken anyways) * add /upstream/:model_id to swap and route to upstream path * add /upstream HTML endpoint and unlisted option * add /upstream endpoint to show a list of available models * add `unlisted` configuration option to omit a model from /v1/models and /upstream lists * add favicon.ico	2024-12-17 14:37:44 -08:00
Benson Wong	7183f6b43d	fix bad logging due to wrong []byte used #28	2024-12-16 16:22:14 -08:00
Benson Wong	d89bfeb441	add .DS_Store to .gitignore	2024-12-16 12:30:31 -08:00
Benson Wong	9a0c6bed40	Improve stop exceptions (#28 ) (#29 ) Stop Process TTL goroutine when process is not ready (#28) - fix issue where the goroutine will continue even though the child process is no longer running and the Process' state is not Ready - fix issue where some logs were going to stdout instead of p.logMonitor causing them to not show up in the /logs - add units to unloading model message	2024-12-16 12:29:25 -08:00
Benson Wong	d6ca535939	tweak release tagging so it is not based on number of commits	2024-12-14 15:46:10 -08:00
Benson Wong	27302c0c02	change llama-swap to use goreleaser default ldflag values	2024-12-14 10:30:06 -08:00
Benson Wong	d4e22cceaa	Fix security vulnerability with golang.org/x/crypto - does not affect the project as llama-swap does not use the crypto libraries - good practice to keep security deps updated!	2024-12-14 10:20:22 -08:00
Benson Wong	4c94927658	Move release to Makefile out of goreleaser - less complexity - easier - goreleaser, github, pipelines: 1... mostlygeek: 0	2024-12-14 10:16:46 -08:00
Benson Wong	a955a4a5c0	create tag to release	2024-12-14 10:07:20 -08:00
Benson Wong	22d3f1a4f9	Change versioning to use git commits counts instead of semver - less work for me - more frequent releases	2024-12-14 09:53:13 -08:00
Benson Wong	e2443251ad	update readme	2024-12-09 19:14:49 -08:00
Benson Wong	5fbd53c616	delay TTL check until after all requests are complete (#25 ) - fixes #25 where requests that last longer than the TTL will cause the process to be unloaded before the next request. - new behavior, TTL waits until all requests are complete before checking timeout	2024-12-09 19:08:03 -08:00
Benson Wong	97dae50dc4	update readme	2024-12-08 21:34:16 -08:00
Benson Wong	cb978f760f	add web interface to /logs	2024-12-08 21:26:22 -08:00
Benson Wong	387f0ef6c4	use new timings data in server response in run-benchmark.sh	2024-12-03 20:48:36 -08:00
Benson Wong	18c134624d	Add Access-Control-Allow-Origin CORS header to /v1/models endpoint - match behavior of llama.cpp where the Origin in request is used - add test for listModelsHandler	2024-12-03 15:53:59 -08:00
Benson Wong	da2326bdc7	add example: optimizing code generation	2024-12-03 10:25:43 -08:00
Benson Wong	da46545630	fix profile example in README	2024-12-01 10:13:31 -08:00
Benson Wong	04b4760e7e	change profile split character to : (colon) (#21 ) - change from `/` to `:` for multiple models loaded as part of a profile - breaking change now, but allows for more compatibility with other inference engines that may have model references like `coding:Qwen/Qwen-2.5-Coder-32B`	2024-12-01 09:10:50 -08:00
Benson Wong	9fc5d5b5eb	improve cmd parsing (#22 ) Switch from using a naive strings.Fields() to shlex.Split() for parsing the model startup command into a string[]. This makes parsing much more reliable around newlines, quotes, etc.	2024-12-01 09:02:58 -08:00
Benson Wong	cf82b3c633	Improve Concurrency and Parallel Request Handling (#19 ) Rewrite the swap behaviour so that in-flight requests block process swapping until they are completed. Additionally: - add tests for parallel requests with proxy.ProxyManager and proxy.Process - improve Process startup behaviour and simplified the code - stopping of processes are sent SIGTERM and have 5 seconds to terminate, before they are killed	2024-11-30 15:24:42 -08:00
Benson Wong	e363f8f498	clean up writing with AI :b	2024-11-28 22:12:44 -08:00
Benson Wong	c9629cf3a2	add speculative decoding example	2024-11-28 22:07:22 -08:00
Benson Wong	50426935a4	.	2024-11-28 22:06:29 -08:00
Benson Wong	2fceb78e8d	Add examples	2024-11-28 22:05:41 -08:00
Ikko Eltociear Ashimine	9a81c53664	chore: update process_test.go (#17 ) nonexistant -> nonexistent	2024-11-26 10:20:16 -08:00
Benson Wong	716d37de82	Update README.md fix grammar	2024-11-25 12:35:00 -08:00
Benson Wong	73ad85ea69	Implement Multi-Process Handling (#7 ) Refactor code to support starting of multiple back end llama.cpp servers. This functionality is exposed as `profiles` to create a simple configuration format. Changes: * refactor proxy tests to get ready for multi-process support * update proxy/ProxyManager to support multiple processes (#7) * Add support for Groups in configuration * improve handling of Model alias configs * implement multi-model swapping * improve code clarity for swapModel * improve docs, rename groups to profiles in config	2024-11-23 19:45:13 -08:00
Benson Wong	533162ce6a	add support for automatically unloading a model (#10 ) (#14 ) * Make starting upstream process on-demand (#10) * Add automatic unload of model after TTL is reached * add `ttl` configuration parameter to models in seconds, default is 0 (never unload)	2024-11-19 16:32:51 -08:00
Benson Wong	ba39ed4c18	Add support for legacy v1/completions API (#12 )	2024-11-19 09:57:39 -08:00
Benson Wong	21f54f96c2	Merge pull request #13 from mostlygeek/set-content-length Dechunk HTTP requests by default (#11)	2024-11-19 09:46:03 -08:00
Benson Wong	7eec51f3f2	Dechunk HTTP requests by default (#11 ) ProxyManager already has all the Request body's data. There is no never a need to use chunked transfer encoding to the upstream process.	2024-11-19 09:40:44 -08:00
Benson Wong	5021e0f299	remove the process handler override	2024-11-18 21:26:39 -08:00
Benson Wong	c9233d2c9a	use gin instead of standard http lib in main	2024-11-18 15:58:28 -08:00
Benson Wong	a33ac6f8fb	update README	2024-11-18 15:37:50 -08:00
Benson Wong	401aa88949	move log handlers to separate file	2024-11-18 15:33:06 -08:00
Benson Wong	e9e88fd229	rename proxy.go to proxymanager.go	2024-11-18 15:30:34 -08:00
Benson Wong	c3b4bb1684	use gin for http server	2024-11-18 15:30:16 -08:00
Benson Wong	e5c909ddf7	add tests for proxy.Process	2024-11-17 20:49:14 -08:00

1 2

91 Commits