e0103d1884049ce450656a48c8ed7c59750a4162
LLaMAGate
A golang gateway that automatically manages llama-server to serve the requested model in the HTTP request. Serve all the models you have downloaded without manually swapping between them.
Created because I wanted:
- ✅ easy to deploy: single binary with no dependencies
- ✅ full control over llama-server's startup settings
- ✅ ❤️ for Nvidia P40 users who are rely on llama.cpp row split mode for large models
YAML Configuration
# Seconds to wait for llama.cpp to be available to serve requests
# Default (and minimum): 15 seconds
healthCheckTimeout: 60
# define models
models:
"llama":
env:
- "CUDA_VISIBLE_DEVICES=0"
cmd: "llama-server --port 8999 -m Llama-3.2-1B-Instruct-Q4_K_M.gguf"
# address where llama-ser
proxy: "http://127.0.0.1:8999"
# list of aliases this llama.cpp instance can also serve
aliases:
- "gpt-4o-mini"
- "gpt-3.5-turbo"
"qwen":
cmd: "llama-server --port 8999 -m path/to/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf"
proxy: "http://127.0.0.1:8999"
Testing with CURL
> curl http://localhost:8080/v1/chat/completions -N -d '{"messages":[{"role":"user","content":"write a 3 word story"}], "model":"llama"}'| jq -c '.choices[].message.content'
# will reuse the llama-server instance
> curl http://localhost:8080/v1/chat/completions -N -d '{"messages":[{"role":"user","content":"write a 3 word story"}], "model":"gpt-4o-mini"}'| jq -c '.choices[].message.content'
# swap to Qwen2.5-1.5B-Instruct-Q4_K_M.gguf
> curl http://localhost:8080/v1/chat/completions -N -d '{"messages":[{"role":"user","content":"write a 3 word story"}], "model":"qwen"}'| jq -c '.choices[].message.content'
Description
Languages
Go
79.6%
TypeScript
13.6%
Shell
2.9%
CSS
1.3%
Makefile
1.1%
Other
1.5%