diff --git a/examples/README.md b/examples/README.md index 8e46a31..71bbdd2 100644 --- a/examples/README.md +++ b/examples/README.md @@ -1,9 +1,6 @@ -# Example Configurations +# Example Configs and Use Cases -Learning by example is best. - -Here in the `examples/` folder are llama-swap configurations that can be used on your local LLM server. - -## List +A collections of usecases and examples for getting the most out of llama-swap. * [Speculative Decoding](speculative-decoding/README.md) - using a small draft model can increase inference speeds from 20% to 40%. This example includes a configurations Qwen2.5-Coder-32B (2.5x increase) and Llama-3.1-70B (1.4x increase) in the best cases. +* [Optimizing Code Generation](benchmark-snakegame/README.md) - find the optimal settings for your machine. This example demonstrates defining multiple configurations and testing which one is fastest. \ No newline at end of file diff --git a/examples/benchmark-snakegame/README.md b/examples/benchmark-snakegame/README.md new file mode 100644 index 0000000..93dd3b7 --- /dev/null +++ b/examples/benchmark-snakegame/README.md @@ -0,0 +1,123 @@ +# Optimizing Code Generation with llama-swap + +Finding the best mix of settings for your hardware can be time consuming. This example demonstrates using a custom configuration file to automate testing different scenarios to find the an optimal configuration. + +The benchmark writes a snake game in Python, TypeScript, and Swift using the Qwen 2.5 Coder models. The experiments were done using a 3090 and a P40. + +**Benchmark Scenarios** + +Three scenarios are tested: + +- 3090-only: Just the main model on the 3090 +- 3090-with-draft: the main and draft models on the 3090 +- 3090-P40-draft: the main model on the 3090 with the draft model offloaded to the P40 + +**Available Devices** + +Use the following command to list available devices IDs for the configuration: + +``` +$ /mnt/nvme/llama-server/llama-server-f3252055 --list-devices +ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no +ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no +ggml_cuda_init: found 4 CUDA devices: + Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes + Device 1: Tesla P40, compute capability 6.1, VMM: yes + Device 2: Tesla P40, compute capability 6.1, VMM: yes + Device 3: Tesla P40, compute capability 6.1, VMM: yes +Available devices: + CUDA0: NVIDIA GeForce RTX 3090 (24154 MiB, 406 MiB free) + CUDA1: Tesla P40 (24438 MiB, 22942 MiB free) + CUDA2: Tesla P40 (24438 MiB, 24144 MiB free) + CUDA3: Tesla P40 (24438 MiB, 24144 MiB free) +``` + +**Configuration** + +The configuration file, `benchmark-config.yaml`, defines the three scenarios: + +```yaml +models: + "3090-only": + proxy: "http://127.0.0.1:9503" + cmd: > + /mnt/nvme/llama-server/llama-server-f3252055 + --host 127.0.0.1 --port 9503 + --flash-attn + --slots + + --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf + -ngl 99 + --device CUDA0 + + --ctx-size 32768 + --cache-type-k q8_0 --cache-type-v q8_0 + + "3090-with-draft": + proxy: "http://127.0.0.1:9503" + # --ctx-size 28500 max that can fit on 3090 after draft model + cmd: > + /mnt/nvme/llama-server/llama-server-f3252055 + --host 127.0.0.1 --port 9503 + --flash-attn + --slots + + --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf + -ngl 99 + --device CUDA0 + + --model-draft /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf + -ngld 99 + --draft-max 16 + --draft-min 4 + --draft-p-min 0.4 + --device-draft CUDA0 + + --ctx-size 28500 + --cache-type-k q8_0 --cache-type-v q8_0 + + "3090-P40-draft": + proxy: "http://127.0.0.1:9503" + cmd: > + /mnt/nvme/llama-server/llama-server-f3252055 + --host 127.0.0.1 --port 9503 + --flash-attn --metrics + --slots + --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf + -ngl 99 + --device CUDA0 + + --model-draft /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf + -ngld 99 + --draft-max 16 + --draft-min 4 + --draft-p-min 0.4 + --device-draft CUDA1 + + --ctx-size 32768 + --cache-type-k q8_0 --cache-type-v q8_0 +``` + +> Note in the `3090-with-draft` scenario the `--ctx-size` had to be reduced from 32768 to to accommodate the draft model. + + +**Running the Benchmark** + +To run the benchmark, execute the following commands: + +1. `llama-swap -config benchmark-config.yaml` +1. `./run-benchmark.sh http://localhost:8080 "3090-only" "3090-with-draft" "3090-P40-draft"` + +The [benchmark script](run-benchmark.sh) generates a CSV output of the results, which can be converted to a Markdown table for readability. + +**Results (tokens/second)** + +| model | python | typescript | swift | +|-----------------|--------|------------|-------| +| 3090-only | 34.03 | 34.01 | 34.01 | +| 3090-with-draft | 106.65 | 70.48 | 57.89 | +| 3090-P40-draft | 81.54 | 60.35 | 46.50 | + +Many different factors, like the programming language, can have big impacts on the performance gains. However, with a custom configuration file for benchmarking it is easy to test the different variations to discover what's best for your hardware. + +Happy coding! \ No newline at end of file diff --git a/examples/benchmark-snakegame/run-benchmark.sh b/examples/benchmark-snakegame/run-benchmark.sh new file mode 100755 index 0000000..797112c --- /dev/null +++ b/examples/benchmark-snakegame/run-benchmark.sh @@ -0,0 +1,43 @@ +#!/usr/bin/env bash + +# This script generates a CSV file showing the token/second for generating a Snake Game in python, typescript and swift +# It was created to test the effects of speculative decoding and the various draft settings on performance. +# +# Writing code with a low temperature seems to provide fairly consistent logic. +# +# Usage: ./benchmark.sh [model2 ...] +# Example: ./benchmark.sh http://localhost:8080 model1 model2 + +if [ "$#" -lt 2 ]; then + echo "Usage: $0 [model2 ...]" + exit 1 +fi + +url=$1; shift + +echo "model,python,typescript,swift" + +for model in "$@"; do + + echo -n "$model," + + for lang in "python" "typescript" "swift"; do + response=$(curl -s --url "$url/v1/chat/completions" -d "{\"messages\": [{\"role\": \"system\", \"content\": \"you only write code.\"}, {\"role\": \"user\", \"content\": \"write snake game in $lang\"}], \"temperature\": 0.1, \"model\":\"$model\"}") + if [ $? -ne 0 ]; then + time="error" + else + time=$(curl -s --url "$url/logs" | grep -oE '\d+(?:\.\d+)? tokens per second' | awk '{print $1}' | tail -n 1) + if [ $? -ne 0 ]; then + time="error" + fi + fi + + if [ "$lang" != "swift" ]; then + echo -n "$time," + else + echo -n "$time" + fi + done + + echo "" +done