add example: optimizing code generation

2024-12-03 10:25:43 -08:00
parent da46545630
commit da2326bdc7
3 changed files with 169 additions and 6 deletions
--- a/examples/README.md
+++ b/examples/README.md
@@ -1,9 +1,6 @@
-# Example Configurations
+# Example Configs and Use Cases
-Learning by example is best.
+A collections of usecases and examples for getting the most out of llama-swap.
 Here in the `examples/` folder are llama-swap configurations that can be used on your local LLM server.
 ## List
 * [Speculative Decoding](speculative-decoding/README.md) - using a small draft model can increase inference speeds from 20% to 40%. This example includes a configurations Qwen2.5-Coder-32B (2.5x increase) and Llama-3.1-70B (1.4x increase) in the best cases.
 * [Optimizing Code Generation](benchmark-snakegame/README.md) - find the optimal settings for your machine. This example demonstrates defining multiple configurations and testing which one is fastest.
--- a/examples/benchmark-snakegame/README.md
+++ b/examples/benchmark-snakegame/README.md
@@ -0,0 +1,123 @@
 # Optimizing Code Generation with llama-swap
 Finding the best mix of settings for your hardware can be time consuming. This example demonstrates using a custom configuration file to automate testing different scenarios to find the an optimal configuration.
 The benchmark writes a snake game in Python, TypeScript, and Swift using the Qwen 2.5 Coder models. The experiments were done using a 3090 and a P40.
 **Benchmark Scenarios**
 Three scenarios are tested:
 - 3090-only: Just the main model on the 3090
 - 3090-with-draft: the main and draft models on the 3090
 - 3090-P40-draft: the main model on the 3090 with the draft model offloaded to the P40
 **Available Devices**
 Use the following command to list available devices IDs for the configuration:
 ```
 $ /mnt/nvme/llama-server/llama-server-f3252055 --list-devices
 ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
 ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
 ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes
  Device 2: Tesla P40, compute capability 6.1, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
 Available devices:
  CUDA0: NVIDIA GeForce RTX 3090 (24154 MiB, 406 MiB free)
  CUDA1: Tesla P40 (24438 MiB, 22942 MiB free)
  CUDA2: Tesla P40 (24438 MiB, 24144 MiB free)
  CUDA3: Tesla P40 (24438 MiB, 24144 MiB free)
 ```
 **Configuration**
 The configuration file, `benchmark-config.yaml`, defines the three scenarios:
 ```yaml
 models:
  "3090-only":
    proxy: "http://127.0.0.1:9503"
    cmd: >
      /mnt/nvme/llama-server/llama-server-f3252055
      --host 127.0.0.1 --port 9503
      --flash-attn
      --slots
      --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
      -ngl 99
      --device CUDA0
      --ctx-size 32768
      --cache-type-k q8_0 --cache-type-v q8_0
  "3090-with-draft":
    proxy: "http://127.0.0.1:9503"
    # --ctx-size 28500 max that can fit on 3090 after draft model
    cmd: >
      /mnt/nvme/llama-server/llama-server-f3252055
      --host 127.0.0.1 --port 9503
      --flash-attn
      --slots
      --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
      -ngl 99
      --device CUDA0
      --model-draft /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf
      -ngld 99
      --draft-max 16
      --draft-min 4
      --draft-p-min 0.4
      --device-draft CUDA0
      --ctx-size 28500
      --cache-type-k q8_0 --cache-type-v q8_0
  "3090-P40-draft":
    proxy: "http://127.0.0.1:9503"
    cmd: >
      /mnt/nvme/llama-server/llama-server-f3252055
      --host 127.0.0.1 --port 9503
      --flash-attn --metrics
      --slots
      --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
      -ngl 99
      --device CUDA0
      --model-draft /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf
      -ngld 99
      --draft-max 16
      --draft-min 4
      --draft-p-min 0.4
      --device-draft CUDA1
      --ctx-size 32768
      --cache-type-k q8_0 --cache-type-v q8_0
 ```
 > Note in the `3090-with-draft` scenario the `--ctx-size` had to be reduced from 32768 to to accommodate the draft model.
 **Running the Benchmark**
 To run the benchmark, execute the following commands:
 1. `llama-swap -config benchmark-config.yaml`
 1. `./run-benchmark.sh http://localhost:8080 "3090-only" "3090-with-draft" "3090-P40-draft"`
 The [benchmark script](run-benchmark.sh) generates a CSV output of the results, which can be converted to a Markdown table for readability.
 **Results (tokens/second)**
 | model           | python | typescript | swift |
 |-----------------|--------|------------|-------|
 | 3090-only       | 34.03  | 34.01      | 34.01 |
 | 3090-with-draft | 106.65 | 70.48      | 57.89 |
 | 3090-P40-draft  | 81.54  | 60.35      | 46.50 |
 Many different factors, like the programming language, can have big impacts on the performance gains. However, with a custom configuration file for benchmarking it is easy to test the different variations to discover what's best for your hardware.
 Happy coding!
--- a/examples/benchmark-snakegame/run-benchmark.sh
+++ b/examples/benchmark-snakegame/run-benchmark.sh
@@ -0,0 +1,43 @@
 #!/usr/bin/env bash
 # This script generates a CSV file showing the token/second for generating a Snake Game in python, typescript and swift
 # It was created to test the effects of speculative decoding and the various draft settings on performance.
 #
 # Writing code with a low temperature seems to provide fairly consistent logic.
 #
 # Usage: ./benchmark.sh <url> <model1> [model2 ...]
 # Example: ./benchmark.sh http://localhost:8080 model1 model2
 if [ "$#" -lt 2 ]; then
    echo "Usage: $0 <url> <model1> [model2 ...]"
    exit 1
 fi
 url=$1; shift
 echo "model,python,typescript,swift"
 for model in "$@"; do
    echo -n "$model,"
    for lang in "python" "typescript" "swift"; do
        response=$(curl -s --url "$url/v1/chat/completions" -d "{\"messages\": [{\"role\": \"system\", \"content\": \"you only write code.\"}, {\"role\": \"user\", \"content\": \"write snake game in $lang\"}], \"temperature\": 0.1, \"model\":\"$model\"}")
        if [ $? -ne 0 ]; then
            time="error"
        else
            time=$(curl -s --url "$url/logs" | grep -oE '\d+(?:\.\d+)? tokens per second' | awk '{print $1}' | tail -n 1)
            if [ $? -ne 0 ]; then
                time="error"
            fi
        fi
        if [ "$lang" != "swift" ]; then
            echo -n "$time,"
        else
            echo -n "$time"
        fi
    done
    echo ""
 done