Optimizing Code Generation with llama-swap

Finding the best mix of settings for your hardware can be time consuming. This example demonstrates using a custom configuration file to automate testing different scenarios to find the an optimal configuration.

The benchmark writes a snake game in Python, TypeScript, and Swift using the Qwen 2.5 Coder models. The experiments were done using a 3090 and a P40.

Benchmark Scenarios

Three scenarios are tested:

3090-only: Just the main model on the 3090
3090-with-draft: the main and draft models on the 3090
3090-P40-draft: the main model on the 3090 with the draft model offloaded to the P40

Available Devices

Use the following command to list available devices IDs for the configuration:

$ /mnt/nvme/llama-server/llama-server-f3252055 --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes
  Device 2: Tesla P40, compute capability 6.1, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
Available devices:
  CUDA0: NVIDIA GeForce RTX 3090 (24154 MiB, 406 MiB free)
  CUDA1: Tesla P40 (24438 MiB, 22942 MiB free)
  CUDA2: Tesla P40 (24438 MiB, 24144 MiB free)
  CUDA3: Tesla P40 (24438 MiB, 24144 MiB free)

Configuration

The configuration file, benchmark-config.yaml, defines the three scenarios:

models:
  "3090-only":
    proxy: "http://127.0.0.1:9503"
    cmd: >
      /mnt/nvme/llama-server/llama-server-f3252055
      --host 127.0.0.1 --port 9503
      --flash-attn
      --slots

      --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
      -ngl 99
      --device CUDA0

      --ctx-size 32768
      --cache-type-k q8_0 --cache-type-v q8_0

  "3090-with-draft":
    proxy: "http://127.0.0.1:9503"
    # --ctx-size 28500 max that can fit on 3090 after draft model
    cmd: >
      /mnt/nvme/llama-server/llama-server-f3252055
      --host 127.0.0.1 --port 9503
      --flash-attn
      --slots

      --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
      -ngl 99
      --device CUDA0

      --model-draft /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf
      -ngld 99
      --draft-max 16
      --draft-min 4
      --draft-p-min 0.4
      --device-draft CUDA0

      --ctx-size 28500
      --cache-type-k q8_0 --cache-type-v q8_0

  "3090-P40-draft":
    proxy: "http://127.0.0.1:9503"
    cmd: >
      /mnt/nvme/llama-server/llama-server-f3252055
      --host 127.0.0.1 --port 9503
      --flash-attn --metrics
      --slots
      --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
      -ngl 99
      --device CUDA0

      --model-draft /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf
      -ngld 99
      --draft-max 16
      --draft-min 4
      --draft-p-min 0.4
      --device-draft CUDA1

      --ctx-size 32768
      --cache-type-k q8_0 --cache-type-v q8_0

Note in the 3090-with-draft scenario the --ctx-size had to be reduced from 32768 to to accommodate the draft model.

Running the Benchmark

To run the benchmark, execute the following commands:

llama-swap -config benchmark-config.yaml
./run-benchmark.sh http://localhost:8080 "3090-only" "3090-with-draft" "3090-P40-draft"

The benchmark script generates a CSV output of the results, which can be converted to a Markdown table for readability.

Results (tokens/second)

model	python	typescript	swift
3090-only	34.03	34.01	34.01
3090-with-draft	106.65	70.48	57.89
3090-P40-draft	81.54	60.35	46.50

Many different factors, like the programming language, can have big impacts on the performance gains. However, with a custom configuration file for benchmarking it is easy to test the different variations to discover what's best for your hardware.

Happy coding!