Files

Benson Wong 9fc0431531 Clean up and Documentation (#347 ) [skip ci]

* cmd,misc: move misc binaries to cmd/
* docs: add docs and move examples/ there
* misc: remove unused misc/assets dir
* docs: add configuration.md
* update README with better structure

Updates: #334

2025-10-19 14:53:13 -07:00

4.0 KiB

Raw Permalink Blame History

Optimizing Code Generation with llama-swap

Finding the best mix of settings for your hardware can be time consuming. This example demonstrates using a custom configuration file to automate testing different scenarios to find the an optimal configuration.

The benchmark writes a snake game in Python, TypeScript, and Swift using the Qwen 2.5 Coder models. The experiments were done using a 3090 and a P40.

Benchmark Scenarios

Three scenarios are tested:

3090-only: Just the main model on the 3090
3090-with-draft: the main and draft models on the 3090
3090-P40-draft: the main model on the 3090 with the draft model offloaded to the P40

Available Devices

Use the following command to list available devices IDs for the configuration:

$ /mnt/nvme/llama-server/llama-server-f3252055 --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes
  Device 2: Tesla P40, compute capability 6.1, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
Available devices:
  CUDA0: NVIDIA GeForce RTX 3090 (24154 MiB, 406 MiB free)
  CUDA1: Tesla P40 (24438 MiB, 22942 MiB free)
  CUDA2: Tesla P40 (24438 MiB, 24144 MiB free)
  CUDA3: Tesla P40 (24438 MiB, 24144 MiB free)

Configuration

The configuration file, benchmark-config.yaml, defines the three scenarios:

models:
  "3090-only":
    proxy: "http://127.0.0.1:9503"
    cmd: >
      /mnt/nvme/llama-server/llama-server-f3252055
      --host 127.0.0.1 --port 9503
      --flash-attn
      --slots

      --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
      -ngl 99
      --device CUDA0

      --ctx-size 32768
      --cache-type-k q8_0 --cache-type-v q8_0

  "3090-with-draft":
    proxy: "http://127.0.0.1:9503"
    # --ctx-size 28500 max that can fit on 3090 after draft model
    cmd: >
      /mnt/nvme/llama-server/llama-server-f3252055
      --host 127.0.0.1 --port 9503
      --flash-attn
      --slots

      --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
      -ngl 99
      --device CUDA0

      --model-draft /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf
      -ngld 99
      --draft-max 16
      --draft-min 4
      --draft-p-min 0.4
      --device-draft CUDA0

      --ctx-size 28500
      --cache-type-k q8_0 --cache-type-v q8_0

  "3090-P40-draft":
    proxy: "http://127.0.0.1:9503"
    cmd: >
      /mnt/nvme/llama-server/llama-server-f3252055
      --host 127.0.0.1 --port 9503
      --flash-attn --metrics
      --slots
      --model /mnt/nvme/models/Qwen2.5-Coder-32B-Instruct-Q4_K_M.gguf
      -ngl 99
      --device CUDA0

      --model-draft /mnt/nvme/models/Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf
      -ngld 99
      --draft-max 16
      --draft-min 4
      --draft-p-min 0.4
      --device-draft CUDA1

      --ctx-size 32768
      --cache-type-k q8_0 --cache-type-v q8_0

Note in the 3090-with-draft scenario the --ctx-size had to be reduced from 32768 to to accommodate the draft model.

Running the Benchmark

To run the benchmark, execute the following commands:

llama-swap -config benchmark-config.yaml
./run-benchmark.sh http://localhost:8080 "3090-only" "3090-with-draft" "3090-P40-draft"

The benchmark script generates a CSV output of the results, which can be converted to a Markdown table for readability.

Results (tokens/second)

model	python	typescript	swift
3090-only	34.03	34.01	34.01
3090-with-draft	106.65	70.48	57.89
3090-P40-draft	81.54	60.35	46.50

Many different factors, like the programming language, can have big impacts on the performance gains. However, with a custom configuration file for benchmarking it is easy to test the different variations to discover what's best for your hardware.

Happy coding!

4.0 KiB Raw Permalink Blame History

Optimizing Code Generation with llama-swap

4.0 KiB

Raw Permalink Blame History