clean up writing with AI :b

This commit is contained in:
Benson Wong
2024-11-28 22:12:44 -08:00
parent c9629cf3a2
commit e363f8f498

View File

@@ -1,17 +1,17 @@
# Speculative Decoding
Speculative decoding can have a big improvement to the tokens per second. The trade off is that more VRAM is used for the draft model. These examples are from a server with three P40s and one 3090.
Speculative decoding can significantly improve the tokens per second. However, this comes at the cost of increased VRAM usage for the draft model. The examples provided are based on a server with three P40s and one 3090.
## Coding Use Case
This is using Qwen2.5 Coder 32B with the 0.5B as a draft model. A quantization of Q8_0 was chosen for the draft model as quantization has a greater effect on smaller models.
This example uses Qwen2.5 Coder 32B with the 0.5B model as a draft. A quantization of Q8_0 was chosen for the draft model, as quantization has a greater impact on smaller models.
The models used:
The models used are:
* [Bartowski Qwen2.5-Coder-32B-Instruct](https://huggingface.co/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF)
* [Bartowski Qwen2.5-Coder-0.5B-Instruct](https://huggingface.co/bartowski/Qwen2.5-Coder-0.5B-Instruct-GGUF)
llama-swap config:
The llama-swap configuration is as follows:
```yaml
models:
@@ -35,9 +35,9 @@ models:
proxy: "http://127.0.0.1:9503"
```
In this configuration two GPUs are used, a 3090 (CUDA0) for the main model and a P40 (CUDA1) for the draft model. While both models can fit on the 3090, relocating the draft model to the P40 freed up room for a larger context size. Even though the P40 is about 1/3rd the speed of the 3090, with such a small model it still improved token/second.
In this configuration, two GPUs are used: a 3090 (CUDA0) for the main model and a P40 (CUDA1) for the draft model. Although both models can fit on the 3090, relocating the draft model to the P40 freed up space for a larger context size. Despite the P40 being about 1/3rd the speed of the 3090, the small model still improved tokens per second.
Multiple tests were run with various parameters and the fastest result chosen for the configuration. In all testing the 0.5B model produced the largest improvements to tokens per second.
Multiple tests were run with various parameters, and the fastest result was chosen for the configuration. In all tests, the 0.5B model produced the largest improvements to tokens per second.
Baseline: 33.92 tokens/second on 3090 without a draft model.
@@ -55,27 +55,26 @@ Baseline: 33.92 tokens/second on 3090 without a draft model.
| 8 | 1 | 0.4 | 67.07 | 55.17 | 48.46 |
| 4 | 1 | 0.4 | 50.13 | 44.96 | 40.79 |
The test script can be found in this [gist](https://gist.github.com/mostlygeek/da429769796ac8a111142e75660820f1). It is a simple curl script to that prompts generating a snake game with python, typescript or swift. Evaluation metrics were pulled from llama.cpp's logs.
The test script can be found in this [gist](https://gist.github.com/mostlygeek/da429769796ac8a111142e75660820f1). It is a simple curl script that prompts generating a snake game in Python, TypeScript, or Swift. Evaluation metrics were pulled from llama.cpp's logs.
```
```bash
for lang in "python" "typescript" "swift"; do
echo "Generating Snake Game in $lang using $model"
curl -s --url http://localhost:8080/v1/chat/completions -d "{\"messages\": [{\"role\": \"system\", \"content\": \"you only write code.\"}, {\"role\": \"user\", \"content\": \"write snake game in $lang\"}], \"temperature\": 0.1, \"model\":\"$model\"}" > /dev/null
done
```
Python was consistently faster than Swift in all tests. This is likely due to the 0.5B draft model being more proficient in generating Python code that was accepted by the larger 32B model.
Python consistently outperformed Swift in all tests, likely due to the 0.5B draft model being more proficient in generating Python code accepted by the larger 32B model.
## Chat
This configuration is for a regular chat use case. This configuration produces about 13 tokens/second in typical use. Up from ~9 tokens/second with only the 3xP40s. Great news for P40 owners.
This configuration is for a regular chat use case. It produces approximately 13 tokens/second in typical use, up from ~9 tokens/second with only the 3xP40s. This is great news for P40 owners.
Models:
The models used are:
* [Bartowswki Meta-Llama-3.1-70B-Instruct-GGUF](https://huggingface.co/bartowski/Meta-Llama-3.1-70B-Instruct-GGUF)
* [Bartowski Meta-Llama-3.1-70B-Instruct-GGUF](https://huggingface.co/bartowski/Meta-Llama-3.1-70B-Instruct-GGUF)
* [Bartowski Llama-3.2-3B-Instruct-GGUF](https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF)
```yaml
models:
"llama-70B":
@@ -96,19 +95,19 @@ models:
--tensor-split 0,1,1,1
```
In this configuration Llama-3.1-70B is split across three P40s and Llama-3.2-3B is on the 3090.
In this configuration, Llama-3.1-70B is split across three P40s, and Llama-3.2-3B is on the 3090.
There are some flags that deserve a bit more explanation:
Some flags deserve further explanation:
* `--split-mode row` - increases inference speeds using multiple P40s by about 30%. It's a P40 thing.
* `--tensor-split 0,1,1,1` - controls how the main model is split across the GPUs. Essentally means 0% on 3090 and evenly split across the P40s. A value of `--tensor-split 0,5,4,1` would mean 0% on 3090, 50%,40%,10% respectively across the other P40s. Of course this would run out of VRAM.
* `--ctx-size 80000` - maximum context size that can fit in the remaining VRAM.
* `--split-mode row` - increases inference speeds using multiple P40s by about 30%. This is a P40-specific feature.
* `--tensor-split 0,1,1,1` - controls how the main model is split across the GPUs. This means 0% on the 3090 and an even split across the P40s. A value of `--tensor-split 0,5,4,1` would mean 0% on the 3090, 50%, 40%, and 10% respectively across the other P40s. However, this would exceed the available VRAM.
* `--ctx-size 80000` - the maximum context size that can fit in the remaining VRAM.
## What is CUDA0, CUDA1, CUDA2, CUDA3?
These devices are the IDs used by llama.cpp.
```
```bash
$ ./llama-server --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
@@ -122,5 +121,4 @@ Available devices:
CUDA1: Tesla P40 (24438 MiB, 24290 MiB free)
CUDA2: Tesla P40 (24438 MiB, 24290 MiB free)
CUDA3: Tesla P40 (24438 MiB, 24290 MiB free)
```
```