diff --git a/README.md b/README.md index e9cef35..6518e33 100644 --- a/README.md +++ b/README.md @@ -72,6 +72,8 @@ profiles: - "llama" ``` +More complex [examples](examples/README.md) for different use cases. + ## Installation 1. Create a configuration file, see [config.example.yaml](config.example.yaml) diff --git a/examples/README.md b/examples/README.md new file mode 100644 index 0000000..8e46a31 --- /dev/null +++ b/examples/README.md @@ -0,0 +1,9 @@ +# Example Configurations + +Learning by example is best. + +Here in the `examples/` folder are llama-swap configurations that can be used on your local LLM server. + +## List + +* [Speculative Decoding](speculative-decoding/README.md) - using a small draft model can increase inference speeds from 20% to 40%. This example includes a configurations Qwen2.5-Coder-32B (2.5x increase) and Llama-3.1-70B (1.4x increase) in the best cases. diff --git a/examples/speculative-decoding/README.md b/examples/speculative-decoding/README.md new file mode 100644 index 0000000..ff8340d --- /dev/null +++ b/examples/speculative-decoding/README.md @@ -0,0 +1,3 @@ +# Qwen 2.5 Coder with a Draft Model + +Using a small draft model like qwen-2.5-coder-0.5B can have a big impact on the performance of the larger 32 billion parameter model. \ No newline at end of file