This is still a work in progress, but since recording the video, I added callbacks for tool use, more tests, and published it as a cargo crate. Currently working on speeding up the prefill. The decode speed is almost the same on my Ryzen 7950x (\~37 tokens/s), but the prefill speed is not yet optimized (almost the same as decode). This model can comfortably run on a machine with 16GB of RAM. Its memory usage will fit within \~7GB. You can reuse the weights between multiple Agent instances, each with their own KV cache. You can also clone Agent object instances if your agents have the same prompt so that you don't need to repeat the prefill work on the prompt.
BeeLlama.cpp is a fork of llama.cpp that integrates DFlash speculative decoding, TurboQuant/TCQ KV-cache compression, and adaptive draft control, achieving up to 3x faster inference and 7.5x context expansion on the same hardware.