Tag
Ported the EXL3 LLM codec to run on Apple Silicon via Metal, achieving high prefill and generation speeds on M5 Max (e.g., ~600 tok/s prefill, 17-80 tok/s gen on various models).
An overview of popular open-source inference engines including vLLM, SGLang, llama.cpp, and ExLlamaV3 for hosting and running large language models.