Tag
BeeLlama.cpp is a performance-focused fork of llama.cpp that introduces DFlash speculative decoding and TurboQuant KV-cache compression, enabling high-speed local inference of large models like Qwen 3.6 27B on consumer hardware.
A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.
This paper introduces TwELL and Hybrid sparse formats with custom CUDA kernels to efficiently leverage unstructured sparsity in LLMs, achieving over 20% faster training and inference on H100 GPUs while reducing energy and memory usage.
vLLM v0.20.0 is released, an open-source library for high-throughput LLM inference and serving, featuring PagedAttention and support for various hardware architectures.
Researchers from MIT and IBM have developed a rapid tool that estimates AI power consumption in seconds, significantly faster than traditional emulation methods, to help optimize data center energy efficiency.
Deepseek open-sourced DeepEP V2 and TileKernels, new GPU kernel libraries aimed at accelerating AI workloads.
vLLM 0.20.0rc1 releases with major throughput, quantization, speculative decoding, and multi-hardware support enhancements for scalable LLM serving.
A 31B parameter model runs locally on a laptop via Hermes agent at 15 tok/s, using 22.8 GB VRAM and 94 W power, highlighting fully autonomous, private AI inference without cloud dependencies.
NVIDIA and Google collaborate to optimize Gemma 4 models for local deployment across RTX GPUs, DGX Spark, and Jetson devices, enabling efficient on-device agentic AI with support for reasoning, coding, multimodal capabilities, and 35+ languages.
OpenAI presents comprehensive techniques for training large neural networks across distributed GPU clusters, covering data parallelism, pipeline parallelism, tensor parallelism, and mixture-of-experts approaches to overcome engineering and scalability challenges.