gpu-optimization

#gpu-optimization

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)

Reddit r/LocalLLaMA ↗ · 3h ago

BeeLlama.cpp is a performance-focused fork of llama.cpp that introduces DFlash speculative decoding and TurboQuant KV-cache compression, enabling high-speed local inference of large models like Qwen 3.6 27B on consumer hardware.

0 favorites 0 likes

#gpu-optimization

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Reddit r/LocalLLaMA ↗ · 7h ago

A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.

0 favorites 0 likes

#gpu-optimization

@hardmaru: The human brain is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLM…

X AI KOLs Timeline ↗ · yesterday Cached

This paper introduces TwELL and Hybrid sparse formats with custom CUDA kernels to efficiently leverage unstructured sparsity in LLMs, achieving over 20% faster training and inference on H100 GPUs while reducing energy and memory usage.

0 favorites 0 likes

#gpu-optimization

vllm-project/vllm v0.20.0

GitHub Releases Watchlist ↗ · 2026-04-27 Cached

vLLM v0.20.0 is released, an open-source library for high-throughput LLM inference and serving, featuring PagedAttention and support for various hardware architectures.

0 favorites 0 likes

#gpu-optimization

A faster way to estimate AI power consumption

MIT News — Artificial Intelligence ↗ · 2026-04-27 Cached

Researchers from MIT and IBM have developed a rapid tool that estimates AI power consumption in seconds, significantly faster than traditional emulation methods, to help optimize data center energy efficiency.

0 favorites 0 likes

#gpu-optimization

Deepseek has released DeepEP V2 and TileKernels.

Reddit r/LocalLLaMA ↗ · 2026-04-23

Deepseek open-sourced DeepEP V2 and TileKernels, new GPU kernel libraries aimed at accelerating AI workloads.

0 favorites 0 likes

#gpu-optimization

vllm-project/vllm v0.20.0rc1

GitHub Releases Watchlist ↗ · 2026-04-22 Cached

vLLM 0.20.0rc1 releases with major throughput, quantization, speculative decoding, and multi-hardware support enhancements for scalable LLM serving.

0 favorites 0 likes

#gpu-optimization

@sudoingX: this is a laptop running a 31b parameter model at 99% gpu autonomously through hermes agent, 15 tok/s sustained, 22.8 o…

X AI KOLs Timeline ↗ · 2026-04-20 Cached

A 31B parameter model runs locally on a laptop via Hermes agent at 15 tok/s, using 22.8 GB VRAM and 94 W power, highlighting fully autonomous, private AI inference without cloud dependencies.

0 favorites 0 likes

#gpu-optimization

From RTX to Spark: NVIDIA Accelerates Gemma 4 for Local Agentic AI

NVIDIA Blog ↗ · 2026-04-02 Cached

NVIDIA and Google collaborate to optimize Gemma 4 models for local deployment across RTX GPUs, DGX Spark, and Jetson devices, enabling efficient on-device agentic AI with support for reasoning, coding, multimodal capabilities, and 35+ languages.

0 favorites 0 likes

#gpu-optimization

Techniques for training large neural networks

OpenAI Blog ↗ · 2022-06-09 Cached

OpenAI presents comprehensive techniques for training large neural networks across distributed GPU clusters, covering data parallelism, pipeline parallelism, tensor parallelism, and mixture-of-experts approaches to overcome engineering and scalability challenges.

0 favorites 0 likes

gpu-optimization

Submit Feedback