llm-inference

#llm-inference

@tom_doerr: Runs 70B LLMs on single 4GB GPU https://github.com/lyogavin/airllm

X AI KOLs Timeline ↗ · 15h ago Cached

AirLLM is an open-source tool that optimizes inference memory usage, enabling 70B LLMs to run on a single 4GB GPU without quantization, and supports 405B models on 8GB VRAM.

0 favorites 0 likes

#llm-inference

CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration

arXiv cs.LG ↗ · 15h ago Cached

This paper introduces CATS, a cascaded adaptive tree speculation framework designed to accelerate LLM inference on memory-constrained edge devices by optimizing memory usage while maintaining high token acceptance rates.

0 favorites 0 likes

#llm-inference

Enabling Performant and Flexible Model-Internal Observability for LLM Inference

arXiv cs.LG ↗ · 15h ago Cached

This paper introduces DMI-Lib, a high-speed deep model inspector that enables efficient internal observability for LLM inference by decoupling monitoring from the inference hot path.

0 favorites 0 likes

#llm-inference

Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference

arXiv cs.CL ↗ · 15h ago Cached

This paper introduces Ada-MK, an adaptive MegaKernel optimization method that uses automated DAG-based search to eliminate runtime branching and reduce shared memory usage for LLM inference. It demonstrates significant throughput improvements on NVIDIA Ada GPUs by integrating with TensorRT-LLM, achieving up to 23.6% faster performance than vanilla TensorRT-LLM in commercial advertising systems.

0 favorites 0 likes

#llm-inference

vllm-project/vllm v0.21.0rc1

GitHub Releases Watchlist ↗ · 21h ago Cached

vLLM v0.21.0rc1 is a pre-release update for the high-performance LLM inference and serving library, featuring optimizations for throughput, quantization, and hardware support.

0 favorites 0 likes

#llm-inference

Is using vLLM actually worth it if you aren't serving the model to other people?

Reddit r/LocalLLaMA ↗ · 21h ago

A user discusses the trade-offs between using vLLM and llama.cpp for local, single-user inference on AMD hardware, questioning if vLLM's performance benefits justify the complexity in non-enterprise settings.

0 favorites 0 likes

#llm-inference

@ivanfioravanti: Interesting video of M5 Max, on impact of Low, Automatic and High power modes on inference. - No external monitor attac…

X AI KOLs Timeline ↗ · 21h ago

A performance test demonstrates the impact of Low, Automatic, and High power modes on LLM inference speed on an M5 Max MacBook, showing significant differences in token generation rates and power consumption.

0 favorites 0 likes

#llm-inference

Attention Drift: What Autoregressive Speculative Decoding Models Learn

Reddit r/LocalLLaMA ↗ · yesterday Cached

This paper identifies 'attention drift' in autoregressive speculative decoding models, where drafters' attention shifts from the prompt to their own generated tokens. The authors propose architectural changes, such as post-norm and RMSNorm, which improve acceptance rates and robustness across various benchmarks.

0 favorites 0 likes

#llm-inference

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

Reddit r/LocalLLaMA ↗ · yesterday

Luce releases DFlash and PFlash support for AMD Strix Halo APUs, achieving 2.23x decode and 3.05x prefill speedups over llama.cpp HIP on Qwen3.6-27B.

0 favorites 0 likes

#llm-inference

@pupposandro: 2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, …

X AI KOLs Following ↗ · yesterday

A new toolset (DFlash + PFlash) achieves 2.5x faster inference than llama.cpp on AMD Ryzen AI MAX+ 395 iGPU, demonstrating significant speedups for Qwen3.6-27B with 128 GiB unified memory.

0 favorites 0 likes

#llm-inference

@pupposandro: https://x.com/pupposandro/status/2054241934164492328

X AI KOLs Timeline ↗ · yesterday Cached

The article announces support for DFlash and PFlash speculative decoding in llama.cpp for AMD Strix Halo iGPUs, demonstrating significant speedups in inference performance using ROCm.

0 favorites 0 likes

#llm-inference

Stop wasting electricity

Reddit r/LocalLLaMA ↗ · yesterday

The author demonstrates how to reduce RTX 4090 power consumption by up to 40% while running quantized Qwen models via llama.cpp, without sacrificing inference speed. By capping GPU power limits through nvidia-smi and adjusting llama-server parameters, users can significantly lower heat, noise, and extend hardware lifespan.

0 favorites 0 likes

#llm-inference

When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression

arXiv cs.LG ↗ · yesterday Cached

This paper introduces a fixed-contract diagnostic tool to analyze why KV cache compression methods succeed or fail in long-context LLM inference. It identifies three failure modes—missing evidence, scoring irrelevant tokens, and breaking related evidence—and evaluates them on LongBench and NeedleBench.

0 favorites 0 likes

#llm-inference

Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant

arXiv cs.LG ↗ · yesterday Cached

This paper analyzes KV cache quantization schemes inspired by TurboQuant, using statistical inference and a new 6D error framework to evaluate quality measures like KL divergence and geometric error.

0 favorites 0 likes

#llm-inference

A hackable compiler to generate efficient fused GPU kernels for AI models [P]

Reddit r/MachineLearning ↗ · yesterday

The author presents a custom, hackable ML compiler written in Python that lowers LLMs to optimized CUDA kernels through a multi-stage IR pipeline, achieving performance competitive with or superior to PyTorch on specific operations. The article details the compiler's optimization passes, lowering rules, and CLI usage for generating efficient fused GPU kernels.

0 favorites 0 likes

#llm-inference

@antirez: Uploading a new HF imatrix GGUF for 2 bits: same name, different content with fixed down layer of shared experts (there…

X AI KOLs Following ↗ · 2d ago

A corrected 2-bit GGUF model file has been uploaded to Hugging Face after fixing a bug in the imatrix computation, leading to improved logits recall and reduced error.

0 favorites 0 likes

#llm-inference

@mudler_it: LocalAI ( @LocalAI_API ) 4.2.0 is out, just few numbers and facts: - +392 commits ( we squash these ) - +11 Backends: v…

X AI KOLs Following ↗ · 2d ago

LocalAI 4.2.0 is released, featuring over 392 commits, 11 new backends including voice and face recognition, improved support for sglang and VLLM, and contributions from over 16 new developers.

0 favorites 0 likes

#llm-inference

@mitsuhiko: And the ds4 SSD caches are great. This is continuing a session after the server was shut down which was already 63k tok…

X AI KOLs Timeline ↗ · 2d ago Cached

A user reports positive performance from ds4 SSD caches when resuming a long-running LLM inference session with 63k tokens, noting acceptable startup times.

0 favorites 0 likes

#llm-inference

@_EldarKurtic: TurboQuant has drawn a lot of attention recently, but the accompanying evals didn't tell the full story. So we ran what…

X AI KOLs Following ↗ · 2d ago Cached

Eldar Kurtic presents a comprehensive study on TurboQuant, revealing its real-world effects on accuracy, latency, and throughput beyond initial evaluations.

0 favorites 0 likes

#llm-inference

@tom_doerr: Runs 35B models on 16GB RAM Macs https://github.com/walter-grace/mac-code…

X AI KOLs Timeline ↗ · 2d ago Cached

A tool that enables running large language models like Qwen3.5-35B on 16GB Macs by streaming model weights from SSD, achieving up to 30 tok/s with an optimal configuration.

0 favorites 0 likes

llm-inference

Submit Feedback