inference-optimization

#inference-optimization

BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models

arXiv cs.AI ↗ · yesterday Cached

This paper introduces BitCal-TTS, a runtime controller that improves accuracy and reduces premature halting in quantized reasoning models by calibrating confidence signals during test-time scaling.

0 favorites 0 likes

#inference-optimization

Meta's Optimized RecSys Inference (58 minute read)

TLDR AI ↗ · yesterday Cached

Meta's In-Kernel Broadcast Optimization (IKBO) eliminates redundant user-embedding broadcast in RecSys inference via kernel-model-system co-design, delivering up to 2/3 latency reduction and ~4x speedup on H100 GPUs, and serving as the backbone for the Meta Adaptive Ranking Model.

0 favorites 0 likes

#inference-optimization

@rohanpaul_ai: atomic[.]chat just made Gemma 4 26B faster inside LLaMA.cpp. making token generation about 40% faster in its MacBook Pr…

X AI KOLs Following ↗ · yesterday

atomic.chat has optimized Gemma 4 26B inference in LLaMA.cpp, achieving ~40% faster token generation on MacBook Pro M5 Max using Multi-Token Prediction (MTP) speculative decoding. This is a notable win for local AI users running desktop apps, coding agents, and private on-device assistants.

0 favorites 0 likes

#inference-optimization

AI agents are changing how people think about compute costs

Reddit r/AI_Agents ↗ · yesterday

The article discusses how AI agent workflows are shifting optimization focus from pure inference costs to broader challenges like latency, orchestration overhead, and reliability. It highlights a trend toward hybrid architectures and dynamic model routing to address these multi-step workflow complexities.

0 favorites 0 likes

#inference-optimization

@googlegemma: Gemma 4 up to 3x faster, directly in your phone! Check out the difference Speculative Decoding makes! Multi-Token Predi…

X AI KOLs Timeline ↗ · 2d ago Cached

Google's Gemma 4 achieves up to 3x faster inference speeds through speculative decoding and multi-token prediction, enabling efficient on-device deployment.

0 favorites 0 likes

#inference-optimization

New Gemma 4 MTP on MLX?

Reddit r/LocalLLaMA ↗ · 2d ago

Google released Multi Token Prediction drafters for Gemma 4 to accelerate inference via speculative decoding, but support for MLX is currently unconfirmed or unavailable.

0 favorites 0 likes

#inference-optimization

Boosting multimodal inference performance by >10% with a single Python dict

Hacker News Top ↗ · 3d ago Cached

Modal engineers profiled SGLang's scheduler on multimodal VLM workloads and found that replacing expensive GPU memory bookkeeping with a simple Python dict cache improved throughput by 16% and reduced latency by over 13%, with the fix merged into SGLang v0.5.10.

0 favorites 0 likes

#inference-optimization

google/gemma-4-26B-A4B-it-assistant

Hugging Face Models Trending ↗ · 2026-04-23 Cached

Google DeepMind released Gemma 4 MTP drafters for the Gemma 4 family, enabling significant decoding speedups via speculative decoding while maintaining exact generation quality for low-latency applications.

0 favorites 0 likes

#inference-optimization

z-lab/Qwen3.6-27B-DFlash

Hugging Face Models Trending ↗ · 2026-04-23 Cached

This article introduces Qwen3.6-27B-DFlash, a specialized drafter model for DFlash, a novel speculative decoding method using block diffusion to accelerate inference speed. It provides installation instructions for vLLM and SGLang to enable parallel drafting with the target Qwen3.6-27B model.

0 favorites 0 likes

#inference-optimization

@0xSero: Finally GLM-5.1-505B-REAP-NVFP4 45 tokens/s decode 1350 tokens/s prefill 32% prune This was the hardest I ever worked t…

X AI KOLs Timeline ↗ · 2026-04-20 Cached

Developer @0xSero achieved high-performance inference on an optimized GLM-5.1-505B variant using NVFP4 quantization and 32% pruning, reaching 45 tokens/s decode and 1350 tokens/s prefill speeds.

0 favorites 0 likes

#inference-optimization

@zhijianliu_: Reasoning VLAs can think. They just can't think fast. Until now. Introducing FlashDrive 716 ms → 159 ms on RTX PRO 6000…

X AI KOLs Timeline ↗ · 2026-04-19 Cached

FlashDrive reduces reasoning vision-language-action model inference latency from 716 ms to 159 ms on RTX PRO 6000—up to 5.7× faster—with zero accuracy loss, enabling real-time autonomous applications.

0 favorites 0 likes

#inference-optimization

@bstnxbt: DFlash v0.1.4 : custom Metal verify kernels for quantized Qwen3 hybrid models, plus significant peak memory reduction a…

X AI KOLs Following ↗ · 2026-04-18 Cached

DFlash v0.1.4 releases custom Metal verify kernels for quantized Qwen3 hybrid models with significant peak memory reduction and 2.2x throughput improvements at long context on M5 Max GPUs.

0 favorites 0 likes

#inference-optimization

So... has anyone actually figured out whose model Elephant Alpha is yet?

Reddit r/singularity ↗ · 2026-04-18

Community discusses the identity of 'Elephant Alpha', a 100B parameter model ranked #1 on OpenRouter with 256K context window, fast inference speed, and strong coding capabilities but poor Chinese support, speculating on which company might be behind it.

0 favorites 0 likes

#inference-optimization

Elucidating the SNR-t Bias of Diffusion Probabilistic Models

Hugging Face Daily Papers ↗ · 2026-04-17 Cached

This paper identifies a Signal-to-Noise Ratio timestep (SNR-t) bias in diffusion probabilistic models during inference, where SNR-timestep alignment from training is disrupted at inference time. The authors propose a differential correction method that decomposes samples into frequency components and corrects each separately, improving generation quality across models like IDDPM, ADM, DDIM, EDM, and FLUX with minimal computational overhead.

0 favorites 0 likes

#inference-optimization

Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

Hugging Face Daily Papers ↗ · 2026-04-17 Cached

This paper introduces STOP (Super Token for Pruning), a lightweight method that learns to prune unpromising reasoning paths early during parallel decoding by appending learnable tokens and reading KV cache states, achieving 70% token reduction while improving performance on AIME and GPQA benchmarks.

0 favorites 0 likes

#inference-optimization

Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

Hugging Face Daily Papers ↗ · 2026-04-16 Cached

This paper analyzes inference-time optimization techniques for AIMO 3, finding that model capability dominates over prompt engineering and diverse sampling strategies. The study reveals that high-temperature sampling already decorrelates errors maximally, leaving no room for prompt-based improvements, and identifies a 6-point selection loss gap between individual model pass@20 and majority voting consensus.

0 favorites 0 likes

#inference-optimization

Forge-UGC: FX optimization and register-graph engine for universal graph compiler

Hugging Face Daily Papers ↗ · 2026-04-14 Cached

Forge-UGC is a four-phase universal graph compiler that speeds up transformer deployment on NPUs, cutting compilation time 6.9-9.2×, inference latency 18-36 % and energy 30-41 % versus OpenVINO/ONNX Runtime.

0 favorites 0 likes

#inference-optimization

Efficient Memory Management for Large Language Model Serving with PagedAttention

Papers with Code Trending ↗ · 2023-09-12 Cached

This paper introduces PagedAttention, an algorithm inspired by virtual memory paging, and vLLM, a serving system that significantly improves LLM throughput by reducing memory fragmentation in key-value caches.

0 favorites 0 likes

#inference-optimization

@bastani_behnam: We just published how we unlocked +50% inference capacity on a 27B model — no new GPUs, no new nodes, at a fraction of …

X AI KOLs Following ↗ · 2026-04-21 Cached

OpenInfer demonstrates "vertical disaggregation" that boosts Qwen 3.5 27B throughput by ~50% by co-executing quantized layers across a single node’s AMD EPYC CPU and Nvidia L40S GPU with a custom SLA-aware scheduler.

0 favorites 0 likes

inference-optimization

Submit Feedback