efficient-inference

#efficient-inference

Qift: Shift-Friendly No-Zero W2 Post-Training Quantization for Rotated W2A4/KV4 LLM Inference

arXiv cs.LG ↗ · 2026-06-03 Cached

This paper introduces Qift, a fixed no-zero two-bit weight quantization level set designed for Hadamard-rotated LLMs, achieving improved W2A4/KV4 inference by leveraging the near-zero-centered Gaussian-like distribution of rotated weights. Experiments on LLaMA-2-7B and LLaMA-3.1-8B show consistent perplexity gains over standard W2 quantization.

0 favorites 0 likes

#efficient-inference

Video2LoRA: Parametric Video Internalization for Vision-Language Models

Hugging Face Daily Papers ↗ · 2026-06-03 Cached

This paper introduces Video2LoRA, a method that predicts Low-Rank Adaptation (LoRA) weights directly from video representations, enabling efficient video processing in frozen vision-language models. It reduces visual token load by up to 1500x and query TTFT by 6-80x while maintaining performance on video summarization and captioning benchmarks.

0 favorites 0 likes

#efficient-inference

EPIC: Efficient and Parallel Inference under CFG Constraints for Diffusion Language Models

arXiv cs.CL ↗ · 2026-06-02 Cached

This paper presents EPIC, an efficient framework for context-free grammar constrained decoding in diffusion language models that reduces inference time by up to 67.5% while maintaining syntactic correctness.

0 favorites 0 likes

#efficient-inference

dMoE: dLLMs with Learnable Block Experts

arXiv cs.CL ↗ · 2026-06-01 Cached

dMoE proposes block-level expert routing for diffusion LLMs, reducing the number of uniquely activated experts from 69.5 to 14.6 while retaining 99.11% performance and achieving 76-80% memory reduction with 1.14-1.66× speedup.

0 favorites 0 likes

#efficient-inference

Robust and Efficient Guardrails with Latent Reasoning

arXiv cs.AI ↗ · 2026-05-29 Cached

CoLaGuard is a new guardrail model that transfers multi-step safety reasoning into a continuous latent space, achieving 12.9x speedup and 22.4x token reduction compared to explicit reasoning baselines while matching macro-F1 performance on ten safety benchmarks.

0 favorites 0 likes

#efficient-inference

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

Hugging Face Daily Papers ↗ · 2026-05-28 Cached

PARCEL introduces a novel vision-language model architecture that uses pool-anchored resampling and conditioned elastic queries to improve efficiency and performance across different visual-token budgets, outperforming existing matryoshka baselines.

0 favorites 0 likes

#efficient-inference

Linearizing Vision Transformer with Test-Time Training

Hugging Face Daily Papers ↗ · 2026-05-28 Cached

This paper proposes a method to convert pretrained Softmax attention models into linear-complexity Test-Time Training (TTT) architectures, achieving comparable text-to-image quality to fine-tuned Softmax models while significantly accelerating inference. The approach is validated by linearizing Stable Diffusion 3.5, resulting in SD3.5-T^5 with 1.32x speedup at 1K resolution.

0 favorites 0 likes

#efficient-inference

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Hugging Face Daily Papers ↗ · 2026-05-28 Cached

VideoMLA replaces per-head KV caches in video diffusion models with a shared low-rank latent and decoupled 3D-RoPE positional keys, reducing per-token KV memory by 92.7% and improving throughput by 1.23x on a B200 while maintaining quality on VBench benchmarks.

0 favorites 0 likes

#efficient-inference

Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

arXiv cs.CL ↗ · 2026-05-25 Cached

Fast-dDrive is a block-diffusion VLA model for end-to-end autonomous driving that achieves state-of-the-art trajectory accuracy while delivering over 12x throughput speedup over autoregressive baselines, addressing the trade-off between high-fidelity planning and efficient inference for edge deployment.

0 favorites 0 likes

#efficient-inference

I built a Mamba1 variant I call SM1 with d_state=1 that runs on Blackwell in pure PyTorch [P]

Reddit r/MachineLearning ↗ · 2026-05-23

The author presents SM1, a variant of Mamba1 with d_state=1, using two native PyTorch ops to replace the selective scan, reducing memory by 16x compared to d_state=16. The closed-form solution eliminates the state dimension, enabling efficient inference with constant memory per token.

0 favorites 0 likes

#efficient-inference

PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models

arXiv cs.CL ↗ · 2026-05-21 Cached

PulseCol introduces a periodically refreshed column-sparse attention method for diffusion language models, achieving higher sparsity and up to 1.95x end-to-end speedup over FlashAttention while maintaining model quality.

0 favorites 0 likes

#efficient-inference

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

arXiv cs.LG ↗ · 2026-05-21 Cached

Quant.npu introduces a fully static quantization framework for mobile NPUs, using learnable parameters and rotation matrices to enable efficient low-bit LLM inference without runtime re-computation, achieving up to 15.1% latency reduction.

0 favorites 0 likes

#efficient-inference

Plug-and-Play Spiking Operators: Breaking the Nonlinearity Bottleneck in Spiking Transformers

arXiv cs.LG ↗ · 2026-05-21 Cached

This paper proposes a plug-and-play framework that implements spike-friendly approximations for Transformer nonlinearities (e.g., Softmax, SiLU, normalization) via population computation with LIF neurons and lightweight bit-shift scaling, achieving less than 1% accuracy drop on LLMs without fine-tuning.

0 favorites 0 likes

#efficient-inference

Multi-Token Residual Prediction

arXiv cs.LG ↗ · 2026-05-20

Introduces Multi-token Residual Prediction (MRP), a lightweight module for diffusion language models that enables dependency-aware multi-token denoising within a single backbone forward pass, achieving up to 1.42× lossless speedup.

0 favorites 0 likes

#efficient-inference

D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting

arXiv cs.LG ↗ · 2026-05-20 Cached

This paper introduces D-PACE, a dynamic position-aware cross-entropy loss for training speculative decoding drafters that adaptively weights positions to improve acceptance length and inference speed, achieving consistent wall-clock speedups across benchmarks with minimal overhead.

0 favorites 0 likes

#efficient-inference

OlmoEarth v1.1: A more efficient family of models

Hugging Face Blog ↗ · 2026-05-19 Cached

OlmoEarth v1.1 is a new family of satellite imagery analysis models from Allen AI that reduces compute costs by up to 3x while maintaining performance, achieved by decreasing token sequence lengths in transformer-based models.

0 favorites 0 likes

#efficient-inference

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

arXiv cs.LG ↗ · 2026-05-19 Cached

ProxyKV is a cross-model proxy pruning framework that offloads importance scoring to a lightweight small model, achieving high precision KV cache pruning with much lower prefilling overhead, matching KVZip accuracy across Llama-3.1, Qwen-2.5, and Qwen-3 families.

0 favorites 0 likes

#efficient-inference

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

arXiv cs.CL ↗ · 2026-05-19 Cached

RTPurbo converts full-attention LLMs into sparse models with only a few hundred training steps, achieving near-lossless accuracy and up to 9.36x prefill and 2.01x decode speedups.

0 favorites 0 likes

#efficient-inference

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

arXiv cs.AI ↗ · 2026-05-19 Cached

The paper introduces TTE-Flash, a method that replaces explicit chain-of-thought reasoning with latent think tokens to generate reasoning-aware multimodal representations at constant inference cost, outperforming explicit CoT baselines on the MMEB-v2 benchmark.

0 favorites 0 likes

#efficient-inference

TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload

Hugging Face Daily Papers ↗ · 2026-05-19 Cached

TIDE is a lossless inference system for diffusion large language models that leverages temporal stability of expert activations to reduce I/O overhead and computation, achieving up to 1.4-1.5x throughput improvements on single GPU-CPU systems.

0 favorites 0 likes

efficient-inference

Submit Feedback