efficient-inference

#efficient-inference

BitNet Text Embeddings

arXiv cs.CL ↗ · 17h ago Cached

This paper introduces BitEmbed, an extreme low-bit framework for LLM-based text embeddings that converts pretrained LLM backbones into BitNet-style encoders with ternary weights and quantized activations. It achieves comparable performance to full-precision models while significantly reducing encoding and storage costs.

0 favorites 0 likes

#efficient-inference

I mapped the KLD of KV cache quantization for Qwen3.6-35B-A3B and Gemma4-E2B QAT

Reddit r/LocalLLaMA ↗ · 2d ago

The author maps the Kullback-Leibler divergence of KV cache quantization for the Qwen3.6-35B-A3B and Gemma4-E2B QAT models.

0 favorites 0 likes

#efficient-inference

@onchainmilady: ANTHROPIC TRIED TO BAN HIS GITHUB Chinese guy published 70B parameter LLM, 20,000 starts on Github + a lawsuit from big…

X AI KOLs Timeline ↗ · 5d ago Cached

A Chinese developer published a 70B parameter LLM that runs locally on minimal hardware (4GB GPU) using flat memory and layer-by-layer loading, potentially replacing expensive subscription services.

0 favorites 0 likes

#efficient-inference

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

Hugging Face Daily Papers ↗ · 2026-06-17 Cached

PerceptionDLM introduces a multimodal diffusion language model that enables parallel region perception via structured attention masking and efficient prompting, achieving faster inference without sacrificing caption quality. Experiments show competitive performance with substantial speed improvements for multi-region perception tasks.

0 favorites 0 likes

#efficient-inference

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

Hugging Face Daily Papers ↗ · 2026-06-17 Cached

ImageWAM proposes replacing video generation with pretrained image editing models in world action models for robot control, achieving superior performance while reducing FLOPs to 1/6 and latency to 1/4 of video-based approaches.

0 favorites 0 likes

#efficient-inference

@AdinaYakup: GLM 5.2 is here 753B ( smaller than you expect? ) 1M context MIT license GLM IndexShare: reuses the indexer across laye…

X AI KOLs Following ↗ · 2026-06-16 Cached

GLM 5.2 is released as a 753B parameter open-source model with 1M context length, MIT license, and achieves 99.2 on AIME 2026, outperforming GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.8.

0 favorites 0 likes

#efficient-inference

Forced Deferral: Manipulating Routing Decisions in Multimodal LLM Cascades

arXiv cs.AI ↗ · 2026-06-16 Cached

This paper introduces the Forced Deferral Attack (FDA), an adversarial image attack that manipulates confidence scores in multimodal LLM cascades, causing queries to be unnecessarily routed to stronger (more expensive) models, thereby shifting compute costs to the provider without degrading answer correctness.

0 favorites 0 likes

#efficient-inference

Efficient On-Device Diffusion LLM Inference with Mobile NPU

arXiv cs.LG ↗ · 2026-06-15 Cached

This paper presents llada.cpp, an NPU-aware inference framework for accelerating diffusion large language models (dLLMs) on smartphones. It introduces three techniques—Multi-Block Speculative Decoding, Dual-Path Progressive Revision, and Swap-Optimized Memory Runtime—to align dLLM inference with mobile NPU characteristics, achieving 17-42x latency reduction over CPU baseline.

0 favorites 0 likes

#efficient-inference

SP^3: Spherical Priors for Plug-and-Play Restoration

Hugging Face Daily Papers ↗ · 2026-06-15 Cached

This paper introduces SP³, a method using Spherical Encoder priors for Plug-and-Play image restoration, achieving perceptual quality comparable to zero-shot diffusion priors while being 3–630× faster across tasks.

0 favorites 0 likes

#efficient-inference

@Tono_Ken3: I noticed that there might be another person who realized that gemma-4-12b could rival qwen3.6-35b in practical work Ye…

X AI KOLs Timeline ↗ · 2026-06-14 Cached

A tweet highlights that the abliterated, NVFP4 quantized Gemma-4-12B model (7.7 GB) can rival Qwen 3.6-35B in practical tasks while running fast on Blackwell GPUs, demonstrating significant efficiency gains.

0 favorites 0 likes

#efficient-inference

@TeksEdge: With MiniMax M3 open source now out, here is what to expect on quants and sizes, including VRAM needed: MiniMax M3 (428…

X AI KOLs Following ↗ · 2026-06-12 Cached

MiniMax M3, a 428B MoE model with ~23B active parameters, is now open source. It offers ultra-long context (up to 1M) and efficiency improvements, with various quantized sizes and VRAM requirements for local deployment.

0 favorites 0 likes

#efficient-inference

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Hugging Face Daily Papers ↗ · 2026-06-12 Cached

Nemotron 3 Ultra is a 550B parameter hybrid Mamba-Attention mixture-of-experts language model, pre-trained on 20T tokens, extended to 1M context, and post-trained with SFT, RL, and MOPD. It achieves up to 6x higher inference throughput than state-of-the-art LLMs with comparable accuracy, and is open-sourced.

0 favorites 0 likes

#efficient-inference

Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting [R]

Reddit r/MachineLearning ↗ · 2026-06-11

This paper introduces an adaptive video tokenisation method that exploits temporal redundancy in latent space to allocate tokens dynamically, achieving efficient compression without auxiliary networks. The proposed Latent Inpainting Transformer reconstructs dropped positions, delivering 31x speedup over ElasticTok-CV and 2x over InfoTok.

0 favorites 0 likes

#efficient-inference

CRUMB: Efficient Prior Fitted Network Inference via Distributionally Matched Context Batching

arXiv cs.LG ↗ · 2026-06-11 Cached

This paper proposes CRUMB, a three-stage inference wrapper that clusters test queries and selects a distributionally matched training subset via MMD minimization to enable efficient Prior-Fitted Network inference on large datasets, achieving state-of-the-art context selection on 51 TabArena datasets.

0 favorites 0 likes

#efficient-inference

HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing

Hugging Face Daily Papers ↗ · 2026-06-11 Cached

HiLo-Token introduces an input-adaptive token compression framework for Diffusion Transformers that allocates more tokens to high-frequency regions, achieving up to 3.13x speedup in image editing tasks without quality loss.

0 favorites 0 likes

#efficient-inference

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

arXiv cs.LG ↗ · 2026-06-10 Cached

IntentKV introduces a cross-turn intent-aware KV cache pruning method for multi-turn LLM agents, maintaining session-level query memory to efficiently prune cache without accuracy loss, significantly reducing token usage and KV reads.

0 favorites 0 likes

#efficient-inference

Streaming Knowledge Compilation: Proactive Materiality-Scored Pinning for Time-Evolving LLM Wikis

arXiv cs.LG ↗ · 2026-06-10 Cached

This paper formalizes Streaming Knowledge Compilation for LLM wikis, introducing a materiality signal to proactively pin important documents from a streaming corpus under a token budget. It proves an O(√(T log K)) regret bound and validates the approach in finance and Wikipedia domains, showing that regret analysis is a reliable evaluation metric.

0 favorites 0 likes

#efficient-inference

FlashMemory DeepSeek-V4 Retriever (GitHub Repo)

TLDR AI ↗ · 2026-06-10 Cached

Introduces FlashMemory DeepSeek-V4 Retriever, a lightweight model that sparsifies DeepSeek-V4's CSA KV-cache by predicting which chunks will be attended to next, keeping only ~10-15% on-device while matching full-attention performance.

0 favorites 0 likes

#efficient-inference

ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization

arXiv cs.LG ↗ · 2026-06-09 Cached

ScaleSweep proposes a new block scale initialization method for NVFP4 post-training quantization of LLMs, achieving improved accuracy by sweeping over feasible block scale candidates. Experiments on Llama and Qwen models show it preserves over 93% of full-precision performance under aggressive quantization.

0 favorites 0 likes

#efficient-inference

Efficient Punctuation Restoration via Weighted Lookahead Scoring Method for Streaming ASR Systems

arXiv cs.CL ↗ · 2026-06-05 Cached

Proposes a non-autoregressive scoring method for punctuation restoration in streaming ASR that preserves the input transcript and outperforms prompt-based and fine-tuned baselines under a limited lookahead budget.

0 favorites 0 likes

efficient-inference

Submit Feedback