Tag
This paper introduces BitEmbed, an extreme low-bit framework for LLM-based text embeddings that converts pretrained LLM backbones into BitNet-style encoders with ternary weights and quantized activations. It achieves comparable performance to full-precision models while significantly reducing encoding and storage costs.
The author maps the Kullback-Leibler divergence of KV cache quantization for the Qwen3.6-35B-A3B and Gemma4-E2B QAT models.
A Chinese developer published a 70B parameter LLM that runs locally on minimal hardware (4GB GPU) using flat memory and layer-by-layer loading, potentially replacing expensive subscription services.
PerceptionDLM introduces a multimodal diffusion language model that enables parallel region perception via structured attention masking and efficient prompting, achieving faster inference without sacrificing caption quality. Experiments show competitive performance with substantial speed improvements for multi-region perception tasks.
ImageWAM proposes replacing video generation with pretrained image editing models in world action models for robot control, achieving superior performance while reducing FLOPs to 1/6 and latency to 1/4 of video-based approaches.
GLM 5.2 is released as a 753B parameter open-source model with 1M context length, MIT license, and achieves 99.2 on AIME 2026, outperforming GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.8.
This paper introduces the Forced Deferral Attack (FDA), an adversarial image attack that manipulates confidence scores in multimodal LLM cascades, causing queries to be unnecessarily routed to stronger (more expensive) models, thereby shifting compute costs to the provider without degrading answer correctness.
This paper presents llada.cpp, an NPU-aware inference framework for accelerating diffusion large language models (dLLMs) on smartphones. It introduces three techniques—Multi-Block Speculative Decoding, Dual-Path Progressive Revision, and Swap-Optimized Memory Runtime—to align dLLM inference with mobile NPU characteristics, achieving 17-42x latency reduction over CPU baseline.
This paper introduces SP³, a method using Spherical Encoder priors for Plug-and-Play image restoration, achieving perceptual quality comparable to zero-shot diffusion priors while being 3–630× faster across tasks.
A tweet highlights that the abliterated, NVFP4 quantized Gemma-4-12B model (7.7 GB) can rival Qwen 3.6-35B in practical tasks while running fast on Blackwell GPUs, demonstrating significant efficiency gains.
MiniMax M3, a 428B MoE model with ~23B active parameters, is now open source. It offers ultra-long context (up to 1M) and efficiency improvements, with various quantized sizes and VRAM requirements for local deployment.
Nemotron 3 Ultra is a 550B parameter hybrid Mamba-Attention mixture-of-experts language model, pre-trained on 20T tokens, extended to 1M context, and post-trained with SFT, RL, and MOPD. It achieves up to 6x higher inference throughput than state-of-the-art LLMs with comparable accuracy, and is open-sourced.
This paper introduces an adaptive video tokenisation method that exploits temporal redundancy in latent space to allocate tokens dynamically, achieving efficient compression without auxiliary networks. The proposed Latent Inpainting Transformer reconstructs dropped positions, delivering 31x speedup over ElasticTok-CV and 2x over InfoTok.
This paper proposes CRUMB, a three-stage inference wrapper that clusters test queries and selects a distributionally matched training subset via MMD minimization to enable efficient Prior-Fitted Network inference on large datasets, achieving state-of-the-art context selection on 51 TabArena datasets.
HiLo-Token introduces an input-adaptive token compression framework for Diffusion Transformers that allocates more tokens to high-frequency regions, achieving up to 3.13x speedup in image editing tasks without quality loss.
IntentKV introduces a cross-turn intent-aware KV cache pruning method for multi-turn LLM agents, maintaining session-level query memory to efficiently prune cache without accuracy loss, significantly reducing token usage and KV reads.
This paper formalizes Streaming Knowledge Compilation for LLM wikis, introducing a materiality signal to proactively pin important documents from a streaming corpus under a token budget. It proves an O(√(T log K)) regret bound and validates the approach in finance and Wikipedia domains, showing that regret analysis is a reliable evaluation metric.
Introduces FlashMemory DeepSeek-V4 Retriever, a lightweight model that sparsifies DeepSeek-V4's CSA KV-cache by predicting which chunks will be attended to next, keeping only ~10-15% on-device while matching full-attention performance.
ScaleSweep proposes a new block scale initialization method for NVFP4 post-training quantization of LLMs, achieving improved accuracy by sweeping over feasible block scale candidates. Experiments on Llama and Qwen models show it preserves over 93% of full-precision performance under aggressive quantization.
Proposes a non-autoregressive scoring method for punctuation restoration in streaming ASR that preserves the input transcript and outperforms prompt-based and fine-tuned baselines under a limited lookahead budget.