Tag
This post presents the second update of a benchmark for local vision language models, comparing 23 models across 30 images with revised settings, and provides performance recommendations for different VRAM tiers. Key findings include that thinking mode hurts vision performance and that MoE models underperform dense models for perception tasks.
ZPPO introduces a replay buffer for hard questions in reinforcement learning for LLMs/VLMs, allowing repeated exposure to gradually improve rollout accuracy without policy drift. The method graduates more hard questions than GRPO, especially those with near-zero initial accuracy.
This paper presents a validated VLM-judge protocol for evaluating single-image-to-3D mesh quality, showing that cheap proxies like render-CLIP and geometry statistics fail to reliably track perceived quality.
This paper studies how self-driving car systems and humans perform on visual question answering tasks across different geographic locations (Lima and New York City), finding that both humans and VLMs show similar performance regardless of location but diverge based on question type.
Merve (@mervenoyann) shares day two findings of a pipeline using multiple small VLMs as judges for road sign detection, achieving map@50=0.8028 with only 1.3k examples. The thread compares model rejection rates and discusses dataset shrinking, super-specific prompts, and plans to generalize the library.
NVIDIA has launched SpatialClaw, a code-based training-free agent framework for complex visual-spatial reasoning tasks, achieving an average of 59.9% on 20 benchmarks, 11.2 points higher than the previous best model.
A call for open training frameworks in AI research, introducing FeynRL, a modular and explicit framework for RL post-training of LLMs, VLMs, and agents, designed to make training processes visible and modifiable.
ProCUA-SFT is a large-scale synthetic dataset of 3.1M step-level SFT samples for training computer-use agents, produced via an automated pipeline using a single VLM (Kimi-K2.5). Fine-tuning UI-TARS 7B on it achieves 45.0% on OSWorld, an 18.7 point improvement over the base model.
Google's Gemma 4 12B model, released last week, has already surpassed 4 million downloads on HuggingFace, making it the most popular encoder-free VLM and the first general-purpose LLM with encoder-free audio input. The model balances size and performance, enabling local laptop use with multi-step reasoning and agentic workflows.
NVIDIA introduces SpatialClaw, a training-free spatial reasoning agent that uses a VLM to write Python code in a persistent kernel, compose perception tools, and revise plans, achieving +11.2 points over prior agents on 20 benchmarks.
AutoMine is a robust self-refining scenario mining method using LLMs and VLMs to mine high-value scenarios from autonomous driving logs, achieving top scores in the Argoverse 2 Scenario Mining Competition at CVPR 2026.
该论文推出KCSAT-ML基准,包含十年韩国高考数学题及全国考生错误率,并提出难度对齐推理增益(DRG)指标,揭示模型错误与人类难度的对齐模式,展现相同准确率下截然不同的推理行为。
A visual breakdown of 8 major AI model architectures including LLMs, VLMs, MoE, SLMs, and more, plus a bonus mention of recursive language models from MIT.
VAMPS is a new benchmark of 1,168 multimodal bilingual math problems designed to evaluate whether LLMs can benefit from constructing and reasoning over graphs/visualizations. Key finding: direct analytical solving surprisingly outperforms tool-enabled visual solving even on problems where plotting is a natural strategy.
This paper explores using visual graph mind maps as reasoning scaffolds for LLMs, finding that visual guidance remains effective even without direct answer hints, while textual flattening of graphs loses benefits.
NVIDIA researchers developed a technique to speed up bounding box detection by 10x by eliminating the autoregressive token-by-token prediction step used in VLM grounding models.
Function2Scene generates 3D indoor layouts from functional descriptions by parsing user needs and applying design constraints through an iterative refinement loop combining geometric analysis, LLM reasoning, and VLM assessment, outperforming baselines in satisfying functional requirements.
A new open-source KV cache calculator tool named KVANTA has been released, supporting any LLM/VLM from Hugging Face.
Open-sourcing Marlin-2B, a tiny VLM for extracting structured information from videos, fine-tuned to answer 'what is happening and when'. Best open model in its weight class, competitive with Gemini-2.5-flash.
The author discusses building a small VLM for desktop GUI automation to move data between apps without APIs, expressing interest in non-coding autonomous use cases for local models.