vlm

#vlm

Best local model for vision - 2nd benchmark update - 21 Jun 2026

Reddit r/LocalLLaMA ↗ · 2d ago

This post presents the second update of a benchmark for local vision language models, comparing 23 models across 30 images with revised settings, and provides performance recommendations for different VRAM tiers. Key findings include that thinking mode hurts vision performance and that MoE models underperform dense models for perception tasks.

0 favorites 0 likes

#vlm

Revisiting Hard Questions with Replay Buffers (8 minute read)

TLDR AI ↗ · 5d ago Cached

ZPPO introduces a replay buffer for hard questions in reinforcement learning for LLMs/VLMs, allowing repeated exposure to gradually improve rollout accuracy without policy drift. The method graduates more hard questions than GRPO, especially those with near-zero initial accuracy.

0 favorites 0 likes

#vlm

A Cross-Model VLM-Judge Protocol for Single-Image 3D Mesh Quality (and Why Cheap Proxies Fall Short)

arXiv cs.LG ↗ · 6d ago Cached

This paper presents a validated VLM-judge protocol for evaluating single-image-to-3D mesh quality, showing that cheap proxies like render-CLIP and geometry statistics fail to reliably track perceived quality.

0 favorites 0 likes

#vlm

Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City

Hugging Face Daily Papers ↗ · 6d ago Cached

This paper studies how self-driving car systems and humans perform on visual question answering tasks across different geographic locations (Lima and New York City), finding that both humans and VLMs show similar performance regardless of location but diverge based on question type.

0 favorites 0 likes

#vlm

@mervenoyann: day 2 findings on this pipeline > it works, got map@50=0.8028 on road sign detection against human annotations, with on…

X AI KOLs Timeline ↗ · 2026-06-17 Cached

Merve (@mervenoyann) shares day two findings of a pipeline using multiple small VLMs as judges for road sign detection, achieving map@50=0.8028 with only 1.3k examples. The thread compares model rejection rates and discusses dataset shrinking, super-specific prompts, and plans to generalize the library.

0 favorites 0 likes

#vlm

@Phoenixyin13: NVIDIA's SpatialClaw is fresh out. This framework directly lets VLM write code step by step in a persistent Python environment, like Jupyter. From calling SAM3 to see things, compute depth, use NumPy and SciPy to process data, view results in real time, if it doesn't work…

X AI KOLs Timeline ↗ · 2026-06-17 Cached

NVIDIA has launched SpatialClaw, a code-based training-free agent framework for complex visual-spatial reasoning tasks, achieving an average of 59.9% on 20 benchmarks, 11.2 points higher than the previous best model.

0 favorites 0 likes

#vlm

Open weights are not enough: we need open training frameworks for research and better algorithms [P]

Reddit r/MachineLearning ↗ · 2026-06-15

A call for open training frameworks in AI research, introducing FeynRL, a modular and explicit framework for RL post-training of LLMs, VLMs, and agents, designed to make training processes visible and modifiable.

0 favorites 0 likes

#vlm

ProCUA-SFT Technical Report

Hugging Face Daily Papers ↗ · 2026-06-15 Cached

ProCUA-SFT is a large-scale synthetic dataset of 3.1M step-level SFT samples for training computer-use agents, produced via an automated pipeline using a single VLM (Kimi-K2.5). Fine-tuning UI-TARS 7B on it achieves 45.0% on OSWorld, an 18.7 point improvement over the base model.

0 favorites 0 likes

#vlm

@AndreasPSteiner: Released last week, and already more than 4M downloads on HuggingFace alone This makes Gemma 4 12B the most popular enc…

X AI KOLs Timeline ↗ · 2026-06-12 Cached

Google's Gemma 4 12B model, released last week, has already surpassed 4 million downloads on HuggingFace, making it the most popular encoder-free VLM and the first general-purpose LLM with encoder-free audio input. The model balances size and performance, enabling local laptop use with multi-step reasoning and agentic workflows.

0 favorites 0 likes

#vlm

@HuggingPapers: SpatialClaw NVIDIA drops a training-free spatial reasoning agent that uses code as its action interface. A VLM writes P…

X AI KOLs Following ↗ · 2026-06-12 Cached

NVIDIA introduces SpatialClaw, a training-free spatial reasoning agent that uses a VLM to write Python code in a persistent kernel, compose perception tools, and revise plans, achieving +11.2 points over prior agents on 20 benchmarks.

0 favorites 0 likes

#vlm

AutoMine Solution for AV2 2026 Scenario Mining Challenge

arXiv cs.AI ↗ · 2026-06-11 Cached

AutoMine is a robust self-refining scenario mining method using LLMs and VLMs to mine high-value scenarios from autonomous driving logs, achieving top scores in the Argoverse 2 Scenario Mining Competition at CVPR 2026.

0 favorites 0 likes

#vlm

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

arXiv cs.CL ↗ · 2026-06-10 Cached

该论文推出KCSAT-ML基准，包含十年韩国高考数学题及全国考生错误率，并提出难度对齐推理增益（DRG）指标，揭示模型错误与人类难度的对齐模式，展现相同准确率下截然不同的推理行为。

0 favorites 0 likes

#vlm

@_avichawla: 8 AI model architectures, visually explained: There's a tendency to treat LLMs as the whole field. But they're one fami…

X AI KOLs Timeline ↗ · 2026-06-09 Cached

A visual breakdown of 8 major AI model architectures including LLMs, VLMs, MoE, SLMs, and more, plus a bonus mention of recursive language models from MIT.

0 favorites 0 likes

#vlm

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

arXiv cs.AI ↗ · 2026-06-04 Cached

VAMPS is a new benchmark of 1,168 multimodal bilingual math problems designed to evaluate whether LLMs can benefit from constructing and reasoning over graphs/visualizations. Key finding: direct analytical solving surprisingly outperforms tool-enabled visual solving even on problems where plotting is a natural strategy.

0 favorites 0 likes

#vlm

Visual Graph Scaffolds for Structural Reasoning in Large Language Models

arXiv cs.AI ↗ · 2026-06-03 Cached

This paper explores using visual graph mind maps as reasoning scaffolds for LLMs, finding that visual guidance remains effective even without direct answer hints, while textual flattening of graphs loses benefits.

0 favorites 0 likes

#vlm

@DataChaz: NVIDIA just pulled off something crazy: making bounding box detection 10x faster by ripping out the exact step the enti…

X AI KOLs Timeline ↗ · 2026-06-01 Cached

NVIDIA researchers developed a technique to speed up bounding box detection by 10x by eliminating the autoregressive token-by-token prediction step used in VLM grounding models.

0 favorites 0 likes

#vlm

Function2Scene: 3D Indoor Scene Layout from Functional Specifications

Hugging Face Daily Papers ↗ · 2026-05-29 Cached

Function2Scene generates 3D indoor layouts from functional descriptions by parsing user needs and applying design constraints through an iterative refinement loop combining geometric analysis, LLM reasoning, and VLM assessment, outperforming baselines in satisfying functional requirements.

0 favorites 0 likes

#vlm

(Yet Another) KV cache calculator - kvanta.vcerny.cz

Reddit r/LocalLLaMA ↗ · 2026-05-25

A new open-source KV cache calculator tool named KVANTA has been released, supporting any LLM/VLM from Hugging Face.

0 favorites 0 likes

#vlm

@HappyyPablo: open sourcing Marlin-2B a tiny VLM to extract structured information from videos Marlin is finetuned for two questions …

X AI KOLs Timeline ↗ · 2026-05-19 Cached

Open-sourcing Marlin-2B, a tiny VLM for extracting structured information from videos, fine-tuned to answer 'what is happening and when'. Best open model in its weight class, competitive with Gemini-2.5-flash.

1 favorites 1 likes

#vlm

what non-coding tasks have you gotten a local model to do autonomously?

Reddit r/LocalLLaMA ↗ · 2026-05-19

The author discusses building a small VLM for desktop GUI automation to move data between apps without APIs, expressing interest in non-coding autonomous use cases for local models.

0 favorites 0 likes

vlm

Submit Feedback