Tag
Ten AI models were asked about the best approach for answering questions; they recommended a council for high-stakes decisions and a grounded fact-checker for factual queries. This led RoundTable to build 'Check mode', a new feature pairing a strong model with a web-grounded fact-checker.
GAVEL introduces a new task for verifying, explaining, and localizing errors in image-text pairs, along with a dataset and benchmark. A supervised baseline shows improvements over strong closed-source models.
This paper proposes a training-free 'identify-before-answer' (IBA) framework for Knowledge-Based Visual Question Answering (KB-VQA) that decouples entity identification from evidence ranking, outperforming fine-tuned multi-modal retrieval-augmented generation baselines while reducing complexity.
This paper introduces DiagFlowBench, a benchmark dataset of 1,676 multi-turn diagnostic conversations derived from industrial flowcharts, designed to evaluate how well language models handle off-procedure inputs and abstain from giving inappropriate advice.
This paper introduces ViGOS, a method for multimodal on-policy self-distillation that decouples perception and reasoning by having the student model first produce a visual description before reasoning, reducing shortcut reliance and improving image-grounding behavior.
This paper introduces visually grounded thinking, a method for vision-language models to interleave natural-language reasoning with explicit visual evidence grounding using points or boxes. A scalable synthesis pipeline and grounding-aware reinforcement learning improve reasoning accuracy, enabling a 4B model to match or surpass a 27B model on spatial and counting benchmarks.
Introduces MINARD, a pipeline for generating narrated, region-grounded walkthrough videos from scientific figures and their papers, along with the FigTalk benchmark and new grounding metrics.
This paper introduces a phrasing-controlled benchmark to measure how much vision-language models rely on textual priors versus image content. Experiments across eleven models show significant degradation when text leakage is minimized, and the authors demonstrate that in-context learning and GRPO post-training can reduce this reliance.
Proposes Reroute, a training-free plug-in for vision-language models that replaces irreversible visual-token pruning with recoverable routing, allowing tokens to re-enter the pipeline later to improve grounding under aggressive token reduction while maintaining VQA performance.
Introduces GATE (Grounding After Test from Execution), a method that bootstraps missing semantic groundings from execution feedback to handle under-specified user phrases in text-to-SQL tasks, consistently improving over strong baselines.
NVIDIA researchers developed a technique to speed up bounding box detection by 10x by eliminating the autoregressive token-by-token prediction step used in VLM grounding models.
This paper demonstrates that training a world model through random physical exploration leads to latent representations that encode spatial semantic structure (direction and position) without any linguistic supervision, highlighting physical geometry as the organizing principle.
This paper introduces Graph Alignment Topology as an inductive bias for grounding detection, using a graph neural network to model alignment structure between reference information and LLM outputs. The method achieves state-of-the-art results on multiple hallucination and question-answering datasets, outperforming GPT-4o.
This paper introduces Text2Opt-Bench, a scalable benchmark for text-to-optimization, and identifies that LLMs struggle with 'binding' (grounding problem data) rather than 'modeling' (choosing optimization structure). The authors propose BIND, a simple inference-time method that externalizes numeric data, significantly improving accuracy across models.
A detailed evaluation of a RAG customer support chatbot reveals that retrieval issues often masquerade as LLM problems, heuristic evaluators are misleading, deduplication improves quality, stricter grounding trades helpfulness for accuracy, and model sweeping can dramatically reduce cost while improving performance.
This paper introduces Grounded Continuation, a linear-time runtime verifier for LLM conversations that maintains an explicit dependency graph to detect whether a next utterance is supported by prior conversation, achieving accuracy gains over baselines on benchmarks including LongMemEval and LoCoMo.
Falcon Perception is a 0.6B-parameter early-fusion Transformer model released by TII UAE for open-vocabulary grounding and segmentation from natural language prompts, utilizing hybrid attention and specialized heads.
DeepMind introduces FACTS Grounding, a comprehensive benchmark with 1,719 examples for evaluating how accurately large language models ground their responses in source material and avoid hallucinations. The benchmark includes a public dataset and an online Kaggle leaderboard tracking LLM performance on factual accuracy and grounding tasks.