Tag
General Intuition raised $320M at a $2.3B valuation to develop AI agents trained on video game action labels, demonstrating a single model that can play games and control real-world robots with minimal fine-tuning.
Highlights three recent AI papers: SpatialClaw (training-free spatial reasoning via code), SkillWeaver (compositional skill routing with decompose-retrieve-compose pipeline), and PreAct (compiling agent runs into fast state machines for repeated tasks).
An informal experiment using a chessboard reveals that vision language models often fail at spatial reasoning and precise structured output, despite correctly recognizing pieces, highlighting a key gap in VLM evaluation.
General Intuition, a startup building a foundation model for training AI agents in spatial-temporal reasoning using video game data, is in talks to raise $300 million at a $2 billion valuation, with backing from Jeff Bezos and Eric Schmidt.
NVIDIA has launched SpatialClaw, a code-based training-free agent framework for complex visual-spatial reasoning tasks, achieving an average of 59.9% on 20 benchmarks, 11.2 points higher than the previous best model.
This paper introduces visually grounded thinking, a method for vision-language models to interleave natural-language reasoning with explicit visual evidence grounding using points or boxes. A scalable synthesis pipeline and grounding-aware reinforcement learning improve reasoning accuracy, enabling a 4B model to match or surpass a 27B model on spatial and counting benchmarks.
NVIDIA introduces SpatialClaw, a training-free spatial reasoning agent that uses a VLM to write Python code in a persistent kernel, compose perception tools, and revise plans, achieving +11.2 points over prior agents on 20 benchmarks.
This paper proposes a self-supervised reinforcement learning framework that uses consistency verifiers—reward functions checking geometric and semantic consistency under transformations—to improve spatial reasoning in large reasoning models without requiring ground-truth annotations. The method approaches the accuracy of supervised fine-tuning and generalizes across diverse tasks.
The paper proposes SVoT, a reinforcement learning framework that generates interleaved, verifiable intermediate states and visualizations for multi-hop spatial reasoning in MLLMs, achieving significant accuracy gains on new benchmarks involving multi-object interactions and numerical reasoning.
SpatialClaw is a training-free framework that uses code as an action interface to enable flexible, stateful spatial reasoning in vision-language models, achieving superior performance across diverse 3D/4D spatial reasoning tasks.
This paper presents Architect-Ant, an editable automatic furnishing framework for architectural floor plans, together with a curated dataset (AntPlan-270) of 270 floor plans with furniture annotations. The method uses a fine-tuned vision-language model and a domain-specific language to generate geometrically valid and functionally plausible furniture layouts that can be rasterized into blueprint-style images.
A training-free framework for spatial reasoning from egocentric videos that enables revisiting conclusions through synthesized novel-view videos generated from predicted 3D geometry.
This paper identifies a failure mode called PhysHack in LLM-based LEGO assembly generation and proposes PVPO, a sample-efficient reinforcement learning method with model-based data selection that improves physical and semantic alignment using only a small fraction of training data.
AlloSpatial is an agentic framework that enhances spatial reasoning in foundation models by converting egocentric observations into structured allocentric representations, using cognitive mapping and tool-use reasoning. It improves performance by 5-18% on benchmarks and outperforms larger models through cold-start reinforcement learning.
SpatialWorld is a unified benchmark for evaluating interactive spatial reasoning in multimodal agents across diverse real-world tasks, revealing that even the strongest models achieve low task success rates.
The paper proposes Astra, an agentic spatial reasoning framework that couples a reinforcement learning-trained VLM policy with a world simulator to generate novel-view observations for improved spatial reasoning in Vision-Language Models.
A benchmark tests LLMs on strict Sokoban puzzles with formatting constraints, finding only ChatGPT, Qwen3.7-max, and Gemini 3.5-thinking succeed, while others fail due to illegal moves or formatting errors.
Proposes SpecFlow, a lightweight multimodal spatial reasoning framework that represents intermediate visual thoughts in a fixed-size discrete cosine space, reducing computation and KV cache costs by up to 2.1 times while maintaining competitive performance.
Imaginative Perception Tokens (IPT) enhance vision-language models' spatial reasoning by externalizing intermediate perceptual representations from alternative viewpoints, outperforming traditional text-based reasoning on perspective taking, path tracing, and multiview counting tasks.
GridVQA-X introduces a diagnostic framework to evaluate cross-modal explainability by distinguishing genuine spatial-relational reasoning from cross-modal shortcuts in multimodal models.