Tag
COMET is a model-based reinforcement learning algorithm that combines a frozen object-centric encoder with a transformer-based world model and Monte Carlo Tree Search, using causal attention to focus on task-relevant objects, achieving higher scores on visual RL benchmarks.
WISE proposes a long-horizon agent framework for Minecraft that enhances low-level controllers with a Causal Event Graph for episodic memory, enabling robust recall under viewpoint changes and opportunistic task reordering via causal reasoning. It also features a multi-scale progressive exploration strategy and demonstrates improved success and efficiency on long-horizon sparse tasks.
This paper proposes 'Trivium,' a framework that introduces long-horizon temporal regret and epistemic regret as first-class objectives alongside outcome regret for causal-memory controllers in agentic LLM systems. The authors prove that outcome-only learning cannot distinguish causal from spurious structure without an intervention channel, while their approach achieves O(log E) temporal regret on CausalBench-Seq experiments versus linear growth for baselines.
Introduces Discrete-WAM, a unified discrete latent vision-action world policy that enables compositional causal reasoning and counterfactual reasoning in autonomous driving through aligned discrete tokens and a shared discrete diffusion framework.
PropLLM integrates hop-by-hop scene reconstruction with LLMs for network fault diagnosis. It uses a dual-layer knowledge graph and a temporal causal propagation attention mechanism to trace back along propagation paths, improving accuracy and reducing hallucinations.
This paper introduces compatibility and incompatibility scores for evaluating collections of bivariate causal statements without relying on faithfulness, and demonstrates their applicability by analyzing causal claims from large language models.
The BEAMS Initiative presents a benchmark suite for evaluating AI tools in modeling and simulation, focusing on human-centered and responsible AI practices. Tests reveal variability across LLM-based engines, with better performance in qualitative tasks than causal reasoning.
Introduces SVI-Bench, a large-scale benchmark for strategic video intelligence using team sports, designed to evaluate models on dynamic scene understanding, causal reasoning, strategic simulation, and agentic synthesis. The benchmark reveals a capability cliff where models perform well on perceptual tasks but sharply degrade on higher-level strategic reasoning.
This paper argues that large language models struggle with causal reasoning and long-horizon planning due to a mismatch between sequence prediction and reasoning over latent environment dynamics, and introduces the Latent Dynamics Inference perspective along with the Flux environment to study these limitations.
This article argues that fundamental architectural limitations, not scaling deficits, prevent current LLMs from achieving true rationality—the ability to recognize and switch frames—citing empirical failures like the reversal curse and frame-transfer issues, and suggests that scaling alone may not bridge this gap.
该论文提出并评估了一类称为事件图基质的因果推理世界模型,通过确定性重放在类型化RDF事件日志上进行反事实查询,在多个基准上优于基线模型,同时保证了可检查性和可重放一致性。
Google's new paper Nexus proposes transforming time series forecasting from statistical extrapolation to multi-agent reasoning, improving prediction accuracy via event context, achieving an 86.6% reduction in MAPE on the Zillow dataset.
This article introduces ReplaySCM, a benchmark designed to evaluate language models' ability to induce executable causal mechanisms from interventional evidence, focusing on semantic replay behavior rather than syntactic matches.
This paper identifies a critical 'model collapse' issue in standard fine-tuning for causal reasoning and proposes a semantic loss function with graph-based logical constraints to prevent it.