Tag
PropLLM integrates hop-by-hop scene reconstruction with LLMs for network fault diagnosis. It uses a dual-layer knowledge graph and a temporal causal propagation attention mechanism to trace back along propagation paths, improving accuracy and reducing hallucinations.
This paper introduces compatibility and incompatibility scores for evaluating collections of bivariate causal statements without relying on faithfulness, and demonstrates their applicability by analyzing causal claims from large language models.
The BEAMS Initiative presents a benchmark suite for evaluating AI tools in modeling and simulation, focusing on human-centered and responsible AI practices. Tests reveal variability across LLM-based engines, with better performance in qualitative tasks than causal reasoning.
Introduces SVI-Bench, a large-scale benchmark for strategic video intelligence using team sports, designed to evaluate models on dynamic scene understanding, causal reasoning, strategic simulation, and agentic synthesis. The benchmark reveals a capability cliff where models perform well on perceptual tasks but sharply degrade on higher-level strategic reasoning.
This paper argues that large language models struggle with causal reasoning and long-horizon planning due to a mismatch between sequence prediction and reasoning over latent environment dynamics, and introduces the Latent Dynamics Inference perspective along with the Flux environment to study these limitations.
This article argues that fundamental architectural limitations, not scaling deficits, prevent current LLMs from achieving true rationality—the ability to recognize and switch frames—citing empirical failures like the reversal curse and frame-transfer issues, and suggests that scaling alone may not bridge this gap.
该论文提出并评估了一类称为事件图基质的因果推理世界模型,通过确定性重放在类型化RDF事件日志上进行反事实查询,在多个基准上优于基线模型,同时保证了可检查性和可重放一致性。
Google's new paper Nexus proposes transforming time series forecasting from statistical extrapolation to multi-agent reasoning, improving prediction accuracy via event context, achieving an 86.6% reduction in MAPE on the Zillow dataset.
This article introduces ReplaySCM, a benchmark designed to evaluate language models' ability to induce executable causal mechanisms from interventional evidence, focusing on semantic replay behavior rather than syntactic matches.
This paper identifies a critical 'model collapse' issue in standard fine-tuning for causal reasoning and proposes a semantic loss function with graph-based logical constraints to prevent it.