Tag
This paper introduces structural uncertainty, a framework that evaluates LLM reasoning consistency by measuring the stability of self-preference rankings among sampled reasoning solutions, complementing traditional answer-dispersion methods for identifying unreliable reasoning.
This paper analyzes the thinking-answer inconsistency in multimodal reinforcement learning with verifiable rewards (RLVR) for large vision-language models and proposes CORA, a method that introduces a consistency reward model and hybrid reward advantage splitting to improve faithfulness and task performance.
PermaVid introduces a multi-modal context memory that disentangles appearance and geometric structure to maintain long-term video consistency after editing operations, outperforming prior methods.
This paper investigates whether different LLMs share common inference patterns when predicting the same token, using interaction-based explanations. Results show that advanced LLMs exhibit consistent interaction patterns, suggesting implicit optimization toward shared inference mechanisms.
Discusses the overlooked problem of memory hygiene in AI agents, where long-term storage leads to stale and unreliable context, and questions whether the industry is ignoring a looming global issue.
This paper proposes a neuro-symbolic framework for constructing ontology-grounded knowledge graphs from text by deferring consistency corrections to a post-extraction stage, reducing token usage while improving KG consistency and maintaining QA performance.
WBench is a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns, providing automatic sub-metrics and diagnostic insights. It reveals that no single model excels across all dimensions.
An opinion piece arguing that LLMs perform better with boring, consistent languages and ecosystems (like Ruby on Rails) because the training corpus has lower variance, leading to more reliable agentic output, while fragmented ecosystems (like JavaScript) produce poor results.
A reflection on the gap between impressive AI agent demos and dependable real-world execution, arguing that current agents excel at structured tasks but fail under unpredictable conditions, suggesting near-term AI roles will focus on narrow automation with human oversight.
Presents S-Bus, an HTTP middleware that uses a DeliveryLog mechanism to automatically reconstruct read sets and enforce Observable-Read Isolation consistency, preventing structural race conditions in multi-agent LLM coordination.
The article introduces A²RD, a novel architecture for generating consistent long videos using agentic autoregressive diffusion. It proposes a Retrieve–Synthesize–Refine–Update cycle and a new benchmark, LVBench-C, to address semantic drift in long-horizon video synthesis.
A detailed tutorial introducing four methods for maintaining character consistency and plot coherence when creating AI short dramas using Seedance 2.0 and GPT-image2, including extending reference videos, using keyframes as the first frame, compositing multiple video segments, and converting storyboards to video.
OpenAI released an upgraded image model that keeps character appearance perfectly consistent across frames and renders crisp, stable text.