Tag
This arXiv preprint challenges the 'Garbage In, Garbage Out' heuristic, arguing that aggressive manual data cleaning can limit predictive performance in high-dimensional tabular data by reducing dimensionality needed to triangulate latent drivers.
This paper introduces Vigil, an evaluation framework for embodied agents that disentangles task execution success from the agent's ability to correctly recognize and report task completion.
This paper introduces a critique-and-routing controller for multi-agent LLM systems that formulates coordination as a sequential decision problem. It uses policy gradients to optimize the controller for iterative refinement, outperforming baselines while reducing reliance on top-tier models.
This paper presents a structured framework for benchmarking generative, multimodal, and agentic AI in healthcare, addressing the gap between high benchmark scores and real-world clinical reliability, safety, and relevance.
This paper introduces PathBoost, a gradient tree boosting method for graph-level prediction that uses path-based features to compete with graph neural networks while offering better interpretability.
This paper introduces EnvSimBench, a benchmark for evaluating Large Language Models' ability to simulate environments for agent training. It identifies a 'state change cliff' in current LLMs and proposes a constraint-driven pipeline to reduce hallucinations and costs.
This paper proposes AGWM, an affordance-grounded world model that uses a dynamic prerequisite graph to track action executability in environments with compositional prerequisites. Experiments show it reduces prediction error and improves generalization compared to standard world models.
This paper introduces BeliefMem, a novel memory paradigm for LLM agents that stores multiple candidate conclusions with probabilities to handle partial observability and reduce self-reinforcing errors. Empirical evaluations show it outperforms deterministic baselines on LoCoMo and ALFWorld benchmarks.
This paper introduces Adaptive Q-Chunking (AQC), a reinforcement learning method that dynamically selects action chunk sizes to balance reactive control and long-horizon planning. It achieves state-of-the-art results on OGBench and Robomimic, enhancing the performance of large-scale VLA models in robotics tasks.
This paper introduces WARDEN, a distributionally robust adversarial training framework for large language models that uses f-divergence to dynamically reweight adversarial examples, significantly reducing attack success rates while maintaining computational efficiency.