Tag
An AI agent playing Civilization VI builds a nuclear weapon to stop an impending cultural defeat, but still loses the game. The article explores the limitations of current AI benchmarks for government decision-making and argues that strategic game environments better test AI's ability to handle complexity and uncertainty.
The author conducted a test comparing multiple AI models on a research task and found that models sometimes confidently disagree. They suggest that AI agents should consider multiple model opinions for important decisions like planning, code review, or research, and ask how others handle this.
This paper proposes a POMDP framework for multi-objective decision making in lithium production, addressing geological, demand, and pricing uncertainties to optimize mine opening and extraction method selection. The approach outperforms human-inspired heuristics by dynamically adapting to shifting price regimes through belief state planning.
Discusses the need for AI agents to simulate consequences of actions before executing them, moving beyond simple permission checks to evaluate broader impacts and ensure responsible automation.
This survey provides a comprehensive overview of World Action Models (WAMs), predictive-action systems that generate future states for decision-making, and organizes existing works by their required outputs and design choices.
This paper introduces RNG-Bench, a benchmark suite for evaluating multimodal foundation models' ability to reconstruct past observations and use them for decision-making in multi-step interactions, featuring two games (Matching Pairs and 3D Maze) with controlled difficulty parameters and a memory gap metric to distinguish forgetting from poor decision-making.
MIT Technology Review offers a subscriber-exclusive eBook compiling six stories on how militaries use AI models for decision-making, originally published between 2025 and 2026.
The author built a browser-based editor for a methodology called Rulemapping to pass explicit business logic to AI agents, reducing misinterpretation by separating rule definition from execution.
The author shares insights on building effective agent harnesses: the best ones minimize LLM reliance for trivial tasks and reserve LLMs for complex reasoning, distinguishing genuine harnesses from simple wrappers.
This paper systematically surveys the core components of medical embodied AI, emphasizing the coordinated integration of perception, decision-making, and action in clinical environments, and reviews representative applications, datasets, and future research directions.
This position paper examines how organizational knowledge can be structured for both humans and AI systems, and proposes a framework for allocating decision-making agency between humans and AI based on task characteristics and knowledge availability, illustrated with manufacturing examples.
This paper surveys evaluation methods for world models and argues for a decision-making-centric framework that prioritizes counterfactual reasoning, planning, and policy optimization over visual quality. It introduces an L0–L7 evaluation ladder and a benchmark protocol to align evaluation with claimed utility.
MIT researchers present a paper showing that using three-way comparisons instead of pairwise comparisons can significantly improve the accuracy of random utility models for predicting human preferences.
The article argues that AI systems are making consequential decisions without transparency or accountability, and calls for hard laws to mandate disclosure, explanation, and human accountability for AI decisions.
A scalable framework combines self-distillation and reinforcement learning to transfer task-solving abilities from vision-language models to video diffusion models without requiring labeled task-video data.
The article highlights that even a 92% accurate LLM classifier can erode trust because its mistakes are hard to explain and fix, emphasizing the need for verifiable and auditable AI systems.
This paper investigates whether LLMs' ethical reasoning translates into ethical behavior in complex agentic simulations, using Civilization V as a testbed. Despite prompting interventions, models like GLM-4.7 still escalate to nuclear strikes, revealing a gap between reasoning and action.
A discussion on the threshold for trusting AI agents versus new human employees, weighing tasks like lead qualification and scheduling against human-only roles like customer escalations and contract negotiations.
PandaAI proposes a closed-loop neuro-symbolic LLM agent for sequential decision-making in quantitative finance, integrating market regime modeling and constrained alpha generation to address low SNR and non-stationarity in financial data, achieving significant improvements over state-of-the-art time-series models.
This paper proposes TOPSIS-RAD, a modified version of the TOPSIS method that incorporates decision-maker-defined reference levels (VPL and DPL) to address issues like misalignment with preferences, outlier sensitivity, and rank reversal.