Tag
A reflection on building a locally hosted AI chatbot for volunteers at a social organization in Rotterdam, emphasizing that when AI mistakes have real consequences (e.g., giving outdated shelter information to homeless individuals), the design and engineering approach must be fundamentally different from low-stakes contexts.
Proposes TEMPO, a policy optimization method that trains LLMs to reason exclusively from pre-cutoff information by using a two-mode reward and GRPO-based training, reducing knowledge leakage by 2–13% while improving task performance by 6–13%.
POLAR-Bench is a diagnostic benchmark that evaluates the privacy-utility trade-off in LLM agents by testing their ability to follow privacy policies while being adversarially probed by third-party models. Results show frontier models protect over 99% of protected attributes but smaller open-weight models leak over half, highlighting gaps in intent-following.
This paper argues that explicit provenance across the full agentic AI lifecycle is the structural necessity for making responsibility computable and actionable, addressing responsibility gaps from emergent harms in autonomous compositions.
CiteVQA is a benchmark for document vision-language models that evaluates both answer correctness and citation of supporting evidence, revealing widespread attribution hallucinations where models provide correct answers but cite wrong regions.
A multi-institution survey proposes a three-layer trust framework to align technical, clinical, and human-centered requirements for trustworthy AI in mental-health support.
Researchers propose PRISM, a diagnostic benchmark that breaks down LLM hallucinations into four dimensions (knowledge missing/errors, reasoning errors, instruction-following errors) across three generation stages (memory, instruction, reasoning), evaluating 24 LLMs to reveal trade-offs in mitigation strategies.
A systematic study evaluating training-free methods for improving trustworthiness in large language models, categorizing approaches into input, internal, and output-level interventions while analyzing trade-offs between trustworthiness, utility, and robustness.
Spectron provides trustworthy agent memory for AI applications.
Anthropic publishes a research post detailing how to build trustworthy AI agents in practice, outlining core safety principles and product implementations like Claude Code and Claude Cowork.