trustworthy-ai

#trustworthy-ai

Building AI where mistakes matter

Reddit r/AI_Agents ↗ · 2026-05-21

A reflection on building a locally hosted AI chatbot for volunteers at a social organization in Rotterdam, emphasizing that when AI mistakes have real consequences (e.g., giving outdated shelter information to homeless individuals), the design and engineering approach must be fundamentally different from low-stakes contexts.

0 favorites 0 likes

#trustworthy-ai

TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting

arXiv cs.LG ↗ · 2026-05-20

Proposes TEMPO, a policy optimization method that trains LLMs to reason exclusively from pre-cutoff information by using a two-mode reward and GRPO-based training, reducing knowledge leakage by 2–13% while improving task performance by 6–13%.

0 favorites 0 likes

#trustworthy-ai

POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

arXiv cs.AI ↗ · 2026-05-20 Cached

POLAR-Bench is a diagnostic benchmark that evaluates the privacy-utility trade-off in LLM agents by testing their ability to follow privacy policies while being adversarially probed by third-party models. Results show frontier models protect over 99% of protected attributes but smaller open-weight models leak over half, highlighting gaps in intent-following.

0 favorites 0 likes

#trustworthy-ai

Responsible Agentic AI Requires Explicit Provenance

arXiv cs.AI ↗ · 2026-05-19 Cached

This paper argues that explicit provenance across the full agentic AI lifecycle is the structural necessity for making responsibility computable and actionable, addressing responsibility gaps from emergent harms in autonomous compositions.

0 favorites 0 likes

#trustworthy-ai

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

Hugging Face Daily Papers ↗ · 2026-05-13 Cached

CiteVQA is a benchmark for document vision-language models that evaluates both answer correctness and citation of supporting evidence, revealing widespread attribution hallucinations where models provide correct answers but cite wrong regions.

0 favorites 0 likes

#trustworthy-ai

Aligning Human-AI-Interaction Trust for Mental Health Support: Survey and Position for Multi-Stakeholders

arXiv cs.CL ↗ · 2026-04-23 Cached

A multi-institution survey proposes a three-layer trust framework to align technical, clinical, and human-centered requirements for trustworthy AI in mental-health support.

0 favorites 0 likes

#trustworthy-ai

PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers propose PRISM, a diagnostic benchmark that breaks down LLM hallucinations into four dimensions (knowledge missing/errors, reasoning errors, instruction-following errors) across three generation stages (memory, instruction, reasoning), evaluating 24 LLMs to reveal trade-offs in mitigation strategies.

0 favorites 0 likes

#trustworthy-ai

A Systematic Study of Training-Free Methods for Trustworthy Large Language Models

arXiv cs.CL ↗ · 2026-04-20 Cached

A systematic study evaluating training-free methods for improving trustworthiness in large language models, categorizing approaches into input, internal, and output-level interventions while analyzing trade-offs between trustworthiness, utility, and robustness.

0 favorites 0 likes

#trustworthy-ai

Spectron

Product Hunt ↗ · 2026-04-09

Spectron provides trustworthy agent memory for AI applications.

0 favorites 0 likes

#trustworthy-ai

Apr 9, 2026PolicyTrustworthy agents in practice

Anthropic Research ↗ · 2026-05-08 Cached

Anthropic publishes a research post detailing how to build trustworthy AI agents in practice, outlining core safety principles and product implementations like Claude Code and Claude Cowork.

0 favorites 0 likes

trustworthy-ai

Submit Feedback