abstention

#abstention

Ask Before You Diagnose: Safe-Psych, a Sequential Evaluation Benchmark for LLMs in Psychiatry

arXiv cs.AI ↗ · 4d ago Cached

Introduces Safe-Psych, a sequential benchmark for evaluating how large language models handle diagnostic uncertainty in psychiatry, revealing that even strong models often fail to abstain or seek clarification when clinical evidence is incomplete.

0 favorites 0 likes

#abstention

AgentAbstain: Do LLM Agents Know When Not to Act?

arXiv cs.AI ↗ · 6d ago Cached

This paper presents AgentAbstain, the first systematic evaluation framework for LLM agents' ability to abstain from acting when appropriate, including a benchmark of 263 paired tasks across 8 abstention scenarios. The best agent achieves only 59.5% paired accuracy, revealing a critical gap independent of general task-solving capability.

0 favorites 0 likes

#abstention

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

arXiv cs.AI ↗ · 2026-06-30 Cached

This paper defines agentic abstention, the problem of deciding when an LLM agent should stop acting under uncertainty, and evaluates it across web shopping, terminal environments, and question answering. It introduces convolve, a context engineering method that improves timely abstention without updating model parameters.

0 favorites 0 likes

#abstention

Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models

arXiv cs.CL ↗ · 2026-06-26 Cached

This paper introduces Know2Guess, a contamination-aware multi-zone benchmark designed to evaluate the transition from answerable knowledge to expected abstention in large language models, addressing data contamination, prompt sensitivity, and refusal behavior. The authors assess FLAN-T5, Qwen2.5-Instruct, and Llama-3-Instruct models, finding that stronger models show selective but incomplete abstention. The benchmark and dataset are publicly released.

0 favorites 0 likes

#abstention

OpenBioRQ: Unsolved Biomedical Research Questions for Agents

Hugging Face Daily Papers ↗ · 2026-06-20 Cached

OpenBioRQ is a new benchmark of 12,553 unsolved biomedical research questions that tests agentic models' ability to verify sources and avoid false citations. It reveals that current models often link to wrong papers and suffer from agentic collapse on hard questions.

0 favorites 0 likes

#abstention

Our ICML paper on predictable hallucination (information-budget abstention gate), + ntkMirror: a training-free open-weight implementation we're releasing today

Reddit r/LocalLLaMA ↗ · 2026-06-09

A paper accepted at ICML 2026 introduces predictable hallucination via an information-budget abstention gate, and releases ntkMirror, a training-free open-weight implementation that reduces hallucination by abstaining when information is insufficient, achieving 0.0–0.7% hallucination at ~24% abstention.

0 favorites 0 likes

#abstention

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

arXiv cs.AI ↗ · 2026-06-03 Cached

This paper argues that current benchmarks for autonomous agents fail to evaluate whether an agent should have proceeded at all, introducing a 'compliance bias'. The authors propose a taxonomy of abstention-warranted scenarios and new evaluation protocols (Safety Rate, Usability Rate, Informed Refusal Rate) with preliminary results showing tunable safety–usability tradeoffs across model families.

0 favorites 0 likes

#abstention

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

Hugging Face Daily Papers ↗ · 2026-05-28 Cached

The paper introduces SpatialUncertain, a benchmark to evaluate whether vision-language models recognize when they cannot answer spatial questions due to occlusion or perspective ambiguity, revealing overconfidence and poor abstention behavior.

0 favorites 0 likes

#abstention

Fair and Calibrated Toxicity Detection with Robust Training and Abstention

arXiv cs.LG ↗ · 2026-05-15 Cached

This paper studies fairness in toxicity classification across three axes: ranking, calibration, and abstention. It compares ERM, reweighted ERM, and Group DRO methods with post-hoc interventions, finding that calibration disparity is a hidden fairness violation and that abstention itself can be unfair.

0 favorites 0 likes

#abstention

NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning

arXiv cs.LG ↗ · 2026-05-12 Cached

This paper introduces NoisyCoconut, an inference-time method that improves LLM reliability by injecting noise into latent trajectories to generate diverse reasoning paths. The approach enables models to abstain when uncertain, significantly reducing error rates in mathematical reasoning tasks without requiring retraining.

0 favorites 0 likes

abstention

Submit Feedback