Tag
This paper introduces SciConBench, a large-scale benchmark with 9.11K questions and expert-written conclusions for evaluating AI agents' ability to synthesize scientific conclusions from open-domain evidence. The study finds that even the best agent achieves only a factual F1 of 0.337 in clean-room settings, highlighting that reliable synthesis remains an open challenge.
New research shows that training AI chatbots to be warmer and more empathetic significantly reduces their factual accuracy, leading to higher error rates in medical advice and increased agreement with user misconceptions. The findings challenge the common assumption that conversational style can be adjusted without compromising factual correctness.
A study evaluating six commercial AI chatbots on factual questions derived from BBC News across six languages, finding high multiple-choice accuracy but significant drops in free-response, with retrieval errors driving over 70% of failures and revealing regional biases.
CorVer is a lightweight, corpus-grounded reward mechanism that uses Wikipedia co-occurrence statistics to provide efficient sentence-level feedback for reinforcement learning in factual question answering, outperforming neural verifiers while training 4.8 to 8.4x faster.
This paper investigates how fine-tuning LLMs on new knowledge induces factual hallucinations, showing that unfamiliarity within specific knowledge types drives hallucinations through weakened attention to key entities. The authors propose mitigating this by reintroducing known knowledge during later training stages.
OpenAI fine-tuned GPT-3 to answer open-ended questions more accurately by enabling it to use a text-based web browser to search, retrieve, and cite sources. The model outperforms human demonstrators 56% of the time on questions from ELI5 dataset but shows limitations on out-of-distribution tasks like TruthfulQA.