factuality

#factuality

ConflictScore: Identifying and Measuring How Language Models Handle Conflicting Evidence

arXiv cs.CL ↗ · 2d ago Cached

ConflictScore is a new metric that quantifies how well language models acknowledge conflicting evidence in their grounding documents, decomposing responses into atomic claims and measuring conflict balance. The paper also introduces ConflictBench, a benchmark covering diverse conflict forms, and shows the metric can improve truthfulness on TruthfulQA.

0 favorites 0 likes

#factuality

@FinanceYF5: 3/ Improved Accuracy: GPT-5.5 Instant shows significant improvements in factual accuracy, particularly in fields with high accuracy requirements such as medicine, law, and finance.

X AI KOLs Following ↗ · 2026-05-10 Cached

Report claims that GPT-5.5 Instant shows significant improvements in factual accuracy, particularly in high-stakes fields like medicine, law, and finance.

0 favorites 0 likes

#factuality

MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

arXiv cs.CL ↗ · 2026-04-20 Cached

MoshiRAG combines a compact full-duplex speech language model with asynchronous retrieval-augmented generation to improve factuality while maintaining real-time interactivity. The approach leverages natural temporal gaps in conversation to retrieve external knowledge without disrupting the natural flow of dialogue.

0 favorites 0 likes

#factuality

FACTS Benchmark Suite: Systematically evaluating the factuality of large language models

Google DeepMind Blog ↗ · 2025-12-09 Cached

Google DeepMind and Kaggle have launched the FACTS Benchmark Suite, a comprehensive set of evaluations including parametric, search, multimodal, and grounding benchmarks to systematically measure the factuality of large language models.

0 favorites 0 likes

#factuality

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

Google DeepMind Blog ↗ · 2024-12-17 Cached

DeepMind introduces FACTS Grounding, a comprehensive benchmark with 1,719 examples for evaluating how accurately large language models ground their responses in source material and avoid hallucinations. The benchmark includes a public dataset and an online Kaggle leaderboard tracking LLM performance on factual accuracy and grounding tasks.

0 favorites 0 likes

#factuality

Introducing SimpleQA

OpenAI Blog ↗ · 2024-10-30 Cached

OpenAI introduces SimpleQA, a new factuality benchmark dataset with 4,326 short fact-seeking questions designed to evaluate frontier language models on their ability to provide accurate answers without hallucination. The dataset achieves high quality through dual independent annotation, rigorous criteria, and achieves only ~3% estimated error rate, with GPT-4o scoring less than 40%.

0 favorites 0 likes

factuality

ConflictScore: Identifying and Measuring How Language Models Handle Conflicting Evidence

@FinanceYF5: 3/ Improved Accuracy: GPT-5.5 Instant shows significant improvements in factual accuracy, particularly in fields with high accuracy requirements such as medicine, law, and finance.

MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

FACTS Benchmark Suite: Systematically evaluating the factuality of large language models

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

Introducing SimpleQA

Submit Feedback