Tag
Report claims that GPT-5.5 Instant shows significant improvements in factual accuracy, particularly in high-stakes fields like medicine, law, and finance.
MoshiRAG combines a compact full-duplex speech language model with asynchronous retrieval-augmented generation to improve factuality while maintaining real-time interactivity. The approach leverages natural temporal gaps in conversation to retrieve external knowledge without disrupting the natural flow of dialogue.
Google DeepMind and Kaggle have launched the FACTS Benchmark Suite, a comprehensive set of evaluations including parametric, search, multimodal, and grounding benchmarks to systematically measure the factuality of large language models.
DeepMind introduces FACTS Grounding, a comprehensive benchmark with 1,719 examples for evaluating how accurately large language models ground their responses in source material and avoid hallucinations. The benchmark includes a public dataset and an online Kaggle leaderboard tracking LLM performance on factual accuracy and grounding tasks.
OpenAI introduces SimpleQA, a new factuality benchmark dataset with 4,326 short fact-seeking questions designed to evaluate frontier language models on their ability to provide accurate answers without hallucination. The dataset achieves high quality through dual independent annotation, rigorous criteria, and achieves only ~3% estimated error rate, with GPT-4o scoring less than 40%.