Tag
New research from Cornell University shows that a single snippet of user-generated text as short as 13 words from sites like Reddit or Wikipedia can be used to manipulate AI search tools like ChatGPT and Google AI Search, highlighting a growing vulnerability in AI-powered information retrieval.
This paper introduces a claim-centric auditing framework for identifying error spans in deep-research agent trajectories, along with a new benchmark TELBench, improving process-level reliability assessment.
DR³-Eval is a benchmark for evaluating deep research agents on multimodal, multi-file report generation with a realistic web environment simulation and comprehensive evaluation framework measuring information recall, factual accuracy, citation coverage, instruction following, and depth quality.