UsefulBench: Towards Decision-Useful Information as a Target for Information Retrieval
Summary
UsefulBench introduces a domain-specific benchmark dataset that distinguishes between document relevance and usefulness for information retrieval, showing that similarity-based IR systems conflate these concepts while LLMs can address this but lack domain expertise.
View Cached Full Text
Cached at: 04/20/26, 08:30 AM
# UsefulBench: Towards Decision-Useful Information as a Target for Information Retrieval Source: https://arxiv.org/html/2604.15827 Tobias Schimanski1,2, Stefanie Lewandowski3, Christian Woerle4, Nicola Reichenau3, Yauheni Huryn4, Markus Leippold1,5 1University of Zurich 2ETH Zurich 3score4more GmbH 4Climate+Tech Think Tank 5Swiss Finance Institute (SFI) [email protected] ###### Abstract Conventional information retrieval focuses on identifying the **relevance** of texts for a given query. Yet, the conventional definition of relevance is dominated by aspects of text similarity, leaving unaddressed whether the text is truly **useful** for addressing the query. For instance, when answering whether Paris is larger than Berlin, texts about Paris being in France are relevant (lexical/semantic similarity), but not useful. In this paper, we introduce **UsefulBench**, a domain-specific dataset curated by three professional analysts labeling whether a text is connected to a query (relevance) or holds practical value in responding to it (usefulness). We show that classic similarity-based information retrieval aligns more strongly with **relevance**. While LLM-based systems can counteract this bias, we find that domain-specific problems require a high degree of expertise, which current LLMs do not fully incorporate. We explore approaches to (partially) overcome this challenge. However, **UsefulBench** presents a dataset challenge for targeted information retrieval systems. ## 1 Introduction Modern Large Language Models (LLMs) have become increasingly capable of solving a wide range of tasks like coding and math (e.g., GPT-5 Team, 2025; Yang et al., 2025; Llama3 Team, 2024). Using LLMs for question answering tasks usually comes with concerns about hallucination and outdated information. To mitigate these concerns, a wide range of applications employ retrieval augmented generation (RAG), where an LLM is provided with external context to answer a question (Lewis et al., 2021). However, with the emergence of RAG, there also started an ongoing debate about its reliability in various use cases (Liu et al., 2023; Schimanski et al., 2024a; Xian et al., 2025). One major bottleneck in reliable RAG is the retrieval process. Prior research has shown that large contexts can distort QA performance (Liu et al., 2024). At the same time, the highest-ranking documents in the retrieval process, that do not contain an answer or documents with misleading evidence, can majorly distort effective QA performance (Cuconasu et al., 2024; Xian et al., 2025; Zeng et al., 2026). This motivates a differentiation of documents into two categories: **relevant** and **useful**. Relevant documents contain query-related information (may help contextualize an answer), while useful documents contain direct information to respond to the query (see Figure 1). **Figure 1:** Relevance and usefulness examples. Documents can be highly relevant, yet not useful in producing an answer to the query. Highly relevant, yet non-useful documents can harm answer quality (see e.g., Cuconasu et al., 2024; Xian et al., 2025). Yet, conventional information retrieval systems are leaning towards what we define as relevance. They rank documents according to lexical or semantic similarity (e.g., Robertson and Zaragoza, 2009; Santhanam et al., 2022), which we confirm in this paper (Section 4.3). At the same time, prior research has shown that advanced similarity-based embedding systems can fail with simple queries, due to non-avoidable theoretical limitations (Weller et al., 2025). To overcome some of these concerns, prior research has suggested using LLM-based relevance-annotators (Es et al., 2024; Saad-Falcon et al., 2024), also incorporating degrees of relevance (Ni et al., 2025). However, to the best of the authors' knowledge, no project has aimed to disentangle relevant and useful documents. For this reason, we introduce **UsefulBench**, a domain-specific dataset that contains documents annotated by professional analysts towards degrees of relevance and usefulness. Specifically, we focus on the domain of sustainability reporting, where queries often cover a wide range of relevant and decision-useful content. For instance, the query "What are a firm's CO2 emissions?" may be widely discussed in a sustainability report, through mentioning their importance, how a firm defines them, or background information on their origin. However, only the concrete CO2 emissions number is **useful** information (see Figure 1). To create **UsefulBench**, we employ an expert analysis process by equipping three professional sustainability analysts with real-world queries and query descriptions. The analysts search through firms' sustainability reports and identify relevant and useful documents. Each identified document is annotated to be either non-relevant, partially relevant, or fully relevant, as well as non-useful, partially useful, or fully useful (see Figure 2). As a result, we obtain a dataset covering 1,110 annotated documents across 15 sustainability reports using 64 different queries (**UsefulBench-gold**). We also extend **UsefulBench-gold** to a report-level dataset covering all documents in a report with and without relevance and usefulness, and comprise a dataset with 53K report-query-document triplets (**UsefulBench-full**). Using **UsefulBench**, we show that relevance and usefulness are naturally connected, as expected. However, there remain clear differences. Conventional similarity-based IR systems align more with our definition of relevance. While LLM-based systems can counteract some of this bias and effectively improve in detecting usefulness, we observe an early upper bound of performance. When two professional analysts investigate the misclassifications, they find that LLM-based judgments do not fully incorporate expert knowledge and may misinterpret given query descriptions. We further show that amongst a set of remedy approaches – from providing examples, refining query descriptions, prompting techniques, and finetuning – there are often only partial gains, trading off classification against calibration. The most promising approaches directly incorporate expert knowledge in the classification process. However, accurate judgments for these domain-specific queries remain a challenge. Collectively, our contributions are: - We introduce **UsefulBench**, a dataset annotated by three professional analysts to differentiate relevance and usefulness of documents to a query. - We show that, while relevance and usefulness are entangled, there are clear differences, and conventional IR systems align more towards relevance. - LLM-based systems integrating expert knowledge can serve as a partial remedy, but ultimately suffer from a lack of expert knowledge. **Figure 2:** UsefulBench creation pipeline. Three professional analysts search for documents (text passages) that are relevant and useful for a given query and query description. The analysts then consolidate their findings in a consensus discussion to determine the final labels. ## 2 Literature Background **Retrieval and relevance.** Information retrieval (IR) has traditionally focused on identifying documents that are **relevant** to a query. In classical IR, relevance is approximated through lexical matching (e.g., in BM25 (Robertson and Zaragoza, 2009)). More recent neural retrievers replace lexical overlap with semantic similarity, for example, in DPR (Karpukhin et al., 2020) or ColBERTv2 (Santhanam et al., 2022). Benchmarks such as BEIR have standardized the comparison of lexical and dense retrieval models across domains (Thakur et al., 2021). However, across these settings, relevance is still largely operationalized as topical or semantic relatedness, which does not guarantee that a document contains the information needed to answer a query. This limitation has become more pronounced with recent work showing theoretical and empirical limitations of embedding-based retrieval for certain query types (Weller et al., 2025). **Retrieval for RAG.** The rise of retrieval-augmented generation (RAG) has made this limitation more consequential, because retrieval errors directly affect answer generation. Prior work shows that passages that are topically related but do not contain the answer can harm downstream performance (Cuconasu et al., 2024; Zeng et al., 2025), while long contexts make it harder for language models to identify and use the right evidence (Liu et al., 2024). Similar performance decreases have been shown when the evidence contains contradictory or malicious documents (Xian et al., 2025; Zeng et al., 2026). At the same time, RAG systems suffer when no useful evidence for QA is present at all (Peng et al., 2025). These challenges are particularly salient in domain-specific settings such as climate and sustainability analysis, where systems must retrieve from long technical reports (Vaghefi et al., 2023; Ni et al., 2023). **LLM-based relevance annotation.** A related line of work uses LLMs to evaluate or annotate retrieval quality. RAGAS and ARES introduce automated frameworks for evaluating retrieval and generation components in RAG pipelines (Es et al., 2024; Saad-Falcon et al., 2024). At the same time, prior work has outlined challenges in domain-specific retrieval (Ni et al., 2025; Schimanski et al., 2024b). Generally, LLM-based retrieval systems improve retrieval quality at a price of more computational effort (Ni et al., 2025; Zhang et al., 2025). LLMs can work out more fine-grained definitions of relevance, which lean towards what we define as usefulness. However, no prior work has taken a stance on an active differentiation between documents that are **relevant** to a query and documents that are **useful** for answering it. This distinction matters especially in knowledge-intensive domains, where many passages discuss the queried topic broadly, but only a small subset provides actionable, decision-useful evidence. ## 3 UsefulBench This project aims to differentiate relevance and usefulness in information retrieval processes. Since, to the best of the authors' knowledge, no dataset explicitly distinguishes between these two concepts, we introduce **UsefulBench**. ### 3.1 Data Creation At the core of **UsefulBench** is the distinction between the conventional understanding of relevance and the concept of usefulness. Relevance is typically defined through lexical or semantic similarity between a query and a document. Usefulness, in contrast, reflects whether the information contained in a document contributes directly to answering the query or supporting a decision. We argue that the distinction between relevance and usefulness becomes particularly important in domain-specific settings that require deep contextual understanding (Szymanski et al., 2025; Ni et al., 2025). For this reason, we focus on the sustainability domain. Sustainability and climate disclosures are inherently qualitative and complex. Even widely used indicators such as emissions can be interpreted in multiple ways. When simplifying a query to "What are a firm's CO2 emissions?", sustainability reports typically contain a large amount of relevant background information, such as descriptions of emission reduction strategies, methodological explanations, or discussions of climate risks. While such passages are clearly relevant, only a small subset of documents contains the exact emission numbers required to answer the query. Consequently, relevant information can overshadow truly useful information (see also Figure 1). To reflect a realistic expert workflow, the dataset is constructed based on a professional sustainability analysis process. Three professional sustainability analysts examine firms' sustainability reports using real-world queries accompanied by descriptions (Figure 5). Examples include information regarding energy efficiency, carbon footprints, or green IT. Overall, the analysts examine 15 sustainability reports of firms from 10 different industries. Each report is analyzed with, on average, 20.5 queries (standard deviation: 8). This follows a structure where each company is evaluated based on seven industry-specific and up to 14 industry-agnostic queries. For each query, the analysts read the entire report and identify passages that contain potentially relevant or decision-useful information. On average, 3.3 documents are annotated per query. The analysts follow the definitions below when retrieving passages (the analysts call a query "criteria"; detailed prompts in Figures 6 and 7): - **(High) Relevance:** The information has a strong and direct connection to the criteria. It contains specific keywords, concepts, or themes that are central to the analysis of the criteria. The content is explicitly describing the criteria or highly related and on-topic. - **(High) Usefulness:** The information provides significant practical value. It contains clear, actionable insights, specific numbers, concrete achievements, solutions, or detailed plans that can be directly used. The content offers tangible value for understanding the company's efforts related to the criteria. These definitions reflect the central distinction of this paper. While usefulness generally implies relevance, the reverse does not necessarily hold: information can be highly relevant yet provide little value for answering the query.
Similar Articles
EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs
EHRBench is an automated and reliable benchmark for evaluating LLMs on clinical decision-making tasks using real-world electronic health records, covering nearly 1M QA items across diagnosis, treatment, and prognosis tasks.
@_reachsumit: OBLIQ-Bench: Exposing Overlooked Bottlenecks in Modern Retrievers with Latent and Implicit Queries @dianetc_ et al pres…
OBLIQ-Bench is a new benchmark that exposes weaknesses in current retrieval systems when handling oblique queries requiring latent or implicit reasoning, showing that even sophisticated retrieval pipelines fail to surface relevant documents that reasoning LLMs can easily verify.
RTI-Bench: A Structured Dataset for Indian Right-to-Information Decision Analysis
Introduces RTI-Bench, a structured dataset for analyzing decisions under India's Right to Information act, useful for NLP and legal AI research.
Ishigaki-IDS-Bench: A Benchmark for Generating Information Delivery Specification from BIM Information Requirements
This paper introduces Ishigaki-IDS-Bench, a benchmark for evaluating LLMs' ability to generate Information Delivery Specification (IDS) XML from BIM information requirements. Evaluation of 10 LLMs shows best models achieve 65.6% macro F1 for content agreement but only 27.7% pass the Content audit, indicating struggles with standard and vocabulary constraints.
@dianetc_: We set out to build a better retriever, so we looked for the hardest IR benchmarks. For each, we asked how much headroo…
The authors introduce OBLIQ-Bench, a new benchmark designed to evaluate information retrieval systems on significantly harder search queries where previous benchmarks showed little remaining headroom.