IRC-Bench: Recognizing Entities from Contextual Cues in First-Person Reminiscences
Summary
This paper introduces IRC-Bench, a benchmark for recognizing implicit entities in first-person reminiscences using contextual cues rather than explicit mentions. It evaluates various LLM and retrieval configurations, finding QLoRA-adapted Llama 3.1 8B to be the top performer in open-world settings.
View Cached Full Text
Cached at: 05/08/26, 07:16 AM
# IRC-Bench: Recognizing Entities from Contextual Cues in First-Person Reminiscences Source: [https://arxiv.org/abs/2605.06142](https://arxiv.org/abs/2605.06142) [View PDF](https://arxiv.org/pdf/2605.06142) > Abstract:When people recount personal memories, they often refer to people, places, and events indirectly, relying on contextual cues rather than explicit names\. Such implicit references are central to reminiscence narratives: first\-person accounts of lived experience used in therapeutic, archival, and social settings\. They pose a difficult computational problem because the intended entity must be inferred from dispersed narrative evidence rather than from a local mention\. We introduce IRC\-Bench, the Implicit Reminiscence Context Benchmark, for evaluating implicit entity recognition in reminiscence transcripts\. The benchmark targets non\-locality: entity\-identifying cues are distributed across multiple, non\-contiguous clauses, unlike named entity recognition, entity linking, or coreference resolution\. IRC\-Bench comprises 25,136 samples constructed from 12,337 Wiki\-data\-linked entities across 1,994 transcripts spanning 11 thematic domains\. Each sample pairs an Entity\-Grounded Narrative, in which the target entity is explicitly mentioned, with an Entity\-Elided Narrative, in which direct mentions are removed\. We evaluate 19 configurations across LLM generation, dense retrieval, RAG, and fine\-tuning\. QLoRA\-adapted Llama 3\.1 8B performs best in the open\-world setting \(38\.94% exact match; 51\.59% Jaccard\), while fine\-tuned DPR leads closed\-world retrieval \(35\.38% Hit@1; 71\.49% Hit@10\)\. We release IRC\-Bench with data, code, and evaluation tools\. ## Submission history From: Yehudit Aperstein \[[view email](https://arxiv.org/show-email/806637db/2605.06142)\] **\[v1\]**Thu, 7 May 2026 12:39:49 UTC \(1,211 KB\)
Similar Articles
@_reachsumit: OBLIQ-Bench: Exposing Overlooked Bottlenecks in Modern Retrievers with Latent and Implicit Queries @dianetc_ et al pres…
OBLIQ-Bench is a new benchmark that exposes weaknesses in current retrieval systems when handling oblique queries requiring latent or implicit reasoning, showing that even sophisticated retrieval pipelines fail to surface relevant documents that reasoning LLMs can easily verify.
Cognis: Context-Aware Memory for Conversational AI Agents
Lyzr Cognis introduces a unified, open-source memory system for conversational AI that fuses BM25 and Matryoshka vector search with version-aware ingestion, achieving SOTA on LoCoMo and LongMemEval benchmarks.
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
Researchers introduce Memora, a benchmark that evaluates LLMs’ ability to retain, update, and forget long-term user memories over weeks-to-months conversations, revealing frequent reuse of obsolete memories.
Beyond Static Personas: Situational Personality Steering for Large Language Models
This paper introduces IRiS, a training-free framework for situational personality steering in LLMs that moves beyond static persona modeling by identifying and leveraging situation-dependent persona neurons. The approach demonstrates that LLM behavior varies contextually and proposes neuron-based identification, retrieval, and weighted steering methods validated on PersonalityBench and a new SPBench benchmark.
PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations
Researchers propose PRISM, a diagnostic benchmark that breaks down LLM hallucinations into four dimensions (knowledge missing/errors, reasoning errors, instruction-following errors) across three generation stages (memory, instruction, reasoning), evaluating 24 LLMs to reveal trade-offs in mitigation strategies.