HistoriQA-ThirdRepublic: Multi-Hop Question Answering Corpus for Historical Research, Parliamentary Debates from the French Third Republic (1870-1940)
Summary
This paper presents HistoriQA-ThirdRepublic, a French-language multi-hop question answering dataset derived from historical documents of the French Third Republic, designed to evaluate retrieval-augmented and LLM systems in historical research contexts.
View Cached Full Text
Cached at: 07/01/26, 05:37 AM
# HistoriQA-ThirdRepublic: Multi-Hop Question Answering Corpus for Historical Research, Parliamentary Debates from the French Third Republic (1870-1940) Source: [https://arxiv.org/abs/2606.31325](https://arxiv.org/abs/2606.31325) [View PDF](https://arxiv.org/pdf/2606.31325) > Abstract:We present HistoriQA\-ThirdRepublic: a French\-language dataset of multi\-hop historical questions derived from parliamentary debates and newspapers of the French Third Republic\. Designed in collaboration with a historian, the corpus captures complex reasoning patterns typical of historical inquiry, including cross\-source synthesis, temporal reasoning, and the integration of sparse evidence\. The dataset is made of 1782 questions and emphasizes multi\-hop connections across heterogeneous historical documents, providing a resource for evaluating retrieval\-augmented and large language model systems in domain\-specific contexts\. We describe the methodology for constructing the corpus, including the selection and alignment of sources, question validation, and metadata integration\. While the dataset focuses on French historical documents, our methodology can be readily adapted to other languages and national corpora\. Finally, we demonstrate how the corpus can support realistic evaluation scenarios for multi\-hop question answering, bridging the gap between NLP benchmarks and the needs of historical scholarship\. ## Submission history From: Aurelien PELLET \[[view email](https://arxiv.org/show-email/c2557c8f/2606.31325)\] \[via CCSD proxy\] **\[v1\]**Tue, 30 Jun 2026 08:28:42 UTC \(5,043 KB\)
Similar Articles
Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering
This paper studies temporal failure modes in LLM-based statutory question answering, including post-cutoff staleness and recency bias. It introduces a benchmark of 312 expert-validated German statutory QA pairs and evaluates LLMs under various inference settings.
HistoRAG: Embedding Historical Methodology in Retrieval-Augmented Generation Through Critical Technical Practice
This paper introduces HistoRAG, a framework that translates historiographical principles (decoupled retrieval/generation, temporal windowing, LLM-as-judge evaluation) into architectural interventions for standard RAG, applied to a corpus of 102,189 Der Spiegel articles to address interpretive rather than factual question-answering needs.
EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries
Introduces EHRNote-ChatQA, a benchmark for evidence-grounded multi-turn clinical question answering over multiple discharge summaries, constructed with expert validation. Benchmarking 22 LLMs reveals challenges in evidence grounding and multi-turn error accumulation.
Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants
This paper introduces the first parallel Arabic cultural QA benchmark spanning Modern Standard Arabic and multiple dialects, converting multiple-choice questions to open-ended formats and evaluating LLMs with chain-of-thought reasoning to address gaps in culturally grounded and dialect-specific knowledge.
TRACE: State-Aware Query Processing over Temporal Evidence Graphs for Conversational Data
This paper presents TRACE, a query processing framework that models conversational data as temporal evidence graphs to enable state-aware reasoning over evolving user states, improving temporal and multi-hop reasoning for long-conversation QA.