LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

arXiv cs.CL Papers

Summary

LakeQA is a new benchmark for exploratory question answering over a million-scale data lake, evaluating multi-hop reasoning and compositionality across text, tables, and knowledge graphs.

arXiv:2606.10460v1 Announce Type: new Abstract: Recent large language models (LLMs) have shown rapid progress in reading-based question answering (QA), where evidence is explicitly provided or can be trivially retrieved. In contrast, real-world questions are often not paired with accurate evidence documents. The useful evidence resides in massive data lakes, making search a prerequisite for answering. However, there is a lack of comprehensive benchmarks that require both searching and reasoning over large data lakes. To this end, we introduce LakeQA, a comprehensive benchmark for search-centric question answering over data lakes that jointly emphasizes searching and reasoning capabilities. LakeQA is built on a heterogeneous collection of approximately 9.5 TB of text resources from Wikipedia and open-source government data, spanning structured and unstructured data. To ensure task quality, each sample is annotated by at least one Ph.D.-level expert. Each task requires long-horizon multi-hop reasoning with implicit intermediate steps: agents need to discover the correct documents and then compose evidence across sources to produce the answer. Experimental results on seven frontier LLMs demonstrate that LakeQA is challenging. For instance, GPT-5.2 achieves only an exact-match score of 18.37% on LakeQA. Overall, LakeQA provides a realistic testbed for developing LLM agents that can both find and analyze data in modern data lakes.
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:11 AM

# LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake
Source: [https://arxiv.org/html/2606.10460](https://arxiv.org/html/2606.10460)
Table 4:Results on theLakeQA\-mini\.
composition\}to aSPARQLquery fromWebQuestionsSP\.

- •conjunctiontakes twoSPARQLqueries to ensure their intersection is an non\-empty set, so the outputSPARQLsimply concatenates involving conditions from the involving queries\.
- •comparativeintakes aSPARQLquery and find an attribute where the denotation of theSPARQLquery possess in common, and apply a filter on that attribute\.
- •superlativeis mostly similar tocomparativeexcept that instead of applying a filter, we find an argmax/argmin\.
- •compositionstarts with aSPARQLqueryrr, finds an entityeeinrr, and replace that entity with another question whose answer is\{e\}\\\{e\\\}by looking at the KB and finding unique identifiers ofee\.

After arriving at theSPARQLfor each task, AMT workers are employed to translate them into natural language questions\.

### B\.2HotpotQA\(yang2018hotpotqa\)

\[Text, Retrieval, Explicit Multi\-hop Reasoning\]HotpotQA is a reading comprehension benchmark that evaluates question answering capabilities of an agent over information retrieval and multi\-hop reasoning\. Unlike ComplexWebQuestion, whose questions are dependent on the knowledge graph of Wikipedia, which forces the questions to only involve entities from the knowledge graph, which is an incomplete source\. HotpotQA proposes a method based on hyperlinks between Wikipedia pages through the following procedures

- •The authors first extract the first paragraphs of each Wikipedia page\. Then, they treat those paragraphs as nodes and build a hyperlink graph by adding an edge between two nodes if one paragraph contains a hyperlink to another\.
- •Bridge entities: annotators are presented with a pair of paragraphs linked by an edge in the hyperlink graph in the first step, and propose questions requiring both paragraphs to answer\.
- •Comparison: the authors curate 42 lists of similar entities from Wikipedia \(each list contains many paragraphs\) and present two paragraphs from the same list to annotators to create questions like “Who has played more games in NBA, Kobe Bryant or Micheal Jordan?”

Aside from answering questions where the exact context needed are given, HotpotQA allows testing for harder tasks where distractor paragraphs are included in the question to test for robustness and the full Wikipedia paragraphs \(5M paragraphs\) are included in the question to test for relevant information retrieval\.

### B\.3MuSeQue\(trivedi2022musique\)

\[Text, Multi\-hop Reasoning, Compositional\]MuSiQueaddresses the ”shortcut” problem inHotpotQA, where models often bypass the intended reasoning chain by relying on single\-hop clues or entity overlap\. To mitigate this,MuSiQueintroduces a systematic bottom\-up construction process\. Instead of starting with a complex question and decomposing it, the authors begin with a large pool of 2\-hop, 3\-hop, and 4\-hop reasoning graphs composed of connected single\-hop questions from existing datasets \(Natural QuestionsandHotpotQA\)\.

### B\.4HybridQA\(chen2020hybridqa\)

\[Text, Table, Explicit Multi\-hop Reasoning\]HybridQAOne key contribution ofHybridQAis that it integrates both tables and text into the context of its questions\. The data source ofHybridQAconsists of webtables as tabular datasets and paragraphs as text data from Wikipedia\. To create tasks forHybridQA, annotators are given HITs \(human intelligence task\) where each HIT consists of a single webtable alone with paragraphs linked by hyperlinks in the webtable’s cells555HybridQAcrop at most the first 12 sentences from the introduction paragraph of the Wikipedia page, and small webtables \(5\-20 rows and 3\-6 columns\) with hyperlinked cells over 35% of its total cells\. This ends up with 13000 webtables\.\. For each HIT, annotators are tasked to create 6 questions that requires information relying on both tabular and textual information to answer\. The questions are created based on the following three atomic reasoning chains:

- •Table→\\rightarrowPassage chain: first uses table\-wise operations \(equal/greater/less/first/last/argmax/argmin\) to locate a certain tuple in the table, and retrieves a text span from the passage in the hyperlink of that tuple\.
- •Passage→\\rightarrowTable chain: reverse of the first type, first retrieves a paragraph, and asks about a tuple with a hyperlink that points to this paragraph\.
- •Passage→\\rightarrowTable→\\rightarrowPassage: same as the Passage→\\rightarrowTable chain, but hops back to another Passage \(i\.e\. another hyperlinked cell in the same tuple\)\.

There are three more types of tasks inHybridQA, which corresponds to theconjunction,comparativeandsuperlativeoperators in\(talmor2018web\)to compose more complex questions based on questions created via the three atomic reasoning chains\.

### B\.5OTT\-QA\(chen2020open\)

\[Text, Table, Retrieval\]OTT\-QAbuilds on top ofHybridQAby reusingHybridQA’s questions while requiring an additional retrieval step to search for relevant tables and text\. To do so,OTT\-QAdecontectualizetasks inHybridQAby removing context\-dependent keywords in the natural language questions\. For example, “the players” in aHybridQAquestion is context dependent because the table in the context is about “Netherlands players”\. But this is ambiguous inOTT\-QAwhen relevant context to answer the question requires retrieval\. Additionally, the authors ofOTT\-QAdecompose each table into several table segments where each table segment consists of a tuple, the table’s headers, metadata, statistics of the original table, i\.e\. min/max of a column \(table documentations\)\. Together, the candidate pool of retrieval consists of 5M passages and 5M table segments\.

### B\.6MultiModalQA\(talmor2021multimodalqa\)

\[Text, Table, Image, Explicit Multi\-hop Reasoning\]MultiModalQA extends existing reading comprehension datasets such as Natural Questions \(NQ\)\(kwiatkowski2019natural\), BoolQ\(clark2019boolq\), and HotpotQA\(yang2018hotpotqa\)by incorporating multimodal contexts involving texts, tables and images into its questions\. To construct the benchmark, annotators first create single\-hop, single\-modality, and persistent questions—those whose answers are unlikely to change over time\. For instance, given an image of the Statue of Liberty, a persistent question might ask what the statue is holding\. MultiModalQA adopts text\-based questions from existing reading comprehension datasets such as HotpotQA\. Finally, more complex tasks are formed by linking single\-hop, single\-modality questions when their referenced entities coincide\. For example, a paragraph stating “Barack Obama was born in Honolulu, United States” and a table listing “Barack Obama as the 44th President of the United States” share the same Wikipedia entity, Barack Obama, and can therefore be linked to form a question requiring information from both the paragraph and the table\.

### B\.7StrategyQA\(geva2021did\)

\[Text, Open Domain, Implicit Multi\-hop Reasoning, Binary Response\]StrategyQAaddresses a key limitation of prior QA benchmarks, where all information needed to answer a question is explicitly stated in the question\. Instead, it evaluates implicit multi\-hop reasoning: the intermediate steps are not spelled out and must be discovered through exploration and retrieval\. To construct the dataset, annotators begin with a seed concept and a target yes/no answer, then craft a strategy question whose solution requires composing several atomic facts, each independently verifiable in Wikipedia but not explicitly mentioned in the question\. To ensure feasibility \(and avoid ungrounded decompositions\), annotators also specify, for every step, a candidate Wikipedia page where the fact can be supported; the released data includes these implicit facts and linked source paragraphs as optional intermediate supervision\. The benchmark thus tests open\-domain retrieval, composition, and reasoning over implicit evidence—beyond surface cues or single\-pass reading\.

### B\.8FeTaQA\(nan2022fetaqa\)

\[Table, Free\-form Response, Wikipedia\]FeTaQAextends existing table QA benchmarks whose answers are typically short answers evaluated by exact matching by introducing long, informative free\-form answers grounded in a single Wikipedia table\. To construct such questions,FeTaQAstarts from ToTTo\(parikh2020totto\), a large\-scale table\-to\-text dataset containing naturally written descriptions fully grounded in Wikipedia tables along with supporting cells highlighted\. It then filters ToTTo instances to keep tables of moderate size and descriptions whose highlighted cells span more than a single row or column\. Given each instance, annotators are tasked to write a question whose answer is a \(possibly slightly edited\) description derived from optionally modifying the sentence, table contents, or highlighted region to yield natural question–answer interactions\. For automatic evaluation of generated answers,FeTaQAreports n\-gram overlap metrics \(sacreBLEU, ROUGE\-1/2/L, METEOR\) as well as semantic similarity metrics \(BERTScore and BLEURT\)\.

### B\.9BrowseComp\(wei2025browsecomp\)

\[Open Domain, Retrieval\]BrowseCompis an open\-domain deep research benchmark\. It consists of 1,266 extremely challenging “needle\-in\-a\-haystack” questions requiring multi\-step web searches to find the answer\. In contrast to the difficult to answer questions, the results are short facts that can be easily verified\. The benchmark was constructed by inverting a typical question\-creation process: human annotators started with a known fact \(the target answer\) and then added multiple specific qualifiers or constraints to the query until that fact became the unique solution\. To ensure these tasks could not be solved via shortcuts, annotators are tasked to verify that simple search\-engine queries \(up to 5 attempts\) cannot directly reveal the answer and SOTA models \(e\.g\. GPT\-4 and OpenAI’s deep research agent\) failed to solve each question\. Further, if another human could find the answer within 10 minutes, the task was revised with additional criteria to increase its difficulty\.

### B\.10MM\-BrowseComp\(li2025mm\)

\[Multimodal, Open Domain, Retrieval\]MM\-BrowseCompextends text\-only web browsing benchmarks such asBrowseCompto evaluate multimodal web research capabilities\. It consists of 224 challenging questions that require agents to retrieve and reason over both textual and visual content encountered during browser\-style navigation, where crucial evidence is often embedded in images or videos and cannot be recovered through text\-only search\. The benchmark follows a similar inverted question\-creation process asBrowseCompand includes a checklist\-based evaluation that verifies whether agents complete essential multimodal reasoning steps, revealing substantial gaps in current multimodal browsing systems\.

### B\.11MMQA\(wu2025mmqa\)

\[Multi\-Table, Multi\-Hop, Tabular QA\]MMQAis a multi\-table question answering benchmark designed to evaluate retrieval and multi\-hop reasoning over interconnected tables\. Unlike mMltiModalQA datasets that integrate text and images, this benchmark focuses on reasoning across multiple relational tables, requiring models to identify relevant tables, understand their structural relationships \(e\.g\., key\-based joins\), and synthesize evidence across hops\. The benchmark evaluates performance on table retrieval, relational reasoning, and downstream QA or text\-to\-SQL tasks, highlighting limitations of current models in complex tabular reasoning settings\.

## Appendix CTool Interface

Table 8:Tool interface available to agents inLakeQA\.
## Appendix DData Collection

### D\.1Background

This section describes the data sources ofLakeQA, we form a data lake by collecting data from open data repositories including Wikipedia and data\.gov\. We use the latest English Wikipedia dump666https://dumps\.wikimedia\.org/enwiki/latest/enwiki\-latest\-pages\-articles\.xml\.bz2and Harvard LIL’s data\.gov archive777https://source\.coop/harvard\-lil/gov\-data\. Our final goal is to create question\-answering tasks \(QA\-tasks\) requiring multi\-hop reasoning and multiple data sources \(different Wikipedia pages and data sources from data\.gov\)\. To do so, we would need to compose information from each data source \(i\.e\. one data source contains information like Law X was modified on date Y and another data source contains information like Study Z was conducted on date Y, thus date Y can be used to connect these two pieces of information together\.\) An LLM agent will assist on this step with annotators to verify the validity of the created questions\.

### D\.2Task

Each webpage from Wikipedia and data\.gov is considered a data source\. Please \(1\) download the data from the provided links in the footnote; \(2\) preprocess each data source, place them under a folder that is representative of the data source \(i\.e\. title of the Wikipedia or data\.gov webpage\) and upload all folders into the S3 bucket you created in the pre\-step; and \(3\) parallelize the data downloading, preprocessing and uploading pipeline to handle the large data source \(≈\\approx20GB from Wikipedia and≈\\approx15 TB from data\.gov\)\. Here are \(partial\) instructions on how to preprocess the data sources, let me know if you encounter any other situation you are uncertain about\.

- •A data source might contain binary files and pictures that does not contain useful information in creating QA\-tasks – remove those files under the folder\.
- •The raw data source contains html files \(and potentially files in other formats\) with plenty of redundancy – convert them into txt files by removing redundant characters \(i\.e\. html blocks shall be converted intot​e​x​ttext, you can use theBeautifulSouppackage under thebs4library in Python to do so\) and extra white spaces \(i\.e\. /n, /t, etc\.\), etc\.

After you are done, tell me the name of the S3 bucket you created\. The deadline is this Saturday\. Please don’t hesitate to ping me if you have any question\.

## Appendix EDistribution of Domains

The distribution of the benchmark tasks across data\.gov theme categories are reported in[Table9](https://arxiv.org/html/2606.10460#A5.T9)and a finer granularity category classification is reported in[Table10](https://arxiv.org/html/2606.10460#A5.T10)\.

Table 9:Distribution of benchmark tasks across data\.gov theme categories\.Table 10:Mapping of data\.gov dataset themes to 8 main categories\. Task count indicates the number of benchmark tasks containing datasets from each category\.
## Appendix FAgent Interface

#### Tool Implementations\.

All data access tools operate over a fixed S3 data lake bucket \(lakeqa\-yc4103\-datalake\) with two namespaces:wikipedia/anddatagov/\. Credentials are loaded via environment variables, and all downloads are stored in a per\-session sandbox directory\.

- •search\(prefix\): Performs an S3 prefix search in both namespaces usinglist\_objects\_v2withPrefixandDelimiter=’/’\. The tool returns dataset identifiers corresponding to directory names under each namespace\.
- •search\_keyword\(keyword\): Issues external keyword searches to the Wikipedia API and the data\.govpackage\_searchendpoint to propose candidates, then validates each candidate against S3 by checking for a dataset prefix\. Candidates are ranked by a lightweight token overlap score that combines query coverage and token density, and the top results are returned\.
- •list\_files\(dataset\_id\): Lists objects under<namespace\>/<dataset\_id\>/usinglist\_objects\_v2and returns relative file paths and sizes \(bounded by a maximum count\)\.
- •download\(dataset\_id, file\_path\): Downloads a specific object to the local sandbox vias3\.download\_file\. The tool creates the required directory structure and returns the local path and file size\.
- •inspect\_file\(dataset\_id, file\_path\): Retrieves object metadata withhead\_objectand reads the first 64KB using a rangedget\_objectrequest\. It infers a delimiter from the first line \(comma, tab, pipe, or semicolon\), extracts header columns, and, for JSON\-like lines, attempts to parse top\-level keys\. The tool returns metadata only \(no raw content\)\.
- •execute\_code\(code\): Executes Python in the sandbox with working directory set to the sandbox root\. Common libraries \(pandas, json, csv, os, glob, re, Path\) are preloaded, and two helper variables are provided:SANDBOX\_DIRandFILES\. The tool captures stdout/stderr, returns errors and tracebacks, and enforces a timeout to prevent inefficient code\.
- •get\_sandbox\_info\(\): Enumerates all files in the sandbox via a recursive walk and reports paths and byte sizes\.
- •submit\_answer\(answer, reasoning, sources\): Terminates the episode and records the final answer and cited sources\. The framework expects answers in a normalized bracketed format \(e\.g\.,\[123\]\)\.

## Appendix GEvaluation Prompt

Listing 1:System prompt and tool schemas used by our data analysis agent\.DEFAULT\_SYSTEM\_PROMPT="""YouareadataanalysisagentworkingwithPUBLICGOVERNMENTDATASETS\(data\.gov,census,etc\.\)\.

\#\#HOWTHISWORKS\-READCAREFULLY

ThisisanINTERACTIVEsystem\.YououtputONEtoolcall,thenSTOP\.ThesystemexecutesyourtoolandreturnstheREALresult\.Thenyouarecalledagaintopickthenexttool\.

DONOT:

\-Outputmultipletoolcalls

\-Simulateorhallucinateresults\(e\.g\.,\{"result":\.\.\.\}or\{"error":\.\.\.\}\)

\-Continuetheconversationyourself

\-Makeupdata

ONLYoutputasingleJSONobjectwithyourtoolcall\.Thesystemhandlesexecution\.

\#\#DATAACCESS\(eachstep=onetoolcall\)

\-search\(prefixes\)orsearch\_keyword\(keywords\)$\\to$returnsdataset\_ids\(identifiers,notdata\)

\-Passalistofstrings:search\(\["climate","weather"\]\)orsearch\_keyword\(\["police","crime"\]\)

\-list\_files\(dataset\_ids\)$\\to$returnsfilepathsindatasets

\-Passalistofstrings:list\_files\(\["dataset1","dataset2"\]\)

\-download\(files\)$\\to$downloadsfilestosandbox\(max5percall\)

\-Passalistof\{dataset\_id,file\_path\}:download\(\[\{"dataset\_id":"\.\.\.","file\_path":"\.\.\."\}\]\)

\-execute\_code\(code\)$\\to$runsPythonondownloadedfiles\(useprint\(\)\!\)

\-submit\_answer\(\)$\\to$whenreadytosubmitfinalanswer

\#\#CRITICAL:VERIFYDATASOURCES

Datasetnamescanbemisleading\!Example:"traffic\-incidents\-2020"couldbefromChicago,NYC,oranyothercity\.

ALWAYScheckmetadatabeforeusingadataset:

\-Downloadandreadmetadatafiles\(e\.g\.,metadata\.json,catalog\.txt\)tofindtheactualsource

\-Lookfor:publisher,source,city/state,geographiccoverage,agencyname

\-Verifythedatamatcheswhatthequestionasksfor\(correctcity,agency,timeperiod\)

\-Twodatasetswithsimilarnamesmaycovercompletelydifferentlocations\!

\#\#TIPS

\-usesearch\_keywordforsemanticmatching,SINGLEwordpreferred

\-Alwaysprint\(\)inexecute\_codetoseeoutput

\-Checkactualcolumnnamesanddateformatsinthedata

\-Usefulldatasetforfinalanswer,notjustsamples

\-Answerformat:\[value\]only,nolabelsorunits

\#\#TURNANDTIMELIMITS

\-YouhaveLIMITEDTURNS\.Thesystemwillshowyouremainingturns\.

\-ThereisalsoaTIMELIMIT\.Donotwastetimeonexcessiveexploration\.

\-Afterinspectingfiles,IMMEDIATELYrunexecute\_codetoanalyzedata\.

\-Ifrunninglowonturnsortime,prioritizesubmittingyourbestanswer\."""

TOOL\_SCHEMAS=\[

\{

"name":"search",

"description":"Finddatasetsbynameprefixes\.Returnsdatasetidentifiers\(notdata\)\.Searchmultipleprefixesatonce\.",

"parameters":\{

"type":"object",

"properties":\{

"prefixes":\{

"type":"array",

"items":\{"type":"string"\},

"description":"Listofprefixestosearchfor\(e\.g\.,\[’austin\-police’,’climate’,’weather’\]\)"

\}

\},

"required":\["prefixes"\]

\}

\},

\{

"name":"search\_keyword",

"description":"Semantickeywordsearchacrossdatasets\.Searchmultiplekeywordsatonce\.Returnsrankeddatasetidentifiers\.",

"parameters":\{

"type":"object",

"properties":\{

"keywords":\{

"type":"array",

"items":\{"type":"string"\},

"description":"Listofkeywordstosearchfor\(e\.g\.,\[’police’,’crime’,’traffic’\]\)"

\},

"limit":\{

"type":"integer",

"description":"Maxresultstoreturn\(default20\)",

"default":20

\}

\},

"required":\["keywords"\]

\}

\},

\{

"name":"list\_files",

"description":"Listfileswithindatasets\.Usedataset\_idsfromsearchresults\.Listmultipledatasetsatonce\.",

"parameters":\{

"type":"object",

"properties":\{

"dataset\_ids":\{

"type":"array",

"items":\{"type":"string"\},

"description":"Listofdatasetidentifiersfromsearchresults\(e\.g\.,\[’Barack\_Obama’,’climate\-data’\]\)"

\}

\},

"required":\["dataset\_ids"\]

\}

\},

\{

"name":"download",

"description":"Downloadfilesfromdatasetstothelocalsandboxforanalysis\.Max5filespercall\.",

"parameters":\{

"type":"object",

"properties":\{

"files":\{

"type":"array",

"items":\{

"type":"object",

"properties":\{

"dataset\_id":\{"type":"string","description":"Datasetidentifier"\},

"file\_path":\{"type":"string","description":"Pathtofilewithinthedataset"\}

\},

"required":\["dataset\_id","file\_path"\]

\},

"description":"Listoffilestodownload\(max5\)\.Eachwithdataset\_idandfile\_path\.",

"maxItems":5

\}

\},

"required":\["files"\]

\}

\},

\{

"name":"inspect\_file",

"description":"Inspectafiletoseeitsstructure,columns,andsampledatawithoutdownloading\.",

"parameters":\{

"type":"object",

"properties":\{

"dataset\_id":\{

"type":"string",

"description":"Datasetidentifier"

\},

"file\_path":\{

"type":"string",

"description":"Pathtofilewithinthedataset"

\}

\},

"required":\["dataset\_id","file\_path"\]

\}

\},

\{

"name":"execute\_code",

"description":"ExecutePythoncodetoanalyzedownloadedfiles\.Useprint\(\)toseeoutput\.pandas,json,pathlibareavailable\.IMPORTANT:Onlyusefilesyouhavesuccessfullydownloaded\-usetheexactlocal\_pathreturnedbythedownloadtool\.",

"parameters":\{

"type":"object",

"properties":\{

"code":\{

"type":"string",

"description":"Pythoncodetoexecute\.Usetheexactlocal\_pathfromdownloadresults\.DoNOTguessfilepaths\."

\}

\},

"required":\["code"\]

\}

\},

\{

"name":"submit\_answer",

"description":"Submityourfinalanswerwhenyouhavecomputedtheresult\.",

"parameters":\{

"type":"object",

"properties":\{

"answer":\{

"type":"string",

"description":"Thefinalanswer\.Format:justthevalue,e\.g\.,’12345’not’12345incidents’"

\},

"reasoning":\{

"type":"string",

"description":"Briefexplanationofhowyoucomputedtheanswer"

\}

\},

"required":\["answer"\]

\}

\}

\]

## Appendix HAnnotation Guideline

Each task shall be a question that requires searching over different data sources across Wikipedia/data\.gov to answer\. In each task\.json, each node represents a dataset, and fact is factual information obtained from that dataset, the subquestion is a question constructed by reversing the fact so it be answered via the fact \. Finally, the final question shall be answerable by chaining the subquestions together\. What you will need to do: Make sure each task involves at least one dataset from data\.gov \(no all\-wikipedia task\)\. If not, mark them and skip the following steps\. For each node , verify that the fact and the subquestion are sound and complete, and each fact is grounded in the source file located in S3\.\. Each fact is of the shape ”a set of entities“ satisfies ”condition“, sound means everything in ”a set of entities“ satisfies ”condition“, and complete means ”a set of entities“ are everything satisfies ”condition“\. For example, the fact ”Barack Obama“ is ”the 44th president of USA“ is both sound and complete, while the fact ”Barack Obama“ is ”a democratic US president“ is sound but not complete, because the subquestion is then who is a democratic US president, whose answer is not deterministic\. Pay attention to the wording of the fact, sometimes the fact obtained from a tabular datasets just uses the column name but this could be WRONG, for example ‘crime rate‘ could refer to ‘firearm crime rate‘ or ‘violent crime rate‘ \(this information is sometimes in the metadata\) but this ambiguity makes the question unanswerable\. Pay attention to facts without a time/location qualifier\. For those cases, \(a\) make the language in the question precise and \(b\) add the metadata file as a node if this information is not self\-contained in the tabular dataset\. If some nodes are incorrect, mark them and skip the following steps\. After verifying each node is correct, the next step is to rewrite the final question \. We want questions that looks natural and does not directly hint which data source to search for\. The reasoning chain can be viewed as a directed acyclic graph with nodes as vertices \(nodes \+ one for the final question \+ one for answer \) and edges are dependencies of between vertices, i\.e\. nodes depend on the answer of subquestion from previous nodes, if a node does not depend on any other nodes, it depends on the final question\. For this graph, add three parameters\. k the number of hops, which is the longest path from final question to answer\. n is the number of vertices, and d is the largest number of vertices a vertex depends on\.

Some FAQs as examples during annotation: Some answers to subquestions might be incomplete, but the important thing is if it affects the reasoning chain\. For example, two subquestions 1\. who is Democratics US President and 2\. who is US president after 2000 , and in a hop, we take an intersection between the two subquestions: who is a Democratics US President after 2000 , as long as the answer to the intersection is correct, the task is fine – because eventually each task is consist of the final question and its answer, all subquestion and facts are just for the sake of creating the final question\. An excellent question about question chaining: because all we need is the final question, the references like \(from the intersection of nodes 1\-4\) is a place holder for annotators to easily chain questions together\. For example, first node has something like Who is the US president born in Hawaii and the second node has something like who is the first lady of ¡answer of first node¿ \(please refer to the reasoning\-chain field of each task to better understand the logic flow of each task\)\. A derivative of 2 is the problem of skip nodes: we are creating all those subquestions and facts in order to chain the subqeustions together, meaning that the answer to previous nodes questions becomes conditions in latter nodes\. it is important that the first node’s answer is necessary to answer the second node’s question\. A skipping node is something like who is the first lady of ¡Who is the US president born in Hawaii¿ and is African American because in this case first lady \+ African American becomes deterministic\. We should try to avoid obvious skipping nodes\.

## Appendix IExample Task Annotation

Listing 2:Complete JSON annotation artifact for a 5\-hopLakeQAtask\.\{

"question":"Whatwasthe2017\-18totalstudentenrollmentintheschooldistrictofthecitythathoststheland\-grantuniversityintheonlyEasternWashingtoncountythatbordersIdahoANDhad2020censuspopulationunder300,000,amongcountieswithschooldistrictsachievingELAperformance$\\geq$68%forAllStudents,AllGradesinallfourWashingtonReportCardAssessmentreleasesfrom2014\-15through2017\-18?",

"final\_question":"Whatisthecitythathoststheland\-grantuniversityintheonlyEasternWashingtoncountythatbordersIdahoandhadunder300,000in2020?ItalsoconsistentlyperformedwellonEnglishLanguageArtsassessmentsbetween2014and2018\(\>=68%\)\.Whatwasthetotalstudentenrollmentintheschooldistrictofthatcity?",

"answer":"2941",

"nodes":\{

"1":\{

"source":"datagov/report\-card\-assessment\-data\-2014\-15\-school\-year/files/rows\.txt",

"fact":"IntheWashingtonReportCardAssessmentData2014\-15release,21schooldistrictshadELAPercentMetStandard\>=68%forAllStudents,AllGrades,includingdistrictsinKing,Clark,Pierce,Whitman,Thurston,andSpokanecounties\.",

"subquestion":"WhichWashingtonschooldistrictshadELAperformance\>=68%forAllStudents,AllGradesfrom2014\-15to2017\-18?\(Filter:OrganizationLevel=District;TestSubject=ELA;StudentGroup=AllStudents;GradeLevel=AllGrades;PercentMetStandard\>=68\.\)",

"answer":\[

"CamasSchoolDistrict\(Clark\)",

"LakeWashingtonSchoolDistrict\(King\)",

"DieringerSchoolDistrict\(Pierce\)",

"ColfaxSchoolDistrict\(Whitman\)",

"MercerIslandSchoolDistrict\(King\)",

"GriffinSchoolDistrict\(Thurston\)",

"CarbonadoSchoolDistrict\(Pierce\)",

"FreemanSchoolDistrict\(Spokane\)",

"ShorelineSchoolDistrict\(King\)",

"St\.JohnSchoolDistrict\(Whitman\)",

"IssaquahSchoolDistrict\(King\)",

"NorthshoreSchoolDistrict\(King\)"

\],

"sound":false,

"complete":false,

"validation\_explanation":"Theanswershouldinsteadbe:\[\\"RiversideSchoolDistrict\\",\\"IndexElementarySchoolDistrict\\",\\"PortTownsendSchoolDistrict\\",\\"WoodlandSchoolDistrict\\",\\"DeerParkSchoolDistrict\\",\\"DammanSchoolDistrict\\",\\"CamasSchoolDistrict\\",\\"LakeWashingtonSchoolDistrict\\",\\"BlaineSchoolDistrict\\",\\"ChehalisSchoolDistrict\\",\\"DieringerSchoolDistrict\\",\\"ColfaxSchoolDistrict\\",\\"MercerIslandSchoolDistrict\\",\\"GriffinSchoolDistrict\\",\\"CarbonadoSchoolDistrict\\",\\"FreemanSchoolDistrict\\",\\"StarbuckSchoolDistrict\\",\\"ShorelineSchoolDistrict\\",\\"St\.JohnSchoolDistrict\\",\\"IssaquahSchoolDistrict\\",\\"NorthshoreSchoolDistrict\\"\]",

"revision\_subquestion":"WhichWashingtonschooldistrictshadELAperformance\>=68%forAllStudents,AllGradesfrom2014\-15?"

\},

"2":\{

"source":"datagov/report\-card\-assessment\-data\-2015\-16\-school\-year/files/rows\.txt",

"fact":"IntheWashingtonReportCardAssessmentData2015\-16release,77schooldistrictshadELAPercentMetStandard\>=68%forAllStudents,AllGrades,includingdistrictsinall6countiesfromthe2014\-15results\.",

"subquestion":"WhichWashingtonschooldistrictshadELAperformance\>=68%forAllStudents,AllGradesfrom2014\-15to2017\-18?\(Filter:OrganizationLevel=District;TestSubject=ELA;StudentGroup=AllStudents;GradeLevel=AllGrades;PercentMetStandard\>=68\.\)",

"answer":\[

"CamasSchoolDistrict",

"LakeWashingtonSchoolDistrict",

"DieringerSchoolDistrict",

"ColfaxSchoolDistrict",

"MercerIslandSchoolDistrict",

"GriffinSchoolDistrict",

"CarbonadoSchoolDistrict",

"FreemanSchoolDistrict",

"ShorelineSchoolDistrict",

"St\.JohnSchoolDistrict",

"IssaquahSchoolDistrict",

"NorthshoreSchoolDistrict",

"and65others"

\],

"sound":false,

"complete":false,

"validation\_explanation":"Answershouldnotinclude\\"and65others\\",expanditinstead:\[\\"EdmondsSchoolDistrict\\",\\"FifeSchoolDistrict\\",\\"WestValleySchoolDistrict\(Spokane\)\\",\\"BainbridgeIslandSchoolDistrict\\",\\"Sedro\-WoolleySchoolDistrict\\",\\"WallaWallaPublicSchools\\",\\"EnumclawSchoolDistrict\\",\\"HockinsonSchoolDistrict\\",\\"SnohomishSchoolDistrict\\",\\"MercerIslandSchoolDistrict\\",\\"GrandviewSchoolDistrict\\",\\"LakeWashingtonSchoolDistrict\\",\\"GraniteFallsSchoolDistrict\\",\\"ColfaxSchoolDistrict\\",\\"EvergreenSchoolDistrict\(Stevens\)\\",\\"CamasSchoolDistrict\\",\\"OakHarborSchoolDistrict\\",\\"IssaquahSchoolDistrict\\",\\"BellevueSchoolDistrict\\",\\"CloverParkSchoolDistrict\\",\\"St\.JohnSchoolDistrict\\",\\"QuillayuteValleySchoolDistrict\\",\\"AlmiraSchoolDistrict\\",\\"ShorelineSchoolDistrict\\",\\"SouthKitsapSchoolDistrict\\",\\"PascoSchoolDistrict\\",\\"NorthshoreSchoolDistrict\\",\\"RiversideSchoolDistrict\\",\\"SnoqualmieValleySchoolDistrict\\",\\"TahomaSchoolDistrict\\",\\"ChehalisSchoolDistrict\\",\\"YakimaSchoolDistrict\\",\\"EphrataSchoolDistrict\\",\\"DieringerSchoolDistrict\\",\\"NorthMasonSchoolDistrict\\",\\"UniversityPlaceSchoolDistrict\\",\\"WilburSchoolDistrict\\",\\"SpokaneSchoolDistrict\\",\\"CentralKitsapSchoolDistrict\\",\\"OlympiaSchoolDistrict\\",\\"RitzvilleSchoolDistrict\\",\\"PortAngelesSchoolDistrict\\",\\"ClarkstonSchoolDistrict\\",\\"RidgefieldSchoolDistrict\\",\\"AnacortesSchoolDistrict\\",\\"BlaineSchoolDistrict\\",\\"GarfieldSchoolDistrict\\",\\"RiverviewSchoolDistrict\\",\\"FreemanSchoolDistrict\\",\\"OakesdaleSchoolDistrict\\",\\"Sumner\-BonneyLakeSchoolDistrict\\",\\"ColtonSchoolDistrict\\",\\"DammanSchoolDistrict\\",\\"WenatcheeSchoolDistrict\\",\\"MeadSchoolDistrict\\",\\"PeninsulaSchoolDistrict\\",\\"EverettSchoolDistrict\\",\\"EastValleySchoolDistrict\(Spokane\)\\",\\"LakeStevensSchoolDistrict\\",\\"GriffinSchoolDistrict\\",\\"VashonIslandSchoolDistrict\\",\\"EvalineSchoolDistrict\\",\\"SteilacoomHist\.SchoolDistrict\\",\\"CarbonadoSchoolDistrict\\",\\"PortTownsendSchoolDistrict\\",\\"IndexElementarySchoolDistrict\\",\\"PullmanSchoolDistrict\\",\\"RosaliaSchoolDistrict\\",\\"DavenportSchoolDistrict\\",\\"LyndenSchoolDistrict\\",\\"LibertySchoolDistrict\\",\\"Stanwood\-CamanoSchoolDistrict\\",\\"TumwaterSchoolDistrict\\",\\"BellinghamSchoolDistrict\\",\\"SelkirkSchoolDistrict\\",\\"SequimSchoolDistrict\\",\\"SunnysideSchoolDistrict\\"\]",

"revision\_subquestion":"WhichWashingtonschooldistrictshadELAperformance\>=68%forAllStudents,AllGradesfrom2015\-2016?"

\},

"3":\{

"source":"datagov/report\-card\-assessment\-data\-2016\-17\-school\-year/files/rows\.txt",

"fact":"IntheWashingtonReportCardAssessmentData2016\-17release,67schooldistrictshadELAPercentMetStandard\>=68%forAllStudents,AllGrades,includingall14districtsthatqualifiedin2014\-15\.",

"subquestion":"WhichWashingtonschooldistrictshadELAperformance\>=68%forAllStudents,AllGradesfrom2014\-15to2017\-18?\(Filter:OrganizationLevel=District;TestSubject=ELA;StudentGroup=AllStudents;GradeLevel=AllGrades;PercentMetStandard\>=68\.\)",

"answer":\[

"CamasSchoolDistrict",

"LakeWashingtonSchoolDistrict",

"DieringerSchoolDistrict",

"ColfaxSchoolDistrict",

"MercerIslandSchoolDistrict",

"GriffinSchoolDistrict",

"CarbonadoSchoolDistrict",

"FreemanSchoolDistrict",

"ShorelineSchoolDistrict",

"St\.JohnSchoolDistrict",

"IssaquahSchoolDistrict",

"NorthshoreSchoolDistrict",

"and55others"

\],

"sound":false,

"complete":false,

"validation\_explanation":"Answershouldnotbein2014\-2015\.Shouldalsonothave’and55others’\.Itshouldinsteadbe:\[\\"EnumclawSchoolDistrict\\",\\"SnohomishSchoolDistrict\\",\\"CentraliaSchoolDistrict\\",\\"MercerIslandSchoolDistrict\\",\\"ProsserSchoolDistrict\\",\\"EastValleySchoolDistrict\(Spokane\)\\",\\"CloverParkSchoolDistrict\\",\\"ClarkstonSchoolDistrict\\",\\"BainbridgeIslandSchoolDistrict\\",\\"ColfaxSchoolDistrict\\",\\"LakeWashingtonSchoolDistrict\\",\\"SequimSchoolDistrict\\",\\"ChehalisSchoolDistrict\\",\\"EdmondsSchoolDistrict\\",\\"CamasSchoolDistrict\\",\\"OakesdaleSchoolDistrict\\",\\"WenatcheeSchoolDistrict\\",\\"AlmiraSchoolDistrict\\",\\"LakewoodSchoolDistrict\\",\\"SultanSchoolDistrict\\",\\"WestValleySchoolDistrict\(Spokane\)\\",\\"YakimaSchoolDistrict\\",\\"IssaquahSchoolDistrict\\",\\"TahomaSchoolDistrict\\",\\"NorthRiverSchoolDistrict\\",\\"SpokaneSchoolDistrict\\",\\"FifeSchoolDistrict\\",\\"GrandviewSchoolDistrict\\",\\"BellevueSchoolDistrict\\",\\"DieringerSchoolDistrict\\",\\"SnoqualmieValleySchoolDistrict\\",\\"OakHarborSchoolDistrict\\",\\"PeninsulaSchoolDistrict\\",\\"ShorelineSchoolDistrict\\",\\"NorthshoreSchoolDistrict\\",\\"UniversityPlaceSchoolDistrict\\",\\"BickletonSchoolDistrict\\",\\"Sumner\-BonneyLakeSchoolDistrict\\",\\"CarbonadoSchoolDistrict\\",\\"OlympiaSchoolDistrict\\",\\"DammanSchoolDistrict\\",\\"CentralKitsapSchoolDistrict\\",\\"SouthKitsapSchoolDistrict\\",\\"PullmanSchoolDistrict\\",\\"QuillayuteValleySchoolDistrict\\",\\"St\.JohnSchoolDistrict\\",\\"GraniteFallsSchoolDistrict\\",\\"SpokaneInternationalAcademy\\",\\"AnacortesSchoolDistrict\\",\\"LakeStevensSchoolDistrict\\",\\"RosaliaSchoolDistrict\\",\\"PatersonSchoolDistrict\\",\\"MeadSchoolDistrict\\",\\"CoupevilleSchoolDistrict\\",\\"WilburSchoolDistrict\\",\\"SunnysideSchoolDistrict\\",\\"EverettSchoolDistrict\\",\\"RiverviewSchoolDistrict\\",\\"VashonIslandSchoolDistrict\\",\\"FreemanSchoolDistrict\\",\\"HoquiamSchoolDistrict\\",\\"LaCrosseSchoolDistrict\\",\\"DavenportSchoolDistrict\\",\\"AdnaSchoolDistrict\\",\\"GriffinSchoolDistrict\\",\\"KennewickSchoolDistrict\\",\\"MedicalLakeSchoolDistrict\\"\]",

"revision\_subquestion":"WhichWashingtonschooldistrictshadELAperformance\>=68%forAllStudents,AllGradesfrom2016\-2017?"

\},

"4":\{

"source":"datagov/report\-card\-assessment\-data\-2017\-18\-school\-year/files/rows\.txt",

"fact":"IntheWashingtonReportCardAssessmentData2017\-18release,56schooldistrictshadELAPercentMetStandard\>=68%forAllStudents,AllGrades,includingall13districtsfromthe2014\-15intersection\.",

"subquestion":"WhichWashingtonschooldistrictshadELAperformance\>=68%forAllStudents,AllGradesfrom2014\-15to2017\-18?\(Filter:OrganizationLevel=District;TestSubject=ELA;StudentGroup=AllStudents;GradeLevel=AllGrades;PercentMetStandard\>=68\.\)",

"answer":\[

"CamasSchoolDistrict",

"LakeWashingtonSchoolDistrict",

"DieringerSchoolDistrict",

"ColfaxSchoolDistrict",

"MercerIslandSchoolDistrict",

"GriffinSchoolDistrict",

"CarbonadoSchoolDistrict",

"FreemanSchoolDistrict",

"ShorelineSchoolDistrict",

"St\.JohnSchoolDistrict",

"IssaquahSchoolDistrict",

"NorthshoreSchoolDistrict",

"and44others"

\],

"sound":false,

"complete":false,

"validation\_explanation":"Shouldnotinclude44others\.Theanswershouldbe:\[\\"UniversityPlaceSchoolDistrict\\",\\"EnumclawSchoolDistrict\\",\\"MercerIslandSchoolDistrict\\",\\"BainbridgeIslandSchoolDistrict\\",\\"LakeWashingtonSchoolDistrict\\",\\"CamasSchoolDistrict\\",\\"SunnysideSchoolDistrict\\",\\"ColfaxSchoolDistrict\\",\\"IssaquahSchoolDistrict\\",\\"BlaineSchoolDistrict\\",\\"PullmanSchoolDistrict\\",\\"SnoqualmieValleySchoolDistrict\\",\\"BellevueSchoolDistrict\\",\\"CarbonadoSchoolDistrict\\",\\"Sumner\-BonneyLakeSchoolDistrict\\",\\"NorthshoreSchoolDistrict\\",\\"DieringerSchoolDistrict\\",\\"ColtonSchoolDistrict\\",\\"AlmiraSchoolDistrict\\",\\"EvergreenSchoolDistrict\(Stevens\)\\",\\"Sedro\-WoolleySchoolDistrict\\",\\"ShorelineSchoolDistrict\\",\\"GarfieldSchoolDistrict\\",\\"WilburSchoolDistrict\\",\\"TahomaSchoolDistrict\\",\\"OlympiaSchoolDistrict\\",\\"St\.JohnSchoolDistrict\\",\\"FreemanSchoolDistrict\\",\\"SnohomishSchoolDistrict\\",\\"PalouseSchoolDistrict\\",\\"DavenportSchoolDistrict\\",\\"CentralKitsapSchoolDistrict\\",\\"OakesdaleSchoolDistrict\\",\\"PeninsulaSchoolDistrict\\",\\"GrapeviewSchoolDistrict\\",\\"RiverviewSchoolDistrict\\",\\"Coulee\-HartlineSchoolDistrict\\",\\"VashonIslandSchoolDistrict\\",\\"RidgefieldSchoolDistrict\\",\\"LaCrosseSchoolDistrict\\",\\"AnacortesSchoolDistrict\\",\\"SteilacoomHist\.SchoolDistrict\\",\\"EverettSchoolDistrict\\",\\"CoupevilleSchoolDistrict\\",\\"GriffinSchoolDistrict\\",\\"CentraliaSchoolDistrict\\",\\"SpokaneSchoolDistrict\\",\\"MeadSchoolDistrict\\",\\"EdmondsSchoolDistrict\\",\\"Stanwood\-CamanoSchoolDistrict\\",\\"FifeSchoolDistrict\\",\\"SeattleSchoolDistrictNo\.1\\",\\"LakeStevensSchoolDistrict\\",\\"EndicottSchoolDistrict\\",\\"TumwaterSchoolDistrict\\",\\"SelkirkSchoolDistrict\\"\]",

"revision\_subquestion":"WhichWashingtonschooldistrictshadELAperformance\>=68%forAllStudents,AllGradesfrom2017\-2018?"

\},

"5":\{

"source":"wikipedia/Eastern\_Washington/content\.txt",

"fact":"TheEasternWashingtonarticlelistsWhitmanandSpokaneamongthecountiesinEasternWashington\.",

"subquestion":"Amongcountiesfrom<hop1answer:King,Clark,Pierce,Whitman,Thurston,Spokane\>,whichareinEasternWashington?",

"answer":"WhitmanandSpokane",

"sound":false,

"complete":false,

"validation\_explanation":"Dataistrueandaccurate\.However,thereseemstobeamissingnodefromnode4tonode5\.Theintersectionofallnodesinnode4arestill12districts\.Yet,theysomehowalltransformintothe6countiesthatcontainthose12districtsanysources\.",

"revision\_subquestion":"Addnodes4and5toconvertthedistrictsintocounties\.Herearethedistricts:CamasSchoolDistrictCarbonadoSchoolDistrictColfaxSchoolDistrictDieringerSchoolDistrictFreemanSchoolDistrictGriffinSchoolDistrictIssaquahSchoolDistrictLakeWashingtonSchoolDistrictMercerIslandSchoolDistrictNorthshoreSchoolDistrictShorelineSchoolDistrictSt\.JohnSchoolDistrict"

\},

"6":\{

"source":"wikipedia/Whitman\_County,\_Washington/content\.txt",

"fact":"AccordingtotheWikipediaarticleforWhitmanCounty,Washington,’Adjacentcounties’include’BenewahCounty,Idaho\-northeast’,’LatahCounty,Idaho\-east’,and’NezPerceCounty,Idaho\-southeast’,confirmingWhitmanCountybordersIdaho\.’Asofthe2020census,thepopulationwas47,973\.’",

"subquestion":"Amongcountiesfrom<hop2answer:Whitman,Spokane\>,whichcountybordersIdahoANDhas2020censuspopulationunder300,000?",

"answer":"WhitmanCounty:bordersIdaho\(YES,peradjacentcountieslist\),population47,973\(YES,under300,000\)$\\to$MEETSBOTHCRITERIA",

"sound":true,

"complete":true,

"validation\_explanation":"Population47,973\(line1\),borderswithIdaholine48\-50"

\},

"7":\{

"source":"wikipedia/Spokane\_County,\_Washington/content\.txt",

"fact":"AccordingtotheWikipediaarticleforSpokaneCounty,Washington,’Adjacentcounties’include’BonnerCounty,Idaho\-northeast’,’KootenaiCounty,Idaho\-east’,and’BenewahCounty,Idaho\-southeast’,confirmingSpokaneCountybordersIdaho\.The2020censuspopulationwas539,339\.",

"subquestion":"Amongcountiesfrom<hop2answer:Whitman,Spokane\>,whichcountybordersIdahoANDhas2020censuspopulationunder300,000?",

"answer":"SpokaneCounty:bordersIdaho\(YES,peradjacentcountieslist\),population539,339\(NO,exceeds300,000\)$\\to$DOESNOTMEETBOTHCRITERIA",

"sound":true,

"complete":true,

"validation\_explanation":"populationwas539,339\(line1\)\.Bordersidaho\(lines80\-82\)"

\},

"8":\{

"source":"wikipedia/Pullman,\_Washington/content\.txt",

"fact":"AccordingtotheWikipediaarticleforPullman,Washington,PullmanisthemostpopulouscityinWhitmanCountyandishometoWashingtonStateUniversity,apublicresearchland\-grantuniversity\.",

"subquestion":"Whichcityhoststheland\-grantuniversityin<hop3answer:WhitmanCounty\>?",

"answer":"Pullmanhoststheland\-grantuniversity\(WSU\)inWhitmanCounty",

"sound":true,

"complete":true,

"validation\_explanation":"Line3"

\},

"9":\{

"source":"datagov/report\-card\-enrollment\-2017\-18\-school\-year/files/rows\.txt",

"fact":"AccordingtotheWashingtonReportCardEnrollmentData2017\-18,PullmanSchoolDistrict\(inWhitmanCounty\)had2,941totalstudentsenrolledforAllGrades\.",

"subquestion":"Whatisthe2017\-18totalstudentenrollmentfortheschooldistrictin<hop4answer:Pullman\>?\(Filter:OrganizationLevel=District;DistrictName=PullmanSchoolDistrict;GradeLevel=AllGrades;column=AllStudents\.\)",

"answer":"2941",

"sound":true,

"complete":true,

"validation\_explanation":"Trueandaccuratebyquery\."

\}

\},

"reasoning\_chain":\[

"HOP1\(d=4,WAReportCardAssessmentData2014\-15through2017\-18\-nodes1\-4\):",

"Node1:DistrictswithELA\>=68%in2014\-15$\\to$21districtsin6counties",

"Node2:DistrictswithELA\>=68%in2015\-16$\\to$77districts",

"Node3:DistrictswithELA\>=68%in2016\-17$\\to$67districts",

"Node4:DistrictswithELA\>=68%in2017\-18$\\to$56districts",

"Intersection:12districtsin6counties\(King,Clark,Pierce,Whitman,Thurston,Spokane\)",

"",

"HOP2\(d=1,WikipediaEasternWashington\-node5\):",

"Node5\(WikipediaEasternWashington\):WhitmanandSpokaneareEasternWashingtoncounties",

"Result:WhitmanandSpokanearetheEasternWashingtoncountiesfromhop1",

"",

"HOP3\(d=2,Wikipediacountypages\-nodes6\-7\):",

"Node6\(WikipediaWhitman\_County\):BordersIdaho\(YES\),pop47,973$\\to$MEETSBOTHCRITERIA",

"Node7\(WikipediaSpokane\_County\):BordersIdaho\(YES\),pop539,339$\\to$DOESNOTMEET\(pop\>300k\)",

"Result:WhitmanCounty\(REQUIREShop2toknowwhichcountiestocheck\)",

"",

"HOP4\(d=1,WikipediaPullman\-node8\):",

"Node8\(WikipediaPullman\):PullmanishometoWSU\(land\-grant\)inWhitmanCounty",

"Result:Pullmanhoststheland\-grantuniversity\(REQUIREShop3\)",

"",

"HOP5\(d=1,data\.govenrollment\-node9\):",

"Node9\(WAReportCardEnrollment2017\-18\):PullmanSchoolDistrict=2,941students",

"Result:2941\(REQUIREShop4toknowthecityisPullman\)",

"",

"Finalanswer:2941"

\],

"annotated\_by":"\[annotator\]",

"annotation\_timestamp":"\[timestamp\]"

\}

Similar Articles

SANA: What Matters for QA Agents over Massive Data Lakes?

arXiv cs.CL

This paper presents SANA, a diagnostic ablation framework for exploratory question answering (EQA) over data lakes, which decomposes end-to-end agent failures into search, planning, data analysis, and policy components. Evaluations on LakeQA and KramaBench reveal data analysis as a consistent bottleneck, with search being a major limitation in large-scale settings.

Introducing SimpleQA

OpenAI Blog

OpenAI introduces SimpleQA, a new factuality benchmark dataset with 4,326 short fact-seeking questions designed to evaluate frontier language models on their ability to provide accurate answers without hallucination. The dataset achieves high quality through dual independent annotation, rigorous criteria, and achieves only ~3% estimated error rate, with GPT-4o scoring less than 40%.

Introducing IndQA

OpenAI Blog

OpenAI introduced IndQA, a new benchmark with 2,278 questions across 12 Indian languages and 10 cultural domains, designed to evaluate AI models' understanding of culturally nuanced and reasoning-heavy tasks that existing benchmarks fail to capture. Created with 261 domain experts, IndQA addresses the saturation of existing multilingual benchmarks like MMMLU and focuses on real-world cultural comprehension rather than translation or multiple-choice tasks.