Latent Bridges for Multi-Table Question Answering
Summary
GRAB uses a GNN encoder to convert relational tables into latent tokens for frozen LLMs, achieving significant performance gains in multi-table question answering.
View Cached Full Text
Cached at: 06/30/26, 05:28 AM
# Latent Bridges for Multi-Table Question Answering
Source: [https://arxiv.org/html/2606.28916](https://arxiv.org/html/2606.28916)
Simone Varriale1Tamara Cucumides2Floris Geerts2Paolo Papotti1 1EURECOM2University of Antwerp
###### Abstract
We introduceGRAB, a constructor–encoder–bridge pipeline for table question answering\. Our method lifts relational data into an heterogeneous graph, encodes it via message passing, and transfers the signals to an LLM through a small set of query\-conditioned latent tokens\. This provides the LLM with a compact, task\-relevant structural representation together with the flattened text\. Crucially, the LLM remains strictly frozen to preserve its general reasoning capabilities; we train only the lightweight graph encoder and latent bridge \(91M parameters\), allowing the entire pipeline to be trained efficiently\. Our pipeline significantly improves performance on relational Question Answering, with the largest gains in demanding multi\-table settings, offering an efficient, principled way to connect relational deep learning with LLMs\.[Link to Code](https://anonymous.4open.science/r/Graph-Relational-Attention-Bridge)
Latent Bridges for Multi\-Table Question Answering
Simone Varriale1Tamara Cucumides2Floris Geerts2Paolo Papotti11EURECOM2University of Antwerp
## 1Introduction
Table question answering \(TQA\) asks language models to answer natural language \(NL\) questions grounded in structured data\. While Text\-to\-SQL is popular for querying databases, TQA is essential because SQL struggles with messy, unnormalized data, implicit relationships, or hybrid contexts where tables are mixed with free textBadaro et al\. \([2023](https://arxiv.org/html/2606.28916#bib.bib3)\)\. Most LLM\-based approaches treat tables as text: they serialize rows and columns into a 1D sequence and rely on the model to recover the underlying structure\. This strategy is convenient, but it mismatches the semantics of tables, where row–column organization, permutation invariance, hierarchical headers, and cross\-cell dependencies are central to meaning\. This loss of structure is a key reason why LLMs remain brittle on table reasoningLi et al\. \([2025](https://arxiv.org/html/2606.28916#bib.bib18)\)\.
A natural alternative is to treat tables as a separate modality and interface them with LLMs through learned representations rather than raw serialization alone\. In this work, we encode tables with a dedicated neural network and inject the resulting features into an LLM as latent tokens\. Our starting point is that many of the difficulties of TQA are inherently relational: relevant evidence is distributed across rows, columns, and value groups, and in the multi\-table setting the model must reason across linked tables through join dependencies\.
We therefore proposeGRAB\(Graph\-RelationalAttentionBridge\), a GNN\-based table encoder for LLM\-based*multi\-table*question answering\. As shown in Figure[1](https://arxiv.org/html/2606.28916#S1.F1), our encoder lifts relational data into a graph, uses message passing to capture structural dependencies, and compresses the result into a small set of latent tokens consumed by the LLM\. The representation is conditioned on the NL question, allowing the encoder to produce question\-relevant structure rather than a single static summary of the input table\.
Adapting LLMs to tabular data typically requires computationally expensive full fine\-tuning or parameter\-heavy adapters, e\.g\., LoRAHu et al\. \([2022](https://arxiv.org/html/2606.28916#bib.bib15)\)\. While LoRA freezes the pretrained backbone and learns task\-specific low\-rank updates, it alters the model’s internal representations\. This can cause the adapted inference behavior to over\-specialize to the fine\-tuning domain, potentially reducing out\-of\-distribution cross\-task performance and causing catastrophic forgettingHuang et al\. \([2024](https://arxiv.org/html/2606.28916#bib.bib16)\)\. In contrast, our approach keeps the LLM strictly frozen and isolates the task\-specific learning entirely within our 91M parameter external module\. This lightweight design allows the entire pipeline to be trained efficiently on a single GPU, democratizing multi\-table reasoning\.
Figure 1:Architecture Overview\.Tables are processed via two parallel streams: text serialization and a tripartite graph that explicitly captures multi\-table joins\. AQuery\-Conditioned Latent Resampleruses the natural language question to actively filter the GNN\-encoded graph into dynamic soft tokens\. These structural tokens guide a frozen LLM to generate the final answer\. Only the lightweight graph and resampler modules are updated during training\.Our central claim is that graph\-conditioned latent interfaces provide a practical middle ground between two extremes: pure text serialization, which underuses relational structure, and symbolic pipelines, which often sacrifice the flexibility of LLMs on underspecified or compositional questions\. By combining a relational graph encoder with a lightweight latent bridge, we preserve structural bias while keeping the downstream model fully compatible with autoregressive LLM reasoning\. Empirically, we show that this design is especially effective on multi\-table and structurally demanding questions\. Conceptually, our results confirm that, for TQA, tables should not be treated merely as text, but as a structured modality that requires its own encoder and interface to the LLM\.
In summary, our main contributions are threefold\. First, we introduceGRAB’s architecture \(Section[4](https://arxiv.org/html/2606.28916#S4)\) and formalize it \(Section[5](https://arxiv.org/html/2606.28916#S5)\)\. Second, we design a stress\-test taxonomy that isolates structural evidence localization from exact arithmetic, providing a novel diagnostic tool for LLMs \(Section[6](https://arxiv.org/html/2606.28916#S6)\)\. Finally, we show thatGRABconsistently outperforms serialization\-only across 13 single\- and multi\-table QA benchmarks \(Section[7](https://arxiv.org/html/2606.28916#S7)\)\.
## 2Problem Setting
We formulate multi\-table question answering \(TQA\) as a conditional generation task over a relational database111The proposed solution works also for other tasks, such as tabular fact checking\. We report these results in Appendix[A](https://arxiv.org/html/2606.28916#A1)\.\. Let𝒯=\{T1,T2,…,Tn\}\\mathcal\{T\}=\\\{T\_\{1\},T\_\{2\},\\dots,T\_\{n\}\\\}be a set of tables, where each tableTiT\_\{i\}consists of a set of columnsCiC\_\{i\}, rowsRiR\_\{i\}, and cell valuesViV\_\{i\}\. The database is accompanied by metadataℳ\\mathcal\{M\}, which defines the schema, including a foreign\-key \(FK\) relationships \(ℱ\\mathcal\{F\}\) linking tables\. Columns have a typeτ∈\{cat\(egorical\),num\(erical\),text\}\\tau\\in\\\{\\mathrm\{cat\(egorical\)\},\\mathrm\{num\(erical\)\},\\mathrm\{text\}\\\}\.
Given a natural language questionQQand the relational context\(𝒯,ℳ\)\(\\mathcal\{T\},\\mathcal\{M\}\), the goal is to generate a target answerAA, which may be a free\-form generative text, an extractive span, or a numeric aggregate\.
In the standard setting, an LLM is either used in a zero/few\-shot setting \(i\.e\.,frozen\) or trained to maximize the probability of the correct answerP\(A∣Q,𝒯,ℳ\)P\(A\\mid Q,\\mathcal\{T\},\\mathcal\{M\}\)\. Typically, the input tables are flattened into a text sequenceStableS\_\{\\text\{\\sl table\}\}, and the LLM implicitly reconstructs row\-column alignments and cross\-table linkagesBadaro et al\. \([2023](https://arxiv.org/html/2606.28916#bib.bib3)\)\. However, tabular data exhibits unique semantic properties, such as permutation invariance of rows and strict hierarchical header structures, that are distorted or even lost by serialization\.
## 3Related Work
LLMs for Table Question Answering\.Early approaches to TQA rely on encoder\-only models pretrained on flattened tabular dataHerzig et al\. \([2020](https://arxiv.org/html/2606.28916#bib.bib13)\); Liu et al\. \([2022a](https://arxiv.org/html/2606.28916#bib.bib20)\)\. Other models mitigate the loss of structure caused by serialization with specialized attention biases to capture row\-column alignments and preserve permutation invarianceYang et al\. \([2022](https://arxiv.org/html/2606.28916#bib.bib34)\), but injecting these structural biases requires retraining of the LLM\. Indeed, with LLMs, the paradigm shifted to text serializationXie et al\. \([2022](https://arxiv.org/html/2606.28916#bib.bib31)\), where tables are converted into \(Markdown or HTML\) sequences and processed via standard autoregressive generationZhang et al\. \([2024](https://arxiv.org/html/2606.28916#bib.bib36)\)\. However, current pure\-LLM TQA strategies suffer context\-window fragmentation when serializing multiple tables, losing structural coherenceContalbo et al\. \([2025](https://arxiv.org/html/2606.28916#bib.bib9)\); Chen et al\. \([2024](https://arxiv.org/html/2606.28916#bib.bib6)\)\.
Recent models such as TAMOLi et al\. \([2025](https://arxiv.org/html/2606.28916#bib.bib18)\)inject hypergraph\-encoded tables as soft tokens\. TAMO constructs a hypergraph over cell occurrences: each cell is a primitive node, while rows, columns, and the whole table act as hyperedges\. In contrast,GRABcanonicalizes repeated values within column classes and merges foreign\-key\-linked column occurrences into shared classes\. Equality patterns and join keys therefore become explicit graph connectivity rather than implicit textual or embedding\-level coincidences\. Moreover, TAMO’s encoding is query\-agnostic, whereas our latent bridge is conditioned on the NL question\.
Graph Learning for Relational Data\.Tabular ML is dominated by tree\-based modelsChen and Guestrin \([2016](https://arxiv.org/html/2606.28916#bib.bib7)\)and multilayer perceptrons, which treat rows as independent, identically distributed samples\. Tabular foundation modelsHollmann et al\. \([2025](https://arxiv.org/html/2606.28916#bib.bib14)\); Chang et al\. \([2025](https://arxiv.org/html/2606.28916#bib.bib5)\)introduce cross\-row attention but are designed for row\-level predictions\. For TQA, we argue that relational databases are too dense to be losslessly compressed as static tokens\.
Conversely, relational deep learningRobinson et al\. \([2024](https://arxiv.org/html/2606.28916#bib.bib28)\)explicitly models multi\-table databases as heterogeneous graphs, using message passing to capture foreign key links\. While these models excel at node classification over databases, they are rarely integrated into LLM pipelines for TQA\. Our work bridges this gap by using graph constructors to encode tabular data before reasoning\.
Soft Tokens & Multimodality\.Parameter\-efficient fine\-tuning methodsLi and Liang \([2021](https://arxiv.org/html/2606.28916#bib.bib19)\); Liu et al\. \([2022b](https://arxiv.org/html/2606.28916#bib.bib21)\)introduced “soft tokens”: continuous, learnable prompt vectors that steer frozen LLMs without updating their weights\. This paradigm has been adapted for multimodal bridging, where models use latent resamplers to compress continuous signals from vision/audio encoders into few soft prefix tokensAlayrac et al\. \([2022](https://arxiv.org/html/2606.28916#bib.bib2)\); Li et al\. \([2023](https://arxiv.org/html/2606.28916#bib.bib17)\)\. We are the first to use query\-conditioned latent resamplers to compressrelational structures\. Rather than acting as a generic summary, our latents act as a learned structural retrieval bridge \(see Appendix[I](https://arxiv.org/html/2606.28916#A9), Table[15](https://arxiv.org/html/2606.28916#A9.T15)for a feature comparison\)\.
## 4Method
As illustrated in Figure[1](https://arxiv.org/html/2606.28916#S1.F1), instead of relying only onStableS\_\{\\text\{\\sl table\}\}, we define agraph constructorγ\\gammathat lifts the relational context into an explicit heterogeneous graph𝒢=γ\(𝒯,ℱ\)\\mathcal\{G\}=\\gamma\(\\mathcal\{T\},\\mathcal\{F\}\)\. In this graph, rows, column classes, and value groups are represented as typed nodes, and foreign\-key metadata induces shared column classes\. A graph encoder processes𝒢\\mathcal\{G\}to capture row–column–value dependencies and cross\-table connectivity\. To interface with the LLM, we define a query\-conditioned latent bridge that projects the encoded graph into a fixed\-length sequence ofKKsoft tokens, denotedZ=\{z1,…,zK\}Z=\\\{z\_\{1\},\\dots,z\_\{K\}\\\}\. The LLM then generates the answer from both the textual prompt and our structural prefix\. Throughout this process, the LLM weights remain frozen; only the graph encoder and bridge are updated\.
Relational Graph Constructor\.Intuitively, our constructor translates relational tables into a graph where rows, columns, and actual cell values are all treated as individual nodes\. Edges are drawn simply based on inclusion: a row is connected to the values it contains, and a column is connected to the values it can hold\. This naturally forces repeated values and foreign\-key joins to become shared connection points \(hubs\) in the graph\.
Thegraph constructorγ\\gammais a fixed, deterministic processing map\. It exposes row–column–value incidence and cross\-table join structure before any neural message passing occurs\. We provide more details in[appendix˜C](https://arxiv.org/html/2606.28916#A3)\. Let us consider the foreign\-key metadata inℳ\\mathcal\{M\}, that is,ℱ=\{\(\(i,c\),\(j,d\)\)∣columncofTijoins columndofTj\}\\mathcal\{F\}=\\bigl\\\{\\bigl\(\(i,c\),\(j,d\)\\bigr\)\\mid\\text\{column $c$ of $T\_\{i\}$ joins column $d$ of $T\_\{j\}$\}\\bigr\\\}\. The constructor maps the relational input to a tripartite graphγ\(𝒯,ℱ\)=\(𝒢,H\(0\)\)\\gamma\(\\mathcal\{T\},\\mathcal\{F\}\)=\\bigl\(\\mathcal\{G\},H^\{\(0\)\}\\bigr\)with𝒢=\(𝒱R∪𝒱C∪𝒱V,ℰRV∪ℰCV\)\\mathcal\{G\}=\\bigl\(\\mathcal\{V\}\_\{R\}\\cup\\mathcal\{V\}\_\{C\}\\cup\\mathcal\{V\}\_\{V\},\\mathcal\{E\}\_\{RV\}\\cup\\mathcal\{E\}\_\{CV\}\\bigr\), where𝒱R\\mathcal\{V\}\_\{R\}are row nodes,𝒱C\\mathcal\{V\}\_\{C\}are column\-class nodes,𝒱V\\mathcal\{V\}\_\{V\}are value\-group nodes,ℰRV\\mathcal\{E\}\_\{RV\}are row–value incidence edges, andℰCV\\mathcal\{E\}\_\{CV\}are column–value incidence edges\. Furthermore,H\(0\)H^\{\(0\)\}denotes the initial row, column and value node featuresHR\(0\)H\_\{R\}^\{\(0\)\},HC\(0\)H\_\{C\}^\{\(0\)\}, andHV\(0\)H\_\{V\}^\{\(0\)\}\.
A*distinguishing characteristic*of our construction is that columns linked by foreign\-key pairs inℱ\\mathcal\{F\}are represented by a single column\-class node\. This embeds joins directly in the graph and makes related tables accessible to later message passing\. For example, in Figure[1](https://arxiv.org/html/2606.28916#S1.F1), the city columns inT1T\_\{1\}andT2T\_\{2\}map to a single node in𝒢\\mathcal\{G\}\. Nodes and edges are otherwise defined in the natural way: row–value edges record which value groups occur in each row, and column–value edges record which value groups belong to each column class\. For value nodes, we take values after applying a standardization map\. Here, categorical and textual values are normalized and canonicalized within their column class, while numerical values are mapped to quantile buckets\.
For initial features, we use a fixed token\-level text encoder and extend it to strings by mean pooling\. The graph hidden dimension is inherited from the embedding model used to initialize the nodes\. In our implementation, row and column nodes are initialized from the same text embedding model, while value nodes are constructed directly in the same hidden dimension\. Hence, all node types already lie in a common embedding space, and no additional projection is applied at initialization\.
Column nodes are initialized from the header embedding of the corresponding column occurrence\. When a column node represents multiple columns identified by foreign\-key links, we average their header embeddings\. Value nodes are not initialized from the raw cell\-value text\. Instead, we use a deterministic embedding of the value\-group identifier, augmented with column\-type information, constructed in the graph hidden dimension\. Thus, value nodes primarily act as structural anchors: they indicate where identical categorical/textual values or discretized numeric buckets occur, while exact values remain available to the LLM throughStableS\_\{\\text\{\\sl table\}\}\.
Graph Encoder\.Given the initialized node representationsHR\(0\)H\_\{R\}^\{\(0\)\},HC\(0\)H\_\{C\}^\{\(0\)\}, andHV\(0\)H\_\{V\}^\{\(0\)\}, the graph encoder applies message passing over the tripartite incidence structure of𝒢\\mathcal\{G\}\. The encoder maintains separate hidden states for row, column, and value nodes, and updates them through the row\-value and column\-value adjacency matrices\. LetARV∈\{0,1\}\|𝒱R\|×\|𝒱V\|A\_\{RV\}\\in\\\{0,1\\\}^\{\|\\mathcal\{V\}\_\{R\}\|\\times\|\\mathcal\{V\}\_\{V\}\|\}denote the row\-value incidence matrix, where\(ARV\)ig=1\(A\_\{RV\}\)\_\{ig\}=1if row noderir\_\{i\}is connected to value nodevgv\_\{g\}\. Similarly, letACV∈\{0,1\}\|𝒱C\|×\|𝒱V\|A\_\{CV\}\\in\\\{0,1\\\}^\{\|\\mathcal\{V\}\_\{C\}\|\\times\|\\mathcal\{V\}\_\{V\}\|\}denote the column\-value incidence matrix, where\(ACV\)jg=1\(A\_\{CV\}\)\_\{jg\}=1if column nodecjc\_\{j\}is connected to value nodevgv\_\{g\}\.
At layerℓ\\ell, value nodes aggregate messages from their incident row and column nodes, while row and column nodes aggregate messages from their incident value nodes:
MV\(ℓ\)=ARV⊤HR\(ℓ\)\+ACV⊤HC\(ℓ\),M\_\{V\}^\{\(\\ell\)\}=A\_\{RV\}^\{\\top\}H\_\{R\}^\{\(\\ell\)\}\+A\_\{CV\}^\{\\top\}H\_\{C\}^\{\(\\ell\)\},MR\(ℓ\)=ARVHV\(ℓ\),MC\(ℓ\)=ACVHV\(ℓ\)\.M\_\{R\}^\{\(\\ell\)\}=A\_\{RV\}H\_\{V\}^\{\(\\ell\)\},\\qquad M\_\{C\}^\{\(\\ell\)\}=A\_\{CV\}H\_\{V\}^\{\(\\ell\)\}\.
The row, column, and value states are then updated in parallel using type\-specific residual blocks\. For each node typeXX, whereXXmay denote rowsRR, columnsCC, or valuesVV, the update is:
H~Xℓ\\displaystyle\\tilde\{H\}\_\{X\}^\{\\ell\}=HXℓ\+Drop\(NXmsg\(MXℓ\)\),\\displaystyle=H\_\{X\}^\{\\ell\}\+\\mathrm\{Drop\}\\\!\\left\(N\_\{X\}^\{\\mathrm\{msg\}\}\(M\_\{X\}^\{\\ell\}\)\\right\),HXℓ\+1\\displaystyle H\_\{X\}^\{\\ell\+1\}=H~Xℓ\+Drop\(FX\(NXffn\(H~Xℓ\)\)\),\\displaystyle=\\tilde\{H\}\_\{X\}^\{\\ell\}\+\\mathrm\{Drop\}\\\!\\left\(F\_\{X\}\\\!\\left\(N\_\{X\}^\{\\mathrm\{ffn\}\}\(\\tilde\{H\}\_\{X\}^\{\\ell\}\)\\right\)\\right\),whereDrop\\mathrm\{Drop\},NXmsgN\_\{X\}^\{\\mathrm\{msg\}\},NXffnN\_\{X\}^\{\\mathrm\{ffn\}\}, andFXF\_\{X\}denote dropout, message normalization, FFN normalization, and the feed\-forward network, respectively\. The update follows a Transformer\-style residual design: the first residual branch injects aggregated graph messages, while the second residual branch applies a type\-specific feed\-forward refinement to the message\-updated node state\. Each node type has separate normalization and feed\-forward parameters\. This message\-passing scheme lets row and column representations exchange information through value nodes\. As a result, rows that contain the same canonical value group can influence one another, columns receive signals from the values they generate, and in the multi\-table case, shared value or foreign\-key\-linked structures allow information to propagate across tables\. AfterLLgraph layers, the encoder outputs contextualized node representationsHR\(L\)H\_\{R\}^\{\(L\)\},HC\(L\)H\_\{C\}^\{\(L\)\}, andHV\(L\)H\_\{V\}^\{\(L\)\}\. These representations are not pooled into a single graph vector, instead, they are passed to the latent bridge, which compresses the variable\-size graph into a fixed number of soft tokens for the LLM\.
Latent Bridge to the LLM\.In standard prompt\- and prefix\-tuningLi and Liang \([2021](https://arxiv.org/html/2606.28916#bib.bib19)\), the LLM is steered by a set of task\-specific soft tokens that remain static at inference time: the same prompt vectors are prepended regardless of the specific input instance\. In our TQA setting, however, the structural context changes dynamically based on both the input graph𝒢\\mathcal\{G\}and the user questionQQ\. Therefore, we design aquery\-conditioned latent resamplerbased on the style of a Perceiver Resampler to generate a sequence of soft tokens for every forward passAlayrac et al\. \([2022](https://arxiv.org/html/2606.28916#bib.bib2)\)\.
LetHgraph∈ℝN×dH\_\{\\text\{\\sl graph\}\}\\in\\mathbb\{R\}^\{N\\times d\}denote all final node embeddings output by the GNN, andHQ∈ℝM×dH\_\{Q\}\\in\\mathbb\{R\}^\{M\\times d\}denote the contextualized embeddings of the NL question \(e\.g\., via a lightweight RoBERTa model\)\. To bridge the modality gap without exhausting the LLM’s context window, we initialize a fixed number ofKKlearnable query vectorsZ\(0\)∈ℝK×dZ^\{\(0\)\}\\in\\mathbb\{R\}^\{K\\times d\}\.
To make the soft tokens question\-conditioned, the resampler uses the question embeddingsHQH\_\{Q\}as additional context during latent extraction\. The initial learnable latent queries are partitioned by node type asZ\(0\)=\[ZR\(0\);ZC\(0\);ZV\(0\)\]Z^\{\(0\)\}=\[Z\_\{R\}^\{\(0\)\};Z\_\{C\}^\{\(0\)\};Z\_\{V\}^\{\(0\)\}\], whereZR\(0\)∈ℝKR×dZ\_\{R\}^\{\(0\)\}\\in\\mathbb\{R\}^\{K\_\{R\}\\times d\},ZC\(0\)∈ℝKC×dZ\_\{C\}^\{\(0\)\}\\in\\mathbb\{R\}^\{K\_\{C\}\\times d\}, andZV\(0\)∈ℝKV×dZ\_\{V\}^\{\(0\)\}\\in\\mathbb\{R\}^\{K\_\{V\}\\times d\}correspond to row, column, and value latents, respectively\. The question is encoded once to obtainHQH\_\{Q\}and is then reused by the latent groups\. Each group performs cross\-attention over its corresponding graph node representations concatenated with the same question embeddings along the sequence dimension:ZR=Attn\(ZR\(0\),\[HR;HQ\]\)Z\_\{R\}=\\mathrm\{Attn\}\(Z\_\{R\}^\{\(0\)\},\[H\_\{R\};H\_\{Q\}\]\),ZC=Attn\(ZC\(0\),\[HC;HQ\]\)Z\_\{C\}=\\mathrm\{Attn\}\(Z\_\{C\}^\{\(0\)\},\[H\_\{C\};H\_\{Q\}\]\), and finallyZV=Attn\(ZV\(0\),\[HV;HQ\]\)Z\_\{V\}=\\mathrm\{Attn\}\(Z\_\{V\}^\{\(0\)\},\[H\_\{V\};H\_\{Q\}\]\)\.
The output is a sequence ofKKsoft tokensZ∈ℝK×dZ\\in\\mathbb\{R\}^\{K\\times d\}, linearly projected to match the hidden dimensiondLd\_\{L\}of the frozen LLM\. Unlike static prefix tokens, theseKKlatents compress the multi\-table graph into a summary conditioned on the question\.
LLM Interface and Answer Generation\.The latent bridge outputs a fixed\-length sequence of graph\-derived soft tokensZ=\{z1,…,zK\}Z=\\\{z\_\{1\},\\ldots,z\_\{K\}\\\}in the graph encoder hidden space\. Before being passed to the LLM, these vectors are mapped to the LLM embedding dimension through a learned projector:Z^=Proj\(Z\)\\hat\{Z\}=\\mathrm\{Proj\}\(Z\)\. The projected latentsZ^\\hat\{Z\}are then injected as soft prefix embeddings before the prompt\.
The final LLM input consists of the graph latents, the task description, the serialized textual table context, and the NL question:\[Z^;D;Stable;Q\]\[\\hat\{Z\};D;S\_\{\\mathrm\{table\}\};Q\], whereDDdenotes the instruction or dataset\-specific description,StableS\_\{\\mathrm\{table\}\}denotes the retained textual serialization of the table or table segments, andQQis the question\. Thus, the LLM receives both explicit textual context and a compact structural prefix\.
Training Objective\.Our training paradigm is designed to be highly parameter\-efficient\. LetΘLLM\\Theta\_\{\\text\{\\sl LLM\}\}denote the parameters of the large language model, andΦGNN\+Bridge\\Phi\_\{\\text\{\\sl GNN\+Bridge\}\}denote the parameters of our graph encoder and query\-conditioned resampler \(∼\\sim91M parameters\)\. During training, we freezeΘLLM\\Theta\_\{\\text\{\\sl LLM\}\}and optimize onlyΦGNN\+Bridge\\Phi\_\{\\text\{\\sl GNN\+Bridge\}\}to minimize the standard auto\-regressive negative log\-likelihood of the target answerAA:ℒ=−∑t=1\|A\|logPΘLLM\(at∣a<t,Q,Z^,Stable\)\\mathcal\{L\}=\-\\sum\_\{t=1\}^\{\|A\|\}\\log P\_\{\\Theta\_\{\\text\{\\sl LLM\}\}\}\\big\(a\_\{t\}\\mid a\_\{<t\},Q,\\hat\{Z\},S\_\{\\text\{\\sl table\}\}\\big\)By keeping the gradients entirely within the lightweightΦGNN\+Bridge\\Phi\_\{\\text\{\\sl GNN\+Bridge\}\}modules, the computational memory footprint is drastically reduced, allowing the framework to be trained end\-to\-end on single GPU while fully preserving the LLM’s pre\-trained general knowledge\.
## 5Structural Analysis
Why Explicit Graph Construction Helps?A serialized table represents equality, co\-occurrence, and joins only indirectly, as repeated strings scattered across a sequence\. The constructor turns these relations into graph structure\. Repeated categorical or textual values become shared value nodes; foreign\-key\-linked columns become shared column classes\. As a result, duplicate elimination becomes a value\-node degree property, and joins become bounded\-hop paths through shared value groups\. The constructor therefore makes common relational operations directly available to the encoder\.
What Message Passing Can Extract\.The encoder is a typed message\-passing network over row, column\-class, and value\-group nodes\. Each layer expands the accessible neighborhood by one graph hop, so anLL\-layer encoder extracts bounded\-depth structural features of𝒢\\mathcal\{G\}\. InGRAB, these features include row membership, column membership, repeated\-value support, row\-level co\-occurrence, and local foreign\-key connectivity\. The encoder is most useful when the bottleneck is structural access, e\.g\., locating rows, forming groups, detecting repeated values, or following joins\. It is less useful when the relevant evidence has already been located and the remaining difficulty is exact symbolic or numerical computation\.
Limits of the Latent Bridge\.The latent bridge is a readout and compression mechanism, not an additional source of graph expressivity\. It receives the node representations produced by message passing and turns them into a fixed number of soft tokens\. It can select, weight, and summarize information exposed by the graph encoder, but it cannot recover distinctions that were not represented in the constructed graph or were erased during message passing\.
This fixed\-capacity compression creates a bottleneck\. A question\-agnostic bridge must compress the whole graph into the same sketch regardless of what is being asked, so it may waste capacity on irrelevant details or discard facts needed for a particular query\. A question\-conditioned bridge mitigates this by using the question to decide which rows, columns, values, and joins should be emphasized\. Thus, it does not make the encoder more expressive, but it makes the limited latent capacity more useful for the current question\. Appendix[D](https://arxiv.org/html/2606.28916#A4)formalizes this intuition as a sketch\-complexity separation\.
## 6Experimental Setup
Datasets\.We evaluate our method on a suite of benchmarks covering both single\- and multi\-table QA, as listed in Table[1](https://arxiv.org/html/2606.28916#S6.T1)\.
DatasetExtraTrainMain ChallengeSingle\-tableStructQANo4\.5kStructure sensitivityHiTabNo7\.4kHierarchical tablesWTQNo11\.3kCompositional QAWikiSQLNo56\.4kFiltering & aggregationHCTQALayout62\.1kComplex layoutsTabMWPText23\.1kMath reasoningMulti\-tableMultiHierTTText7\.0kMulti\-hop reasoningSciTaTText11\.6kScientific evidenceMMQAText2\.3kCross\-modal reasoningTQA\-BenchNo9\.8kLong\-context joinsAtisNo0\.4kFlight\-query reasoningGeoQueryNo0\.5kGeographic queryingSpiderNo6\.0kSQL result generationTable 1:Datasets used in our experiments\. The suite spans standard single\-table QA, structure\-heavy and layout\-heavy settings, and multi\-table or hybrid benchmarks requiring reasoning across tables and text\.For single\-table experiments, we use six datasets\. StructQALi et al\. \([2025](https://arxiv.org/html/2606.28916#bib.bib18)\)focuses on table structure understanding and robustness to structural variation\. HiTabCheng et al\. \([2022](https://arxiv.org/html/2606.28916#bib.bib8)\)emphasizes hierarchical headers and aggregation\-heavy reasoning\. WTQPasupat and Liang \([2015](https://arxiv.org/html/2606.28916#bib.bib26)\)and WikiSQLZhong et al\. \([2017](https://arxiv.org/html/2606.28916#bib.bib39)\)provide standard flat\-table benchmarks with broad use in the literature, while HCTQAAhmad et al\. \([2026](https://arxiv.org/html/2606.28916#bib.bib1)\)focuses on human\-centric tables with complex layouts\. TabMWPLu et al\. \([2023](https://arxiv.org/html/2606.28916#bib.bib22)\)complements these datasets with table\-grounded math word problems that require numerical and multi\-step reasoning\.
For multi\-table experiments, we use seven datasets from reasoning with relations to hybrid QA over tables and text\. MultiHierttZhao et al\. \([2022](https://arxiv.org/html/2606.28916#bib.bib38)\), MMQAWu et al\. \([2025](https://arxiv.org/html/2606.28916#bib.bib30)\)and SCITATZhang et al\. \([2025b](https://arxiv.org/html/2606.28916#bib.bib37)\)combine tables with textual evidence, TQA\-BenchQiu et al\. \([2024](https://arxiv.org/html/2606.28916#bib.bib27)\)focuses on scalable multi\-table relational reasoning under long contexts \(8K\), and Atis, GeoQuery and SpiderPal et al\. \([2023](https://arxiv.org/html/2606.28916#bib.bib25)\)targets QA where the output itself may be tabular\.
BaselinesWe compare our method,GRAB, against representative baselines\. We include anInference\-onlybaseline, where the model receives only the serialized table\(s\) and question in zero\-shot form, without any trainable table encoder or prompt parameters\. For theFrozen LLMsetting, we compare againstPrompt Tuning, where the base LLM remains fixed and only a small set of learned soft prompt vectors is optimized, andTAMO222TAMO is excluded from our multi\-table experiments as the released implementation supports only single\-table inputs\., which encodes each table with a hypergraph neural network and injects the resulting latent table features into the frozen LLM as soft tokens\. Finally, we consider aTuned LLMsetting, where the LLM is adapted with LoRA with and without the GRAB encoder\. As an additional reference, we report results forGPT\-5\.4\-miniOpenAI \([2026](https://arxiv.org/html/2606.28916#bib.bib24)\)under an inference\-only setup\. We also include two task\-specialized baselines:TableLlamaZhang et al\. \([2024](https://arxiv.org/html/2606.28916#bib.bib36)\)for the single\-table setting andMultiTabQAfor the multi\-table setting\. These baselines are fine\-tuned separately on each dataset, providing supervised references for the corresponding evaluation regimes\.
Table 2:Single\-table question answering results with Qwen3\-4B\-Base as the base model\. We report denotation accuracy on all datasets, except F1 and Complete Containment \(CC\) on HCTQA\.Table 3:Multi\-table question answering results with Qwen3\-4B\-Base\. We report denotation accuracy on MultiHiertt and MMQA, EM/F1 on SCITAT, accuracy on TQA\-Bench, and Table Exact Match \(T\-EM\) and Cell F1 on Atis, GeoQuery and Spider\.Stress\-test Taxonomy\.Existing benchmarks conflate three orthogonal sources of difficulty: the form of the expected answer, the structural depth of table access required to locate evidence, and the computational path needed to derive the final value\. Appendix[F](https://arxiv.org/html/2606.28916#A6)formalizes this as a three\-axis taxonomy \(answer type, structural depth, and computational path\), which lets us attribute failures to a specific axis rather than to aggregate “hardness\.” To isolate the contributions ofGRAB, we design a targeted stress\-test over 15 relational tables that independently varies thestructural axis\(0–2 categorical filters, plus optionalgroup\-bykeys\) and thecomputational axis\(lookup,count,max,avg\)\.
Implementation Details\.Our reference backbone is Qwen3\-4B\-BaseYang et al\. \([2025](https://arxiv.org/html/2606.28916#bib.bib33)\)\. For the graph encoder, we initialize textual row and column representations with embeddings from Qwen3\-Embedding\-0\.6B\. All experiments, including both inference and fine\-tuning, are run on NVIDIA A100 GPUs\. The encoder–bridge component is on the order of hundreds of millions of parameters\. Appendix[E](https://arxiv.org/html/2606.28916#A5)reports the full training setup for GRAB and the baselines \(optimizer, schedule, batch size, LoRA configuration, GNN hyperparameters\) as well as dataset preprocessing details\.
Evaluation Metrics\.We report the original metric from each benchmark\. For most datasets it is accuracy: a prediction is correct only if, after normalization, its multiset of answer values exactly matches the gold answers\. We parse both the prediction and the reference into a multiset of values, apply case\-folding and numeric canonicalization \(comma stripping and integer/float unification\), and require multiset equality\. The comparison is order\-invariant but count\-sensitive, partial overlap earns no credit\. For HCTQA we report F1 together with Complete Containment \(CC\), a binary score that is 1 only when the prediction fully covers the gold answer set \(recall = 1\)\. As TQA\-Bench answers are multiple\-choice, we report accuracy by extracting the choice from the generated text before comparing it to the gold option\. For SciTaT, we follow its protocol and split examples by answer length: short\-form answers are scored by Exact Match, while free\-form answers are scored by token\-level F1\. For the SQL\-style multi\-table benchmarks \(ATIS, GeoQuery, Spider\-SQL\), the model generates the answer as a linearized result table, which we evaluate by table\-level exact match \(T\-EM\), i\.e\., full ordered table must match, and by cell\-level F1, which scores cells as unordered multisets\.
## 7Main Results
Tables[2](https://arxiv.org/html/2606.28916#S6.T2)and[3](https://arxiv.org/html/2606.28916#S6.T3)report results across single\- and multi\-table benchmarks\. By comparingGRABagainst serialization, graph, and scale baselines, four patterns stand out:\(i\) Massive gains on demanding structures\.GRABconsistently outperforms serialized baselines, with the largest jumps exactly where 1D text struggles most: complex layouts \(HCTQA:\+17\.5\+17\.5F1\) and multi\-table joins \(Spider:\+13\.3\+13\.3Cell F1; TQA:\+7\.1\+7\.1Acc\)\. This confirms that modeling tables relieves the LLM from implicitly reconstructing relational structure\.\(ii\) Frozen GRAB rivals LoRA fine\-tuning\.Despite training only∼91\{\\sim\}91M parameters while keeping the LLM frozen,GRABapproaches or beats a LoRA\-adapted LLM reading serialized text \(e\.g\.,84\.8084\.80vs77\.9377\.93on StructQA\)\. Injecting structural bias proves more efficient than updating LLM weights to brute\-force structure from text\. Combining both yields the strongest results overall\.\(iii\) Query\-conditioned graphs beat static hypergraphs\.GRABoutperforms TAMO on every single\-table benchmark: dynamically allocating latent capacity based on the question is superior to query\-agnostic compression\.\(iv\) Punching above its weight class\.GRABconsistently outperforms fine\-tuned, TQA\-specialized models \(TableLlama and MultiTabQA\)\. Our 4B\-parameter pipeline rivals or exceeds GPT\-5\.4\-mini \(e\.g\.,\+29\.2\+29\.2F1 on HCTQA,\+37\.0\+37\.0Acc on TQA\-Bench\), a closed\-source model roughly 100×\\timeslarger, demonstrating that explicit relational encoding can bridge massive gaps in raw parametric scale\.
Table 4:Stress\-test results \(F1\) for representative categories\. Full table in Appendix[G](https://arxiv.org/html/2606.28916#A7)\.Performance by Query Type\.Table[4](https://arxiv.org/html/2606.28916#S7.T4)reports a condensed view of the diagnostic tests \(full results in Appendix[G](https://arxiv.org/html/2606.28916#A7)\), revealing howGRAB’s topology mitigates structural bottlenecks while exposing inherent LLM arithmetic limits\. Onlocating evidence\(lookup\),GRABgains\+14\+14to\+24\+24F1 across all condition depths, as the graph explicitly links rows via shared value nodes to route condition\-matching information\. This structural advantage is most pronounced oncounting: while a serialized baseline must reconstruct frequencies from sequence position, cardinality in our graph reduces to a local structural property \(a value node’s degree\), yielding our largest absolute gains \(up to\+46\+46F1\)\. Similarly, forgroup\-byoperations,GRAB’s explicit column\-class nodes act as natural partition anchors, substantially recovering performance \(e\.g\., Group\-bymax:42\.9→72\.642\.9\\to 72\.6F1\)\. Conversely,exact arithmetic\(avg\) defines the ceiling of our structural bridge\. Because our graph maps continuous data to quantile buckets, it cannot execute exact math; it merely isolates the correct rows for the frozen LLM\. Consequently,avgremains the hardest operator, with gains appearing only when filtering shrinks the operand set\. These patterns are not an artifact of backbone size: rerunning the stress\-test with a larger LLM \(Qwen3\-14B, Table[14](https://arxiv.org/html/2606.28916#A7.T14)\) shows theGRABgap widens rather than closes, theavgceiling is preserved exactly, and the serialized baseline does not improve on Group\-by despite the larger model: the bottleneck is structural, not parametric\.
Ablations\.GNN Depth and Latent Count\.A sequential search over latent countK∈\{1,…,256\}K\\in\\\{1,\\dots,256\\\}and GNN depthL∈\{1,…,16\}L\\in\\\{1,\\dots,16\\\}\(Appendix[B](https://arxiv.org/html/2606.28916#A2)\) shows thatK=32K=32suffices and that deeper encoders yield negligible improvement overL=1L=1\. This is consistent with the tripartite structure: row\-node initialization encodes per\-row cell content, value nodes act as sinks for rows sharing an attribute value, and column nodes aggregate their values, so a single message\-passing step exposes co\-occurrence and column membership to every value node\.
Question Conditioning Variants\.Removing question conditioning from the resampler costs11–33points on three of four ablation benchmarks \(Appendix[B](https://arxiv.org/html/2606.28916#A2), Table[9](https://arxiv.org/html/2606.28916#A2.T9)\)\. The effect is modest but consistent in direction on multi\-table benchmarks, supporting the sketch\-complexity argument \(Appendix[D](https://arxiv.org/html/2606.28916#A4)\) that question conditioning helps most where the same graph must answer diverse queries\.
Serialized Table\.Removing the serialized table text from the prompt causes a collapse across datasets, e\.g\., 84\.80 to 10\.07 on StructQA \(Appendix[B](https://arxiv.org/html/2606.28916#A2), Table[9](https://arxiv.org/html/2606.28916#A2.T9)\), confirming thatGRABacts as a structural supplement rather than replacing the textual evidence needed by the LLM for exact values\.
Additional Ablations\.Appendix[B](https://arxiv.org/html/2606.28916#A2)reports further checks on architectural choices and robustness\. First, a linear projection head matches or outperforms deeper MLP bridges, indicating that the graph encoder and latent resampler already perform the relevant structural abstraction\. Second, results across five random seeds show low variance, confirming that the gains are stable rather than driven by a favorable initialization\.
## 8Conclusion
We presentedGRAB, a graph\-relational latent bridge for multi\-table QA with frozen LLMs\. Beyond feeding relational evidence into a sequential prompt,GRABconstructs a typed graph over rows, column classes, and value groups, applies message passing to expose structural dependencies, and compresses the resulting representation into question\-conditioned soft tokens\. This design preserves the flexibility of autoregressive language models while adding an explicit inductive bias for joins, repeated values, and cross\-row evidence aggregation\. Across single\- and multi\-table benchmarks,GRABimproves over serialization\-only and soft\-prompt baselines\. The results support the view that tables should be treated as a structured modality\. Our analysis shows that latent graph tokens act as structural guidance: they help the LLM locate relevant evidence, while arithmetic and fine\-grained symbolic computation remain challenging\.
Our findings suggest a broader direction for table\-language modeling: pre\-trained relational encoders that can be reused across datasets, analogous to the role of pre\-trained encoders in vision\-language modelsDai et al\. \([2023](https://arxiv.org/html/2606.28916#bib.bib11)\)\. We seeGRABas a step toward such table foundation interfaces, where relational structure is encoded explicitly and integrated with LLM reasoning without sacrificing the general capabilities of the underlying LLM\.
## 9Limitations
WhileGRABis parameter\-efficient relative to LLM fine\-tuning, it introduces additional preprocessing and graph\-encoding overhead\. This cost is modest in our setting, but it may become more significant for very large databases, highly connected schemas, or applications requiring low\-latency inference\. Future work should study scalable table retrieval jointly with graph construction, stronger pretraining for relational encoders, and tighter integration with symbolic tools for exact arithmetic and executable reasoning\.
BecauseGRABsupplements rather than replaces the serialized text, the approach is still bounded by the context window limits of the underlying LLM \(e\.g\., dropping or truncating tables that exceed 8K tokens\)\. Also, in the graph constructor, missing, noisy, or ambiguous schema information may weaken the structural graph and reduce the benefit of message passing\.
In TQA datasets, tests come with only the correct tables required to answer a question\. In a more general setting, where databases are given as input, in the current implementationGRABassumes that a relevant subset of tables has already been retrieved\. To handle this issue, standard pipelines rely on retrieval modules to first extract a relevant subset of tablesZhang et al\. \([2025a](https://arxiv.org/html/2606.28916#bib.bib35)\); Shen et al\. \([2024](https://arxiv.org/html/2606.28916#bib.bib29)\)\. A natural next step is to jointly learn retrieval, schema linking, and graph construction so that structural representations remain robust under incomplete or noisy database metadata\.
## 10Use of AI assistants
When writing this paper, we used ChatGPT to improve the flow of writing and the vocabulary of the initial drafts we manually wrote\. Each suggestion has been manually validated by the authors\.
## Acknowledgment
This work was funded by the French government, through the 3IA Côte d’Azur Investments in the IA\-cluster project managed by the National Research Agency \(ANR\-23\-IACL\-0001\)\. This project was provided with resources by GENCI at IDRIS, thanks to grants 2025\-AD010616649 and 2025\-AD010616180\.
## References
- Ahmad et al\. \(2026\)Mohammad S\. Ahmad, Zan A\. Naeem, Michaël Aupetit, Ahmed Elmagarmid, Mohamed Eltabakh, Xiaosong Ma, Mourad Ouzzani, Chaoyi Ruan, and Hani Al\-Sayeh\. 2026\.[Hct\-qa: A benchmark for question answering on human\-centric tables](https://arxiv.org/abs/2504.20047)\.*Preprint*, arXiv:2504\.20047\.
- Alayrac et al\. \(2022\)Jean\-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millicah, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, and 8 others\. 2022\.[Flamingo: a visual language model for few\-shot learning](https://openreview.net/forum?id=EbMuimAbPbs)\.In*Proceedings of the 36th International Conference on Neural Information Processing Systems*, NeurIPS ’22, Red Hook, NY, USA\. Curran Associates Inc\.
- Badaro et al\. \(2023\)Gilbert Badaro, Mohammed Saeed, and Paolo Papotti\. 2023\.[Transformers for tabular data representation: A survey of models and applications](https://doi.org/10.1162/tacl_a_00544)\.*Transactions of the Association for Computational Linguistics*, 11:227–249\.
- Barceló et al\. \(2020\)Pablo Barceló, Egor V\. Kostylev, Mikael Monet, Jorge Pérez, Juan Reutter, and Juan Pablo Silva\. 2020\.[The logical expressiveness of graph neural networks](https://openreview.net/forum?id=r1lZ7AEKvB)\.In*International Conference on Learning Representations*\.
- Chang et al\. \(2025\)Shuaichen Chang, Madelon Hulsebos, Qian Liu, Wenhu Chen, and Huan Sun, editors\. 2025\.[*Proceedings of the 4th Table Representation Learning Workshop*](https://doi.org/10.18653/v1/2025.trl-1.0)\. Association for Computational Linguistics, Vienna, Austria\.
- Chen et al\. \(2024\)Peter Baile Chen, Yi Zhang, and Dan Roth\. 2024\.[Is table retrieval a solved problem? exploring join\-aware multi\-table retrieval](https://doi.org/10.18653/v1/2024.acl-long.148)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 2687–2699, Bangkok, Thailand\. Association for Computational Linguistics\.
- Chen and Guestrin \(2016\)Tianqi Chen and Carlos Guestrin\. 2016\.[Xgboost: A scalable tree boosting system](https://doi.org/10.1145/2939672.2939785)\.In*Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, KDD ’16, page 785–794, New York, NY, USA\. Association for Computing Machinery\.
- Cheng et al\. \(2022\)Zhoujun Cheng, Haoyu Dong, Zhiruo Wang, Ran Jia, Jiaqi Guo, Yan Gao, Shi Han, Jian\-Guang Lou, and Dongmei Zhang\. 2022\.[HiTab: A hierarchical table dataset for question answering and natural language generation](https://doi.org/10.18653/v1/2022.acl-long.78)\.In*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 1094–1110, Dublin, Ireland\. Association for Computational Linguistics\.
- Contalbo et al\. \(2025\)Michele Luca Contalbo, Sara Pederzoli, Francesco Del Buono, Venturelli Valeria, Francesco Guerra, and Matteo Paganelli\. 2025\.[GRI\-QA: a comprehensive benchmark for table question answering over environmental data](https://doi.org/10.18653/v1/2025.findings-acl.814)\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 15764–15779, Vienna, Austria\. Association for Computational Linguistics\.
- Cucumides and Geerts \(2026\)Tamara Cucumides and Floris Geerts\. 2026\.[Grables: Tabular learning beyond independent rows](https://arxiv.org/abs/2602.03945)\.*Preprint*, arXiv:2602\.03945\.
- Dai et al\. \(2023\)Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi\. 2023\.[InstructBLIP: Towards general\-purpose vision\-language models with instruction tuning](https://openreview.net/forum?id=vvoWPYqZJA)\.In*Thirty\-seventh Conference on Neural Information Processing Systems*\.
- Gilmer et al\. \(2017\)Justin Gilmer, Samuel S\. Schoenholz, Patrick F\. Riley, Oriol Vinyals, and George E\. Dahl\. 2017\.[Neural message passing for quantum chemistry](https://proceedings.mlr.press/v70/gilmer17a/gilmer17a.pdf)\.In*Proceedings of the 34th International Conference on Machine Learning \- Volume 70*, ICML’17, page 1263–1272\. JMLR\.org\.
- Herzig et al\. \(2020\)Jonathan Herzig, Pawel Krzysztof Nowak, Thomas Müller, Francesco Piccinno, and Julian Eisenschlos\. 2020\.[TaPas: Weakly supervised table parsing via pre\-training](https://doi.org/10.18653/v1/2020.acl-main.398)\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, pages 4320–4333, Online\. Association for Computational Linguistics\.
- Hollmann et al\. \(2025\)Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter\. 2025\.[Accurate predictions on small data with a tabular foundation model](https://doi.org/10.1038/S41586-024-08328-6)\.*Nat\.*, 637\(8044\):319–326\.
- Hu et al\. \(2022\)Edward J\. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen\. 2022\.[Lora: Low\-rank adaptation of large language models](https://openreview.net/forum?id=nZeVKeeFYf9)\.In*The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25\-29, 2022*\. OpenReview\.net\.
- Huang et al\. \(2024\)Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin\. 2024\.[Lorahub: Efficient cross\-task generalization via dynamic loRA composition](https://openreview.net/forum?id=TrloAXEJ2B)\.In*First Conference on Language Modeling*\.
- Li et al\. \(2023\)Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi\. 2023\.[Blip\-2: bootstrapping language\-image pre\-training with frozen image encoders and large language models](https://proceedings.mlr.press/v202/li23q.html)\.In*Proceedings of the 40th International Conference on Machine Learning*, ICML’23\. JMLR\.org\.
- Li et al\. \(2025\)Liyao Li, Chao Ye, Wentao Ye, Yifei Sun, Zhe Jiang, Haobo Wang, Jiaming Tian, Yiming Zhang, NINGTAO WANG, Xing Fu, Gang Chen, and Junbo Zhao\. 2025\.[Table as a modality for large language models](https://openreview.net/forum?id=kurEZdWU9G)\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*\.
- Li and Liang \(2021\)Xiang Lisa Li and Percy Liang\. 2021\.[Prefix\-tuning: Optimizing continuous prompts for generation](https://doi.org/10.18653/v1/2021.acl-long.353)\.In*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\)*, pages 4582–4597, Online\. Association for Computational Linguistics\.
- Liu et al\. \(2022a\)Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, and Jian\-Guang Lou\. 2022a\.[TAPEX: Table pre\-training via learning a neural SQL executor](https://openreview.net/forum?id=O50443AsCP)\.In*International Conference on Learning Representations*\.
- Liu et al\. \(2022b\)Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang\. 2022b\.[P\-tuning: Prompt tuning can be comparable to fine\-tuning across scales and tasks](https://doi.org/10.18653/v1/2022.acl-short.8)\.In*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\)*, pages 61–68, Dublin, Ireland\. Association for Computational Linguistics\.
- Lu et al\. \(2023\)Pan Lu, Liang Qiu, Kai\-Wei Chang, Ying Nian Wu, Song\-Chun Zhu, Tanmay Rajpurohit, Peter Clark, and Ashwin Kalyan\. 2023\.[Dynamic prompt learning via policy gradient for semi\-structured mathematical reasoning](https://openreview.net/forum?id=DHyHRBwJUTN)\.In*The Eleventh International Conference on Learning Representations*\.
- Morris et al\. \(2019\)Christopher Morris, Martin Ritzert, Matthias Fey, William L\. Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe\. 2019\.[Weisfeiler and Leman go neural: Higher\-order graph neural networks](https://doi.org/10.1609/aaai.v33i01.33014602)\.In*AAAI*\.
- OpenAI \(2026\)OpenAI\. 2026\.Introducing gpt\-5\.4 mini and nano\.[https://openai\.com/index/introducing\-gpt\-5\-4\-mini\-and\-nano/](https://openai.com/index/introducing-gpt-5-4-mini-and-nano/)\.Accessed: 2026\-05\-26\.
- Pal et al\. \(2023\)Vaishali Pal, Andrew Yates, Evangelos Kanoulas, and Maarten de Rijke\. 2023\.[MultiTabQA: Generating tabular answers for multi\-table question answering](https://doi.org/10.18653/v1/2023.acl-long.348)\.In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 6322–6334, Toronto, Canada\. Association for Computational Linguistics\.
- Pasupat and Liang \(2015\)Panupong Pasupat and Percy Liang\. 2015\.[Compositional semantic parsing on semi\-structured tables](https://aclanthology.org/P15-1142/)\.In*Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\)*, pages 1470–1480\.
- Qiu et al\. \(2024\)Zipeng Qiu, You Peng, Guangxin He, Binhang Yuan, and Chen Wang\. 2024\.[Tqa\-bench: Evaluating llms for multi\-table question answering with scalable context and symbolic extension](https://arxiv.org/abs/2411.19504)\.*Preprint*, arXiv:2411\.19504\.
- Robinson et al\. \(2024\)Joshua Robinson, Rishabh Ranjan, Weihua Hu, Kexin Huang, Jiaqi Han, Alejandro Dobles, Matthias Fey, Jan Eric Lenssen, Yiwen Yuan, Zecheng Zhang, Xinwei He, and Jure Leskovec\. 2024\.[Relbench: A benchmark for deep learning on relational databases](https://openreview.net/forum?id=WEFxOm3Aez)\.In*The Thirty\-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track*\.
- Shen et al\. \(2024\)Zhili Shen, Pavlos Vougiouklis, Chenxin Diao, Kaustubh Vyas, Yuanyi Ji, and Jeff Z\. Pan\. 2024\.[Improving retrieval\-augmented text\-to\-SQL with AST\-based ranking and schema pruning](https://doi.org/10.18653/v1/2024.emnlp-main.449)\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 7865–7879, Miami, Florida, USA\. Association for Computational Linguistics\.
- Wu et al\. \(2025\)Jian Wu, Linyi Yang, Dongyuan Li, Yuliang Ji, Manabu Okumura, and Yue Zhang\. 2025\.[Mmqa: Evaluating llms with multi\-table multi\-hop complex questions](https://proceedings.iclr.cc/paper_files/paper/2025/file/794a425a2e47e05d29d30f79b79a692d-Paper-Conference.pdf)\.In*International Conference on Learning Representations*, volume 2025, pages 48626–48643\.
- Xie et al\. \(2022\)Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong, Torsten Scholak, Michihiro Yasunaga, Chien\-Sheng Wu, Ming Zhong, Pengcheng Yin, Sida I\. Wang, Victor Zhong, Bailin Wang, Chengzu Li, Connor Boyle, Ansong Ni, Ziyu Yao, Dragomir Radev, Caiming Xiong, Lingpeng Kong, and 4 others\. 2022\.[UnifiedSKG: Unifying and multi\-tasking structured knowledge grounding with text\-to\-text language models](https://doi.org/10.18653/v1/2022.emnlp-main.39)\.In*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing*, pages 602–631, Abu Dhabi, United Arab Emirates\. Association for Computational Linguistics\.
- Xu et al\. \(2019\)Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka\. 2019\.[How powerful are graph neural networks?](https://openreview.net/forum?id=ryGs6iA5Km)In*International Conference on Learning Representations*\.
- Yang et al\. \(2025\)An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others\. 2025\.[Qwen3 technical report](https://arxiv.org/abs/2505.09388)\.*arXiv preprint arXiv:2505\.09388*\.
- Yang et al\. \(2022\)Jingfeng Yang, Aditya Gupta, Shyam Upadhyay, Luheng He, Rahul Goel, and Shachi Paul\. 2022\.[TableFormer: Robust transformer modeling for table\-text encoding](https://doi.org/10.18653/v1/2022.acl-long.40)\.In*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 528–537, Dublin, Ireland\. Association for Computational Linguistics\.
- Zhang et al\. \(2025a\)Chi Zhang, Meihui Zhang, Yuxin Yang, Tao Chen, and Zhaojing Luo\. 2025a\.[Aixelask: A stepwise\-guided retrieval and reasoning framework for large table qa](https://doi.org/10.1145/3769831)\.*Proc\. ACM Manag\. Data*, 3\(6\)\.
- Zhang et al\. \(2024\)Tianshu Zhang, Xiang Yue, Yifei Li, and Huan Sun\. 2024\.[TableLlama: Towards open large generalist models for tables](https://doi.org/10.18653/v1/2024.naacl-long.335)\.In*Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pages 6024–6044, Mexico City, Mexico\. Association for Computational Linguistics\.
- Zhang et al\. \(2025b\)Xuanliang Zhang, Dingzirui Wang, Baoxin Wang, Longxu Dou, Xinyuan Lu, Keyan Xu, Dayong Wu, and Qingfu Zhu\. 2025b\.[SCITAT: A question answering benchmark for scientific tables and text covering diverse reasoning types](https://doi.org/10.18653/v1/2025.findings-acl.199)\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 3859–3881, Vienna, Austria\. Association for Computational Linguistics\.
- Zhao et al\. \(2022\)Yilun Zhao, Yunxiang Li, Chenying Li, and Rui Zhang\. 2022\.[MultiHiertt: Numerical reasoning over multi hierarchical tabular and textual data](https://aclanthology.org/2022.acl-long.454)\.In*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 6588–6600, Dublin, Ireland\. Association for Computational Linguistics\.
- Zhong et al\. \(2017\)Victor Zhong, Caiming Xiong, and Richard Socher\. 2017\.[Seq2sql: Generating structured queries from natural language using reinforcement learning](https://arxiv.org/abs/1709.00103)\.*Preprint*, arXiv:1709\.00103\.
## Appendix ABeyond Table Question Answering
The main focus of our experiments is on table question answering, where the model must extract or reason over tabular content to produce a natural\-language answer\. We further hypothesize that the representations learned by GRAB are not specific to QA, but can transfer to structurally different tabular tasks\. To test this hypothesis, we evaluate on two additional benchmarks: TabFact, a fact verification benchmark in which the model must classify a statement as entailed or refuted by a given table, and Spider, a text\-to\-SQL benchmark in which the model must generate a structured query from a natural\-language question and a database schema\. These tasks differ from QA both in output format and in the type of reasoning required, providing a broader test of the encoder’s ability to capture table structure in a task\-agnostic way\. In Table[5](https://arxiv.org/html/2606.28916#A1.T5), we report accuracy for TabFact and execution accuracy for Spider\. The results confirm that GRAB is able to extract task\-agnostic information that can be used for tasks beyond question answering\.
Table 5:Comparison of Base, Soft Prompt, and GRAB on TabFact and Spider with Qwen3\-4B\-Base
## Appendix BAblation
### B\.1GNN Architecture and Latent Bridge Ablations
In the first stage, we fix the GNN at 2 layers and the resampler at 1 head and 8 layers, and sweep over latent countK∈\{1,2,4,8,16,32,64,128,256\}K\\in\\\{1,2,4,8,16,32,64,128,256\\\}, with results reported in Table[6](https://arxiv.org/html/2606.28916#A2.T6)\.
We find thatK=32K\{=\}32andK=64K\{=\}64perform best while larger values bring no further gain, and very small values \(K≤2K\\leq 2\) degrade performance noticeably\. Based on this, we carry forwardK∈\{4,32,64\}K\\in\\\{4,32,64\\\}as representative small, medium, and large capacity settings\.
Table 6:Stage 1 ablation: latent count sweep with GNN fixed at 2 layers and resampler fixed at 1 head and 8 layers\. Bold denotes the best result per column\.In the second stage, we fix the resampler and jointly sweep GNN depthL∈\{1,2,4,8,16\}L\\in\\\{1,2,4,8,16\\\}against the three selected latent counts, for a total of 15 configurations\. Results are reported in Table[7](https://arxiv.org/html/2606.28916#A2.T7)\.
Table 7:Joint ablation of GNN depth and latent countKK\(average over StructQA, HiTab, MultiHierTT, and MMQA\)\. The resampler is fixed at 1 head and 8 layers throughout\. Bold denotes the best overall configuration\.We further ablate the design of the projection head that bridges the GNN encoder to the LLM input space\. Specifically, we compare three variants: a single linear layer \(Linear\), a two\-layer MLP with a non\-linearity \(MLP\), and a deeper MLP with additional hidden layers \(Deep MLP\)\. All other components are kept fixed\.
As shown in Table[8](https://arxiv.org/html/2606.28916#A2.T8), the linear projection consistently matches or outperforms the non\-linear alternatives, suggesting that the resampler already provides sufficient expressivity and that additional capacity in the projection head does not yield further gains, and can even hurt performance, as seen on MMQA and MultiHierTT\.
Table 8:Ablation comparison of projection heads across StructProbe, HiTab, MultiHierTT, and MMQA\.
### B\.2Impact of Conditioning and Serialization
To assess the contribution of key components ofGRAB, we conduct two ablation studies\. In the first, we remove the question conditioning from the resampler, meaning the GNN produces table embeddings independently of the input question rather than attending to it during the cross\-attention aggregation\. In the second, we remove the serialized table from the LLM prompt entirely, relying solely on the projected GNN embeddings to convey table content to the language model\. These two ablations isolate respectively the role of question\-aware table encoding and the role of the text\-based table representation as a complementary signal to the graph embeddings\. Results are reported in Table[9](https://arxiv.org/html/2606.28916#A2.T9)
Table 9:Ablation comparison across StructProbe, HiTab, MultiHierTT, and MMQA\.
### B\.3Robustness
To verify that the reported results are not an artifact of a particular random initialization, we retrain the best\-performing configuration with three different random seeds and report mean and standard deviation across runs\. A small variance would confirm that the model converges reliably and that comparisons with baselines are meaningful beyond a single lucky initialization\. Results reported in Table[10](https://arxiv.org/html/2606.28916#A2.T10)\.
Table 10:Accuracy results across random seeds\.
## Appendix CConstructor Details
Forn∈ℕn\\in\\mathbb\{N\},n≠0n\\neq 0, we let\[n\]=\{1,2,…,n\}\[n\]=\\\{1,2,\\ldots,n\\\}\.
The main text already provided some details on theGRABgraph constructor\. Here, we give the remaining details in[section˜C\.1](https://arxiv.org/html/2606.28916#A3.SS1)\. For completeness, in[section˜C\.2](https://arxiv.org/html/2606.28916#A3.SS2), we include below the corresponding constructor for the TAMO/HyTrel\-style baseline used in our comparisons\.
### C\.1GRAB: Graph Constructor\.
Thegraph constructorγ\\gammais a fixed, deterministic processing map\. It exposes row–column–value incidence and cross\-table join structure before any neural message passing occurs\. Letℱ\\mathcal\{F\}denote the foreign\-key metadata inℳ\\mathcal\{M\}:ℱ=\{\(\(i,c\),\(j,d\)\)∣columncofTijoins columndofTj\}\.\\mathcal\{F\}=\\bigl\\\{\\bigl\(\(i,c\),\(j,d\)\\bigr\)\\mid\\text\{column $c$ of $T\_\{i\}$ joins column $d$ of $T\_\{j\}$\}\\bigr\\\}\.
The constructor maps the relational input to a tripartite graphγ\(𝒯,ℱ\)=\(𝒢,H\(0\)\)\\gamma\(\\mathcal\{T\},\\mathcal\{F\}\)=\\bigl\(\\mathcal\{G\},H^\{\(0\)\}\\bigr\), with𝒢=\(𝒱R∪𝒱C∪𝒱V,ℰRV∪ℰCV\)\\mathcal\{G\}=\\bigl\(\\mathcal\{V\}\_\{R\}\\cup\\mathcal\{V\}\_\{C\}\\cup\\mathcal\{V\}\_\{V\},\\mathcal\{E\}\_\{RV\}\\cup\\mathcal\{E\}\_\{CV\}\\bigr\), where𝒱R\\mathcal\{V\}\_\{R\}are row nodes,𝒱C\\mathcal\{V\}\_\{C\}are column\-class nodes,𝒱V\\mathcal\{V\}\_\{V\}are value\-group nodes,ℰRV\\mathcal\{E\}\_\{RV\}are row–value incidence edges, andℰCV\\mathcal\{E\}\_\{CV\}are column–value incidence edges\. Furthermore,H\(0\)=\{hv\(0\)\}v∈𝒱R∪𝒱C∪𝒱VH^\{\(0\)\}=\\\{h^\{\(0\)\}\_\{v\}\\\}\_\{v\\in\\mathcal\{V\}\_\{R\}\\cup\\mathcal\{V\}\_\{C\}\\cup\\mathcal\{V\}\_\{V\}\}are the initial node features\. In more detail:
Column classes\.Let𝒞occ=\{\(i,c\)∣i∈\[n\],c∈Ci\}\\mathcal\{C\}\_\{\\mathrm\{occ\}\}=\\\{\(i,c\)\\mid i\\in\[n\],\\,c\\in C\_\{i\}\\\}be the set of column occurrences across all tables\. The relation∼ℱ\\sim\_\{\\mathcal\{F\}\}is the smallest equivalence relation on𝒞occ\\mathcal\{C\}\_\{\\mathrm\{occ\}\}containing all foreign\-key pairs inℱ\\mathcal\{F\}\. We write\[\(i,c\)\]ℱ\[\(i,c\)\]\_\{\\mathcal\{F\}\}for the equivalence class of column occurrence\(i,c\)\(i,c\), and useα\\alphafor a generic class\. Whenℱ=∅\\mathcal\{F\}=\\emptyset, the relation is the identity and no columns are merged\. We denote the set of column classes by𝒞ℱ=𝒞occ/∼ℱ\\mathcal\{C\}\_\{\\mathcal\{F\}\}=\\mathcal\{C\}\_\{\\mathrm\{occ\}\}/\\\!\\sim\_\{\\mathcal\{F\}\}\. For example, in Figure[1](https://arxiv.org/html/2606.28916#S1.F1), the city columns inT1T\_\{1\}andT2T\_\{2\}map to a single node in𝒢\\mathcal\{G\}\. Each classα\\alphainherits a single typeτα∈\{cat,num,text\}\\tau\_\{\\alpha\}\\in\\\{\\mathrm\{cat\},\\mathrm\{num\},\\mathrm\{text\}\\\}\.
Nodes\.Row nodes are indexed globally across tables, column nodes correspond to FK\-induced column classes, and value nodes correspond to canonical value groups within a column class:𝒱R=\{ρi,r∣i∈\[n\],r∈Ri\}\\mathcal\{V\}\_\{R\}=\\bigl\\\{\\rho\_\{i,r\}\\mid i\\in\[n\],\\,r\\in R\_\{i\}\\bigr\\\},𝒱C=\{cα∣α∈𝒞ℱ\}\\mathcal\{V\}\_\{C\}=\\bigl\\\{c\_\{\\alpha\}\\mid\\alpha\\in\\mathcal\{C\}\_\{\\mathcal\{F\}\}\\bigr\\\}, and𝒱V=\{vα,g∣α∈𝒞ℱ,g∈Im\(gα\)\}\\mathcal\{V\}\_\{V\}=\\bigl\\\{v\_\{\\alpha,g\}\\mid\\alpha\\in\\mathcal\{C\}\_\{\\mathcal\{F\}\},\\,g\\in\\mathrm\{Im\}\(g\_\{\\alpha\}\)\\bigr\\\}\. Here, for each column classα\\alpha, the grouping mapgαg\_\{\\alpha\}standardizes cell contents:
gα\(x\)=\{idα\(norm\(x\)\),τα∈\{cat,text\},qα\(x\),τα=num\.g\_\{\\alpha\}\(x\)=\\begin\{cases\}\\mathrm\{id\}\_\{\\alpha\}\\bigl\(\\mathrm\{norm\}\(x\)\\bigr\),&\\tau\_\{\\alpha\}\\in\\\{\\mathrm\{cat\},\\mathrm\{text\}\\\},\\\\\[2\.0pt\] q\_\{\\alpha\}\(x\),&\\tau\_\{\\alpha\}=\\mathrm\{num\}\.\\end\{cases\}wherenorm\(⋅\)\\mathrm\{norm\}\(\\cdot\)normalizes the cell string andidα\(⋅\)\\mathrm\{id\}\_\{\\alpha\}\(\\cdot\)assigns a unique identifier to each distinct normalized value withinα\\alpha, whileqα\(⋅\)q\_\{\\alpha\}\(\\cdot\)maps values to quantile\-based buckets fitted on all numeric values in the class\. A value nodevα,gv\_\{\\alpha,g\}therefore represents a repeated categorical/textual value or a numeric range, not an individual cell occurrence\.333Value groups are not replacements for exact numeric content, which remains available in the textual prompt\.
Edges\.For each cellxr,c\(i\)x^\{\(i\)\}\_\{r,c\}\(value in columnccin rowrrinRiR\_\{i\}\), letα=\[\(i,c\)\]ℱ\\alpha=\[\(i,c\)\]\_\{\\mathcal\{F\}\}andg=gα\(xr,c\(i\)\)g=g\_\{\\alpha\}\(x^\{\(i\)\}\_\{r,c\}\)\. The constructor adds the row–value edge\{ρi,r,vα,g\}\\\{\\rho\_\{i,r\},v\_\{\\alpha,g\}\\\}toℰRV\\mathcal\{E\}\_\{RV\}and the column–value edge\{cα,vα,g\}\\\{c\_\{\\alpha\},v\_\{\\alpha,g\}\\\}toℰCV\\mathcal\{E\}\_\{CV\}\. Thus, repeated values become shared graph neighborhoods, and foreign\-key joins become explicit connectivity through shared column classes and value groups\.
Initial features\.Letϕtok\\phi\_\{\\mathrm\{tok\}\}be a fixed token\-level text encoder\. For any strings∈Σ∗s\\in\\Sigma^\{\*\}, defineϕ\(s\)\\phi\(s\)asMeanPool\(ϕtok\(s\)\)\\mathrm\{MeanPool\}\\bigl\(\\phi\_\{\\mathrm\{tok\}\}\(s\)\\bigr\)\.
The graph’s hidden feature dimension is inherited from the embedding model used to initialize the nodes\. In our implementation, row and column nodes are initialized from the same text embedding model, while value nodes are constructed directly in the same hidden dimension\. Therefore, all node types already lie in a common embedding space and no additional projection is applied at initialization\. We write∥\\\|for string concatenation\.
For a row nodeρi,r\\rho\_\{i,r\}, we encode a row\-level string formed by concatenating header–value pairs:
si,r=∥c∈Ci\[hc\(i\):xr,c\(i\)\],hρi,r\(0\)=ϕ\(si,r\)\.s\_\{i,r\}=\\big\\\|\_\{c\\in C\_\{i\}\}\\bigl\[\\,h^\{\(i\)\}\_\{c\}:x^\{\(i\)\}\_\{r,c\}\\,\\bigr\],\\qquad h^\{\(0\)\}\_\{\\rho\_\{i,r\}\}=\\phi\(s\_\{i,r\}\)\.
For a column nodecαc\_\{\\alpha\}, we initialize from the header embedding of the corresponding column occurrence\. Ifα\\alphacontains multiple FK\-linked column occurrences, we average their header embeddingshcα\(0\)=1\|α\|∑\(i,c\)∈αϕ\(hc\(i\)\)h^\{\(0\)\}\_\{c\_\{\\alpha\}\}=\\frac\{1\}\{\|\\alpha\|\}\\sum\_\{\(i,c\)\\in\\alpha\}\\phi\\bigl\(h^\{\(i\)\}\_\{c\}\\bigr\)\.
For a value nodevα,gv\_\{\\alpha,g\}, we do not initialize from the cell\-value text\. Instead, we use a deterministic embedding of the value\-group identifier, augmented with the column\-type information, constructed in the graph hidden dimension\. Thus, value nodes primarily act as structural anchors: they indicate where identical categorical/textual values or discretized numeric buckets occur, while exact values are available to the LLM withStableS\_\{\\text\{\\sl table\}\}\.
### C\.2TAMO: Graph Constructor
We describe the TAMO/HyTrel structural encoder for a single flat table\. LetTTbe a table with row setRR, column setCC, and cell valuesxr,cx\_\{r,c\}inVV, forr∈Rr\\in Randc∈Cc\\in C\. Each column carries a header stringhch\_\{c\}and a declared typeτc∈\{cat,num,text\}\\tau\_\{c\}\\in\\\{\\mathrm\{cat\},\\mathrm\{num\},\\mathrm\{text\}\\\}\.
TAMO/HyTrel constructs a hypergraph in which cell occurrences are primitive nodes, while rows, columns, and the whole table are represented as hyperedges\. Equivalently, we represent this hypergraph by its typed incidence graph\. The constructor returnsγHT\(T\)=\(𝒢HT,HHT\(0\)\)\\gamma\_\{\\mathrm\{HT\}\}\(T\)=\\bigl\(\\mathcal\{G\}\_\{\\mathrm\{HT\}\},H^\{\(0\)\}\_\{\\mathrm\{HT\}\}\\bigr\), with
𝒢HT=\(𝒱X∪𝒱R∪𝒱C∪𝒱T,ℰXR∪ℰXC∪ℰXT\),\\mathcal\{G\}\_\{\\mathrm\{HT\}\}=\\bigl\(\\mathcal\{V\}\_\{X\}\\cup\\mathcal\{V\}\_\{R\}\\cup\\mathcal\{V\}\_\{C\}\\cup\\mathcal\{V\}\_\{T\},\\mathcal\{E\}\_\{XR\}\\cup\\mathcal\{E\}\_\{XC\}\\cup\\mathcal\{E\}\_\{XT\}\\bigr\),where𝒱X\\mathcal\{V\}\_\{X\}are cell\-occurrence nodes,𝒱R\\mathcal\{V\}\_\{R\}are row\-hyperedge nodes,𝒱C\\mathcal\{V\}\_\{C\}are column\-hyperedge nodes,𝒱T\\mathcal\{V\}\_\{T\}contains the table\-hyperedge node, andℰXR\\mathcal\{E\}\_\{XR\},ℰXC\\mathcal\{E\}\_\{XC\}, andℰXT\\mathcal\{E\}\_\{XT\}encode incidences between cell nodes and row, column, and table hyperedge nodes\. The initial node features are
HHT\(0\)=\{hv\(0\)\}v∈𝒱X∪𝒱R∪𝒱C∪𝒱T\.H^\{\(0\)\}\_\{\\mathrm\{HT\}\}=\\\{h^\{\(0\)\}\_\{v\}\\\}\_\{v\\in\\mathcal\{V\}\_\{X\}\\cup\\mathcal\{V\}\_\{R\}\\cup\\mathcal\{V\}\_\{C\}\\cup\\mathcal\{V\}\_\{T\}\}\.
#### Cell, row\-hyperedge, column\-hyperedge, and table\-hyperedge nodes\.
For every cell occurrencexr,cx\_\{r,c\}we introduce a cell node; for every row we introduce a row\-hyperedge node; for every column we introduce a column\-hyperedge node; and for the whole table we introduce a table\-hyperedge node:
𝒱X\\displaystyle\\mathcal\{V\}\_\{X\}=\{ur,c∣r∈R,c∈C\},\\displaystyle=\\bigl\\\{u\_\{r,c\}\\mid r\\in R,\\ c\\in C\\bigr\\\},𝒱R\\displaystyle\\mathcal\{V\}\_\{R\}=\{ρr∣r∈R\},\\displaystyle=\\bigl\\\{\\rho\_\{r\}\\mid r\\in R\\bigr\\\},𝒱C\\displaystyle\\mathcal\{V\}\_\{C\}=\{κc∣c∈C\},\\displaystyle=\\bigl\\\{\\kappa\_\{c\}\\mid c\\in C\\bigr\\\},𝒱T\\displaystyle\\mathcal\{V\}\_\{T\}=\{θT\}\.\\displaystyle=\\bigl\\\{\\theta\_\{T\}\\bigr\\\}\.Hereur,cu\_\{r,c\}denotes the node corresponding to the cell occurrence whose surface value isxr,cx\_\{r,c\}\. Thus, equal cell values in different positions still give distinct cell nodes\. The nodeθT\\theta\_\{T\}represents the whole\-table hyperedge and provides a global aggregation channel\.
#### Incidence edges\.
The incidence edges encode membership of cells in rows, columns, and the whole table\. For each cell occurrenceur,cu\_\{r,c\}, we add\(ur,c,ρr\)\(u\_\{r,c\},\\rho\_\{r\}\)toℰXR\\mathcal\{E\}\_\{XR\},\(ur,c,κc\)\(u\_\{r,c\},\\kappa\_\{c\}\)toℰXC\\mathcal\{E\}\_\{XC\}, and\(ur,c,θT\)\(u\_\{r,c\},\\theta\_\{T\}\)toℰXT\\mathcal\{E\}\_\{XT\}:
ℰXR\\displaystyle\\mathcal\{E\}\_\{XR\}=\{\(ur,c,ρr\)∣r∈R,c∈C\},\\displaystyle=\\bigl\\\{\(u\_\{r,c\},\\rho\_\{r\}\)\\mid r\\in R,\\ c\\in C\\bigr\\\},ℰXC\\displaystyle\\mathcal\{E\}\_\{XC\}=\{\(ur,c,κc\)∣r∈R,c∈C\},\\displaystyle=\\bigl\\\{\(u\_\{r,c\},\\kappa\_\{c\}\)\\mid r\\in R,\\ c\\in C\\bigr\\\},ℰXT\\displaystyle\\mathcal\{E\}\_\{XT\}=\{\(ur,c,θT\)∣r∈R,c∈C\}\.\\displaystyle=\\bigl\\\{\(u\_\{r,c\},\\theta\_\{T\}\)\\mid r\\in R,\\ c\\in C\\bigr\\\}\.Equivalently, each cell node belongs to exactly one row hyperedge, one column hyperedge, and the table hyperedge\.
#### Initial features\.
Letϕtok\\phi\_\{\\mathrm\{tok\}\}be a fixed token\-level text encoder\. For any strings∈Σ∗s\\in\\Sigma^\{\*\}, define its pooled representation by
ϕ\(s\)=MeanPool\(ϕtok\(s\)\)\.\\phi\(s\)=\\mathrm\{MeanPool\}\\bigl\(\\phi\_\{\\mathrm\{tok\}\}\(s\)\\bigr\)\.For a cell nodeur,cu\_\{r,c\}, the cell\-value text is used:
sr,cX=xr,c,hur,c\(0\)=ϕ\(sr,cX\)\.s^\{X\}\_\{r,c\}=x\_\{r,c\},\\qquad h^\{\(0\)\}\_\{u\_\{r,c\}\}=\\phi\(s^\{X\}\_\{r,c\}\)\.Thus, the value of the cell is encoded as the content of a cell\-occurrence node\. For a column\-hyperedge nodeκc\\kappa\_\{c\}, one initializes from the column header:
hκc\(0\)=ϕ\(hc\)\.h^\{\(0\)\}\_\{\\kappa\_\{c\}\}=\\phi\(h\_\{c\}\)\.For a row\-hyperedge nodeρr\\rho\_\{r\}, a learned or random initialization is used:
hρr\(0\)=urR,h^\{\(0\)\}\_\{\\rho\_\{r\}\}=u^\{R\}\_\{r\},whereurR∈ℝdhu^\{R\}\_\{r\}\\in\\mathbb\{R\}^\{d\_\{h\}\}is an initialized row\-hyperedge embedding\. For the table\-hyperedge nodeθT\\theta\_\{T\}, one initializes from a table caption, title, or identifier when available:
hθT\(0\)=ϕ\(sT\),h^\{\(0\)\}\_\{\\theta\_\{T\}\}=\\phi\(s^\{T\}\),wheresTs^\{T\}denotes the available textual description of the table\. If no such description is available, a learned or random table\-hyperedge initialization can be used instead:
hθT\(0\)=uT,h^\{\(0\)\}\_\{\\theta\_\{T\}\}=u^\{T\},whereuT∈ℝdhu^\{T\}\\in\\mathbb\{R\}^\{d\_\{h\}\}is an initialized table\-hyperedge embedding\.
## Appendix DFormal View of the Latent Bridge
We next formalize the discussion in[section˜5](https://arxiv.org/html/2606.28916#S5)about the role and limits of the latent bridge\. As we will see, the bridge does not increase the expressive power of the message\-passing encoder\. Rather, it provides a finite, question\-conditioned readout over the graph features already made available by the constructor and encoder\.
The relevant starting point is that the theoretical limits of message\-passing GNNs are well studied\. Their expressive power is closely related to the Weisfeiler–Leman hierarchy and to corresponding fragments of first\-order logicGilmer et al\. \([2017](https://arxiv.org/html/2606.28916#bib.bib12)\); Morris et al\. \([2019](https://arxiv.org/html/2606.28916#bib.bib23)\); Xu et al\. \([2019](https://arxiv.org/html/2606.28916#bib.bib32)\); Barceló et al\. \([2020](https://arxiv.org/html/2606.28916#bib.bib4)\)\. More recently, theGrablesframeworkCucumides and Geerts \([2026](https://arxiv.org/html/2606.28916#bib.bib10)\)extended this perspective to tabular learning, showing that row\-local methods fail on “extension\-sensitive” queries: tasks whose answers depend on cross\-row structure, such as counting, overlaps, or joins\. This motivates making the graph construction explicit, as we do here, so that these relations are exposed to message passing rather than left implicit in a serialized table\.
### D\.1Logical View of the Constructed Graph
We here use the connection between message\-passing and graded modal logicBarceló et al\. \([2020](https://arxiv.org/html/2606.28916#bib.bib4)\)\. Our graph constructor maps the relational input to the tripartite structure
𝒢=γ\(𝒯,ℱ\)=\(𝒱R∪𝒱C∪𝒱V,ℰRV∪ℰCV\),\\mathcal\{G\}=\\gamma\(\\mathcal\{T\},\\mathcal\{F\}\)=\\bigl\(\\mathcal\{V\}\_\{R\}\\cup\\mathcal\{V\}\_\{C\}\\cup\\mathcal\{V\}\_\{V\},\\mathcal\{E\}\_\{RV\}\\cup\\mathcal\{E\}\_\{CV\}\\bigr\),where row nodes are denoted byρi,r\\rho\_\{i,r\}\(withiiindicating the source table\), column\-class nodes bycαc\_\{\\alpha\}, and value\-group nodes byvα,gv\_\{\\alpha,g\}\. We view𝒢\\mathcal\{G\}as a finite relational structure with unary predicates
Row\(x\),Col\(x\),Val\(x\),\\mathrm\{Row\}\(x\),\\qquad\\mathrm\{Col\}\(x\),\\qquad\\mathrm\{Val\}\(x\),type\-specific refinements such asRowi\(x\)\\mathrm\{Row\}\_\{i\}\(x\)\(identifying rows belonging to tableii\) andValα\(x\)\\mathrm\{Val\}\_\{\\alpha\}\(x\)\(identifying values in column classα\\alpha\), and binary incidence relationsℰRV\(r,v\)\\mathcal\{E\}\_\{RV\}\(r,v\)andℰCV\(c,v\)\\mathcal\{E\}\_\{CV\}\(c,v\)\.
LetGMLL\\mathrm\{GML\}\_\{L\}denote graded modal logic of modal depth at mostLLover this tripartite vocabulary\. Its characteristic modality is
∃≥Ny\(ℰρ\(y,x\)∧φ\(y\)\),\\exists^\{\\geq N\}y\\,\\bigl\(\\mathcal\{E\}\_\{\\rho\}\(y,x\)\\wedge\\varphi\(y\)\\bigr\),whereℰρ\\mathcal\{E\}\_\{\\rho\}ranges over the typed incidence relations and their inverses\. This modality identifies nodesxxthat have at leastNNneighborsyysatisfyingφ\\varphi\. Under standard expressivity assumptions on typed multiset aggregation, anLL\-layer message\-passing encoder can represent depth\-LLGML facts: one message\-passing layer corresponds to one step of graded neighborhood inspection\. Thus, the graph encoder is best understood as a local logical feature extractor over the tripartite table graph\.
The constructor matters because it makes useful table relations local\. For example, repeated values in a column class are detected by the value\-node formula
Dupα\(v\):=Valα\(v\)∧∃≥2r\(ℰRV\(r,v\)∧Row\(r\)\)\.\\mathrm\{Dup\}\_\{\\alpha\}\(v\):=\\mathrm\{Val\}\_\{\\alpha\}\(v\)\\wedge\\exists^\{\\geq 2\}r\\,\\bigl\(\\mathcal\{E\}\_\{RV\}\(r,v\)\\wedge\\mathrm\{Row\}\(r\)\\bigr\)\.Hence duplicate and equality information is no longer merely a coincidence between cell strings in a serialization; it is a one\-hop counting fact around a value node\. Similarly, if two foreign\-key\-linked columns are represented by the same column classα\\alpha, then rows joined by that key are connected through a shared value node\. A bounded\-hop join therefore becomes a bounded\-depth GML pattern in𝒢\\mathcal\{G\}\.
### D\.2The Bridge as a Question\-Conditioned Readout
After message passing, the bridge receives three typed multisets of node states: rows, columns, and values\. The question\-conditioned resampler selects and compresses information from these states intoKKlatent tokens\. A useful logical abstraction is therefore a typed, question\-conditioned readout over GML\-definable node properties\.
For a formulaφ\\varphiand a node typeX∈\{R,C,V\}X\\in\\\{R,C,V\\\}, write
\#Xφ\(𝒢\)=\|\{x∈𝒱X∣𝒢,x⊧φ\(x\)\}\|,\\\#\_\{X\}\\varphi\(\\mathcal\{G\}\)=\\bigl\|\\\{x\\in\\mathcal\{V\}\_\{X\}\\mid\\mathcal\{G\},x\\models\\varphi\(x\)\\\}\\bigr\|,for the number of nodes of typeXXsatisfyingφ\\varphi\. The bridge can be idealized as a finite sketch
sQ\(𝒢\)=ηQ\(\\displaystyle s\_\{Q\}\(\\mathcal\{G\}\)=\\eta\_\{Q\}\\bigl\(\#RφQ,1R\(𝒢\),…,\#RφQ,mRR\(𝒢\),\\displaystyle\\\#\_\{R\}\\varphi^\{R\}\_\{Q,1\}\(\\mathcal\{G\}\),\\ldots,\\\#\_\{R\}\\varphi^\{R\}\_\{Q,m\_\{R\}\}\(\\mathcal\{G\}\),\#CφQ,1C\(𝒢\),…,\#CφQ,mCC\(𝒢\),\\displaystyle\\\#\_\{C\}\\varphi^\{C\}\_\{Q,1\}\(\\mathcal\{G\}\),\\ldots,\\\#\_\{C\}\\varphi^\{C\}\_\{Q,m\_\{C\}\}\(\\mathcal\{G\}\),\#VφQ,1V\(𝒢\),…,\#VφQ,mVV\(𝒢\)\),\\displaystyle\\\#\_\{V\}\\varphi^\{V\}\_\{Q,1\}\(\\mathcal\{G\}\),\\ldots,\\\#\_\{V\}\\varphi^\{V\}\_\{Q,m\_\{V\}\}\(\\mathcal\{G\}\)\\bigr\),where the formulas are depth\-LLGML formulas and the finite mapηQ\\eta\_\{Q\}may depend on the question\. We denote this abstraction by
QReadB\(GMLL\)\.\\mathrm\{QRead\}\_\{B\}\(\\mathrm\{GML\}\_\{L\}\)\.
This abstraction should be read conservatively\. The bridge does not create new message\-passing information; it selects and compresses what the encoder has already made available\. If two graphs have the same typed multisets of depth\-LLGML node types, then any permutation\-invariant bridge reading only those encoded states receives the same graph\-side information\. Question conditioning at the bridge can improve relevance and compression, but it cannot recover distinctions erased by the constructor or encoder\.
### D\.3Why Question Conditioning Helps
The bridge has fixed capacity, so the relevant issue is not only which facts are locally encodable, but also how many graph states the bridge must separate\. Let𝔊\\mathfrak\{G\}be a finite set of possible constructed tripartite graphs\. Each questionQ∈𝒬Q\\in\\mathcal\{Q\}induces an answer map
aQ:𝔊→𝒜Qa\_\{Q\}:\\mathfrak\{G\}\\to\\mathcal\{A\}\_\{Q\}and therefore a partitionΠQ\\Pi\_\{Q\}of𝔊\\mathfrak\{G\}:
𝒢≡Q𝒢′⟺aQ\(𝒢\)=aQ\(𝒢′\)\.\\mathcal\{G\}\\equiv\_\{Q\}\\mathcal\{G\}^\{\\prime\}\\quad\\Longleftrightarrow\\quad a\_\{Q\}\(\\mathcal\{G\}\)=a\_\{Q\}\(\\mathcal\{G\}^\{\\prime\}\)\.Let
Π𝒬=⋀Q∈𝒬ΠQ\\Pi\_\{\\mathcal\{Q\}\}=\\bigwedge\_\{Q\\in\\mathcal\{Q\}\}\\Pi\_\{Q\}be the common refinement\. Thus, two graphs are equivalent underΠ𝒬\\Pi\_\{\\mathcal\{Q\}\}exactly when they have the same answer to every question in𝒬\\mathcal\{Q\}\.
We call a bridge*exact*for a question family if its code is sufficient to recover the correct answer for every graph in𝔊\\mathfrak\{G\}and every question in the family\. In the question\-agnostic case, the bridge uses one code map for all questions\. In the question\-conditioned case, the bridge may use a different code map for eachQQ\.
###### Proposition 1\(Exact sketch complexity\)\.
For a finite graph class𝔊\\mathfrak\{G\}and finite question family𝒬\\mathcal\{Q\}, the minimum number of bits required by a question\-agnostic exact bridge is
Cagn\(𝒬\)=⌈log2\|Π𝒬\|⌉\.C\_\{\\mathrm\{agn\}\}\(\\mathcal\{Q\}\)=\\left\\lceil\\log\_\{2\}\|\\Pi\_\{\\mathcal\{Q\}\}\|\\right\\rceil\.The minimum number of bits required by a question\-conditioned exact bridge is
Ccond\(𝒬\)=⌈log2maxQ∈𝒬\|ΠQ\|⌉\.C\_\{\\mathrm\{cond\}\}\(\\mathcal\{Q\}\)=\\left\\lceil\\log\_\{2\}\\max\_\{Q\\in\\mathcal\{Q\}\}\|\\Pi\_\{Q\}\|\\right\\rceil\.Consequently,
Ccond\(𝒬\)≤Cagn\(𝒬\),C\_\{\\mathrm\{cond\}\}\(\\mathcal\{Q\}\)\\leq C\_\{\\mathrm\{agn\}\}\(\\mathcal\{Q\}\),and the inequality can be strict\.
###### Proof\.
A question\-agnostic sketchs:𝔊→\{0,1\}Bs:\\mathfrak\{G\}\\to\\\{0,1\\\}^\{B\}induces a partitionΠs\\Pi\_\{s\}of graph states\. Ifssis exact for allQ∈𝒬Q\\in\\mathcal\{Q\}, then two graphs with the same sketch must have the same answer to every question\. HenceΠs\\Pi\_\{s\}refinesΠ𝒬\\Pi\_\{\\mathcal\{Q\}\}, sossmust use at least\|Π𝒬\|\|\\Pi\_\{\\mathcal\{Q\}\}\|distinct codes\. Thus2B≥\|Π𝒬\|2^\{B\}\\geq\|\\Pi\_\{\\mathcal\{Q\}\}\|\. This gives the lower bound, and it is tight by assigning one code to each block ofΠ𝒬\\Pi\_\{\\mathcal\{Q\}\}\.
For the question\-conditioned case, the same argument applies separately to eachQQ\. Exactness forQQrequires at least\|ΠQ\|\|\\Pi\_\{Q\}\|codes, and this is tight by encoding the block ofΠQ\\Pi\_\{Q\}\. Since the same bit budget must work for every question, the required number of bits is
⌈log2maxQ∈𝒬\|ΠQ\|⌉\.\\left\\lceil\\log\_\{2\}\\max\_\{Q\\in\\mathcal\{Q\}\}\|\\Pi\_\{Q\}\|\\right\\rceil\.Finally,Π𝒬\\Pi\_\{\\mathcal\{Q\}\}refines everyΠQ\\Pi\_\{Q\}, so\|ΠQ\|≤\|Π𝒬\|\|\\Pi\_\{Q\}\|\\leq\|\\Pi\_\{\\mathcal\{Q\}\}\|for allQ∈𝒬Q\\in\\mathcal\{Q\}\. ∎
#### Exponential separation in code space\.
The gap can be exponential at the level of distinguishable graph classes\. Let there bemmindependent binary factsb1\(𝒢\),…,bm\(𝒢\)∈\{0,1\}b\_\{1\}\(\\mathcal\{G\}\),\\ldots,b\_\{m\}\(\\mathcal\{G\}\)\\in\\\{0,1\\\}about the constructed graph, and let𝒬=\{Q1,…,Qm\}\\mathcal\{Q\}=\\\{Q\_\{1\},\\ldots,Q\_\{m\}\\\}where
aQj\(𝒢\)=bj\(𝒢\)\.a\_\{Q\_\{j\}\}\(\\mathcal\{G\}\)=b\_\{j\}\(\\mathcal\{G\}\)\.Assume every bit vectorb∈\{0,1\}mb\\in\\\{0,1\\\}^\{m\}is realized by some graph in𝔊\\mathfrak\{G\}\. Then, for each fixed questionQjQ\_\{j\}, the partitionΠQj\\Pi\_\{Q\_\{j\}\}has two blocks, so
Ccond\(𝒬\)=1\.C\_\{\\mathrm\{cond\}\}\(\\mathcal\{Q\}\)=1\.However, the joint answer vector
\(aQ1\(𝒢\),…,aQm\(𝒢\)\)=\(b1\(𝒢\),…,bm\(𝒢\)\)\(a\_\{Q\_\{1\}\}\(\\mathcal\{G\}\),\\ldots,a\_\{Q\_\{m\}\}\(\\mathcal\{G\}\)\)=\(b\_\{1\}\(\\mathcal\{G\}\),\\ldots,b\_\{m\}\(\\mathcal\{G\}\)\)can take all2m2^\{m\}values\. Hence
\|Π𝒬\|=2m,Cagn\(𝒬\)=m\.\|\\Pi\_\{\\mathcal\{Q\}\}\|=2^\{m\},\\qquad C\_\{\\mathrm\{agn\}\}\(\\mathcal\{Q\}\)=m\.Thus, the question\-conditioned bridge needs to distinguish only two answer classes for the current question, while a question\-agnostic bridge must distinguish2m2^\{m\}joint classes\. In bits, this is a gap ofmmversus11; in codewords, it is exponential\.
## Appendix EExperimental Setup Details
### E\.1Dataset and Benchmark Details
Table[11](https://arxiv.org/html/2606.28916#A5.T11)summarises the number of samples per split and the fraction excluded by our token\-budget filter\. To keep all inputs within a context window of the LLM backbone, we discard any sample whose linearised table exceeds8,192 tokens, as measured by the Qwen3\-4B tokenizer\. The resulting indices are stored in a skip file and applied consistently to training, validation, and test splits, as well as to graph pre\-computation, ensuring that excluded samples never appear in any evaluation\. This is applied to all the baselines when training and testing\. The vast majority of datasets are unaffected \(0%0\\%skipped\); the most impacted datasets are MMQA \(≈13\\approx 13–14%14\\%per split, owing to long concatenated table texts\) and Spider \(≈9%\\approx 9\\%train,27%27\\%test\)\. Most datasets are well within the 8,192\-token budget at the median, confirming that filtering has a minor impact\. The high maximum for WTQ \(25,943 tokens\) and HCTQA \(7,705 tokens\) reflects a small number of extremely wide Wikipedia tables\. For the dataset which did not provide a split, the full dataset has been split as in three parts\.
Table 11:Dataset split sizes and percentage of samples removed by the 8,192\-token table\-length filter\. Counts reflect the dataset size before filtering\. \*Since Spider has no test set available, the validation set has been used for testing only purposes\.
### E\.2Per\-Row and Column\-Header Token Statistics
For graph\-based encoder variants, each table row is tokenised independently in the formatcoln: valn, and each column header is tokenised separately\. Table[12](https://arxiv.org/html/2606.28916#A5.T12)reports the maximum observed lengths using the Qwen3\-Embedding\-0\.6B tokenizer across all splits, which determine the safe settings for the max limit for rows and columns length\.
Table 12:Maximum row and column\-header token lengths per dataset using the Qwen3\-Embedding\-0\.6B tokenizer across all splits\.
### E\.3Training Setup
We use a unified training setup across all experiments to ensure that differences in performance are attributable to the model components rather than optimization choices\. Unless otherwise stated, the LLM backbone is kept frozen and only the task\-specific adaptation modules are trained\.
All models are optimized with AdamW \(β1=0\.9\\beta\_\{1\}\{=\}0\.9,β2=0\.95\\beta\_\{2\}\{=\}0\.95, weight decay0\.050\.05\) at a learning rate of10−410^\{\-4\}, following a half\-cycle cosine decay schedule with 1 epoch of linear warmup and a minimum learning rate of5×10−65\{\\times\}10^\{\-6\}\. Gradient norms are clipped to0\.10\.1\. Training runs for up to 10 epochs with early stopping \(patience 3\), using an effective batch size of 32, adapted on the number of GPUs\. The effective training time depends on the number of GPUs used for training, which linearly reduces the time per epoch through data parallelism, but the per\-step cost remains dominated by the frozen LLM’s forward pass, which cannot be skipped since its intermediate activations are required for the gradient signal to flow back through the projector and table encoder\. Consequently, total time scales with dataset size: a small dataset can complete in under two hours \(StructQA\), while a large one may require close to a full day \(HCT\-QA\), even with the same hardware configuration\. All experiments use seed 42\.
#### Base LLM and LoRA\.
When LoRA is applied, we use rankr=8r\{=\}8,α=16\\alpha\{=\}16, dropout0\.050\.05, targeting the query and value projection matrices \(q\_proj,v\_proj\), with no bias adaptation\. In all GNN encoder runs the LLM is otherwise frozen, leaving only the table encoder and projector trainable\.
#### Soft prompt baseline\.
The soft prompt model prepends 10 learnable virtual tokens to the LLM input with a dimension of 1024\.
#### TableLlama\.
TableLlamaZhang et al\. \([2024](https://arxiv.org/html/2606.28916#bib.bib36)\)is fine\-tuned with LoRA using rankr=16r=16, scaling factorα=32\\alpha=32, and dropout0\.050\.05\. LoRA adapters are applied to all attention projection matrices:q\_proj,k\_proj,v\_proj, ando\_proj\. The model is trained with learning rate2×10−42\\times 10^\{\-4\}, weight decay0\.010\.01, and a linear warmup over the first6%6\\%of training steps\. LLaMA\-2’s absolute position embeddings are capped at 4096, so the effective input length is necessarily clamped to 4,096 tokens in all runs\. Within this limit, only the table is truncated, while the instruction prefix and question are always preserved\.
#### MultiTabQA\.
MultiTabQAPal et al\. \([2023](https://arxiv.org/html/2606.28916#bib.bib25)\)is implemented using the base model checkpoint fine\-tuned with learning rate10−410^\{\-4\}, weight decay0\.010\.01, and a linear warmup over the first6%6\\%of training steps\. For Atis, GeoQuery and Spider the checkpoints were already available\. Its absolute position embeddings are capped at1,0241\{,\}024tokens due to architectural constraints\. As a result, the effective input length is limited to1,0241\{,\}024tokens, regardless of the configuredmax\_length\. This substantially limits the amount of table content that the model can process and makes it less suitable for large inputs\.
#### GNN encoder\.
The graph encoder operates on a tripartite graph of row \(RR\), column \(CC\), and value \(VV\) nodes\. Node embeddings are precomputed offline using Qwen3\-Embedding\-0\.6B and stored asfloat16tensors, together with question token embeddings, row/column validity masks, and adjacency matrices encodingRR–VVandCC–VVedges\. The GNN applies 1 message\-passing layer over this structure, producing 32 row latents, 32 column latents, and 32 value latents\. A cross\-attention resampler with 4 heads and 2 layers then pools these into a fixed\-size representation per table\. GNN dropout is0\.10\.1; A single linear projector maps the encoder output to the LLM embedding space, initialized with Xavier uniform \(gain0\.010\.01\)\.
## Appendix FStructural Hierarchy of TQA Queries
This appendix introduces a taxonomy of table question answering queries along three orthogonal axes: the form of the expected answer, the structural depth of table access required to locate evidence, and the computational path required to derive the final value\. The taxonomy supports the stress\-test in Appendix[G](https://arxiv.org/html/2606.28916#A7), which varies the structural and computational axes independently to attributeGRAB’s gains \(and limitations\) to a specific source\. The answer\-type axis is included for completeness and to justify why we hold it fixed in the stress\-test, as we discuss below\.
### F\.1Axis I: Answer Type
The first axis concerns the*form*of the expected answer, independently of how it is computed\. We distinguish three classes\.
#### Boolean queries \(A1\)\.
The answer is a truth value \(yes/no\), e\.g\.,*“Is the value in columnccgreater thankk?”*or*“Do any two rows share the same value in columncc?”*
#### Retrieval queries \(A2\)\.
The answer is a value that exists verbatim in the table and is identified by locating the right position, e\.g\.,*“What is the value of columnccfor the row where columnddequalsaa?”*
#### Derived queries \(A3\)\.
The answer is computed from the table and does not need to appear in it, e\.g\.,*“What is the average of columnccfor rows where columnddequalsaa?”*
#### Why we do not vary this axis\.
Answer type governs the evaluation protocol but not the reasoning required to produce a correct answer\. A boolean question such as*“Is the average salary of employees in London higher than the company\-wide average?”*is A1 by output form but requires two aggregations and a comparison, making it as demanding as any A3 query\. The only effect that is specific to A1 is that the binary output space inflates baseline accuracy through guessing, masking rather than revealing reasoning difficulty\. We therefore fix the answer type at A2/A3 in the stress\-test and vary the two axes that actually expose hardness: structural depth and computational path\.
### F\.2Axis II: Structural Depth
The second axis concerns how much relational structure must be accessed to locate the evidence required for the answer\. We distinguish four levels, ordered by minimum structural access\. Each question is assigned to the lowest sufficient level\.
#### Scan\-sufficient queries \(S1\)\.
The answer is determined by locating the right row and reading off a cell value; no aggregation across rows is needed\. Example:*“What is the value in columnccfor the row where columnddequalsaa?”*A model reading the serialized table sequentially has access to all required information\.
#### Single\-column queries \(S2\)\.
The answer requires aggregating or filtering over the values of a single column, e\.g\.,*“How many rows have valueaain columncc?”*or*“What is the average of columnccfor rows where columnddequalsaa?”*The flat serialization contains all necessary values but does not make cross\-row statistics explicit\.
#### Multi\-column queries \(S3\)\.
The answer requires reasoning jointly over two or more columns within a single table, e\.g\.,*“Which value in columnc1c\_\{1\}co\-occurs most often with valueaain columnc2c\_\{2\}?”*or*“Is there a pair of rows sharing values in bothc1c\_\{1\}andc2c\_\{2\}?”*These cannot be decomposed into independent single\-column computations; the joint distribution across columns must be accessible to the model\.
#### Multi\-table queries \(S4\)\.
The answer requires traversing two or more tables via foreign\-key relationships, e\.g\.,*“What is the total of columnccinT2T\_\{2\}for all rows linked to this row inT1T\_\{1\}via columndd?”*The relevant evidence is distributed across tables and cannot be recovered from any single table in isolation\.
### F\.3Axis III: Computational Path
The third axis concerns the arithmetic operations performed*after*the relevant data has been located\. This axis is orthogonal to both answer type and structural depth: a scan\-sufficient boolean query may require multi\-step arithmetic, while a multi\-table retrieval query may require no computation beyond identification\. Separating this axis is essential for diagnosis: a model that correctly identifies the relevant rows but produces a wrong numeric answer is failing on computation, not on structural reasoning, and the two failure modes call for different interventions\.
#### Lookup \(C0\)\.
No arithmetic\. The answer is read off once the relevant row or cell is located, e\.g\.,*“Which department does employeexxbelong to?”*C0 is the cleanest probe of structural reasoning in isolation: any failure is attributable to incorrect row identification\.
#### Counting \(C1\)\.
The answer is the number of rows satisfying a condition, e\.g\.,*“How many transactions were placed by customers over 30?”*Counting maps naturally to degree statistics on value nodes inGRAB’s graph but can be unreliable for LLMs on serialized text as the matching set grows\.
#### Single aggregation \(C2\)\.
A single arithmetic operation over a set of values: sum, max, min, or mean, e\.g\.,*“What is the average age of employees in the engineering department?”*These operations require exact numeric values\. BecauseGRAB’s graph encodes numeric values as quantile buckets, the graph alone cannot perform exact arithmetic; the exact values must be recovered from the textual serialization\.
#### Multi\-step derivation \(C3\)\.
Chained arithmetic where the output of one operation feeds the next, e\.g\.,*“By how much does the average order value of returning customers exceed that of new customers?”*C3 queries compound structural and arithmetic errors: a failure may stem from incorrect row selection, from arithmetic error at any step, or from both simultaneously\.
### F\.4Using the Taxonomy
The stress\-test in Appendix[G](https://arxiv.org/html/2606.28916#A7)uses this taxonomy as an experimental scaffold\. It holds the answer\-type axis fixed \(queries are A2 or A3\) and varies the structural and computational axes independently, so that each empirical result can be attributed to a single source\.
Three diagnostic recipes follow from the orthogonality of the axes:
- •Fix S, vary C\.Holds the structural access pattern constant while increasing arithmetic demand\. Failures isolate the arithmetic ceiling of the model and are not attributable to incorrect row identification\.
- •Fix C, vary S\.Holds the computation constant \(typically at C0, pure lookup\) while increasing the structural depth required to find the right rows\. Failures isolate structural reasoning capacity\.
- •Match\(S,C\)\(S,C\), compare with/without GRAB\.At matched cells of the\(S,C\)\(S,C\)grid, the difference between the serialized baseline andGRABisolates the contribution of the graph encoder\.
A characteristic signature follows: if the graph encoder improves performance at C0 across S levels but not at C2–C3, the structural token is locating the right evidence, but arithmetic remains the binding constraint\. Appendix[G](https://arxiv.org/html/2606.28916#A7)reports exactly this\.
## Appendix GFine\-Grained Error Analysis and Stress\-Test Design
Building on the taxonomy in Appendix[F](https://arxiv.org/html/2606.28916#A6), we now apply it as an experimental scaffold\. The stress\-test holds the answer\-type axis fixed and varies the structural and computational axes jointly, generating a controlled grid of questions over which each failure mode can be attributed to a specific axis\. We begin by stating two predictions aboutGRAB’s expected behavior on this grid, then describe the design and analyze the results\.
#### Prediction 1: arithmetic ceiling at C2\.
GRAB’s graph constructor encodes numeric values as quantile buckets, not as exact numeric content\. The graph therefore cannot perform exact arithmetic; it can only isolate the correct rows over which arithmetic must be applied by the LLM\. We predict that pure aggregation queries \(C2, especiallyAvg\) will fail at low structural depth \(S1, no filtering required\), since there is no structural bottleneck forGRABto relieve — only the arithmetic step remains, and that step is unchanged by the presence of the graph\. The clearest case is the unconditionedAvgquery: the model must sum a set of values and divide by their count, two operations that LLMs perform unreliably on serialized input regardless of how cleanly the input is presented\. Conversely, when filtering reduces the value set,GRABshould help indirectly by shrinking the operand set the LLM must aggregate over\.
#### Prediction 2: structural gain at S2–S3\.
Multiple simultaneous conditions on different columns \(S3\) require the model to identify rows satisfying joint constraints\. On a serialized table, this demands matching patterns distributed across distant positions in the sequence; onGRAB’s graph, the conditions correspond to explicit incidence edges from a row node to typed value nodes\. We predict that the graph encoder will help most on \(S2–S3, C0–C1\) cells: queries where locating the right rows is the dominant subtask and no exact arithmetic is required\. Note that the flattening of hierarchical column headers inGRAB’s graph construction \(where a nested header such as*Export\>\>2020\>\>Q1*is expanded into separate columns\) means that filtering on a deeply nested cell is structurally equivalent to applying multiple column conditions simultaneously, so both manifest as the same pattern in the graph\.
### G\.1Stress\-Test Set Design
We construct a controlled set of questions that varies the structural and computational axes independently, so that each cell of the\(S,C\)\(S,C\)grid can be attributed to a specific source of difficulty\. A model that fails onAvgqueries, for instance, may be failing because it cannot identify the correct rows \(structural\), because it cannot compute the mean over correctly identified rows \(arithmetic\), or both; varying one axis at a time disentangles these explanations\.
#### Tables\.
The stress\-test is conducted on 15 standard relational tables constructed in the style of HCT\-QA\. Concretely, the tables span the same domains as the real\-world HCT\-QA sources: scientific paper benchmarks, government statistics, and demographic census reports, and follow the same structural conventions as the HCT\-QA synthetic generator: a flat relational layout where categorical attributes define row identity and numerical attributes fill the value columns\. All tables contain at least three categorical columns and two numerical columns, a requirement imposed to support the increasing levels of structural difficulty examined below\. Tables from the original HCT\-QA pool were not used directly because too few contained the minimum number of categorical columns needed for the multi\-condition experiments; the synthetic tables are otherwise in\-distribution with the HCT\-QA data the model was trained on\.
#### Design\.
Questions are generated by independently varying two axes:
- •Structural axis \(S1–S3\)\.The number of simultaneous filter conditions applied before computing the answer: no conditions \(global, over all rows; S1\), one condition \(single categorical filter; S2\), or two conditions \(joint filter on two distinct categorical columns; S3\), optionally followed by a group\-by operator partitioning rows by one or two categorical keys \(G1 and G2 respectively\)\.
- •Computational axis \(C0–C2\)\.The aggregation operator applied to the retrieved values:Lookup\(C0\),Count\(C1\),Max\(C2\), andAvg\(C2\)\.
This design, extended with group\-by variants, yields 2,337 questions across 15 tables and isolates each source of difficulty independently, allowing failures to be attributed to their origin rather than conflating structural and arithmetic errors\.
#### Arithmetic bottleneck \(verifies Prediction 1\)\.
Table[13](https://arxiv.org/html/2606.28916#A7.T13)reveals a clear operator hierarchy\.Lookup\(C0\) is the strongest category across all models and filter levels, with the serialized baseline achieving F1 of 66–74 regardless of the number of conditions\.Max\(C2\) occupies a middle ground \(F1==52–69\), whileAvg\(C2\) is the hardest operator by a large margin: both models score near zero on unconditioned average queries \(F1==1\.23 at \(S1, C2\)\), where no filtering is required and arithmetic alone must be performed\. This rules out a structural explanation for average failures and identifies exact arithmetic as a persistent limitation of LLMs on serialized tables, one that\+GRABcannot fully resolve either, exactly as predicted by the quantile\-bucket encoding of value nodes\.
Count\(C1\) shows a qualitatively different pattern: the serialized baseline performs poorly \(F1==8\.63 with no condition\), while\+GRABachieves 54\.90 — the largest absolute gain in the entire table \(\+\+46\.27 F1\)\. This aligns with the natural capacity of message\-passing GNNs to aggregate over neighborhoods, which maps directly onto counting over row\-conditioned subgraphs\.
#### Filtering makes questions easier, not harder\.
A consistent and counterintuitive pattern emerges across all operators: adding filter conditions*reduces*difficulty rather than increasing it\. ForLookup, performance improves monotonically from single \(F1==66\.59\) to triple conditions \(F1==73\.90\) on the serialized baseline\. This is because more conditions uniquely identify the target row more precisely, reducing ambiguity in the answer\. The LLM benefits from additional anchors in the text: each extra condition narrows the search space and makes the correct row easier to locate by pattern\-matching over the serialized sequence\.
The same effect holds forAvg: performance rises from F1==1\.23 with no filter to 28\.62 with two conditions, because filtering also reduces the number of values over which the mean must be computed, easing the arithmetic load\. ForCountandMax, the trend is less pronounced but consistent in direction\. This finding has a practical implication: the stress\-test categories without any filter condition are structurally the hardest, not the easiest, contrary to the intuition that more conditions imply more complexity\.
GRABamplifies this effect \(verifying Prediction 2\)\. The graph encoder provides column\-typed value nodes that act as precise structural anchors, allowing the model to locate target cells even more reliably than pattern\-matching over flat text\. The gain is largest onLookup\(\+\+14–24 F1 across S2–S3\) andCount\(\+\+22–46 F1 across S1–S3\), where locating the correct rows is the dominant subtask\.
#### Group\-by as the hardest structural probe\.
Group\-by questions expose a qualitatively harder regime than any other category\. Unlike lookup or aggregation, group\-by requires the model to simultaneously partition the table by one or two categorical keys, apply an aggregation within each partition, and produce a complete structured answer with one tuple per group\. This demands not only locating the relevant rows, but retaining and organizing them across multiple groups before producing the output: a form of working memory over the table that serialized text processing handles poorly\.
The serialized baseline drops to F1==42\.92 on Group\-byMaxsingle key and CC==0\.00 on all double\-key variants, meaning the model essentially never produces a fully correct multi\-group answer\.GRABrecovers substantially: F1==72\.59 on Group\-byMaxsingle key \(\+\+29\.67\) with CC rising from 4\.76 to 48\.57, indicating that the graph encoder helps produce*complete*structured outputs rather than only partial ones\.
The counterintuitive filtering effect reappears here too\. Adding a filter to a group\-by query does not consistently increase difficulty and in some cases reduces it, presumably because the filter constrains the set of groups the model must track, reducing the working memory demand\. Group\-byAvgremains the hardest setting overall, since it compounds the group\-tracking difficulty with exact arithmetic;GRABreaches at most F1==56\.59 here, consistent with the arithmetic ceiling identified in the scalar setting\.
Table 13:Stress\-test results\. F1 and CC scores averaged over questions across 15 different tables\. \+Prompt Tuning uses learned soft prompt vectors without a graph encoder; \+GRAB adds the full graph encoder and query\-conditioned latent bridge; Serialized only uses the frozen LLM with flattened table input\.Δ\\Deltareports the gain of \+GRAB over Serialized only\.
### G\.2Taxonomy Verification
We close by mapping the results back onto the diagnostic recipes in Appendix[F\.4](https://arxiv.org/html/2606.28916#A6.SS4), then verifying that the observed pattern is not an artifact of model scale\.
#### Fixing S, varying C\.
At S1 \(no filtering\), the operator hierarchy is C0\>\>C1\>\>C2 for both the serialized baseline and\+GRAB\.Avgat \(S1, C2\) is the unique cell where neither model exceeds F1==1\.23, isolating exact arithmetic as the binding constraint when no structural reasoning is required\. This is Prediction 1 confirmed\.
#### Fixing C, varying S\.
At C0, the serialized baseline improves with more filters \(66\.59→\\to73\.90 F1\) and\+GRABimproves further \(80\.72→\\to95\.52\), confirming that structural localization is the binding constraint when no arithmetic is required\. The gap between baseline and\+GRABwidens at higher S, indicating that the graph’s incidence structure provides increasing relative value as the number of joint conditions grows — Prediction 2 confirmed\.
#### Matched\(S,C\)\(S,C\), with vs\. withoutGRAB\.
GRAB’s gains concentrate at \(S2–S3, C0–C1\) cells, where the graph encoder relieves a real structural bottleneck without competing with arithmetic demand\. Gains shrink at \(S1, C2\), where no structural bottleneck exists and only arithmetic remains\. This is the characteristic signature predicted in Appendix[F\.4](https://arxiv.org/html/2606.28916#A6.SS4): the structural token is locating the right evidence, and arithmetic remains the binding constraint where it appears\.
#### Robustness to model scale\.
We change the underlying LLM from Qwen3\-4B to Qwen3\-14B and rerun the stress\-test on both the serialized baseline and\+GRAB\(Table[14](https://arxiv.org/html/2606.28916#A7.T14)\)\. Three patterns confirm that the gains attributed toGRABreflect a structural rather than a capacity bottleneck\. First, the overall\+GRABgain*widens*with scale, from\+21\.42\+21\.42F1 at 4B to\+31\.68\+31\.68F1 at 14B: a stronger backbone does not absorb the structural signal, it amplifies the benefit of receiving it in encoded form\. Second, the arithmetic ceiling at \(S1, C2\) is preserved exactly —1\.231\.23F1 across all four 4B/14B×\\timesSerialized/\+GRABconfigurations — exactly as predicted by the quantile\-bucket encoding of value nodes\. Third, the largest 14B gains concentrate precisely where the diagnostic recipes predict\. OnCount,\+GRABadds\+54\.90\+54\.90,\+51\.43\+51\.43, and\+26\.98\+26\.98F1 across the three condition depths, since counting still maps onto value\-node degree regardless of backbone size\. On group\-by, the contrast is sharper: the 14B serialized baseline actually*regresses*relative to 4B on several Group\-byCountsettings \(e\.g\., single key34\.58→24\.6534\.58\\to 24\.65F1, double key37\.44→18\.7037\.44\\to 18\.70\), consistent with the working\-memory failure mode being orthogonal to model scale, while\+GRABrecovers above7373F1 in every Group\-byCountcell and lifts CC on Group\-byMaxsingle key by\+65\.72\+65\.72points\. Taken together, the predicted \(S, C\) signature reproduces at scale, and the graph encoder contributes along an axis of difficulty that additional parameters do not resolve\.
Table 14:Stress\-test results across two backbones \(Qwen3\-4B and Qwen3\-14B\)\. F1 and CC scores averaged over questions across 15 different tables\. Serialized only uses the frozen LLM with flattened table input; \+GRAB adds the full graph encoder and query\-conditioned latent bridge\.Δ\\DeltaF1 reports the F1 gain of \+GRAB over Serialized only within each backbone\.
## Appendix HLicenses
We indicate the licenses of the artifacts used in this work, based on the official repositories, dataset cards, or release pages whenever available\. All the artifacts used in this paper can be used for research: HiTab \(Computational Use of Data Agreement v1\.0\), WikiTableQuestions \(CC BY\-SA 4\.0\), WikiSQL \(BSD\-3\-Clause repository license\), HCT\-QA \(MIT\), TabMWP \(CC BY\-NC\-SA 4\.0\), MultiHierTT \(MIT\), TQABench \(GPL\-3\.0 license\), GeoQuery \(GPL\-2\.0\), Spider \(CC BY\-SA 4\.0\), TabFact \(CC BY 4\.0\), Qwen models \(Apache 2\.0\), MultiTabQa \(MIT\), TableLLama \(MIT\)\.
## Appendix IArchitectural Comparison with Frozen\-LLM Table Adapters
Table[15](https://arxiv.org/html/2606.28916#A9.T15)summarizes the main architectural differences between GRAB and the closest frozen\-LLM table adaptation baselines\. The comparison highlights whether each method explicitly models multi\-table structure, conditions its latent representation on the question, and incorporates foreign\-key information\.
Table 15:Comparison of GRAB with representative frozen\-LLM table adaptation methods\. GRAB differs by explicitly modeling multi\-table relational structure, conditioning its latent structural tokens on the question, and incorporating foreign\-key\-aware graph construction\.Similar Articles
AB-RAG: Adaptive Budgeted Retrieval-Augmented Generation for Reliable Question Answering
AB-RAG is a training-free, backbone-agnostic framework that adaptively retrieves passages for question answering by estimating answer confidence, improving efficiency and accuracy across multiple backbones and datasets.
Generic Triple-Latent Compression with Gated Associative Retrieval
This paper introduces generic triple-latent recurrent models that compress token pair interactions into a latent state, and a gated associative retrieval variant that improves exact recall. The hybrid model outperforms Transformers on byte-level WikiText-2 and a tokenized language benchmark, achieving up to 41.9% associative recall versus 25%.
FinRAG-12B: A Production-Validated Recipe for Grounded Question Answering in Banking
FinRAG-12B is a 12B-parameter LLM optimized for retrieval-augmented generation in banking, featuring a unified training framework that improves answer quality, citation grounding, and calibrated refusal. The model outperforms GPT-4.1 in citation grounding and is deployed across over 40 financial institutions with significant cost and latency advantages.
One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA
Latent Memory introduces a compressed representation approach for external memory in question answering, reducing token consumption and storage requirements while maintaining competitive performance across text-only and multimodal benchmarks.
@neural_avb: https://x.com/neural_avb/status/2063907440509571354
Explores a common failure mode in recursive language models (RLMs) where free-text subagent responses cause issues, and presents a solution using structured outputs to improve reliability, illustrated with a long-context question-answering example from NarrativeQA.