AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA

arXiv cs.AI Papers

Summary

AgentFinVQA is a multi-agent pipeline for financial chart question answering that decomposes queries into planning, OCR, legend grounding, visual inspection, and verification steps, recording each step in a traceable Model Evaluation Packet. It achieves significant accuracy gains over zero-shot baselines while enabling on-premise deployment and auditability.

arXiv:2606.19782v1 Announce Type: new Abstract: Financial chart question answering in regulated settings demands more than accuracy: practitioners must know which answers to trust before acting on them, and many institutions cannot send client data to external model providers. Yet existing chart-QA agents are accuracy-focused and opaque, and most assume proprietary API access; to our knowledge, none combines auditability with on-premise deployability without significant accuracy compromise. We present AgentFinVQA, a multi-agent pipeline that decomposes each query into planning, OCR, legend grounding, visual inspection, and verification, recording every step in a traceable Model Evaluation Packet (MEP) per sample. On FinMME, AgentFinVQA improves $+7.68$ pp over a primary-backbone matched zero-shot baseline with a proprietary backbone (Gemini-3 Flash; 71.24% vs. 63.56%, McNemar $p \approx 1.1 \times 10^{-16}$), and $+4.84$ pp with open-weights Qwen3.6-27B-FP8 served locally. The verifier's verdict also serves as a useful confidence signal (68.2% vs. 55.6% exact accuracy on confirmed vs. revised answers), enabling human-in-the-loop review routing. Error analysis shows that question misunderstanding, legend confusion and extraction error account for nearly two-thirds of failures and are the categories least detected by the verifier, identifying clear directions for future work. Together these results show that auditable, on-premise financial chart QA is practical and that the open-weights system keeps most of the accuracy gains while enabling full data residency. We release our code to support reproducible evaluation.
Original Article
View Cached Full Text

Cached at: 06/20/26, 02:33 PM

# AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QA
Source: [https://arxiv.org/html/2606.19782](https://arxiv.org/html/2606.19782)
Aravind Narayanan Shaina Raza Vector Institute \{aravind\.narayanan,shaina\.raza\}@vectorinstitute\.ai

###### Abstract

Financial chart question answering in regulated settings demands more than accuracy: practitioners must know which answers to trust before acting on them, and many institutions cannot send client data to external model providers\. Yet existing chart\-QA agents are accuracy\-focused and opaque, and most assume proprietary API access; to our knowledge, none combines auditability with on\-premise deployability without significant accuracy compromise\. We presentAgentFinVQA, a multi\-agent pipeline that decomposes each query into planning, OCR, legend grounding, visual inspection, and verification, recording every step in a traceable Model Evaluation Packet \(MEP\) per sample\. On FinMME, AgentFinVQA improves\+7\.68\+7\.68pp over a primary\-backbone matched zero\-shot baseline with a proprietary backbone \(Gemini\-3 Flash; 71\.24% vs\. 63\.56%, McNemarp≈1\.1×10−16p\\approx 1\.1\\times 10^\{\-16\}\), and\+4\.84\+4\.84pp with open\-weights Qwen3\.6\-27B\-FP8 served locally\. The verifier’s verdict also serves as a useful confidence signal \(68\.2% vs\. 55\.6% exact accuracy on confirmed vs\. revised answers\), enabling human\-in\-the\-loop review routing\. Error analysis shows that question misunderstanding, legend confusion and extraction error account for nearly two\-thirds of failures and are the categories least detected by the verifier, identifying clear directions for future work\. Together these results show that auditable, on\-premise financial chart QA is practical and that the open\-weights system keeps most of the accuracy gains while enabling full data residency\. We release our code to support reproducible evaluation\.[Project Code](https://github.com/VectorInstitute/AgentFinVQA/)

AgentFinVQA: A Deployable Multi\-Agent Pipeline for Auditable Financial Chart QA

Aravind Narayanan Shaina RazaVector Institute\{aravind\.narayanan,shaina\.raza\}@vectorinstitute\.ai

## 1Introduction

Financial charts are a primary medium for communicating economic data, and automated question answering over such charts could substantially reduce the analytical burden on practitionersShuet al\.\([2025](https://arxiv.org/html/2606.19782#bib.bib10)\)\. However, deployment in financial settings demands more than raw accuracy\. Recent work shows that such systems exhibit significant hallucination rates on financial tasks, posing direct operational and regulatory risk to practitionersZhanget al\.\([2025](https://arxiv.org/html/2606.19782#bib.bib21)\)\. A system that is frequently correct but unpredictably wrong is difficult to trust, and one that gives no visibility into its reasoning is impossible to audit\.

Figure[1](https://arxiv.org/html/2606.19782#S1.F1)shows a concrete case: zero\-shot over\-selects fiscal years above the 40% threshold, while AgentFinVQA checks each MCQ option against the 10k–25k segment values and recovers the exact multi\-select answer\. A further hard constraint is that many financial institutions cannot send sensitive client documents to external model providers, making accurate local deployment a necessity rather than a preference\.

Agentic decomposition, which breaks a chart question into specialised reasoning steps, has improved accuracy on general chart QAWanget al\.\([2026](https://arxiv.org/html/2606.19782#bib.bib1)\); Liuet al\.\([2023a](https://arxiv.org/html/2606.19782#bib.bib12)\)\. The closest systems are agentic chart\-QA frameworks such as ChartAgentKauret al\.\([2026](https://arxiv.org/html/2606.19782#bib.bib16)\), which decompose queries into visual subtasks but rely on local computer\-vision tools to manipulate the image, and ChartSketcherHuanget al\.\([2026](https://arxiv.org/html/2606.19782#bib.bib17)\), which requires a two\-stage fine\-tune on hundreds of thousands of annotated samples\. To our knowledge, no prior system delivers chart QA that is at once prompting\-only \(no local segmentation models or task\-specific fine\-tuning\), auditable per answer, and deployable on open weights in\-house, the combination that regulated financial settings require\.

![Refer to caption](https://arxiv.org/html/2606.19782v1/x1.png)Figure 1:AgentFinVQA corrects zero\-shot over\-selection on a FinMME stacked ticket\-size chart\. Zero\-shot lists FY18–FY22 for the 10k–25k segment above 40%, while AgentFinVQA verifies each MCQ option independently, excludes FY23 at 39\.6%, and returns the exact answer: FY18 \+ FY19 \+ FY20\.In this work we ask:can a financial chart\-QA system be auditable and run entirely in\-house without sacrificing accuracy?To answer this, we developAgentFinVQA, a multi\-agent pipeline that coordinates a text\-only planner, an OCR reader, a legend grounder, a lightweight deterministic colour\-area measurement stage, and a vision\-and\-verify loop, recording each stage in a Model Evaluation Packet \(MEP\) for full auditability and confidence\-based human review routing\.

![Refer to caption](https://arxiv.org/html/2606.19782v1/x2.png)Figure 2:AgentFinVQA pipeline\. Input flows left\-to\-right through five required stages \(blue/teal\): Planner, OCR Reader, Vision Agent, Verifier, and MEP output\. Purple boxes indicate conditionally gated stages: Legend Grounder \(multi\-series charts\), Colour\-Area Tool \(bar/pie comparison questions\), Forced\-Choice Retry \(MCQ over\-refusal\), and Confidence Gate \(low\-confidence verifier revisions\)\.On the FinMME benchmarkLuoet al\.\([2025](https://arxiv.org/html/2606.19782#bib.bib9)\), our pipeline with a proprietary backbone \(Gemini\-3\) improves\+7\.68 ppover a model\-matched zero\-shot baseline \(p≈1\.1×10−16p\\approx 1\.1\\times 10^\{\-16\}\), with the largest gains on MCQ questions \(\+8\.1 pp\)\. The same pipeline with open\-weights Qwen3\.6\-27B\-FP8, served locally on a single A100, yields\+4\.84 pp\(p≈3\.0×10−6p\\approx 3\.0\\times 10^\{\-6\}\), confirming the gains do not depend on a proprietary API\.

Our contributions are: \(1\)AgentFinVQA, a multi\-agent chart\-QA pipeline whose every step is recorded in a traceable MEP, making each answer auditable; \(2\) evidence that accuracy gains transfer from proprietary to locally\-served open\-weights models, supporting in\-house deployment with modest accuracy trade\-off; and \(3\) a verifier verdict that acts as a confidence signal, letting human reviewers focus on the answers most likely to be wrong\.

## 2Related Work

#### Chart QA benchmarks\.

Early datasets such as FigureQAKahouet al\.\([2018](https://arxiv.org/html/2606.19782#bib.bib4)\), DVQAKafleet al\.\([2018](https://arxiv.org/html/2606.19782#bib.bib5)\), and PlotQAMethaniet al\.\([2020](https://arxiv.org/html/2606.19782#bib.bib6)\)established the task on synthetic charts, while ChartQAMasryet al\.\([2022](https://arxiv.org/html/2606.19782#bib.bib7)\)and ChartQA ProMasryet al\.\([2025](https://arxiv.org/html/2606.19782#bib.bib8)\)introduced human\-written questions over real charts\. FinMMELuoet al\.\([2025](https://arxiv.org/html/2606.19782#bib.bib9)\), FinChart\-BenchShuet al\.\([2025](https://arxiv.org/html/2606.19782#bib.bib10)\)and MME\-FinanceGanet al\.\([2024](https://arxiv.org/html/2606.19782#bib.bib11)\)confirm the difficulty of the financial setting but offer no deployable agentic solution, the gap we address\.

#### Decomposition and structured extraction\.

DePlotLiuet al\.\([2023a](https://arxiv.org/html/2606.19782#bib.bib12)\)converts charts into data tables for one\-shot LLM reasoning, separating modality conversion from reasoning; this informs our design, where the OCR Reader and Legend Grounder produce structured text metadata that grounds later stages\. Unlike MatChaLiuet al\.\([2023b](https://arxiv.org/html/2606.19782#bib.bib13)\)and UniChartMasryet al\.\([2023](https://arxiv.org/html/2606.19782#bib.bib14)\), which require domain\-specific pretraining, our pipeline achieves comparable extraction through prompting alone and works with any vision language model \(VLM\) backend\.

#### Agentic chart understanding\.

ReActYaoet al\.\([2023](https://arxiv.org/html/2606.19782#bib.bib15)\)established the reason\-and\-act paradigm; our Plan, OCR, Ground, Inspect, Verify pipeline applies it to chart QA\. The closest systems, ChartAgentKauret al\.\([2026](https://arxiv.org/html/2606.19782#bib.bib16)\)and a YOLO\-based variantWanget al\.\([2026](https://arxiv.org/html/2606.19782#bib.bib1)\), decompose queries into visual subtasks but physically manipulate the chart image with local computer\-vision models\. AgentFinVQA instead injects OCR output and legend maps as text, avoiding local segmentation infrastructure\. ChartSketcherHuanget al\.\([2026](https://arxiv.org/html/2606.19782#bib.bib17)\)reaches strong results but requires a two\-stage fine\-tune on 300K samples with RL; our pipeline is prompting\-only and runs on both proprietary and open\-weights backends\. MAC\-SQLWanget al\.\([2025](https://arxiv.org/html/2606.19782#bib.bib18)\)shows the same plan–reason–verify pattern in text\-to\-SQL\.

#### Verification and deployment\.

Prior work reduces hallucination through self\-correction or judge\-based analysisPanet al\.\([2026](https://arxiv.org/html/2606.19782#bib.bib19)\); Zhenget al\.\([2023](https://arxiv.org/html/2606.19782#bib.bib22)\); our verifier follows this intuition but runs as a prompting\-only inference stage and provides a confidence signal for review routing\. We use judge\-based analysis only for post\-hoc failure taxonomy, while answer accuracy is measured by the rule\-based scorer in Section[4\.1](https://arxiv.org/html/2606.19782#S4.SS1)\. FAITHZhanget al\.\([2025](https://arxiv.org/html/2606.19782#bib.bib21)\)motivates the need to manage silent failures in financial settings\.

## 3AgentFinVQA Framework

#### Problem setup\.

Given a financial chart imagecc, a natural\-language questionqq, and an optional set of MCQ choicesO=\{o1,…,ok\}O=\\\{o\_\{1\},\\dots,o\_\{k\}\\\}, the task is to produce an answeraatogether with an auditable traceℳ\\mathcal\{M\}of the reasoning that led to it\. For open\-ended \(standard\) questionsaais a numeric value or short text string; for single\-select MCQa∈Oa\\in O; for multi\-select MCQa⊆Oa\\subseteq O\. An answer is scored by a rule\-based scorer that applies numeric tolerance to standard answers and partial credit to MCQ answers, giving a per\-sample score in\[0,1\]\[0,1\]\(mean answer accuracy\); a sample counts as exactly correct when its score is≥0\.999\\geq 0\.999\(exact accuracy\)\. Beyond producingaa, we impose two deployment constraints that the architecture is built to satisfy while preserving strong answer accuracy: the system must be \(i\)*prompting\-only*, using no task\-specific fine\-tuning and no local computer\-vision models, and \(ii\) runnable on a self\-hosted open\-weights backbone, so sensitive charts never leave the institution\.

AgentFinVQA is a multi\-agent system of specialised agents, a planner, OCR reader, legend grounder, deterministic colour\-area tool, vision agent, and verifier, that operate sequentially over\(c,q,O\)\(c,q,O\)and produce a verified answer alongside a fully traceableModel Evaluation Packet \(MEP\)\. The pipeline executes stages in a fixed order, with gated stages skipped when their trigger conditions are not met: Plan→\\rightarrowOCR→\\rightarrowGround→\\rightarrowColour\-Area→\\rightarrowInspect→\\rightarrowVerify\. The order is not arbitrary: each stage’s structured output becomes structured evidence for the next, so cheaper and more reliable text extraction \(OCR, legend\) constrains the harder visual estimation that follows, and verification runs last against an independent view of the chart\. Figure[2](https://arxiv.org/html/2606.19782#S1.F2)shows the architecture\. Every stage writes its inputs, outputs, tool traces, and timestamps into the per\-sample MEP, a portable JSON artifact \(see Appendix[B](https://arxiv.org/html/2606.19782#A2)\) enabling reproducible evaluation, post\-hoc error attribution, and audit without re\-running the pipeline\. The formal definitions of all stage abbreviations and terminology used throughout this section are provided in Appendix[A](https://arxiv.org/html/2606.19782#A1)\.

#### Planner\.

A text\-only LLM receives the question and MCQ choices but does not see the chart image\. It outputs a structured JSON inspection plan comprising two to three focus points, a question type classification, and an answerability assessment\. For MCQ questions, the planner explicitly instructs the vision agent to checkeachchoice independently and verify that the selected answer is not contradicted by other data\. For multi\-select questions, it instructs the agent to compileallsupported choices rather than stopping at the first match\.

#### OCR Reader\.

A single focused VLM call transcribes all visible text, producing structured metadata: chart type, axis labels and ticks, legend entries, data labels, and annotations \(see Appendix[B](https://arxiv.org/html/2606.19782#A2), stage field structure\)\. This output serves as structured evidence for visible text in all subsequent stages, preventing the vision agent from misreading labels already reliably extracted\.

#### Legend Grounder\.

A targeted VLM call maps each legend entry to its visual properties: colour description, approximate RGB, line style, and confidence\. This legend map is injected into the vision prompt as explicit structured evidence with the instruction not to reassign colours during value extraction\. A compliance check verifies that the vision agent’s explanation references at least one legend label by name; if not, the vision call is retried\. Compliance retries fired on 13\.2% of MEPs, confirming the check is actively catching non\-compliance\. The stage is gated on chart type and legend size\.

Table 1:AgentFinVQA vs\. model\-matched zero\-shot baseline on FinMME\. McNemar: Gemini\-3p≈1\.1×10−16p\\approx 1\.1\\times 10^\{\-16\}; Qwen3\.6\-27B\-FP8p≈3\.0×10−6p\\approx 3\.0\\times 10^\{\-6\}\.†StandardΔ\\Deltafor Qwen within noise \(CI≈\\approx±\\pm10 pp\)\.
#### Colour\-Area Tool\.

To address perceptual estimation errors on stacked bar and pie charts, we introduce adeterministic pixel\-counting stagebetween legend grounding and vision\. Given the legend RGB map produced by the previous stage, the tool applies per\-series HSV colour masks \(±\\pm10 hue,±\\pm40 saturation/value\) to count matched pixels per legend entry, producing a dominant\-label hint injected into the vision prompt\. Unlike the local computer\-vision models used by prior agentic systemsKauret al\.\([2026](https://arxiv.org/html/2606.19782#bib.bib16)\), this stage requires no GPU, no trained model, and no segmentation infrastructure; it is a lightweight rule\-based computation that runs anywhere Python and OpenCV are available\.

The stage is gated conservatively: it fires only when chart type is bar, pie, or donut; the legend has\>1\>1entry; the question contains a comparison keyword \(largest, smallest, most, least, etc\.\); and no colour ambiguity is detected between legend entries \(pairwise HSV hue distance\>15\>15\)\. Suppression flags and the full pixel breakdown are stored in theMEPColorAreafield for post\-hoc audit \(Appendix[B](https://arxiv.org/html/2606.19782#A2)\)\.

#### Vision Agent\.

A CrewAI\-orchestrated agentCrewAI \([2026](https://arxiv.org/html/2606.19782#bib.bib3)\)executes the inspection plan with the chart image, OCR metadata, legend map, and colour\-area hint \(when available\)\. It produces a draft answer, explanation, and per\-choice confidence scores\. Three prompt paths handle single\-select MCQ, multi\-select MCQ, and open\-ended questions\. If the vision agent returns UNANSWERABLE on an MCQ question, a forced\-choice retry re\-runs with an explicit instruction to select the most plausible option, reducing the UNANSWERABLE rate\.

#### Verifier\.

A second independent VLM call audits the draft answer against the chart image\. It receives the draft answer, explanation, per\-choice analysis, MCQ choices, analyst caption, and the sample’s related sentences\. Trace analysis showed that many confirmed\-wrong cases contradicted information explicitly stated in this field\. The verifier produces aCONFIRMorREVISEverdict and a self\-reported confidence score, both recorded in the per\-sample MEP \(Appendix[B](https://arxiv.org/html/2606.19782#A2)\)\. Aconfidence gatedowngrades revisions with confidence<0\.75<0\.75to confirmations, preventing the verifier from overriding high\-quality vision answers with uncertain revisions\. Note that the verifier is distinct from the evaluation judge: the verifier is a pipeline stage that runs on every sample at inference time, whereas the judge is a separate model used only for post\-hoc failure categorisation in Section[4\.4](https://arxiv.org/html/2606.19782#S4.SS4)\.

#### Backend flexibility\.

All LLM stages route through a configurable backend abstraction supporting the Gemini API, OpenAI\-compatible endpoints, and locally\-served models via vLLM, allowing different stages to use different models — lightweight for structured extraction \(OCR, legend grounding\) and more capable for reasoning\-intensive stages \(planning, vision, verification\)\. For the open\-weights evaluation we serveQwen3\.6\-27B\-FP8Qwen Team \([2026](https://arxiv.org/html/2606.19782#bib.bib23)\)on a single A100\-80G with no prompt modification, satisfying on\-premise data\-residency requirements for chart inference without any external API dependency\.

Table 2:Component contributions to the AgentFinVQA pipeline\. Multi\-select MCQ support is the largest isolated accuracy gain\. The colour\-area tool’s primary value is deterministic grounding and audit traceability rather than aggregate accuracy lift\.

## 4Results

### 4\.1Experimental Setup

Dataset\.We evaluate onFinMMELuoet al\.\([2025](https://arxiv.org/html/2606.19782#bib.bib9)\), a financial VQA benchmark of∼\\sim11,000 samples spanning bar, line, pie, stacked bar, and combination charts across single\-select MCQ, multi\-select MCQ, and open\-ended standard question formats\.

Models\.We compare AgentFinVQA against primary\-backbone matched zero\-shot baselines, a single structured VLM call requesting JSON output:Gemini\-3 FlashGoogle DeepMind \([2025](https://arxiv.org/html/2606.19782#bib.bib2)\)andQwen3\.6\-27B\-FP8for the open\-weights configuration\. All systems receive the same auxiliary analyst caption and related\-sentence fields when available, so these inputs are controlled across comparisons\. In the Gemini\-based configuration, Gemini\-3 Flash handles planning, vision, and verification; OCR and legend grounding use Gemini\-2\.5 Flash Lite, which suffices for structured text extraction at lower cost\. In the Qwen\-based configuration, Qwen3\.6\-27B\-FP8 runs all inference stages via vLLM on a single A100\-80G; Gemini Batch API is used only for post\-hoc error analysis\.

Metrics\.Accuracy is measured by a rule\-based scorer with numeric tolerance for standard answers and partial credit for MCQ \(mean answer accuracy\), and by the fraction of samples scoring≥0\.999\\geq 0\.999\(exact accuracy\)\. Statistical significance uses the pairedMcNemar teston binary strict correctness\.

### 4\.2Performance Evaluation

Our first experiment tests the core claim:does agentic decomposition beat a model\-matched zero\-shot baseline, and does the gain hold across both a proprietary and an open\-weights backbone?Table[1](https://arxiv.org/html/2606.19782#S3.T1)presents the comparison\.

The agent improves over zero\-shot on both backbones, confirming the gains are not artefacts of a specific proprietary model\. MCQ questions drive the aggregate on both backbones \(\+8\.1 pp Gemini, \+5\.3 pp Qwen\), while standard open\-ended questions show a smaller gain for Gemini \(\+3\.0 pp\) and no reliable gain for Qwen \(−\-1\.0 pp, wide CI\)\. The smaller overall Qwen gain \(4\.84 pp vs\. 7\.68 pp\) is partly attributable to verifier over\-revision: the Qwen verifier modifies 41% of vision\-agent answers \(vs\. 17% for Gemini\), and revised answers score substantially lower than confirmed ones, suggesting the verifier frequently overrides correct vision outputs\.

### 4\.3Ablations

Our second experiment isolates the contribution of each major design decision, drawn from the most cleanly isolated comparisons in our development data \(Table[2](https://arxiv.org/html/2606.19782#S3.T2)\)\. Development runs were conducted on a 25% sample of FinMME \(n≈2,775n\\approx 2\{,\}775\) due to API cost constraints\.

The most defensible isolated gain ismulti\-select MCQ support: adding separate planner, vision, and verifier prompt paths for select\-all\-that\-apply questions raisedmultiple\_choiceaccuracy from 30\.2% to 53\.5% \(\+23\.3 pp\), the largest single\-component contribution in the pipeline\. The full development progression is in Appendix[C](https://arxiv.org/html/2606.19782#A3)\.

### 4\.4Error Analysis and Uncertainty

We classify all the incorrect outputs from the best run using a judge\-generated taxonomy \(Figure[3](https://arxiv.org/html/2606.19782#S4.F3)\)\. Three major categories account for nearly two\-thirds of the Gemini\-3 failures\.Question misunderstanding, where the agent answers a related but different question \(typically inverting a superlative or misreading a trend direction\);legend confusion, referring to the incorrect series\-to\-colour mapping, most common on multi\-series line or stacked complex charts; andextraction errorfor wrongly reading the numeric value despite identifying the correct chart element\. The remaining failures are split across hallucinated elements, axis misreads, and other minor categories\.

![Refer to caption](https://arxiv.org/html/2606.19782v1/fig4a_failure_distribution.png)Figure 3:Failure type distribution across backbones \(deltas: Qwen3\.6−\-Gemini\-3\)\. The proprietary backbone fails more on question misunderstanding and hallucinated elements; the open\-weights backbone shows higher extraction error and axis misread rates\.![Refer to caption](https://arxiv.org/html/2606.19782v1/fig4b_verifier_heatmap.png)Figure 4:Verifier revision rate per failure category\. Green cells indicate failure modes the verifier frequently flags for revision; red cells indicate under\-detection\. Qwen3\.6 consistently shows higher revision rates across all categories, reflecting greater verifier intervention on a weaker backbone\.#### Verifier as a routing signal\.

Figure[4](https://arxiv.org/html/2606.19782#S4.F4)shows the verifier revision rate per failure category\. A practitioner can implement a simple routing rule:revisedanswers are prioritized for analyst review, whileconfirmedanswers are treated as lower\-priority review items\. This concentrates human effort on the∼\\sim19% of outputs most likely to be wrong, while the remaining∼\\sim81% achieve 68% exact accuracy, offering a pragmatic accuracy\-auditability tradeoff for financial deployment\.

## 5Conclusion

We presented AgentFinVQA, a multi\-agent pipeline for financial chart question answering that decomposes each query into planning, OCR, legend grounding, visual inspection, and verification, recording every step in a traceable Model Evaluation Packet\. On FinMME it improves \+7\.68 pp over a model\-matched zero\-shot baseline with a proprietary backbone and \+4\.84 pp with an open\-weights model served locally, confirming that the gains do not depend on a proprietary API\. The verifier’s verdict serves as a review\-routing signal, letting analysts focus on the outputs most likely to require correction\. Error analysis reveals that question misunderstanding, legend confusion and extraction error account for nearly two\-thirds of remaining Gemini\-3 failures, precisely the categories the verifier is least effective at self\-detecting\. Together these results show that auditable, on\-premise financial chart QA is practical while preserving the accuracy gains of an agentic approach\.

## Limitations

We acknowledge some limitations of our evaluation\. First, we evaluate only on FinMME; while it is large and domain\-diverse, we have not tested cross\-dataset generalization to established financial chart benchmarks such as FinChart\-Bench or MME\-Finance, so transfer to other annotation styles and chart distributions remains open\. Second, the open\-weights configuration recovers most of the MCQ gain but shows no reliable gain on open\-ended standard questions \(−\-1\.0 pp, within noise\), indicating that on\-premise deployment with open\-weight models still carries a modest accuracy cost on the hardest question type\. Third, the colour\-area tool activates on only 5% of dataset and is suggestive rather than conclusive at this distribution; its contribution is architectural \(deterministic grounding and audit trace\) rather than aggregate accuracy, and its benefit may be larger on chart distributions richer in stacked\-bar and pie comparisons\. Finally, the verifier’s confidence is self\-reported rather than calibrated, and although we show that itsCONFIRM/REVISEverdict separates answer quality, we do not yet evaluate the proposed human\-in\-the\-loop routing with real analysts; quantifying reviewer time saved is left to future work\.

## Acknowledgments

Resources used in preparing this research were provided, in part, by the Province of Ontario and the Government of Canada through CIFAR, as well as companies sponsoring the Vector Institute \([http://www\.vectorinstitute\.ai/\#partners](http://www.vectorinstitute.ai/#partners)\)\.

This research was funded by the European Union’s Horizon Europe research and innovation programme under the AIXPERT project \(Grant Agreement No\. 101214389\), which aims to develop an agentic, multi\-layered, GenAI\-powered framework for creating explainable, accountable, and transparent AI systems\.

## References

- CrewAI\.Note:[https://crewai\.com](https://crewai.com/)Accessed: 2026\-06\-10Cited by:[§3](https://arxiv.org/html/2606.19782#S3.SS0.SSS0.Px6.p1.1)\.
- Z\. Gan, Y\. Lu, D\. Zang, H\. Li, C\. Liu, J\. Liu, J\. Liu, H\. Wu, C\. Fu, Z\. Xu, R\. Zhang, and Y\. Dai \(2024\)MME\-finance: a multimodal finance benchmark for expert\-level understanding and reasoning\.arXiv preprint arXiv:2411\.03314\.External Links:[Link](https://arxiv.org/abs/2411.03314)Cited by:[§2](https://arxiv.org/html/2606.19782#S2.SS0.SSS0.Px1.p1.1)\.
- Google DeepMind \(2025\)Gemini 3 flash model card\.Technical reportGoogle\.External Links:[Link](https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf)Cited by:[§4\.1](https://arxiv.org/html/2606.19782#S4.SS1.p2.1)\.
- M\. Huang, L\. Zhang, J\. Ma, H\. Lai, F\. Xu, Y\. Li, W\. Wu, Y\. Wu, and J\. Liu \(2026\)Chartsketcher: reasoning with multimodal feedback and reflection for chart understanding\.Advances in Neural Information Processing Systems38,pp\. 70485–70515\.Cited by:[§1](https://arxiv.org/html/2606.19782#S1.p3.1),[§2](https://arxiv.org/html/2606.19782#S2.SS0.SSS0.Px3.p1.1)\.
- K\. Kafle, B\. Price, S\. Cohen, and C\. Kanan \(2018\)Dvqa: understanding data visualizations via question answering\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 5648–5656\.External Links:[Link](https://openaccess.thecvf.com/content_cvpr_2018/papers/Kafle_DVQA_Understanding_Data_CVPR_2018_paper.pdf)Cited by:[§2](https://arxiv.org/html/2606.19782#S2.SS0.SSS0.Px1.p1.1)\.
- S\. E\. Kahou, V\. Michalski, A\. Atkinson, A\. Kadar, A\. Trischler, and Y\. Bengio \(2018\)FigureQA: an annotated figure dataset for visual reasoning\.External Links:1710\.07300,[Link](https://arxiv.org/abs/1710.07300)Cited by:[§2](https://arxiv.org/html/2606.19782#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Kaur, N\. Srishankar, Z\. Zeng, S\. Ganesh, and M\. Veloso \(2026\)ChartAgent: a multimodal agent for visually grounded reasoning in complex chart question answering\.External Links:2510\.04514,[Link](https://arxiv.org/abs/2510.04514)Cited by:[§1](https://arxiv.org/html/2606.19782#S1.p3.1),[§2](https://arxiv.org/html/2606.19782#S2.SS0.SSS0.Px3.p1.1),[§3](https://arxiv.org/html/2606.19782#S3.SS0.SSS0.Px5.p1.2)\.
- F\. Liu, J\. Eisenschlos, F\. Piccinno, S\. Krichene, C\. Pang, K\. Lee, M\. Joshi, W\. Chen, N\. Collier, and Y\. Altun \(2023a\)DePlot: one\-shot visual language reasoning by plot\-to\-table translation\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 10381–10399\.External Links:[Link](https://aclanthology.org/2023.findings-acl.660/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.660)Cited by:[§1](https://arxiv.org/html/2606.19782#S1.p3.1),[§2](https://arxiv.org/html/2606.19782#S2.SS0.SSS0.Px2.p1.1)\.
- F\. Liu, F\. Piccinno, S\. Krichene, C\. Pang, K\. Lee, M\. Joshi, Y\. Altun, N\. Collier, and J\. Eisenschlos \(2023b\)MatCha: enhancing visual language pretraining with math reasoning and chart derendering\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 12756–12770\.External Links:[Link](https://aclanthology.org/2023.acl-long.714/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.714)Cited by:[§2](https://arxiv.org/html/2606.19782#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Luo, Z\. Kou, L\. Yang, X\. Luo, J\. Huang, Z\. Xiao, J\. Peng, C\. Liu, J\. Ji, X\. Liu, S\. Han, M\. Zhang, and Y\. Guo \(2025\)FinMME: benchmark dataset for financial multi\-modal reasoning evaluation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 29465–29489\.External Links:[Link](https://aclanthology.org/2025.acl-long.1426/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1426),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2606.19782#S1.p5.2),[§2](https://arxiv.org/html/2606.19782#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.19782#S4.SS1.p1.1)\.
- A\. Masry, M\. S\. Islam, M\. Ahmed, A\. Bajaj, F\. Kabir, A\. Kartha, M\. T\. R\. Laskar, M\. Rahman, S\. Rahman, M\. Shahmohammadi, M\. Thakkar, M\. R\. Parvez, E\. Hoque, and S\. Joty \(2025\)ChartQAPro: a more diverse and challenging benchmark for chart question answering\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 19123–19151\.External Links:[Link](https://aclanthology.org/2025.findings-acl.978/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.978),ISBN 979\-8\-89176\-256\-5Cited by:[§2](https://arxiv.org/html/2606.19782#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Masry, P\. Kavehzadeh, X\. L\. Do, E\. Hoque, and S\. Joty \(2023\)UniChart: a universal vision\-language pretrained model for chart comprehension and reasoning\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 14662–14684\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.906/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.906)Cited by:[§2](https://arxiv.org/html/2606.19782#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Masry, D\. X\. Long, J\. Q\. Tan, S\. Joty, and E\. Hoque \(2022\)ChartQA: a benchmark for question answering about charts with visual and logical reasoning\.InFindings of the Association for Computational Linguistics: ACL 2022,S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 2263–2279\.External Links:[Link](https://aclanthology.org/2022.findings-acl.177/),[Document](https://dx.doi.org/10.18653/v1/2022.findings-acl.177)Cited by:[§2](https://arxiv.org/html/2606.19782#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Methani, P\. Ganguly, M\. M\. Khapra, and P\. Kumar \(2020\)Plotqa: reasoning over scientific plots\.InProceedings of the ieee/cvf winter conference on applications of computer vision,pp\. 1527–1536\.External Links:[Link](https://openaccess.thecvf.com/content_WACV_2020/papers/Methani_PlotQA_Reasoning_over_Scientific_Plots_WACV_2020_paper.pdf)Cited by:[§2](https://arxiv.org/html/2606.19782#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Pan, Y\. Wu, J\. Hua, J\. Feng, S\. Yan, B\. Deng, Z\. Cao, and J\. Ye \(2026\)Through the lens of contrast: self\-improving visual reasoning in vlms\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.19782#S2.SS0.SSS0.Px4.p1.1)\.
- Qwen Team \(2026\)Qwen3\.6\-27B: flagship\-level coding in a 27B dense model\.External Links:[Link](https://qwen.ai/blog?id=qwen3.6-27b)Cited by:[§3](https://arxiv.org/html/2606.19782#S3.SS0.SSS0.Px8.p1.1)\.
- D\. Shu, H\. Yuan, Y\. Wang, Y\. Liu, H\. Zhang, H\. Zhao, and M\. Du \(2025\)FinChart\-bench: benchmarking financial chart comprehension in vision\-language models\.External Links:2507\.14823,[Link](https://arxiv.org/abs/2507.14823)Cited by:[§1](https://arxiv.org/html/2606.19782#S1.p1.1),[§2](https://arxiv.org/html/2606.19782#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Wang, C\. Ren, J\. Yang, X\. Liang, J\. Bai, L\. Chai, Z\. Yan, Q\. Zhang, D\. Yin, X\. Sun, and Z\. Li \(2025\)MAC\-SQL: a multi\-agent collaborative framework for text\-to\-SQL\.InProceedings of the 31st International Conference on Computational Linguistics,O\. Rambow, L\. Wanner, M\. Apidianaki, H\. Al\-Khalifa, B\. D\. Eugenio, and S\. Schockaert \(Eds\.\),Abu Dhabi, UAE,pp\. 540–557\.External Links:[Link](https://aclanthology.org/2025.coling-main.36/)Cited by:[§2](https://arxiv.org/html/2606.19782#S2.SS0.SSS0.Px3.p1.1)\.
- B\. Wang, X\. Wang, Y\. Chen, X\. Li, J\. Xu, J\. Yuan, and C\. Liu \(2026\)Chartagent: a chart understanding framework with tool integrated reasoning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 2773–2782\.External Links:[Link](https://openaccess.thecvf.com/content/CVPR2026F/papers/Wang_ChartAgent_A_Chart_Understanding_Framework_with_Tool_Integrated_Reasoning_CVPRF_2026_paper.pdf)Cited by:[§1](https://arxiv.org/html/2606.19782#S1.p3.1),[§2](https://arxiv.org/html/2606.19782#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2210.03629)Cited by:[§2](https://arxiv.org/html/2606.19782#S2.SS0.SSS0.Px3.p1.1)\.
- M\. Zhang, J\. Fu, T\. Warrier, Y\. Wang, T\. Tan, and K\. Huang \(2025\)FAITH: a framework for assessing intrinsic tabular hallucinations in finance\.InProceedings of the 6th ACM International Conference on AI in Finance,pp\. 159–167\.External Links:[Link](https://dl.acm.org/doi/pdf/10.1145/3768292.3770433)Cited by:[§1](https://arxiv.org/html/2606.19782#S1.p1.1),[§2](https://arxiv.org/html/2606.19782#S2.SS0.SSS0.Px4.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing, H\. Zhang, J\. Gonzalez, and I\. Stoica \(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 46595–46623\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/91f18a1287b398d378ef22505bf41832-Paper-Datasets_and_Benchmarks.pdf)Cited by:[§2](https://arxiv.org/html/2606.19782#S2.SS0.SSS0.Px4.p1.1)\.

## Appendix ANotation and Stage Definitions

Table[3](https://arxiv.org/html/2606.19782#A1.T3)defines the key terms and stage abbreviations used throughout this paper\.

Table 3:Notation and stage definitions used throughout the paper\.
## Appendix BMEP Schema and Example

Each pipeline run produces oneModel Evaluation Packet \(MEP\): a portable JSON artifact that records every stage’s inputs, outputs, parsed results, tool traces, and wall\-clock timestamps\. Figure[2](https://arxiv.org/html/2606.19782#S1.F2)in the main paper points to the MEP as the pipeline’s audit output; this appendix shows its structure\.

#### Top\-level fields\.

The MEP contains the following top\-level keys:

```
{
  "schema_version":   "mep.v1",
  "run_id":           "1748647a-...",
  "sample_id":        "finmme_000006",
  "config": {
    "planner_model":  "gemini-3-flash-preview",
    "vision_model":   "gemini-3-flash-preview",
    "judge_backend":  "gemini"
  },
  "plan":             { ... },
  "ocr":              { ... },
  "legend_grounding": { ... },
  "color_area":       null,
  "vision":           { ... },
  "verifier":         { ... },
  "answer":           "C",
  "answer_accuracy":  1.0,
  "verifier_verdict": "confirmed",
  "timestamps":       { "start":
                            "2026-05-12T19:48:34Z",
                        "end":
                            "2026-05-12T19:49:18Z" },
  "errors":           [],
  "lf_trace_id":      "2d62cdb2-..."
}
```

#### Stage field structure\.

Every stage field \(e\.g\.ocr,vision\) follows a common schema:

```
{
  "chart_type":  "bar",
  "x_axis":      { "ticks": ["FY19","FY20","FY21",
                              "FY24F","FY25F","FY26E"] },
  "legend":      ["Employee cost (INR Mn) (LHS)",
                  "Rental cost (INR Mn) (LHS)",
                  "Overheads as a % of sales (RHS)"],
  "data_labels": ["19.3%","18.9%"],
  "parse_error": false,
  "tool_trace":  {
    "tool":       "ocr_reader_tool",
    "model":      "gemini-2.5-flash-lite",
    "elapsed_ms": 1640.8
  }
}
```

#### Colour\-area fields\.

Thecolor\_areafield isnullwhen gating conditions are not met \(chart type, comparison keyword, or colour ambiguity checks fail\); on this sample the stage was not triggered\. When active it stores:

```
{
  "triggered":            true,
  "breakdown":            {"Series A": 14823,
                            "Series B": 9102},
  "largest":              "Series A",
  "total_pixels_matched": 23925,
  "low_confidence":       false,
  "color_ambiguity":      false,
  "parse_error":          false
}
```

## Appendix CDevelopment Progression

Table[4](https://arxiv.org/html/2606.19782#A3.T4)shows the iterative development history\. It is provided for reproducibility; the main paper reports component contributions rather than version history\.

Table 4:Iterative development history on a fixed development sample \(n≈2,775n\\approx 2\{,\}775, 25% of FinMME dataset\)\. The v6–v7 bug was caught by inspecting MEP traces, demonstrating the audit value of per\-sample traceability\.

Similar Articles