ARMOR: An Agentic Framework for Reaction Feasibility Prediction via Adaptive Utility-aware Multi-tool Reasoning

arXiv cs.AI 05/11/26, 04:00 AM Papers
Summary
The article introduces ARMOR, an agentic framework for predicting chemical reaction feasibility by adaptively prioritizing and resolving conflicts among multiple AI tools. It demonstrates superior performance over single-tool and aggregation methods on public datasets.
arXiv:2605.07103v1 Announce Type: new Abstract: Reaction feasibility prediction, as a fundamental problem in computational chemistry, has benefited from diverse tools enabled by recent advances in artificial intelligence, particularly large language models. However, the performance of individual tools varies substantially across reactions, making it difficult for any single tool to consistently perform well across all cases. This raises a critical challenge: how to effectively leverage multiple tools to obtain more accurate feasibility predictions. To address this, we propose ARMOR, an agentic framework that explicitly models tool-specific utilities, adaptively prioritizes tools, and further resolves the potential tool conflicts to produce the final prediction for each reaction. Unlike existing approaches that rely on simple aggregation or heuristic assignment over various tools, ARMOR organizes tools into a hierarchy that prioritizes top-performing tools and defers others when needed, characterizes their strengths through tool-specific patterns, and resolves conflicts via memoryaugmented reasoning. Extensive experiments on a public dataset demonstrate that ARMOR consistently outperforms strong baselines, including single-tool methods as well as various tool aggregation and tool selection approaches. Further analysis shows that the improvements are particularly significant on reactions with conflicting tool predictions, highlighting the effectiveness of ARMOR in leveraging the complementary strengths of multiple tools. The code is available via https://anonymous.4open.science/r/ARMOR-E13F.
Original Article
View Cached Full Text
Cached at: 05/11/26, 07:12 AM
# An Agentic Framework for Reaction Feasibility Prediction via Adaptive Utility-aware Multi-tool Reasoning
Source: [https://arxiv.org/html/2605.07103](https://arxiv.org/html/2605.07103)
Ye Liu1, Botao Yu2, Xinyi Ling2, Daniel Adu\-Ampratwum3, Xia Ning1,2,3,4 1Department of Biomedical Informatics 2Department of Computer Science and Engineering 3Division of Medicinal Chemistry and Pharmacognosy 4Translational Data Analytics Institute The Ohio State University \{liu\.12989, ling\.303, yu\.3737, adu\-ampratwum\.1, ning\.104\}@osu\.edu

###### Abstract

Reaction feasibility prediction, as a fundamental problem in computational chemistry, has benefited from diverse tools enabled by recent advances in artificial intelligence, particularly large language models\. However, the performance of individual tools varies substantially across reactions, making it difficult for any single tool to consistently perform well across all cases\. This raises a critical challenge: how to effectively leverage multiple tools to obtain more accurate feasibility predictions\. To address this, we proposeARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limits, an agentic framework that explicitly models tool\-specific utilities, adaptively prioritizes tools, and further resolves the potential tool conflicts to produce the final prediction for each reaction\. Unlike existing approaches that rely on simple aggregation or heuristic assignment over various tools,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsorganizes tools into a hierarchy that prioritizes top\-performing tools and defers others when needed, characterizes their strengths through tool\-specific patterns, and resolves conflicts via memory\-augmented reasoning\. Extensive experiments on a public dataset demonstrate thatARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsconsistently outperforms strong baselines, including single\-tool methods as well as various tool aggregation and tool selection approaches\. Further analysis shows that the improvements are particularly significant on reactions with conflicting tool predictions, highlighting the effectiveness ofARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsin leveraging the complementary strengths of multiple tools\. The code is available via[https://anonymous\.4open\.science/r/ARMOR\-E13F](https://anonymous.4open.science/r/ARMOR-E13F)\.

## 1Introduction

Reaction feasibility prediction, which aims to assess whether a chemical reaction is feasible or not, is a fundamental problem in computational chemistry and chemical synthesis\(Warr,[2014](https://arxiv.org/html/2605.07103#bib.bib15)\)\. Recent advances in artificial intelligence, particularly large language models \(LLMs\), have led to a variety of tools for this task, such as classification\-based feasibility predictors\(Chainaniet al\.,[2025](https://arxiv.org/html/2605.07103#bib.bib11)\), forward\-generative models\(Irwinet al\.,[2022](https://arxiv.org/html/2605.07103#bib.bib13)\), and LLM\-based feasibility reasoners\(Rubinet al\.,[2022](https://arxiv.org/html/2605.07103#bib.bib44)\)\. However, the performance of existing tools varies across reactions, and no single tool consistently produces correct predictions in all cases\. For example, classification\-based predictors tend to perform well on reactions with relatively regular structures\(Probstet al\.,[2022](https://arxiv.org/html/2605.07103#bib.bib37)\), while LLM\-based methods are more effective on reactions that require complex reasoning or contextual understanding\(Krishnanet al\.,[2026](https://arxiv.org/html/2605.07103#bib.bib43)\)\. Such complementary strengths suggest that effectively leveraging multiple tools for each reaction is crucial for more accurate feasibility prediction\.

To this end, prior work has explored leveraging multiple tools, such as dynamic ensemble selection methods\(Cruzet al\.,[2020](https://arxiv.org/html/2605.07103#bib.bib45)\)and mixture\-of\-experts models\(Huanget al\.,[2024](https://arxiv.org/html/2605.07103#bib.bib51)\)\. However, these approaches typically rely on simple aggregation or heuristic assignment strategies, without explicitly distinguishing when each tool is more appropriate for which reaction, making their final predictions less stable or accurate\. Meanwhile, recent advances in LLM\-based agents have demonstrated strong capability in coordinating multiple tools through planning and reasoning\(Quet al\.,[2025](https://arxiv.org/html/2605.07103#bib.bib30); Qinet al\.,[2024b](https://arxiv.org/html/2605.07103#bib.bib33); Yeet al\.,[2025](https://arxiv.org/html/2605.07103#bib.bib55)\)\. Nevertheless, they mainly focus on orchestrating tools with different functionalities to accomplish multi\-step tasks, where each tool typically serves a distinct sub\-goal\. In contrast, leveraging multiple tools for a single task, especially in domain\-specific scenarios such as reaction feasibility prediction, remains underexplored\.

In this paper, we study how to effectively leverage multiple tools for accurate reaction feasibility prediction, where different tools exhibit varying strengths across reactions\. We proposeARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limits, anAgentic framework for adaptive utility\-awareMulti\-toolReasoning and cOnflict resolution forReaction feasibility prediction, which models tool utilities, prioritize tools with respect to different reactions and further resolves the potential tool conflicts with the support of contrastive demonstrations\.ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsconsists of three key components:\(1\)a tool hierarchy construction module that organizes multiple tools into a two\-level structure, where the first level includes top\-performing tools for initial decision\-making, while the second level contains remaining tools that exhibit specialized performance for reactions with specific characteristics;\(2\)a utility\-aware tool prioritization module that selects tools by characterizing the reaction\-specific utilities of different tools, thus adaptively prioritizing tools that tend to make correct predictions for each reaction\. and\(3\)a tool conflict resolution module that resolves the potential conflicting predictions of selected tools via a novel memory\-augmented reasoning mechanism, which leverages the historical reasoning behaviors over contrastive reaction\-tool demonstrations to obtain the final predictions\. Overall,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsenables reaction\-specific utility assessment over multiple tools, adaptively prioritizes appropriate tools and resolves tool conflicts to accurately predict reaction feasibility\.

Although we focus on reaction feasibility prediction in this work, the proposedARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsis a general framework that can be applied to other tasks where different tools exhibit varying performance across inputs\. We use the reaction feasibility prediction task as a representative setting to evaluateARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limits, and conduct extensive experiments on the public reaction feasibility dataset with diverse tools\. The results demonstrate thatARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsconsistently outperforms strong baselines, including single\-tool methods and various tool aggregation and tool selection approaches\. In particular,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsachieves superior and balanced performance, without bias toward either feasible or infeasible reactions\. Further analysis shows that the performance gain primarily stems from its ability to model tool\-specific utilities and effectively resolve tool conflicts\.

Our contributions can be summarized as follows: \(1\) We investigate how to leverage multiple tools for reaction feasibility prediction and, for the first time, propose the direction of explicit modeling the utilities of different tools, moving beyond direct aggregation or heuristic mixture\-of\-expert approaches\. \(2\) We developARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limits, an agentic framework that characterizes tool utilities and leverages them to select appropriate tools for each reaction, followed by a tool conflict resolution module to derive the final predictions\. \(3\) Extensive experiments demonstrate thatARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsconsistently achieves state\-of\-the\-art performance\. It shows significant advantages on reactions with persistent tool conflicts, where leveraging complementary tool strengths becomes especially important\.

## 2Related Work

##### Reaction Feasibility Prediction

Early studies on reaction feasibility prediction relied on expert\-designed rules\(Warr,[2014](https://arxiv.org/html/2605.07103#bib.bib15); Zhonget al\.,[2025](https://arxiv.org/html/2605.07103#bib.bib14)\), which assess feasibility through handcrafted constraints such as functional\-group compatibility and valence rules\(Jorgensenet al\.,[1990](https://arxiv.org/html/2605.07103#bib.bib16); Aithal and Upadhyay,[2012](https://arxiv.org/html/2605.07103#bib.bib17)\)\. While these methods provide interpretable decision criteria, they require substantial manual effort and often fail to generalize beyond predefined rule sets\(Foosheeet al\.,[2018](https://arxiv.org/html/2605.07103#bib.bib18)\)\. To overcome these limitations, subsequent work has explored machine learning approaches, which can be broadly categorized into classification\-based predictors and forward\-generation models\(Parket al\.,[2022](https://arxiv.org/html/2605.07103#bib.bib19)\)\. Classification\-based methods train supervised models using reaction fingerprints or molecular descriptors\(Probstet al\.,[2022](https://arxiv.org/html/2605.07103#bib.bib37); Yanget al\.,[2024](https://arxiv.org/html/2605.07103#bib.bib1); Chainaniet al\.,[2025](https://arxiv.org/html/2605.07103#bib.bib11)\), while forward\-generation models assess feasibility by predicting plausible products and checking their consistency with the target outputs\(Schwalleret al\.,[2019](https://arxiv.org/html/2605.07103#bib.bib12); Irwinet al\.,[2022](https://arxiv.org/html/2605.07103#bib.bib13)\)\.

More recently, large language models \(LLMs\) have been applied to reaction understanding and synthesis planning\(Murakumoet al\.,[2023](https://arxiv.org/html/2605.07103#bib.bib22)\)\. For feasibility prediction, LLMs can perform zero\-shot reasoning or leverage in\-context learning \(ICL\)\(Kojimaet al\.,[2022](https://arxiv.org/html/2605.07103#bib.bib41); Brownet al\.,[2020](https://arxiv.org/html/2605.07103#bib.bib42)\), and can be further enhanced by incorporating additional, external signals\(Rubinet al\.,[2022](https://arxiv.org/html/2605.07103#bib.bib44); Krishnanet al\.,[2026](https://arxiv.org/html/2605.07103#bib.bib43)\)\. These advances highlight the flexibility of LLMs in integrating diverse information and performing contextual reasoning for reaction analysis\.

##### Tool Selection

Selecting appropriate tools for a given input has been widely studied in machine learning and AI systems\(Cruzet al\.,[2020](https://arxiv.org/html/2605.07103#bib.bib45)\)\. Early approaches rely on static strategies such as majority voting or weighted aggregation\(Dietterich,[2000](https://arxiv.org/html/2605.07103#bib.bib47)\), which combine predictions without considering instance\-specific characteristics and thus fail to fully exploit tool complementarity\(Rokach,[2010](https://arxiv.org/html/2605.07103#bib.bib46)\)\. To address this, dynamic selection methods have been proposed to adaptively select tools based on input characteristics\(Britto Jret al\.,[2014](https://arxiv.org/html/2605.07103#bib.bib48)\)\. Representative approaches include dynamic ensemble selection methods such as KNORA\(Koet al\.,[2008](https://arxiv.org/html/2605.07103#bib.bib49)\)and DES variants\(Woloszynskiet al\.,[2012](https://arxiv.org/html/2605.07103#bib.bib50)\), as well as mixture\-of\-experts \(MoE\) models that route inputs to different experts via learned gating mechanisms\(Shazeeret al\.,[2017](https://arxiv.org/html/2605.07103#bib.bib52); Huanget al\.,[2024](https://arxiv.org/html/2605.07103#bib.bib51)\)\. However, these methods typically rely on implicit competence estimation\.

Beyond these approaches and benefiting from the advances of LLMs, a variety of agent\-based tool selection methods have been explored\(Qinet al\.,[2024a](https://arxiv.org/html/2605.07103#bib.bib29); Quet al\.,[2025](https://arxiv.org/html/2605.07103#bib.bib30)\), where the LLM\-based agents are used to coordinate multiple tools for complex tasks\. In such settings, tools are usually assigned to different sub\-tasks, and the agent focuses on orchestrating their interactions\(Qinet al\.,[2024b](https://arxiv.org/html/2605.07103#bib.bib33)\)\. In contrast, leveraging multiple tools for a single task, remains underexplored, especially in domain\-specific scenarios, such as reaction feasibility prediction, where multiple tools with the same functionalities need be compared\. Our work addresses this gap by explicitly modeling tool utilities and adaptively selecting appropriate tools for each reaction\.

## 3ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsFramework

##### Problem Definition

Given a reactionr=\(S→D\)r=\(S\\rightarrow D\), whereSSandDDdenote the reactants and products, respectively, reaction feasibility prediction is formulated as a binary classification problem to predictrras either feasible \(y=1y=1\) or infeasible \(y=0y=0\)\(Yanget al\.,[2024](https://arxiv.org/html/2605.07103#bib.bib1)\)\. In this work, we consider a setting in which multiple feasibility prediction tools𝒯=\{ti\}\\mathcal\{T\}=\\\{t\_\{i\}\\\}are available and have varying performance across reactions, and we aim to optimally leverage these tools to derive accurate predictions through a novel agentic framework,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limits\.

##### Overview

ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsdynamically identifies the most suitable tool\(s\) for reaction feasibility prediction by measuring tool utilities and prioritizing tools with respect to different reactions, and resolving potential prediction conflicts from different tools by learning and reasoning through contrastive demonstrations\. As illustrated in Figure[1](https://arxiv.org/html/2605.07103#S3.F1),ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsconsists of three key components:\(1\)Tool Hierarchy Construction \(Section[3\.1](https://arxiv.org/html/2605.07103#S3.SS1)\), which organizes tools into two levels: tools on the first level have strong overall performance across reactions, and can therefore be used for initial prediction, whereas tools on the second level exhibit more pronounced reaction\-dependent performance and thus, are better suited for specialized prediction scenarios or reactions with specific characteristics\.ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsleverages such a hierarchical tool framework to balance overall robustness with reaction\-specific specialization in tool utilization\.\(2\)Utility\-aware Tool Prioritization \(Section[3\.2](https://arxiv.org/html/2605.07103#S3.SS2)\), which performs tool selection by leveraging patterns that characterize tool utility across different reactions, thereby prioritizing tools that are more likely to generate correct predictions for specific reaction characteristics\. This prioritization enables more adaptive and reaction\-aware tool utilization, improving predictive accuracy while better leveraging the complementary strengths of different tools\.\(3\)Tool Conflict Resolution \(Section[3\.3](https://arxiv.org/html/2605.07103#S3.SS3)\), which reconciles conflicting predictions from selected tools through novel memory\-augmented reasoning over contrastive reaction\-tool demonstrations to derive the final prediction\. The conflict resolution enables more reliable and context\-aware decision\-making by leveraging complementary evidence and historical reasoning behaviors across tools\.

### 3\.1Tool Hierarchy Construction

It is commonly observed that tool performance often varies across reactions, with no single tool consistently performing best in all cases\. To assess overall performance,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsquantifies the performance of the tool set𝒯\\mathcal\{T\}on the validation set \(e\.g\., using accuracy\), with the topρ%\\rho\\%best\-performing tools categorized into the first level of the tool hierarchy, denoted as𝒯\(1\)\\mathcal\{T\}^\{\(1\)\}, and the remaining tools in the second level, denoted as𝒯\(2\)\\mathcal\{T\}^\{\(2\)\}, whereρ∈\(0,100\)\\rho\\in\(0,100\)is a predefined ratio\.𝒯\(1\)\\mathcal\{T\}^\{\(1\)\}exhibit strong overall performance and thus are utilized for the initial decision\-making, while𝒯\(2\)\\mathcal\{T\}^\{\(2\)\}provide complementary evidence and will be used when reactions cannot be consistently handled by𝒯\(1\)\\mathcal\{T\}^\{\(1\)\}\. This hierarchical organization enablesARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsto adaptively leverage tools for different reactions, leading to more accurate predictions\.

![Refer to caption](https://arxiv.org/html/2605.07103v1/x1.png)Figure 1:ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsframework\. The robot icon indicates that the corresponding module is agentic\.
### 3\.2Utility\-aware Tool Prioritization

Given a reactionrr,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsfirst applies tools of𝒯\(1\)\\mathcal\{T\}^\{\(1\)\}to produce initial predictions\. If these predictions are consistent, indicating strong agreement and high confidence among tools with robust overall performance,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsadopts the consensus as the final prediction\. Otherwise, discrepancies among predictions from𝒯\(1\)\\mathop\{\\mathcal\{T\}^\{\(1\)\}\}\\limitsare further addressed through utility\-aware tool prioritization, which selectively leverages tools from both𝒯\(1\)\\mathop\{\\mathcal\{T\}^\{\(1\)\}\}\\limitsand𝒯\(2\)\\mathop\{\\mathcal\{T\}^\{\(2\)\}\}\\limitsbased on their reaction\-specific utilities\. In order to select the most suitable tool for the reactionrr,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsutilizes a two\-step process:\(1\)Assess tool utilities using patterns; and\(2\)select tools using their patterns with respect torr\.

#### 3\.2\.1Reaction\-specific Tool Utility Assessment

##### Pattern Extraction

ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsuses discrete, descriptive patterns to summarize the situations when each tool is likely to make correct reaction feasibility predictions, and thus, to characterize tool utilities\. Specifically, a tool\-specific pattern is defined as a tuple:Pt=\(dt,et,𝒳t\)P\_\{t\}=\(d\_\{t\},e\_\{t\},\\mathcal\{X\}\_\{t\}\), wheredtd\_\{t\}is a succinct description \(e\.g\., “double bond formation"\),ete\_\{t\}is a textual explanation of the conditions under which the toolttperforms well \(e\.g\., “The tool correctly predicts reactions where double bonds are formed between atoms that were previously single\-bonded\. This includes cases where a double bond is formed between two carbon atoms, or between a carbon atom and another atom like nitrogen or oxygen"\), and𝒳t\\mathcal\{X\}\_\{t\}is a set of representative reaction examples covered by this pattern\. Such patterns are extracted from reactions in the validation set for which𝒯\(1\)\\mathop\{\\mathcal\{T\}^\{\(1\)\}\}\\limitsfails to produce consistent predictions – this set of reactions is denoted asℛv¬\(1\)\\mathop\{\\mathcal\{R\}^\{\\neg\(1\)\}\_\{v\}\}\\limits\. The reactions inℛv¬\(1\)\\mathop\{\\mathcal\{R\}^\{\\neg\(1\)\}\_\{v\}\}\\limitsare particularly informative for characterizing tool utilities, as consistently predicted reactions provide limited discriminative signals for differentiating the complementary strengths and weaknesses of individual tools\.

To extract𝒳t\\mathcal\{X\}\_\{t\}fromℛv¬\(1\)\\mathop\{\\mathcal\{R\}^\{\\neg\(1\)\}\_\{v\}\}\\limits,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsconstructsMMdiagnostic subsets for toolttfromℛv¬\(1\)\\mathop\{\\mathcal\{R\}^\{\\neg\(1\)\}\_\{v\}\}\\limits, denoted as\{ℛt\(m\)\}m=1M\\\{\\mathcal\{R\}\_\{t\}^\{\(m\)\}\\\}\_\{m=1\}^\{M\}, with each subsetℛt\(m\)=\{\{r11\},\{r10\},\{r01\},\{r00\}\}t\\mathcal\{R\}\_\{t\}^\{\(m\)\}=\\\{\\\{r\_\{11\}\\\},\\\{r\_\{10\}\\\},\\\{r\_\{01\}\\\},\\\{r\_\{00\}\\\}\\\}\_\{t\}, where\{rxy\}\\\{r\_\{xy\}\\\}represents a set ofNNreactions with ground\-truth feasibility labelxxandtt’s predicted feasibilty labelyy\(1/01/0representing feasible/infeasible\)\. These diagnostic subsets provide a comprehensive basis for analyzingtt’s behavior over informative and discriminative reactions\. Meanwhile, using subsets instead of the entireℛv¬\(1\)\\mathop\{\\mathcal\{R\}^\{\\neg\(1\)\}\_\{v\}\}\\limitsimproves the efficiency in extracting many patterns\.

ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitssubsequently extracts patterns fromℛt\(m\)\\mathcal\{R\}\_\{t\}^\{\(m\)\}using a large language model𝙻𝙻𝙼\\mathtt\{LLM\}, and aggregates the extracted patterns across allMMdiagnostic subsets to obtain the pattern set𝒫t0\\mathop\{\\mathcal\{P\}\_\{t\}^\{0\}\}\\limitsfor tooltt, as follows:

𝒫t0=⋃m=1M𝙻𝙻𝙼\(ℛt\(m\)\)\.\\mbox\{$\\mathop\{\\mathcal\{P\}\_\{t\}^\{0\}\}\\limits$\}=\\bigcup\\nolimits\_\{m=1\}^\{M\}\\mathtt\{LLM\}\(\\mathcal\{R\}\_\{t\}^\{\(m\)\}\)\.\(1\)

##### Pattern Refinement

𝙻𝙻𝙼\\mathtt\{LLM\}may make mistakes when extracting patterns\. To eliminate those mistakenly/inaccurately generated patterns by𝙻𝙻𝙼\\mathtt\{LLM\},ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsfurther refines𝒫t0\\mathop\{\\mathcal\{P\}\_\{t\}^\{0\}\}\\limitsfrom two perspectives:\(1\)How well patternPtj=\(dtj,etj,𝒳tj\)∈𝒫t0P\_\{tj\}=\(d\_\{tj\},e\_\{tj\},\\mathcal\{X\}\_\{tj\}\)\\in\\mbox\{$\\mathop\{\\mathcal\{P\}\_\{t\}^\{0\}\}\\limits$\}truly reflects the behavior of tooltt, using a scoreAlign\(Ptj\)\\texttt\{Align\}\(P\_\{tj\}\), defined as the proportion of reactions in𝒳tj\\mathcal\{X\}\_\{tj\}that are correctly predicted bytt\. HigherAlign\(Ptj\)\\texttt\{Align\}\(P\_\{tj\}\)indicates thatPtjP\_\{tj\}better reflects the utility oftt\.\(2\)How well patternPtjP\_\{tj\}covers its representative reaction examples𝒳tj\\mathcal\{X\}\_\{tj\}, using a scoreCov\(Ptj\)\\texttt\{Cov\}\(P\_\{tj\}\), defined as the proportion of reactions in𝒳tj\\mathcal\{X\}\_\{tj\}that are covered byPtjP\_\{tj\}\. Higher values indicate better coverage\. Thus,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsretains only patterns that satisfy the following:

𝒫t′=\{Ptj∣Align\(Ptj\)≥τ1,Cov\(Ptj\)≥τ2\},\\mbox\{$\\mathop\{\\mathcal\{P\}\_\{t\}^\{\\prime\}\}\\limits$\}=\\\{\\,P\_\{tj\}\\mid\\texttt\{Align\}\(P\_\{tj\}\)\\geq\\tau\_\{1\},\\;\\texttt\{Cov\}\(P\_\{tj\}\)\\geq\\tau\_\{2\}\\,\\\},\(2\)whereτ1,τ2∈\[0,1\]\\tau\_\{1\},\\tau\_\{2\}\\in\[0,1\]are the thresholds to control the quality of the retained pattern set𝒫t′\\mathop\{\\mathcal\{P\}\_\{t\}^\{\\prime\}\}\\limits\.

##### Pattern Consolidation

It is possible that in𝒫t′\\mathop\{\\mathcal\{P\}\_\{t\}^\{\\prime\}\}\\limits, different patterns extracted from different reactions describe similar tool utilities\. To obtain a representative, compact, and non\-redundant pattern set,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsconsolidates patterns sharing the same succinct descriptions, and only retains one representative pattern out of them, selected by𝙻𝙻𝙼\\mathtt\{LLM\}\. For each patternPtjP\_\{tj\}remaining in this consolidated pattern set,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsfurther measures its quality on the entireℛv¬\(1\)\\mathop\{\\mathcal\{R\}^\{\\neg\(1\)\}\_\{v\}\}\\limits\(note that all the patterns are extracted from subsets ofℛv¬\(1\)\\mathop\{\\mathcal\{R\}^\{\\neg\(1\)\}\_\{v\}\}\\limits\), using a confidence score defined as follows:

Conf\(Ptj\)=\|\{r∈ℛv¬\(1\)∣ris covered byPtj,ft\(r\)is correct\}\|\|\{r∈ℛv¬\(1\)∣ris covered byPtj\}\|,\\texttt\{Conf\}\(P\_\{tj\}\)=\\frac\{\\left\|\\\{\\,r\\in\\mbox\{$\\mathop\{\\mathcal\{R\}^\{\\neg\(1\)\}\_\{v\}\}\\limits$\}\\mid r\\text\{ is covered by \}P\_\{tj\},\\;f\_\{t\}\(r\)\\text\{ is correct\}\\,\\\}\\right\|\}\{\\left\|\\\{\\,r\\in\\mbox\{$\\mathop\{\\mathcal\{R\}^\{\\neg\(1\)\}\_\{v\}\}\\limits$\}\\mid r\\text\{ is covered by \}P\_\{tj\}\\,\\\}\\right\|\},\(3\)whereft\(r\)f\_\{t\}\(r\)is the predicted feasibility ofrrby toolttassociated withPtjP\_\{tj\}\. Compared toAlign\(Ptj\)\\texttt\{Align\}\(P\_\{tj\}\), which is measured on𝒳tj\\mathcal\{X\}\_\{tj\}that are subsets ofℛv¬\(1\)\\mathop\{\\mathcal\{R\}^\{\\neg\(1\)\}\_\{v\}\}\\limits,Conf\(Ptj\)\\texttt\{Conf\}\(P\_\{tj\}\)offers a more holistical assessment over the entireℛv¬\(1\)\\mathop\{\\mathcal\{R\}^\{\\neg\(1\)\}\_\{v\}\}\\limits, and thus, enabling a more reliable tool selection based on the consolidated pattern set\. For each toolttin this consolidated pattern set,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsretains its top\-5 patterns withConftj\\texttt\{Conf\}\_\{tj\}score aboveτ3\\tau\_\{3\}\(τ3∈\[0,1\]\\tau\_\{3\}\\in\[0,1\]\), and thus, constructs the final pattern set, denoted as𝒫t\\mathop\{\\mathcal\{P\}\_\{t\}\}\\limits, that will be used for tool prioritization and conflict resolution\.

#### 3\.2\.2Pattern\-based Tool Selection

To accurately predict reaction feasibility of a new reactionr′r^\{\\prime\}during inference time,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsfirst applies𝒯\(1\)\\mathop\{\\mathcal\{T\}^\{\(1\)\}\}\\limitstools\. If no consensus from𝒯\(1\)\\mathop\{\\mathcal\{T\}^\{\(1\)\}\}\\limitstools,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsproceeds to select the most suitable tools from both𝒯\(1\)\\mathop\{\\mathcal\{T\}^\{\(1\)\}\}\\limitsand𝒯\(2\)\\mathop\{\\mathcal\{T\}^\{\(2\)\}\}\\limits, based onr′r^\{\\prime\}\-specific tool utilities captured in\{𝒫t\}\\\{\\mbox\{$\\mathop\{\\mathcal\{P\}\_\{t\}\}\\limits$\}\\\}\. Specifically,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsfirst identifies the patterns in\{𝒫t\}\\\{\\mbox\{$\\mathop\{\\mathcal\{P\}\_\{t\}\}\\limits$\}\\\}that coverr′r^\{\\prime\}\. From those covering patterns,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsidentifies their associated tools, and the top\-1 most confident pattern, in terms ofConfscores, of each tooltt, that coversr′r^\{\\prime\}\(The tools may have patterns not coveringr′r^\{\\prime\}\)\. This pattern is denoted asPt∗\(r′\)P^\{\*\}\_\{t\}\(r^\{\\prime\}\)\. Based onPt∗\(r′\)P^\{\*\}\_\{t\}\(r^\{\\prime\}\)’sConfscores, the top\-LLtools are selected to predictr′r^\{\\prime\}\. This set of selected tools is denoted as𝒯\(s\)\(r′\)\\mbox\{$\\mathop\{\\mathcal\{T\}^\{\(s\)\}\}\\limits$\}\(r^\{\\prime\}\)\. This tool selection, grounded on historical patterns and tool performance over the patterns, enablesARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsto identify reaction\-specific tool strengths, leading to more accurate predictions\. Meanwhile, leveraging multiple tools provides opportunities forARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsto consider diverse patterns covering the reaction, leading to more robust predictions\. If these tools produce consistent predictions forr′r^\{\\prime\},ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsoutputs the consensus as the final prediction\. Otherwise, it proceeds to the conflict resolution stage\.

### 3\.3Tool Conflict Resolution

##### Tool Conflict Memory

ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitslearns to resolve conflicts among tool predictions via a novel tool conflict memoryℳ\\mathop\{\\mathcal\{M\}\}\\limits\. To constructℳ\\mathop\{\\mathcal\{M\}\}\\limits,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsconstructs structured contrastive instances𝑥\\mathop\{\{x\}\}\\limitsfor each reactionr∈ℛv¬\(1\)r\\in\\mbox\{$\\mathop\{\\mathcal\{R\}^\{\\neg\(1\)\}\_\{v\}\}\\limits$\}as follows:

𝑥r,t\+=\{r,t\+,Pt\+∗\(r\),\{t−\},\{Pt−∗\(r\)\}\},\\mbox\{$\\mathop\{\{x\}\}\\limits$\}\_\{r,t^\{\+\}\}=\\\{r,t^\{\+\},P^\{\*\}\_\{t^\{\+\}\}\(r\),\\\{t^\{\-\}\\\},\\\{P^\{\*\}\_\{t^\{\-\}\}\(r\)\\\}\\\},\(4\)wheret\+t^\{\+\}is a tool that accurately predictsrr, withPt\+∗\(r\)P^\{\*\}\_\{t^\{\+\}\}\(r\)its most confident pattern coveringrr;\{t−\}\\\{t^\{\-\}\\\}is a set of tools that fail to accurately predictrr, with\{Pt−∗\(r\)\}\\\{P^\{\*\}\_\{t^\{\-\}\}\(r\)\\\}the set of most confident patterns of\{t−\}\\\{t^\{\-\}\\\}coveringrr\. The𝙻𝙻𝙼\\mathtt\{LLM\}is employed to generate a rationaleGr,t\+G\_\{r,t^\{\+\}\}explaining whyti\+t\_\{i\}^\{\+\}is more suitable for reactionrrthan\{tj−\}\\\{t\_\{j\}^\{\-\}\\\}based on their associated patterns\.Gr,t\+G\_\{r,t^\{\+\}\}is further incorporated into𝑥r,t\+\\mbox\{$\\mathop\{\{x\}\}\\limits$\}\_\{r,t^\{\+\}\}, yielding the final contrastive instance stored inℳ\\mathop\{\\mathcal\{M\}\}\\limits\. These contrastive instances highlight the performance differences among tools on the same reaction, as well as their patterns inducing tool predictions\. The memoryℳ\\mathcal\{M\}stores such instances as a source of contrastive demonstrations, enabling the𝙻𝙻𝙼\\mathtt\{LLM\}to learn how to resolve conflicts among tools\.

##### Memory\-augmented Conflict Resolution

To address the conflicts among tool predictions during inference time for reactionrr,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsadapts a novel conflict resolution agent via memory\-augmented reasoning usingℳ\\mathop\{\\mathcal\{M\}\}\\limits\. Specifically,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsretrieves the top\-KKmost similar reactions torrfromℳ\\mathop\{\\mathcal\{M\}\}\\limitsbased on DRFP representation\(Probstet al\.,[2022](https://arxiv.org/html/2605.07103#bib.bib37)\), a reaction fingerprint that captures reaction\-level structural transformations\. For each of these similar reactions,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsrandomly selects one of its contrastive reasoning instances inℳ\\mathop\{\\mathcal\{M\}\}\\limits, These instances are used as few\-shot demonstrations, together with the selected tool set𝒯\(s\)\(r\)\\mbox\{$\\mathop\{\\mathcal\{T\}^\{\(s\)\}\}\\limits$\}\(r\)and their associated patterns, to prompt the𝙻𝙻𝙼\\mathtt\{LLM\}to determine the optimal tool forrr, and thus produce the final prediction\. The conflict resolution process leverages the reasoning capability of the𝙻𝙻𝙼\\mathtt\{LLM\}by learning from historical contrastive instances of reactions similar torr, thereby providing informative signals for distinguishing reliable tools under conflicts\.

## 4Experimental Setting

##### Dataset

Table 1:Dataset statistics\.Split\#Reactions\#Feasible\#InfeasibleValidation6,0003,0003,000Test6,0003,0003,000We conduct experiments on theFREAdataset\(Yuet al\.,[2026](https://arxiv.org/html/2605.07103#bib.bib56)\), which is constructed from the U\.S\. Patent & Trademark Office \(USPTO\)\. The dataset contains real\-world feasible reactions from USPTO, and infeasible reactions generated through multiple chemically motivated perturbation strategies validated with expert evaluations, providing a rigorous testbed for reaction feasibility prediction\. Specifically, we randomly select 12,000 reactions and split them into a validation set and a test set \(Table[1](https://arxiv.org/html/2605.07103#S4.T1)\)\. Since our objective is to leverage multiple tools for reaction feasibility prediction, rather than train individual tools, we do not involve model training on this dataset, and thus do not introduce a training split\. The validation set is used for tool hierarchy construction, tool utility assessment, and tool conflict memory construction, while the test set is used for evaluating the performance ofARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limits\.

##### Reaction Feasibility Tool Set

The tool set consists of 13 individual tools, including 3 classification\-based feaasibility predictors, 2 forward\-generative models, and 8 LLM\-based feasibility reasoners\. Details on the construction of the tool set are provided in Appendix[A](https://arxiv.org/html/2605.07103#A1)\. Importantly, we strictly ensure that there is no overlap between the dataset \(Table[1](https://arxiv.org/html/2605.07103#S4.T1)\) and the training data for individual tools \(Table[A1](https://arxiv.org/html/2605.07103#A1.T1)in Appendix\)\. This setup guarantees that the reported results of ourARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsframework precisely reflect the effectiveness of leveraging multiple tools, rather than being influenced by data leakage or model memorization\.

##### Evaluation Metrics

We evaluate the reaction feasibility prediction performance using Accuracy \(ACC\), F1 score, and Matthews Correlation Coefficient \(MCC\)\(Chicco and Jurman,[2020](https://arxiv.org/html/2605.07103#bib.bib53)\)\. We report overall accuracy as well as class\-wise accuracy for feasible and infeasible reactions\. For F1 score, we report class\-wise performance by treating each class \(feasible/infeasible\) as the positive class in turn\. MCC is further adopted as a balanced metric to provide a comprehensive evaluation\.

##### Tool Selection Baselines

We compareARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitswith a diverse set of baselines, grouped into four categories:\(1\)Single\-Tool Methods, which evaluate each individual tool independently;\(2\)Statistical Methods, which aggregate tool predictions based on predefined statistical rules, including majority voting, weighted voting, random selection, and StaticSel\(Britto Jret al\.,[2014](https://arxiv.org/html/2605.07103#bib.bib48)\);\(3\)Dynamic Methods, which select tools for each instance by estimating tool competence based on its local neighborhoods or input characteristics, includingDES\-KNNandDES\-Clustering\(Soareset al\.,[2006](https://arxiv.org/html/2605.07103#bib.bib54)\),KNORA\-EandKNORA\-U\(Koet al\.,[2008](https://arxiv.org/html/2605.07103#bib.bib49)\), and HarderMoE\(Huanget al\.,[2024](https://arxiv.org/html/2605.07103#bib.bib51)\); and\(4\)LLM\-based Methods, which leverage LLMs to assess tool utilities and perform tool selection through semantic reasoning, including ToolEyes\(Yeet al\.,[2025](https://arxiv.org/html/2605.07103#bib.bib55)\)and several closed\-source LLMs \(e\.g\., GPT\-5\.4\-mini\(Singhet al\.,[2025](https://arxiv.org/html/2605.07103#bib.bib57)\), DeepSeek\-v4\-flash\(DeepSeek\-AI,[2026](https://arxiv.org/html/2605.07103#bib.bib59)\), and Claude\-Sonnet\-4\.6\(Anthropic,[2026](https://arxiv.org/html/2605.07103#bib.bib58)\)\)\. Detailed descriptions of all baselines are provided in Appendix[B](https://arxiv.org/html/2605.07103#A2)\.

##### Implementation Details

InARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limits, we adoptT3Q\-Qwen\-14B\(JungZoona,[2025](https://arxiv.org/html/2605.07103#bib.bib40)\)as the backbone LLM, with default parameter settings and𝚍𝚘\_𝚜𝚊𝚖𝚙𝚕𝚎=𝙵𝚊𝚕𝚜𝚎\\mathtt\{do\\\_sample=False\}to eliminate sampling randomness\. For tool hierarchy construction \(Section[3\.1](https://arxiv.org/html/2605.07103#S3.SS1)\), we set the proportionρ=25\\rho=25and use accuracy on the validation set as the ranking metric\. For pattern extraction \(Section[3\.2\.1](https://arxiv.org/html/2605.07103#S3.SS2.SSS1.Px1)\), we setM=100M=100andN∈\{5,10,25,45\}N\\in\\\{5,10,25,45\\\}to construct diagnostic subsets\. For pattern refinement \(Eq\.[2](https://arxiv.org/html/2605.07103#S3.E2)\) and pattern consolidation, we setτ1=1\\tau\_\{1\}=1,τ2=1\\tau\_\{2\}=1andτ3=0\.5\\tau\_\{3\}=0\.5, respectively\. For pattern\-based tool prioritization \(Section[3\.2\.2](https://arxiv.org/html/2605.07103#S3.SS2.SSS2)\), we setL=5L=5to control the number of selected tools for prediction\. For memory\-augmented conflict resolution \(Section[3\.3](https://arxiv.org/html/2605.07103#S3.SS3)\), we set the number of retrieved similar reactions toK=8K=8\. All experiments are conducted on a Linux server with two Tesla A100 GPUs\.

## 5Experimental Results

### 5\.1Overall Performance

Table 2:Overall experimental results\.Underlinedresults indicate the best performance among single\-tool methods, andboldvalues highlight the best performance among tool selection methods\. “Overall", “Feasible" and “Infeasible" represent that the evaluation is on the entire, only feasible, and only infeasible reactions, respectively\.↑\\uparrowindicates higher values are better\.CategoryMethodsACC\(%\)↑\\uparrowF\-1 Score\(%\)↑\\uparrowMCC↑\\uparrowOverallFeasibleInfeasibleFeasibleInfeasibleSingle\-ToolMethodsBERT87\.9080\.3095\.5086\.9088\.750\.7669DRFP80\.6266\.1395\.1077\.3383\.070\.6398Dora\_xgb50\.8022\.0079\.6030\.9061\.800\.0196Molecular82\.3071\.3393\.2780\.1284\.050\.6621Chemformer81\.8868\.2395\.5379\.0284\.060\.6628PromptingT3Q\{\}\_\{\\texttt\{T3Q\}\}58\.0754\.3061\.8356\.4359\.590\.1618PromptingLlama\{\}\_\{\\texttt\{Llama\}\}54\.0223\.3784\.6733\.6964\.800\.1017ICLT3Q\{\}\_\{\\texttt\{T3Q\}\}63\.3876\.2750\.5067\.5657\.970\.2770ICLLlama\{\}\_\{\\texttt\{Llama\}\}55\.1289\.9320\.3066\.7131\.140\.1426MolecularT3Q\{\}\_\{\\texttt\{T3Q\}\}71\.9379\.7764\.1073\.9769\.550\.4442MolecularLlama\{\}\_\{\\texttt\{Llama\}\}60\.9384\.9736\.9068\.5048\.570\.2494ChemformerT3Q\{\}\_\{\\texttt\{T3Q\}\}69\.9880\.3359\.6372\.8066\.520\.4085ChemformerLlama\{\}\_\{\\texttt\{Llama\}\}60\.5590\.7030\.4069\.6943\.520\.2645Theoretical Upper Bound99\.9099\.80100\.0099\.9099\.900\.9980StatisticalMethodsStaticSel74\.1281\.8766\.3775\.9871\.940\.4882Majority Voting82\.8882\.5383\.2382\.8282\.940\.6577Weighted Voting86\.3581\.5091\.2085\.6586\.980\.7304Random Selection67\.4868\.6766\.2967\.8667\.090\.3497DynamicMethodsDES\-KNN75\.8076\.5775\.0375\.9875\.610\.5161DES\-Clustering72\.2084\.9059\.5075\.3368\.160\.4591KNORA\-E82\.8882\.5383\.2382\.8282\.940\.6577KNORA\-U81\.8580\.3383\.3781\.5782\.120\.6373HarderMoE89\.5086\.0392\.9789\.1289\.850\.7919LLM\-basedMethodsToolEyes87\.2579\.7394\.7786\.2188\.140\.7536GPT\-5\.4\-mini75\.9887\.9364\.0378\.5572\.720\.5352DeepSeek\-v4\-flash80\.9077\.1384\.6780\.1581\.590\.6198Claude\-Sonnet\-4\.685\.5574\.5796\.5383\.7786\.980\.7288ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limits\(ours\)91\.6291\.5791\.6791\.6191\.620\.8323Table[2](https://arxiv.org/html/2605.07103#S5.T2)reports the performance of individual tools, the theoretical upper bound of leveraging all tools, various tool aggregation and tool selection baselines, and ourARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsframework\. We highlight several key findings from the results as follows:

Individual tools exhibit limited performance and substantial variation\.No single tool consistently performs best across all reactions\. For example, BERT achieves the best overall performance among all individual tools \(87\.90% in overall accuracy\), while ChemformerLlama\{\}\_\{\\texttt\{Llama\}\}performs best on feasible reactions, and Chemformer achieves the strongest performance on infeasible reactions, Such diversity leads to a very high theoretical upper bound of best performance of all tools together, \(i\.e\., the performance obtained when a reaction is considered correctly predicted if any tool predicts it correctly\), reaching 99\.90% overall accuracy, indicating that different tools capture complementary aspects of the task\. This observation validates our motivation for leveraging multiple tools and highlights the significant potential of selecting appropriate tools for each reaction\.

ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsconsistently outperforms all baselines with more balanced performance across both feasible and infeasible reactions\.ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsachieves the best performance on most metrics, except for accuracy on the infeasible set\. In terms of overall accuracy, it surpasses the strongest individual tool \(BERT, 87\.90%\) and reaches 91\.62%\.ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsalso consistently improves over existing tool selection baselines, including HarderMoE \(89\.50%\), which leverages implicit expert routing to assign more tools to harder cases and fewer to easier ones\. Unlike such implicit routing strategies,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsexplicitly reasons over the utilities of different tools for each reaction and further resolves conflicts among selected tools, allowing it to better leverage the complementary strengths of different tools\. In addition,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsachieves strong and balanced accuracy performance on both feasible and infeasible reactions, resulting in superior F1 and MCC scores that reflect a more balanced predictive behavior across different reaction classes\.

LLM\-based methods achieve competitive performance but often suffer from imbalanced behavior\.Claude\-Sonnet\-4\.6 and ToolEyes rank among the top\-performing tool selection methods in terms of overall accuracy\. However, their performance varies substantially across reaction classes\. For example, Claude\-Sonnet\-4\.6 and ToolEyes achieve strong accuracy performance on the infeasible set \(96\.53% and 94\.77%, respectively\), but their performance on the feasible set is significantly lower \(74\.57% and 87\.25%\), leading to suboptimal overall results\. This further demonstrates the advantage ofARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsin achieving balanced performance across both feasible and infeasible reactions\.Statistical methods fail to outperform the strongest individual tool\.Statistical methods such as majority voting and weighted voting combine predictions using predefined rules without considering input\-specific characteristics or tool strengths\. As a result, they are unable to correct systematic errors shared by a majority of tools, leading to inferior performance\.Most dynamic methods show limited improvement over statistical baselines\.KNORA\-E achieves performance comparable to majority voting \(82\.88%\), suggesting limited benefit from its neighborhood\-based selection\. Furthermore, KNORA\-U and DES variants underperform compared to majority voting, indicating that their selection strategies may even introduce suboptimal decisions in this setting\.

More experiments show that, inARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limits, incorporating more demonstrations improves conflict resolution and leads to more accurate predictions\. In addition, the tool selection distribution analysis reveals that generally strong𝒯\(1\)\\mathop\{\\mathcal\{T\}^\{\(1\)\}\}\\limitstools are not frequently selected in conflict resolution, whereas some specialized𝒯\(2\)\\mathop\{\\mathcal\{T\}^\{\(2\)\}\}\\limitstools are chosen substantially more often\. These observations further highlight the importance of tool utility modeling\. More details can be found in Appendix[C](https://arxiv.org/html/2605.07103#A3)and Appendix[D](https://arxiv.org/html/2605.07103#A4)\.

### 5\.2Ablation Study

![Refer to caption](https://arxiv.org/html/2605.07103v1/x2.png)

![Refer to caption](https://arxiv.org/html/2605.07103v1/x3.png)

![Refer to caption](https://arxiv.org/html/2605.07103v1/x4.png)

Figure 2:Ablation study ofARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limits\.In this subsection, we conduct ablation experiments to assess the effectiveness of different components in theARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsframework\. We progressively remove the tool conflict resolution module \(Section[3\.3](https://arxiv.org/html/2605.07103#S3.SS3)\), utility\-aware tool prioritization module \(Section[3\.2](https://arxiv.org/html/2605.07103#S3.SS2)\) and tool hierarchy construction module \(Section[3\.1](https://arxiv.org/html/2605.07103#S3.SS1)\), resulting in three variants:\-w/o Conflict,\-w/o Utility, and\-w/o Hierarchy, respectively\. Specifically, In\-w/o Conflict, the final prediction degrades to majority voting over the selected tools𝒯\(s\)\\mathop\{\\mathcal\{T\}^\{\(s\)\}\}\\limitsin Section[3\.2\.2](https://arxiv.org/html/2605.07103#S3.SS2.SSS2)\. In\-w/o Utility, predictions are obtained via majority voting across all tools for reactions that cannot be consistently resolved by𝒯\(1\)\\mathop\{\\mathcal\{T\}^\{\(1\)\}\}\\limits\. In\-w/o Hierarchy, the framework is reduced to majority voting across all tools for all reactions\.

From the results in Figure[2](https://arxiv.org/html/2605.07103#S5.F2), we observe a clear performance degradation as components are progressively removed, indicating that each component contributes to the overall effectiveness ofARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limits\. In particular, removing tool conflict resolution leads to a noticeable drop, highlighting the effectiveness of our memory\-augmented design in resolving tool conflicts\. Further removing utility\-aware tool prioritization causes additional degradation, demonstrating the necessity of explicitly modeling tool\-specific utilities\. The largest performance drop occurs when removing the tool hierarchy, highlighting its importance in prioritizing top\-performing tools for straightforward reactions and deferring the use of the remaining tools to more challenging cases\.

### 5\.3Performance Gains across Reaction Categories

In this subsection, we analyze the improvement ofARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsover the strongest baseline, HarderMoE, across three reaction categories, including: \(1\) reactions where𝒯\(1\)\\mathop\{\\mathcal\{T\}^\{\(1\)\}\}\\limitsproduces consistent predictions, denoted as𝒯\(1\)\\mathop\{\\mathcal\{T\}^\{\(1\)\}\}\\limits; \(2\) reactions where the selected tools𝒯\(s\)\\mathop\{\\mathcal\{T\}^\{\(s\)\}\}\\limits\(Section[3\.2\.2](https://arxiv.org/html/2605.07103#S3.SS2.SSS2)\) produce consistent predictions, denoted as𝒯\(s\)\\mathop\{\\mathcal\{T\}^\{\(s\)\}\}\\limits; and \(3\) reactions where𝒯\(s\)\\mathop\{\\mathcal\{T\}^\{\(s\)\}\}\\limitsstill produce conflicting predictions, denoted asConflict\.

Table 3:Improvement ofARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsover HarderMoE on the overall accuracy \(ACC \(%\)\)\.CategoryNNProportion \(%\)ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsHarderMoEΔ\\Delta𝒯\(1\)\\mathop\{\\mathcal\{T\}^\{\(1\)\}\}\\limits4,32872\.1394\.4194\.15\+ 0\.26𝒯\(s\)\\mathop\{\\mathcal\{T\}^\{\(s\)\}\}\\limits5699\.4893\.6786\.64\+ 7\.03Conflict1,10318\.3879\.6072\.71\+ 6\.89Total6,000100\.0091\.6289\.50\+ 2\.12As shown in Table[3](https://arxiv.org/html/2605.07103#S5.T3), the improvement is not uniformly distributed across three categories, but is mainly concentrated on cases where tools produce inconsistent predictions\. Specifically, for reactions where𝒯\(1\)\\mathop\{\\mathcal\{T\}^\{\(1\)\}\}\\limitsreach consistent predictions,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsachieves only a marginal improvement \(\+0\.26%\), indicating that these cases can be well handled by strong baselines\. For reactions where𝒯\(s\)\\mathop\{\\mathcal\{T\}^\{\(s\)\}\}\\limitsproduce consistent predictions,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsachieves a more noticeable improvement \(\+7\.03%\)\. For these reactions,𝒯\(1\)\\mathop\{\\mathcal\{T\}^\{\(1\)\}\}\\limitsinitially produce inconsistent predictions, but after pattern\-based tool selection, the selected tools \(𝒯\(s\)\\mathop\{\\mathcal\{T\}^\{\(s\)\}\}\\limits\) reach agreements, suggesting thatARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limits’s utility\-aware tool prioritization module effectively identifies appropriate tools for each reaction\. In contrast, for reactions where tools in𝒯\(s\)\\mathop\{\\mathcal\{T\}^\{\(s\)\}\}\\limitsstill produce conflicting predictions, which correspond to the most challenging cases,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsachieves a substantial improvement \(\+6\.89%\)\. This demonstrates its effectiveness in resolving persistent tool conflicts under difficult scenarios\. Overall,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsconsistently improves over HarderMoE \(\+2\.12%\), with the gains mainly arising from its ability to identify appropriate tools through tool utility modeling and to resolve remaining conflicts in difficult cases\.

### 5\.4Case Study

![Refer to caption](https://arxiv.org/html/2605.07103v1/x5.png)Figure 3:Case study\.To further illustrate the reasoning process ofARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limits, we present a representative case study in Figure[3](https://arxiv.org/html/2605.07103#S5.F3)\. For the given reactionrr,𝒯\(1\)\\mathop\{\\mathcal\{T\}^\{\(1\)\}\}\\limits\(BERT, DRFP and Molecular\) produce inconsistent predictions, indicating that generally strong tools alone are insufficient for this reaction\.ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitstherefore conduct the pattern\-based tool selection, where𝒯\(s\)\\mathop\{\\mathcal\{T\}^\{\(s\)\}\}\\limitsare selected together with their associated patterns\. Although these patterns all cover the given reaction, the selected tools still produce inconsistent predictions\. To resolve the conflict,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsretrieves the contrastive reasoning instances from the tool conflict memoryℳ\\mathop\{\\mathcal\{M\}\}\\limits\. One of the retrieved instance is shown in Figure[3](https://arxiv.org/html/2605.07103#S5.F3)\. It involves a similar reaction to therr, and contains three tools, including one correctly predicting tool \(ICLLlama\{\}\_\{\\texttt\{Llama\}\}\) and two incorrectly predicting tools \(ChemformerLlama\{\}\_\{\\texttt\{Llama\}\}, MolecularLlama\{\}\_\{\\texttt\{Llama\}\}\), with patterns overlapping those of the current reactionrr\. According to the rationale in this instance, ICLLlama\{\}\_\{\\texttt\{Llama\}\}is preferred because it specializes in reactions involving the formation of new C–N bonds in aromatic rings\. Since the current reactionrrexhibits the same transformation pattern,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsselects ICLLlama\{\}\_\{\\texttt\{Llama\}\}and uses its prediction as the final prediction\. This case demonstrates howARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitstransfers reasoning signals from historical contrastive instances to identify the most appropriate tool under conflicting predictions, finally deriving the correct prediction\.

## 6Conclusions

In this work, we studied how to effectively leverage multiple tools for reaction feasibility prediction, where different tools exhibit varying strengths across reactions\. We developedARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limits, an agent\-based framework that integrates tool hierarchy construction, utility\-aware tool prioritization, and tool conflict resolution to guide tool usage and derive the final prediction for each reaction\. By explicitly modeling when different tools succeed,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitseffectively leverages their strengths, particularly in challenging cases with persistent conflicting predictions\. Extensive experiments demonstrate thatARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsconsistently outperforms strong baselines, highlighting the importance of utility\-aware tool prioritization over direct aggregation strategies\.

## References

- Llama 3 model card\.External Links:[Link](https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md)Cited by:[1st item](https://arxiv.org/html/2605.07103#A1.I3.i1.p1.2),[Appendix A](https://arxiv.org/html/2605.07103#A1.SS0.SSS0.Px2.p1.2)\.
- S\. Aithal and D\. Upadhyay \(2012\)Feasibility study of the potential use of chemistry based emission predictions for real\-time control of modern diesel engines\.Applied Energy91\(1\),pp\. 475–482\.Cited by:[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px1.p1.1)\.
- Anthropic \(2026\)Claude sonnet 4\.6 system card\.Note:https://www\.anthropic\.com[https://www\-cdn\.anthropic\.com/78073f739564e986ff3e28522761a7a0b4484f84\.pdf](https://www-cdn.anthropic.com/78073f739564e986ff3e28522761a7a0b4484f84.pdf)Cited by:[4th item](https://arxiv.org/html/2605.07103#A2.I1.i4.p1.1),[§4](https://arxiv.org/html/2605.07103#S4.SS0.SSS0.Px4.p1.1)\.
- A\. S\. Britto Jr, R\. Sabourin, and L\. E\. Oliveira \(2014\)Dynamic selection of classifiers—a comprehensive review\.Pattern recognition47\(11\),pp\. 3665–3680\.Cited by:[2nd item](https://arxiv.org/html/2605.07103#A2.I1.i2.p1.1),[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2605.07103#S4.SS0.SSS0.Px4.p1.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px1.p2.1)\.
- Y\. Chainani, Z\. Ni, K\. M\. Shebek, L\. J\. Broadbelt, and K\. E\. Tyo \(2025\)DORA\-xgb: an improved enzymatic reaction feasibility classifier trained using a novel synthetic data approach\.Molecular Systems Design & Engineering10\(2\),pp\. 129–142\.Cited by:[3rd item](https://arxiv.org/html/2605.07103#A1.I1.i3.p1.1),[§1](https://arxiv.org/html/2605.07103#S1.p1.1),[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Chicco and G\. Jurman \(2020\)The advantages of the matthews correlation coefficient \(mcc\) over f1 score and accuracy in binary classification evaluation\.BMC genomics21\(1\),pp\. 6\.Cited by:[§4](https://arxiv.org/html/2605.07103#S4.SS0.SSS0.Px3.p1.1)\.
- R\. M\. Cruz, L\. G\. Hafemann, R\. Sabourin, and G\. D\. Cavalcanti \(2020\)DESlib: a dynamic ensemble selection library in python\.Journal of Machine Learning Research21\(8\),pp\. 1–5\.Cited by:[§1](https://arxiv.org/html/2605.07103#S1.p2.1),[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px2.p1.1)\.
- DeepSeek\-AI \(2026\)DeepSeek\-v4: towards highly efficient million\-token context intelligence\.Note:[https://huggingface\.co/deepseek\-ai/DeepSeek\-V4\-Pro/blob/main/DeepSeek\_V4\.pdf](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf)Accessed: 2026\-04\-29Cited by:[4th item](https://arxiv.org/html/2605.07103#A2.I1.i4.p1.1),[§4](https://arxiv.org/html/2605.07103#S4.SS0.SSS0.Px4.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)Bert: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\),pp\. 4171–4186\.Cited by:[1st item](https://arxiv.org/html/2605.07103#A1.I1.i1.p1.1)\.
- T\. G\. Dietterich \(2000\)Ensemble methods in machine learning\.InInternational workshop on multiple classifier systems,pp\. 1–15\.Cited by:[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Fooshee, A\. Mood, E\. Gutman, M\. Tavakoli, G\. Urban, F\. Liu, N\. Huynh, D\. Van Vranken, and P\. Baldi \(2018\)Deep learning for chemical reaction prediction\.Molecular Systems Design & Engineering3\(3\),pp\. 442–452\.Cited by:[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px1.p1.1)\.
- Q\. Huang, Z\. An, N\. Zhuang, M\. Tao, C\. Zhang, Y\. Jin, K\. Xu, L\. Chen, S\. Huang, and Y\. Feng \(2024\)Harder task needs more experts: dynamic routing in moe models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 12883–12895\.Cited by:[3rd item](https://arxiv.org/html/2605.07103#A2.I1.i3.p1.2),[§1](https://arxiv.org/html/2605.07103#S1.p2.1),[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2605.07103#S4.SS0.SSS0.Px4.p1.1)\.
- R\. Irwin, S\. Dimitriadis, J\. He, and E\. J\. Bjerrum \(2022\)Chemformer: a pre\-trained transformer for computational chemistry\.Machine Learning: Science and Technology3\(1\),pp\. 015022\.Cited by:[1st item](https://arxiv.org/html/2605.07103#A1.I2.i1.p1.1),[3rd item](https://arxiv.org/html/2605.07103#A1.I3.i3.p1.2),[§1](https://arxiv.org/html/2605.07103#S1.p1.1),[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px1.p1.1)\.
- W\. L\. Jorgensen, E\. R\. Laird, A\. J\. Gushurst, J\. M\. Fleischer, S\. A\. Gothe, H\. E\. Helson, G\. D\. Paderes, and S\. Sinclair \(1990\)CAMEO: a program for the logical prediction of the products of organic reactions\.Pure and Applied Chemistry62\(10\),pp\. 1921–1932\.Cited by:[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px1.p1.1)\.
- JungZoona \(2025\)T3Q\-qwen2\.5\-14b\-v1\.0\-e3\.Note:[https://huggingface\.co/JungZoona/T3Q\-qwen2\.5\-14b\-v1\.0\-e3](https://huggingface.co/JungZoona/T3Q-qwen2.5-14b-v1.0-e3)Accessed: 2026\-04\-15Cited by:[1st item](https://arxiv.org/html/2605.07103#A1.I3.i1.p1.2),[Appendix A](https://arxiv.org/html/2605.07103#A1.SS0.SSS0.Px2.p1.2),[§4](https://arxiv.org/html/2605.07103#S4.SS0.SSS0.Px5.p1.10)\.
- A\. H\. Ko, R\. Sabourin, and A\. S\. Britto Jr \(2008\)From dynamic classifier selection to dynamic ensemble selection\.Pattern recognition41\(5\),pp\. 1718–1731\.Cited by:[3rd item](https://arxiv.org/html/2605.07103#A2.I1.i3.p1.2),[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2605.07103#S4.SS0.SSS0.Px4.p1.1)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.Advances in neural information processing systems35,pp\. 22199–22213\.Cited by:[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px1.p2.1)\.
- B\. N\. Krishnan, A\. Heydarabadipour, and H\. Sauro \(2026\)BioModelsRAG: a biological modeling assistant using rag \(retrieval augmented generation\)\.arXiv preprint arXiv:2601\.22684\.Cited by:[§1](https://arxiv.org/html/2605.07103#S1.p1.1),[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px1.p2.1)\.
- K\. Murakumo, N\. Yoshikawa, K\. Rikimaru, S\. Nakamura, K\. Furui, T\. Suzuki, H\. Yamasaki, Y\. Nishigaya, Y\. Takagi, and M\. Ohue \(2023\)LLM drug discovery challenge: a contest as a feasibility study on the utilization of large language models in medicinal chemistry\.InAI for Accelerated Materials Design\-NeurIPS 2023 Workshop,Cited by:[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px1.p2.1)\.
- S\. Park, H\. Han, H\. Kim, and S\. Choi \(2022\)Machine learning applications for chemical reactions\.Chemistry–An Asian Journal17\(14\),pp\. e202200203\.Cited by:[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Probst, P\. Schwaller, and J\. Reymond \(2022\)Reaction classification and yield prediction using the differential reaction fingerprint drfp\.Digital discovery1\(2\),pp\. 91–97\.Cited by:[2nd item](https://arxiv.org/html/2605.07103#A1.I1.i2.p1.2),[2nd item](https://arxiv.org/html/2605.07103#A1.I3.i2.p1.2),[§1](https://arxiv.org/html/2605.07103#S1.p1.1),[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px1.p1.1),[§3\.3](https://arxiv.org/html/2605.07103#S3.SS3.SSS0.Px2.p1.14)\.
- Y\. Qin, S\. Hu, Y\. Lin, W\. Chen, N\. Ding, G\. Cui, Z\. Zeng, X\. Zhou, Y\. Huang, C\. Xiao,et al\.\(2024a\)Tool learning with foundation models\.ACM Computing Surveys57\(4\),pp\. 1–40\.Cited by:[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px2.p2.1)\.
- Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian,et al\.\(2024b\)ToolLLM: facilitating large language models to master 16000\+ real\-world apis\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.07103#S1.p2.1),[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px2.p2.1)\.
- C\. Qu, S\. Dai, X\. Wei, H\. Cai, S\. Wang, D\. Yin, J\. Xu, and J\. Wen \(2025\)Tool learning with large language models: a survey\.Frontiers of Computer Science19\(8\),pp\. 198343\.Cited by:[§1](https://arxiv.org/html/2605.07103#S1.p2.1),[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px2.p2.1)\.
- L\. Rokach \(2010\)Ensemble\-based classifiers\.Artificial intelligence review33\(1\),pp\. 1–39\.Cited by:[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px2.p1.1)\.
- O\. Rubin, J\. Herzig, and J\. Berant \(2022\)Learning to retrieve prompts for in\-context learning\.InProceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies,pp\. 2655–2671\.Cited by:[§1](https://arxiv.org/html/2605.07103#S1.p1.1),[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px1.p2.1)\.
- P\. Schwaller, T\. Laino, T\. Gaudin, P\. Bolgar, C\. A\. Hunter, C\. Bekas, and A\. A\. Lee \(2019\)Molecular transformer: a model for uncertainty\-calibrated chemical reaction prediction\.ACS central science5\(9\),pp\. 1572–1583\.Cited by:[2nd item](https://arxiv.org/html/2605.07103#A1.I2.i2.p1.1),[4th item](https://arxiv.org/html/2605.07103#A1.I3.i4.p1.2),[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. Le, G\. Hinton, and J\. Dean \(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram,et al\.\(2025\)Openai gpt\-5 system card\.arXiv preprint arXiv:2601\.03267\.Cited by:[4th item](https://arxiv.org/html/2605.07103#A2.I1.i4.p1.1),[§4](https://arxiv.org/html/2605.07103#S4.SS0.SSS0.Px4.p1.1)\.
- R\. G\. Soares, A\. Santana, A\. M\. Canuto, and M\. C\. P\. de Souto \(2006\)Using accuracy and diversity to select classifiers to build ensembles\.InThe 2006 IEEE international joint conference on neural network proceedings,pp\. 1310–1316\.Cited by:[3rd item](https://arxiv.org/html/2605.07103#A2.I1.i3.p1.2),[§4](https://arxiv.org/html/2605.07103#S4.SS0.SSS0.Px4.p1.1)\.
- W\. A\. Warr \(2014\)A short review of chemical reaction database systems, computer\-aided synthesis design, reaction prediction and synthetic feasibility\.Molecular informatics33\(6\-7\),pp\. 469–476\.Cited by:[§1](https://arxiv.org/html/2605.07103#S1.p1.1),[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Woloszynski, M\. Kurzynski, P\. Podsiadlo, and G\. W\. Stachowiak \(2012\)A measure of competence based on random classification for dynamic ensemble selection\.Information Fusion13\(3\),pp\. 207–213\.Cited by:[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px2.p1.1)\.
- F\. Yang, J\. Liu, Q\. Zhang, Z\. Yang, J\. Liu, and G\. Wu \(2024\)Subgraph\-based self\-supervised learning framework for enzymatic reaction feasibility prediction\.In2024 IEEE International Conference on Bioinformatics and Biomedicine \(BIBM\),pp\. 779–784\.Cited by:[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2605.07103#S3.SS0.SSS0.Px1.p1.8)\.
- J\. Ye, G\. Li, S\. Gao, C\. Huang, Y\. Wu, S\. Li, X\. Fan, S\. Dou, T\. Ji, Q\. Zhang,et al\.\(2025\)Tooleyes: fine\-grained evaluation for tool learning capabilities of large language models in real\-world scenarios\.InProceedings of the 31st international conference on computational linguistics,pp\. 156–187\.Cited by:[4th item](https://arxiv.org/html/2605.07103#A2.I1.i4.p1.1),[§1](https://arxiv.org/html/2605.07103#S1.p2.1),[§4](https://arxiv.org/html/2605.07103#S4.SS0.SSS0.Px4.p1.1)\.
- B\. Yu, D\. Adu\-Ampratwum, F\. N\. Baker, B\. Zhou, Z\. Chen, R\. Averly, Y\. Liu, W\. Gao, X\. Ning, and H\. Sun \(2026\)FREA: benchmarking chemical reaction feasibility with systematic negatives\.GitHub\.Note:[https://github\.com/OSU\-NLP\-Group/FREA](https://github.com/OSU-NLP-Group/FREA)Cited by:[Appendix A](https://arxiv.org/html/2605.07103#A1.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.07103#S4.SS0.SSS0.Px1.p1.1)\.
- H\. Zhong, Y\. Liu, H\. Sun, Y\. Liu, R\. Zhang, B\. Li, Y\. Yang, Y\. Huang, F\. Yang, F\. S\. Mak,et al\.\(2025\)Towards global reaction feasibility and robustness prediction with high throughput data and bayesian deep learning\.Nature Communications16\(1\),pp\. 4522\.Cited by:[§2](https://arxiv.org/html/2605.07103#S2.SS0.SSS0.Px1.p1.1)\.

## Appendix ATool Set Construction and Training Details

Table A1:Dataset for constructing the tool set\.Split\#Reactions\#Feasible\#InfeasibleTrain200,00040,000160,000Validation25,0005,00020,000##### Training Data

The individual tools are trained using theFREAdataset\(Yuet al\.,[2026](https://arxiv.org/html/2605.07103#bib.bib56)\), which is derived from the U\.S\. Patent & Trademark Office \(USPTO111[https://figshare\.com/articles/dataset/Chemical\_reactions\_from\_US\_patents\_1976\-Sep2016\_/5104873](https://figshare.com/articles/dataset/Chemical_reactions_from_US_patents_1976-Sep2016_/5104873)\)\. The detailed data statistics are summarized in Table[A1](https://arxiv.org/html/2605.07103#A1.T1)\. Following the evaluation protocol in Section[4](https://arxiv.org/html/2605.07103#S4.SS0.SSS0.Px1), the reactions used for training individual tools are fully disjoint from those used in validation and testing \(Table[1](https://arxiv.org/html/2605.07103#S4.T1)\)\. This separation ensures that the performance ofARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsreflects its ability in leveraging multiple tools, rather than gains from overlapping training data\.

##### Tool Set

As introduced in Section[4](https://arxiv.org/html/2605.07103#S4.SS0.SSS0.Px2), the tool set contains 13 tools, including 3 classification\-based feasibility predictors, 2 forward\-generative models, and 8 LLM\-based feasibility reasoners\. For LLM\-based methods, subscriptsllamaandT3Qindicate the method is based on theLlama\-3\.1\-8B\-Instruct\(AI@Meta,[2024](https://arxiv.org/html/2605.07103#bib.bib4)\)andT3Q\-Qwen\-14B\(JungZoona,[2025](https://arxiv.org/html/2605.07103#bib.bib40)\), respectively, as the base LLMs\.

##### \(1\) Classification\-based Feasibility Predictors

- •BERT\(Devlinet al\.,[2019](https://arxiv.org/html/2605.07103#bib.bib38)\)\. We fine\-tunebert\-base\-uncasedas a sequence classification baseline for reaction feasibility prediction\. Each reaction is represented as a plain text string by concatenating the reactant and product SMILES with a \>\> separator \(i\.e\., reactants\>\>product\), which is then tokenized using BERT’s native WordPiece tokenizer and truncated to a maximum length of 128 tokens\. The model is trained for 50 epochs using AdamW \(weight decay = 0\.01\) with a learning rate of 2e\-5, a linear warmup over the first 10% of training steps, and mixed\-precision \(fp16\) training\. The checkpoint with the highest F1 score on the validation set is selected as the final model\.
- •DRFP\(Probstet al\.,[2022](https://arxiv.org/html/2605.07103#bib.bib37)\)\. Instead of text\-based tokenization, we encode each reaction using the Differential Reaction Fingerprint \(DRFP\), which produces a 2048\-dimensional binary vector capturing the structural changes between reactants and products\. The fingerprint is fed into a two\-layer MLP \(2048→\\rightarrow256→\\rightarrow2\) with ReLU activation and dropout \(0\.1\)\. All other training settings follow those of BERT above\.
- •Dora\_xgb\(Chainaniet al\.,[2025](https://arxiv.org/html/2605.07103#bib.bib11)\)\. We directly deploy the pre\-trained DORA\-XGB model222https://github\.com/tyo\-nu/Dora\_xgb\.without any fine\-tuning\. The model encodes each reaction using ECFP4 fingerprints \(2048 bits\), where molecular fingerprints of individual species are arranged by descending molecular weight to form a fixed\-length reaction fingerprint, which is then fed into a pre\-trained XGBoost classifier to predict feasibility\.

##### \(2\) Forward\-Generation Models

- •Chemformer\(Irwinet al\.,[2022](https://arxiv.org/html/2605.07103#bib.bib13)\)\. We deploy the pre\-trained Chemformer forward synthesis model333https://github\.com/MolecularAI/Chemformer\.without fine\-tuning\. Given the reactant SMILES, the model performs beam search to generate the top\-10 most likely products along with their log\-likelihoods\. A reaction is predicted as feasible if the given product appears among the top\-5 candidates \(after canonicalization and removal of atom mapping\)\.
- •Molecular\(Schwalleret al\.,[2019](https://arxiv.org/html/2605.07103#bib.bib12)\)\. Following the same inference strategy as Chemformer, we deploy the pre\-trained Molecular Transformer444https://github\.com/pschwllr/MolecularTransformer\.for forward reaction prediction\. The reactant SMILES is first canonicalized and then tokenized at the atom level before being fed into the model, which generates the top\-5 predicted products\. Feasibility is determined by whether the given product is present in the top\-5 predictions\.

##### \(3\) LLM\-based Feasibility Reasoners

- •PromptLlama\{\}\_\{\\texttt\{Llama\}\}/ PromptT3Q\{\}\_\{\\texttt\{T3Q\}\}\. We directly prompt two instruction\-tuned LLMs —Llama\-3\.1\-8B\-Instruct\(AI@Meta,[2024](https://arxiv.org/html/2605.07103#bib.bib4)\)andT3Q\-Qwen\-14B\(JungZoona,[2025](https://arxiv.org/html/2605.07103#bib.bib40)\)— for zero\-shot feasibility prediction\. Given only the reaction SMILES \(reactants\>\>product\), each LLM is asked to evaluate whether the reaction is chemically feasible under plausible laboratory conditions, based solely on its internalized chemistry knowledge\.
- •ICLLlama\{\}\_\{\\texttt\{Llama\}\}/ ICLT3Q\{\}\_\{\\texttt\{T3Q\}\}\. We augment the same two LLMs with retrieved in\-context examples\. For each query reaction, we use its DRFP fingerprint\(Probstet al\.,[2022](https://arxiv.org/html/2605.07103#bib.bib37)\)to retrieve the top\-3 most similar feasible reactions and top\-3 most similar infeasible reactions from the training set via FAISS binary index with Hamming distance\. These six retrieved reactions are provided to the LLM alongside the query, and the model is asked to make a feasibility judgment informed by the retrieved evidence\.
- •ChemformerLlama\{\}\_\{\\texttt\{Llama\}\}/ ChemformerT3Q\{\}\_\{\\texttt\{T3Q\}\}\. We augment the same two LLMs with forward synthesis predictions from Chemformer\(Irwinet al\.,[2022](https://arxiv.org/html/2605.07103#bib.bib13)\)\. For each query reaction, Chemformer generates the top\-10 candidate products via beam search, and the top\-5 \(after canonicalization and removal of atom mapping\) are provided to the LLM as prior evidence\. The LLM then makes a final feasibility judgment by reasoning over whether the given product is consistent with the predicted candidates\.
- •MolecularLlama\{\}\_\{\\texttt\{Llama\}\}/ MolecularT3Q\{\}\_\{\\texttt\{T3Q\}\}\. Following the same strategy as above, we replace Chemformer with the Molecular Transformer\(Schwalleret al\.,[2019](https://arxiv.org/html/2605.07103#bib.bib12)\)to generate the top\-5 candidate products, which are likewise provided to the LLM as contextual evidence for feasibility judgment\.

## Appendix BBaselines

To demonstrate the effectiveness ofARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limits, we compare it with a range of tool selection baselines, which can be grouped into four categories:

- •Single\-Tool Methods\.We report the performance of each individual tool in the tool set constructed in Appendix[A](https://arxiv.org/html/2605.07103#A1), serving as fundamental baselines\.
- •Statistical Methods\.StaticSel\(Britto Jret al\.,[2014](https://arxiv.org/html/2605.07103#bib.bib48)\)performs majority voting over a fixed subset of tools selected based on development performance\. In contrast, we also include simple aggregation baselines applied to all tools, including majority voting, weighted voting, and random selection\.
- •Dynamic Methods\.These methods estimate tool competence based on the validation set and select tools dynamically for each instance during testing\. KNORA\-E\(Koet al\.,[2008](https://arxiv.org/html/2605.07103#bib.bib49)\)selects tools that correctly classify all samples in thekk\-nearest neighborhood of a test reaction, while KNORA\-U relaxes this requirement to at least one sample\. DES\(Soareset al\.,[2006](https://arxiv.org/html/2605.07103#bib.bib54)\)selects tools by jointly considering tool accuracy and diversity\. DES\-KNN estimates these criteria within thekk\-nearest neighborhood of each test reaction, whereas DES\-Clustering evaluates them within cluster\-defined regions\. In addition, we include HarderMoE\(Huanget al\.,[2024](https://arxiv.org/html/2605.07103#bib.bib51)\), which performs dynamic expert routing based on input difficulty\.
- •LLM\-based Methods\.ToolEyes\(Yeet al\.,[2025](https://arxiv.org/html/2605.07103#bib.bib55)\)employs LLM\-based scoring to assess tool utility and select tools for each instance, where the same LLM backbone is adopted as inARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsfor a fair comparison\. We also include closed\-source LLMs, including GPT\-5\.4\-mini\(Singhet al\.,[2025](https://arxiv.org/html/2605.07103#bib.bib57)\), DeepSeek\-v4\-flash\(DeepSeek\-AI,[2026](https://arxiv.org/html/2605.07103#bib.bib59)\)and Claude\-Sonnet\-4\.6\(Anthropic,[2026](https://arxiv.org/html/2605.07103#bib.bib58)\), which are prompted to evaluate tool utilities on the validation set and perform tool selection during testing\.

## Appendix CImpact of Demonstrations

Table A2:Impact of the number of demonstrations \(KK\) on accuracy \(ACC \(%\\%\)\)\.\# DemonstrationOverallFeasibleInfeasibleK=0K=090\.8092\.5389\.07K=2K=291\.1292\.5789\.67K=4K=491\.3092\.5090\.10K=8K=891\.6291\.5791\.67In this section, we analyze the impact of the number of retrieved demonstrations \(KK\) in the memory\-augmented conflict resolution part \(Section[3\.3](https://arxiv.org/html/2605.07103#S3.SS3)\)\. As shown in Table[A2](https://arxiv.org/html/2605.07103#A3.T2), the overall performance improves asKKincreases from 0 to 8, suggesting that incorporating more demonstrations provides more informative evidence for resolving tool conflicts and improving tool selection\. Meanwhile, we also observe a trade\-off between feasible and infeasible classes: smallerKK\(e\.g\.,K=2K=2\) achieves the best performance on feasible reactions, while largerKK\(e\.g\.,K=8K=8\) significantly improves the performance on infeasible reactions, leading to more balanced overall results\. This behavior suggests that whenKKis small, the model relies on limited evidence, which may lead to suboptimal or skewed tool selection decisions\. AsKKincreases, the retrieved instances become more diverse, providing richer evidence for resolving conflicts among tools, thus reducing bias in final predictions\.

## Appendix DTool Selection Distribution

Table A3:Tool selection distribution and performance onConflictset and full test set\. Level shows the level \(𝒯\(1\)\\mathcal\{T\}^\{\(1\)\},𝒯\(2\)\\mathcal\{T\}^\{\(2\)\}\) from which the tool comes\.Conflictset consists of reactions where𝒯\(s\)\\mathop\{\\mathcal\{T\}^\{\(s\)\}\}\\limitsproduce conflicting predictions\. For each tool,↗\\nearrowindicates that the accuracy on the selected conflict cases is higher than that on full test set, while↘\\searrowindicates the opposite\. “–” denotes that the metric is not applicable\.ToolLevelProportion \(%\)ConflictSetFull Test SetTrendCountACC \(%\)CountACC \(%\)ChemformerLlama\{\}\_\{\\texttt\{Llama\}\}𝒯\(2\)\\mathcal\{T\}^\{\(2\)\}44\.4249077\.35600060\.55↗\\nearrowBERT𝒯\(1\)\\mathcal\{T\}^\{\(1\)\}31\.1034382\.51600087\.90↘\\searrowICLLlama\{\}\_\{\\texttt\{Llama\}\}𝒯\(2\)\\mathcal\{T\}^\{\(2\)\}14\.6916280\.25600055\.12↗\\nearrowMolecularLlama\{\}\_\{\\texttt\{Llama\}\}𝒯\(2\)\\mathcal\{T\}^\{\(2\)\}3\.263683\.33600060\.93↗\\nearrowChemformerT3Q\{\}\_\{\\texttt\{T3Q\}\}𝒯\(2\)\\mathcal\{T\}^\{\(2\)\}2\.452770\.37600069\.98↗\\nearrowICLT3Q\{\}\_\{\\texttt\{T3Q\}\}𝒯\(2\)\\mathcal\{T\}^\{\(2\)\}1\.5417100\.00600063\.38↗\\nearrowMolecularT3Q\{\}\_\{\\texttt\{T3Q\}\}𝒯\(2\)\\mathcal\{T\}^\{\(2\)\}0\.73837\.50600071\.93↘\\searrowChemformer𝒯\(2\)\\mathcal\{T\}^\{\(2\)\}0\.45580\.00600081\.88↘\\searrowMolecular𝒯\(1\)\\mathcal\{T\}^\{\(1\)\}0\.36450\.00600082\.30↘\\searrowUnused toolsDora\_xgb𝒯\(2\)\\mathcal\{T\}^\{\(2\)\}0\.000–600050\.80DRFP𝒯\(1\)\\mathcal\{T\}^\{\(1\)\}0\.000–600080\.62PromptingLlama\{\}\_\{\\texttt\{Llama\}\}𝒯\(2\)\\mathcal\{T\}^\{\(2\)\}0\.000–600054\.02PromptingT3Q\{\}\_\{\\texttt\{T3Q\}\}𝒯\(2\)\\mathcal\{T\}^\{\(2\)\}0\.000–600058\.07Failure \(no tool selected\)/1\.0011–––To better understand the behavior ofARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limits, we focus on the 1,103 cases where tools in𝒯\(s\)\\mathop\{\\mathcal\{T\}^\{\(s\)\}\}\\limitsproduce conflicting predictions\.555Refer to Table[3](https://arxiv.org/html/2605.07103#S5.T3)for the detailed definition of these 1,103 reactions\.Table[A3](https://arxiv.org/html/2605.07103#A4.T3)reports the distribution ofARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limits’s final selected tools in memory\-augmented conflict resolution \(Section[3\.3](https://arxiv.org/html/2605.07103#S3.SS3)\)\. It further presents the level \(𝒯\(1\)\\mathcal\{T\}^\{\(1\)\},𝒯\(2\)\\mathcal\{T\}^\{\(2\)\}\) from which each tool comes, together with its accuracy on the selected conflict cases and the full test set\.

𝒯\(1\)\\mathcal\{T\}^\{\(1\)\}consists of BERT, Molecular, and DRFP, which correspond to the globally strong tools\. However, in theConflictset, only BERT is frequently selected \(343 times\), while Molecular is rarely selected and DRFP is never selected\. This indicates that tools with strong global performance are not always suitable for resolving these specific cases\. Instead, several tools in𝒯\(2\)\\mathcal\{T\}^\{\(2\)\}are selected more frequently\. In particular, ChemformerLlama\{\}\_\{\\texttt\{Llama\}\}is selected 490 times, achieving 77\.35% accuracy on the conflict cases where it is selected, compared to 60\.55% on the full test set, and ICLLlama\{\}\_\{\\texttt\{Llama\}\}is selected 162 times, improving accuracy from 55\.12% to 80\.25% on its selected conflict cases\. These substantial improvements demonstrate thatARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitscan effectively identify scenarios where these tools are particularly appropriate, even though their overall performance is not the strongest\.

In addition, some tools exhibit highly specialized behavior\. For example, ICLT3Q\{\}\_\{\\texttt\{T3Q\}\}is selected only 17 times, but achieves perfect accuracy \(100%\) on these selected cases, compared to 63\.38% on the full test set\. This suggests that certain tools are highly accurate and stable under specific conditions, highlighting the importance of modeling fine\-grained tool utilities\. We also observe that, for some tools, the accuracy on its selected cases is lower than that on the full test set, e\.g\., BERT, which is the strongest tool in terms of the accuracy on the full test set\. This indicates that whileARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitscan effectively leverage diverse tools, there remains room for improvement in better utilizing the globally strongest tool in difficult scenarios\. Finally, a small portion of cases \(1\.00%\) result in failure, where no suitable tool can be identified and the prediction degrades to directly asking the LLM\. This also leaves a limited number of instances for further improvement\.

## Appendix ELimitations & Discussions

In this paper,ARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsrelies on a validation set with ground\-truth labels to construct the tool hierarchy and to characterize tool utilities\. While this enables effective modeling of tool strengths, it assumes access to sufficient labeled data for effective utility characterization, which may not always be available in practical deployment\. A promising direction for future work is to leverageARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limits’s own outcomes as weak supervision signals, enabling continual refinement of tool utilities during inference\. Such an online adaptation mechanism could allowARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsto evolve over time and better capture dynamic tool behaviors\.

Beyond the current experimental setting, the proposedARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsframework is not restricted to reaction feasibility prediction\. The key idea of explicitly modeling tool utilities and selecting appropriate tools for each input instance can be naturally extended to other domains, such as AI4Science tasks like molecular retrosynthesis, as well as more general applications such as personalized content generation and recommendation, where different tools exhibit varying performance across inputs\. Future work will explore the generalization ofARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitsto diverse tasks and investigate how to adapt utility modeling to different tool types and data distributions\.

## Appendix FImpact Statement

This work focuses on improving how multiple tools are leveraged for reaction feasibility prediction, which may benefit AI\-assisted scientific research and chemical synthesis planning\. By explicitly modeling tool utilities and providing a transparent decision\-making process, our framework can enhance the reliability and interpretability of AI systems in scientific workflows, potentially facilitating more efficient exploration of chemical reactions and accelerating discovery\.

Potential negative impacts are limited but may arise from incorrect predictions\. Inaccurate feasibility assessments could mislead downstream decision\-making in chemical synthesis, leading to inefficient use of resources or incorrect experimental directions\. In addition, over\-reliance on automated tool selection systems without expert validation may amplify such risks in practical deployment\. To mitigate these risks, our framework emphasizes interpretability and transparency, allowing users to better understand and verify tool selection decisions\. We recommend that such systems be used in a human\-in\-the\-loop manner, particularly in high\-stakes scientific applications, where expert validation remains essential\.

## Appendix GPrompts

We provide all prompts used byARMOR\\mathop\{\\textsc\{ARMOR\}\}\\limitshere\.

### G\.1Pattern Extraction

We use the following prompts to extract patterns following Eq\.[1](https://arxiv.org/html/2605.07103#S3.E1)\.

Youareacarefulchemistryscientist\.

Followtheuser’sinstructionsexactly\.

YoumustoutputONLYvalidJSONtext\.

Donotusemarkdown,codeblocks,oranytextoutsidetheJSONobject\.

Youwillbegivenasingledatasetfilerelatedtoreactionfeasibilityprediction\.

ThefileisinJSONLinesformat,whereeachlinecorrespondstoonechemicalreaction\.

Eachreactionentryincludes:

\-idx:anintegeridentifierforreferencingspecificreactions;

\-reactants:aSMILESstringrepresentingthereactantmolecules;

\-product:aSMILESstringrepresentingthereactionproduct;

\-label:aground\-truthbinarylabelindicatingwhetherthereactionisfeasible;

\-predictionfromasingletool,wherethepredictionmaybe0,1,orNA\(missing\)\.

Yourtaskistoanalyzepredictionsfromthetoolonlyandprovideaconcise,pattern\-levelsummaryofthistool’sbehavior\.

Focusonrecurringtrendsratherthanexhaustivecoverage\.

Youmustcompletethetaskinasingleresponse\.

DoNOTasktosplitthetaskacrossmessages\.

DoNOTrequestscopeconfirmation\.

TreatNA,None,ormissingpredictionsasWRONGwhencomputingACC\.PleasecomputeexactACCandextractrepresentativeindicesprogrammatically\.

Strictrules\(mustfollow\):

•OutputvalidJSONonly\.DoNOTusecodeblocksormarkdown\.DoNOTincludeanytextoutsidetheJSONobject\.

IMPORTANTPRACTICALCONSTRAINTS:

•ReportreactionpatternsthetoolisoftenCORRECTon

•Patternsshouldcapturerecurringreactionbehaviorsandbeformulatedasdecision\-relevantcategories,suchthatagivenreactioncanbejudgedaseitherbelongingtoornotbelongingtothepatternbasedonobservablecharacteristics\.

OutputthefollowingstructureexactlyasasingleJSONobject:

\{

"tool\_acc":"xx\.xx%",

"often\_correct\_on":\[

\{

"name":"Short,human\-readable,chemistry\-levelname\.",

"explanation":"Describeexplicit,observablereactioncharacteristics\(e\.g\.,bondchanges,functionalgroupsappearingordisappearing,orstructuralmotifs\)thatenableanLLMtomakeaYES/NOdecisiononwhetheragivenreactionbelongstothispattern\.Wherepossible,theexplanationshouldspecifybothinclusioncues\(whatmustbepresent\)andexclusioncues\(whatwoulddisqualifyareactionfromthepattern\)\.",

"examples\_idx":\[0,0,0,0,0\]

\}

\]

\}

Hardconstraints:

•Reactionpatternnamesmustbeshortandhuman\-readable\.

•Explanationsmayincludechemicalconcepts,structuralcues,ortransformationcharacteristics,aslongasithelpsthemodelmakeareliableyes/nodecisionaboutwhetheragivenreactionbelongstothepattern\.Explanationsshouldprioritizedecisivenessanddiscriminabilityoverabstractorhigh\-levelchemicalgeneralization\.

•Eachexamples\_idxmustcontainexactly5integersthatappearasidxvaluesinthedataset\.

AnalyzeONLYthedatasetprovidednext\.Aftercompletingtheanalysis,internallyverifywhetherthegeneratedreactionpatternsandexplanationscancorrectlyclassifythelistedexamplereactions\.Ifmismatchesarefound,refinethepatterndefinitionsandexplanationstoresolvetheinconsistencies\.Performthisverificationandrefinementsilently,andoutputONLYthefinal,consolidatedresultsthatconformtotherequiredoutputstructure\.DoNOToutputintermediateversions,alternativedrafts,orself\-correctionsteps\.

Datasetinput:

\{dataset\_text\}

### G\.2Pattern Matching

The prompt used in pattern refinement, pattern consolidation, and pattern\-based tool selection is as follows:

Youareacarefulchemistryreaction\-ruleevaluator\.

Followtheuser’sinstructionsexactly\.

YoumustoutputONLYvalidJSON\(nomarkdown,noextratext\)\.

YouwillbegivenONEreactionrulewithname,explanation,andanexample\.

Yourtask:forTHISruleONLY,judgewhetherTHISexamplebelongstotherule\.

Instructions:

\-Judgethisexampleindependently\.

\-DoNOTassumetheexampleiscorrect\.

\-Basejudgmentonlyon:

\(1\)rulename,

\(2\)ruleexplanation,

\(3\)examplereactants/productSMILES\.

\-Whenmakingthejudgment,treattheruleexplanationastheprimaryandauthoritativecriterionfordecidingwhetheranexamplebelongstotherule;therulenameservesonlyasahigh\-levellabel\.

OutputexactlyONEJSONobjectwithformat:

\{

"name":<rulename\>,

"idx":<example\_idx\>,

"belongs\_to\_rule":true/false,

"confidence":"high"\|"medium"\|"low",

"reason":"briefexplanation\(1\-2sentences\)"

\}

Outputrequirements:

\-OutputMUSTbevalidJSON\.

\-OutputONEJSONobjectonly\.

\-DoNOTincludeanytextoutsidetheJSONobject\.

\-DoNOTrepeattheinputexamplesverbatim\.

Ruleinput:

\{

"name":\{rule\_name\},

"explanation":\{rule\_explanation\}

\}

Example:

\{example\_json\}

### G\.3Pattern Consolidation

The prompt used in pattern consolidation is as follows:

Youareacarefulchemistryscientist\.

Followtheuser’sinstructionsexactly\.

YoumustoutputONLYvalidJSONtext\.

Donotusemarkdown,codeblocks,oranytextoutsidetheJSONobject\.

Youaregiven\{n\_rules\}reaction\-patternrulesthatallsharethesamename:"\{rule\_name\}"\.

Theserulesdescribethesameconceptbutwerewrittenseparatelywithslightlydifferentwording\.

YourtaskistokeepexactlyONEbestruleandremoveallothers\.

Criteriafortheruletokeep:

\-Mostpreciseandcompletechemistrydescription

\-Clearestlanguage

YouMUSTremoveallrulesexceptthebestone\.DoNOTkeepmorethanone\.

ReturnONLYvalidJSON\(noextratext\)inthisexactschema:

\{

"keep\_index":<integer\>,

"reason":"<1\-2sentences\>"

\}

Herearethecandidates:

\{candidates\_json\}

### G\.4Memory Building

We use the following prompt to build the tool conflict memory:

Youareacarefulchemistryscientist\.

Followtheuser’sinstructionsexactly\.

YoumustoutputONLYvalidJSONtext\.

Donotusemarkdown,codeblocks,oranytextoutsidetheJSONobject\.

Weareconstructingademonstrationfortoolselection\.

Reaction\(SMILES\):

\-reactants:\{reactants\}

\-product:\{product\}

\{cands\_section\}

Inthisdemonstration,thereisEXACTLYONEtrustedtool:

\-trustedtoolMUSTbe:"\{gold\_tool\}"

\-thefollowingtoolsareNOTtrustedinthisdemonstration:\{neg\_tools\_json\}

Yourjob:

WriteaconciseandstablereasoningtraceexplainingWHY"\{gold\_tool\}"ischosenoverthenon\-trustedtools\.

Basethereasoningprimarilyonthetransformationimpliedbyreactants→\\rightarrowproductandtheruleexplanations\.

Thetool\_predictionfieldcanbenoisy;doNOTdecidepurelybymajorityvote\.

ReturnONLYvalidJSONwiththisexactschema:

\{

"tool":"\{gold\_tool\}",

"evidence":\["<2\-4bulletpointstiedtothetransformationandspecificexplanations\>"\],

"elimination":\[\{"tool":"<non\_trusted\_tool\>","why\_not":"<1sentence\>"\}\.\.\.upto3items\],

"final\_reason":"<1\-2sentencessummary\>"

\}

Constraints:

\-"tool"mustbeexactly"\{gold\_tool\}"\.

\-"elimination"toolsmustbeselectedfromthenon\-trustedtoolsprovidedaboveandmustbeamong:\[\{neg\_str\}\]\.

\-DoNOTmentionthewords:feasible,label,groundtruth,training\.

### G\.5Tool Selection

We use the following tool selection prompt in memory\-augmented conflict resolution\.

Youareacarefulchemistryscientist\.

Followtheuser’sinstructionsexactly\.

YoumustoutputONLYvalidJSONtext\.

Donotusemarkdown,codeblocks,oranytextoutsidetheJSONobject\.

Youaregivenachemicalreaction:

\-reactants\(SMILES\):\{reactants\}

\-product\(SMILES\):\{product\}

\{cands\_section\}

Selectionprinciples:

\-Choosethetoolwhosematchedrulebestexplainstheactualchemicaltransformation\.

\-Prefermorespecificandchemicallyplausibleruleexplanationsovervague/genericones\.

\{conf\_hint\}\{tiebreak\}\-Ifmultiplecandidatesmatchwell,considerinternalconsistencyacrossrules\.

\-Output"abstain"onlyifnocandidateprovidesaconvincingmatch\.

\{demos\_section\}

ReturnONLYvalidJSONinthisexactschema:

\{

"tool":<oneof\[\{allowed\_str\}\]\>,

"reason":"<1\-3sentences\>"

\}
ARMOR: An Agentic Framework for Reaction Feasibility Prediction via Adaptive Utility-aware Multi-tool Reasoning

Similar Articles

ARCANA: A Reflective Multi-Agent Program Synthesis Framework for ARC-AGI-2 Reasoning

Tools as Continuous Flow for Evolving Agentic Reasoning

Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

ChemAmp: Amplified Chemistry Tools via Composable Agents

As we scale toward agentic, multimodal systems combining LLMs, RLHF, tool-use, and retrieval-augmented generation, what practical architecture best balances reliability, alignment, and cost?

Submit Feedback

Similar Articles

ARCANA: A Reflective Multi-Agent Program Synthesis Framework for ARC-AGI-2 Reasoning
Tools as Continuous Flow for Evolving Agentic Reasoning
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
ChemAmp: Amplified Chemistry Tools via Composable Agents
As we scale toward agentic, multimodal systems combining LLMs, RLHF, tool-use, and retrieval-augmented generation, what practical architecture best balances reliability, alignment, and cost?