Learning Complementary Action Modeling from Automotive Maintenance Instructions

arXiv cs.CL Papers

Summary

This paper introduces Complementary Action Modeling (CAM), a task that identifies or generates procedural counterparts of automotive maintenance instructions by modifying the action phrase while preserving context. Using a German automotive dataset, the authors examine candidate matching and controlled Seq2Seq generation to model these complementary instructions.

arXiv:2606.27808v1 Announce Type: new Abstract: A minute lexical variation can reverse the procedural meaning of an instruction even when the rest of the sentence remains unchanged. In automotive maintenance instructions, this pattern often appears when an action phrase turns an instruction into its procedural counterpart. The entities, modifiers, and surrounding context remain largely invariant, while the action phrase determines the procedural relation. We define this task as Complementary Action Modeling (CAM). Given a maintenance instruction, the goal is to identify or generate its procedural counterpart by modifying the action phrase while preserving the remaining sentence context. This task focuses on three aspects: distinguishing complementarity from surface similarity, controlling generation at the action-phrase level, and evaluating relational correctness using retrieval, overlap-based, and human evaluation. Using a German automotive maintenance dataset, we examine these questions through candidate matching and controlled Seq2Seq generation. The results show that complementary maintenance instructions are best modeled as procedural associations grounded in subtle lexical cues. They should therefore not be treated as ordinary cases of sentence similarity or synonym-based paraphrasing.
Original Article
View Cached Full Text

Cached at: 06/29/26, 05:24 AM

# Learning Complementary Action Modeling from Automotive Maintenance Instructions
Source: [https://arxiv.org/html/2606.27808](https://arxiv.org/html/2606.27808)
Jiaqi Wu1,2,\*Bai Li3,\*Jochen Hartmann2Martin Gaedke3Sander Stuijk1 1Eindhoven University of Technology, Eindhoven, The Netherlands 2BMW Group, Munich, Germany 3Chemnitz University of Technology, Chemnitz, Germany \*Equal contribution\. Correspondence:jiaqi\.wu@bmw\.de,bai\.li@informatik\.tu\-chemnitz\.de

###### Abstract

A minute lexical variation can reverse the procedural meaning of an instruction even when the rest of the sentence remains unchanged\. In automotive maintenance instructions, this pattern often appears when an action phrase turns an instruction into its procedural counterpart\. The entities, modifiers, and surrounding context remain largely invariant, while the action phrase determines the procedural relation\. We define this task as Complementary Action Modeling \(CAM\)\. Given a maintenance instruction, the goal is to identify or generate its procedural counterpart by modifying the action phrase while preserving the remaining sentence context\. This task focuses on three aspects: distinguishing complementarity from surface similarity, controlling generation at the action\-phrase level, and evaluating relational correctness using retrieval, overlap\-based, and human evaluation\. Using a German automotive maintenance dataset, we examine these questions through candidate matching and controlled Seq2Seq generation\. The results show that complementary maintenance instructions are best modeled as procedural associations grounded in subtle lexical cues\. They should therefore not be treated as ordinary cases of sentence similarity or synonym\-based paraphrasing\.

Learning Complementary Action Modeling from Automotive Maintenance Instructions

Jiaqi Wu1,2,\*Bai Li3,\*Jochen Hartmann2Martin Gaedke3Sander Stuijk11Eindhoven University of Technology, Eindhoven, The Netherlands2BMW Group, Munich, Germany3Chemnitz University of Technology, Chemnitz, Germany\*Equal contribution\. Correspondence:jiaqi\.wu@bmw\.de,bai\.li@informatik\.tu\-chemnitz\.de

## 1Introduction

Minute lexical variations in maintenance instructions given to human operators and robots can determine the direction and objective of their actions\. While the surrounding context may remain virtually unchanged, differences within the action phrases determine whether a specific component is to be installed or removed\. Actual operational behaviors must be represented in a form that an intelligent system can reliably interpret\. This capability is relevant to embodied intelligence and language\-guided systems, which must distinguish similar actions and recognize complementary operations performed on the same object\(ichteret al\.,[2023](https://arxiv.org/html/2606.27808#bib.bib15); Zitkovichet al\.,[2023](https://arxiv.org/html/2606.27808#bib.bib16)\)\. We define Complementary Action Modeling \(CAM\) to address this problem\. An illustration of CAM is shown in Fig\.[1](https://arxiv.org/html/2606.27808#S1.F1)\.

![Refer to caption](https://arxiv.org/html/2606.27808v1/x1.png)Figure 1:Illustration of Complementary Action Modeling \(CAM\)\. CAM preserves the non\-action context while transforming the action phrase into its complementary counterpart\.Existing procedural text benchmarks do not fully capture this problem scenario\. This is particularly evident in benchmark scenarios centered on procedural paragraphs and dynamic world tracking\(Dalviet al\.,[2018](https://arxiv.org/html/2606.27808#bib.bib1)\)\. However, automotive maintenance instructions are highly repetitive\. Strict constraints on vocabulary usage often leave room for variation only within a very narrow segment\. When the action phrase is substituted to alter the procedural relationship, the remainder of the sentence remains unchanged\. This characteristic distinguishes CAM from a standard semantic similarity task; instructions that are most similar at the lexical level may, in terms of actual operation, be entirely incorrect\. Nor is it a typical paraphrase generation task, as its objective is not to preserve the overall meaning of the sentence, but rather to achieve a controlled transformation of the operational relationship\. Furthermore, it is not a contradiction detection task, as its focus lies on a complementary procedural correspondence embedded within the context of a single workflow, rather than on sentence\-level incompatibility in a broader sense\.

Another rationale for clearly defining CAM lies at the methodological level\. Beyond evaluating neural models, this work follows a human\-verified workflow for problem formalization, data construction, and result interpretation\. Understanding the structure of maintenance procedures requires domain knowledge, which informs the rule\-based alignment process used to formalize complementary relationships\. Ambiguous actions then undergo selective human verification, while learning\-based models are used to generalize beyond deterministic rule patterns\.

In this sense, CAM reflects a lightweight form of human\-machine collaboration: Human expertise defines and verifies the procedural relation, while computational models scale this knowledge to larger sets of maintenance instructions\. Trustworthy human judgment remains important for ensuring the reliability of downstream modeling pipelines\(Bakaevet al\.,[2020](https://arxiv.org/html/2606.27808#bib.bib17)\)\.

CAM can also be viewed as an auxiliary modeling task for domain\-specific procedural documents, where end users and domain experts may benefit from systems that make implicit procedural relations explicit\(Espositoet al\.,[2023](https://arxiv.org/html/2606.27808#bib.bib18)\)\. In this setting, the contrastive model retrieves complementary action pairs that already exist within structured procedures, while the generative model tests whether complementary counterparts can be produced through controlled action\-level generation\.

The present study asks three research questions\. \(1\) Can a model distinguish complementarity from surface similarity? This question is central because complementary instructions often share most of their tokens, making lexical overlap a poor proxy for the underlying procedural relation\. \(2\) Can a generative model modify only the action phrase instead of freely rewriting the whole sentence? Fluent output is not sufficient: a generated sentence may look well\-formed and still fail if it changes the correct context or preserves the source action\. \(3\) How should CAM be evaluated? Overlap\-based text generation metrics may reward outputs that remain close to the source even when they do not express the intended complementary relation\. Relation\-sensitive retrieval metrics and targeted human evaluation are therefore also needed\. To answer these questions, we study CAM on a dataset derived from German automotive maintenance manuals,111The anonymized dataset and experimental resources will be publicly released for research purposes upon acceptance\.where automotive instructions are organized within hierarchical maintenance procedures\. We examine the task through candidate matching and controlled sequence\-to\-sequence generation\. Complementary maintenance instructions should be modeled as procedural associations grounded in subtle lexical cues rather than treated as ordinary instances of paraphrase or contradiction\.

## 2Related Work

CAM relates to three lines of work: procedural text understanding, industrial information extraction and semantic matching, and procedural generation\. Unlike prior work, CAM focuses on complementary action relations in highly repetitive automotive maintenance instructions, where small lexical changes alter the procedural relation while most context remains unchanged\.

Procedural text understanding models actions, entities, and state changes in instructional text\. Existing benchmarks focus on tracking entity states in process descriptionsDalviet al\.\([2018](https://arxiv.org/html/2606.27808#bib.bib1)\), with transformer\-based methods further improving entity trackingGupta and Durrett \([2019](https://arxiv.org/html/2606.27808#bib.bib2)\)\. Other work introduces richer structures, such as dynamic knowledge graphs and entity\-action\-location reasoning graphsDaset al\.\([2018](https://arxiv.org/html/2606.27808#bib.bib3)\); Huanget al\.\([2021](https://arxiv.org/html/2606.27808#bib.bib4)\)\. These studies mainly target curated domains where state changes and procedure order are often explicit\.

Industrial information extraction frequently uses rule\-based or hybrid methods because they are interpretable and can encode expert knowledge\. Rule\-based systems remain useful in enterprise settings with limited supervisionChiticariuet al\.\([2013](https://arxiv.org/html/2606.27808#bib.bib5)\), but are less robust to linguistic variation and implicit action semantics\. Contrastive learning provides a data\-driven alternative for semantic matching\. InfoNCE\-style objectives learn pairwise alignment from positive and negative examplesvan den Oordet al\.\([2018](https://arxiv.org/html/2606.27808#bib.bib6)\), although random instance\-level evaluation may overlook leakage in structurally repetitive documentsSøgaardet al\.\([2021](https://arxiv.org/html/2606.27808#bib.bib12)\)\.

Generative and script\-based models treat procedural knowledge as action generation or event prediction\. Neural checklist, state\-aware, and pretrained language models have been used for coherent procedural generationKiddonet al\.\([2016](https://arxiv.org/html/2606.27808#bib.bib7)\); Dhingraet al\.\([2018](https://arxiv.org/html/2606.27808#bib.bib8)\); H\. Leeet al\.\([2020](https://arxiv.org/html/2606.27808#bib.bib9)\)\. Retrieval\-augmented and script\-based methods further incorporate external steps or event knowledgeNishimuraet al\.\([2019](https://arxiv.org/html/2606.27808#bib.bib10)\); Sakaguchiet al\.\([2021](https://arxiv.org/html/2606.27808#bib.bib11)\)\. However, inverse or complementary action relations are usually addressed only indirectly\. Broader event relation studiesZhouet al\.\([2022](https://arxiv.org/html/2606.27808#bib.bib13)\); Zhuet al\.\([2023](https://arxiv.org/html/2606.27808#bib.bib14)\)suggest relevant directions but leave complementary action relations in automotive maintenance instructions underexplored\.

## 3Dataset Construction

This section describes the construction of a dataset for Complementary Action Modeling \(CAM\) from German automotive maintenance manuals\. Our goal is to collect paired maintenance instructions while preserving the procedural structure\. Automotive maintenance manuals describe goal\-directed sequences of actions over objects, tools, and intermediate states, and their interpretation depends on contextual structure\.

### 3\.1Source Maintenance Manuals

Our source documents are automotive maintenance manuals provided by an original equipment manufacturer \(OEM\)\. The manuals describe maintenance and repair procedures as structured textual instructions organized hierarchically into repair tasks, repair processes, and repair steps\.

A repair task specifies a high\-level maintenance objective\. Each task is decomposed into multiple repair processes, and each process consists of fine\-grained repair steps expressed as short imperative instructions\. In this work, we operate at the level of individual repair steps, but we retain process membership as an essential contextual variable\. Complementary relations in maintenance manuals arise within coherent procedural contexts such as disassembly and reassembly workflows\.

To preserve this structure, we represent each repair process as a*bucket*and assign all steps from the same process the same*bucket identifier*\. The bucket structure preserves the local procedural context and later supports process\-aware evaluation splits\.

### 3\.2Rule\-Based Construction of Complementary Action Pairs

We construct complementary action pairs with a two\-stage rule\-based alignment pipeline followed by manual verification\. The first stage operates at the process level and identifies candidate process pairs that are likely to stand in a complementary relation, such as disassembly versus assembly process\. The second stage operates within each matched process pair and aligns individual repair steps\.

At the core of the rule matching process is a curated German lexicon of complementary action expressions\. Candidate pairs are first identified based on compatible action opposition, such as install/remove, connect/disconnect, or tighten/loosen\. After detecting candidate action pairs, the matched action expressions are removed and the remaining object\- and context\-bearing text is referred to as the action\-stripped skeleton, which is then compared across candidate pairs\. Candidate pairs are retained when their action\-stripped skeletons match exactly or differ only by conservative surface variations\.

To handle such variation, the rule matching process combines approximate string matching with whitelist\-style normalization rules\. These rules tolerate harmless differences in word order, light modifiers, and synonym substitutions, such as*vorsichtig*\("carefully"\) or*wie abgebildet*\("as illustrated"\)\. At the same time, the rule matching process applies hard procedural constraints to avoid spurious alignments\. In particular, side\- and location\-sensitive expressions are treated as incompatibility signals, so that otherwise similar instructions are not matched if they refer to different physical positions\.

At the step level, the pipeline additionally exploits shared illustration references, which are particularly useful in highly repetitive repair subprocedures where multiple steps differ only in localized object references\.

Although the rule\-based pipeline can detect certain one\-to\-many correspondences, such as grouped screw\-removal operations that correspond to several later fastening steps, these cases are excluded from the final learning benchmark\. We retain only high\-confidence one\-to\-one complementary pairs in order to maintain an unambiguous retrieval and generation setting for downstream learning\-based experiments\.

### 3\.3Manual Verification and Provenance

Manual verification is applied selectively rather than to all automatically proposed pairs\. High\-confidence cases include strict rule matches and whitelist\-supported matches\. Candidates that cannot be resolved by these deterministic rules but remain plausible under looser approximate matching, are exported for human review, where ambiguous candidates are confirmed or rejected\.

We retain provenance labels indicating whether a pair was obtained through strict rule matching, whitelist\-supported normalization, or approximate matching with manual confirmation\. This provenance distinguishes rule\-resolved cases from more ambiguous semantic cases\. In our corpus\-level audit, the rule\-based matching pipeline resolves approximately 59% of the final complementary pairs before manual intervention, while the remaining retained pairs require human adjudication due to lexical divergence, implicit component references, or context\-dependent formulations\.

This distinction enables a natural separation between rule\-resolved and rule\-unresolved cases\. Rule\-resolved cases are primarily governed by explicit lexical and structural regularities, whereas rule\-unresolved cases represent semantically harder instances that cannot be recovered reliably through deterministic heuristics alone\.

After deduplication and filtering for high\-confidence one\-to\-one alignments, the final benchmark contains 1,459 complementary instruction pairs\. Each pair is associated with its original repair\-process bucket and provenance label\. These annotations later support process\-aware evaluation splits and rule\-resolved versus rule\-unresolved analysis\.

The rule\-based pipeline serves two purposes in this work: it constructs a high\-confidence CAM dataset while also separating rule\-resolved cases from semantically harder ones\. The learning\-based models are motivated by the latter, where surface similarity, local lexical opposition, and handcrafted constraints are not sufficient to recover the correct counterpart\.

## 4Experimental Design

We evaluate CAM under two experimental paradigms: candidate matching and controlled generation\. Candidate matching formulates CAM as a relation\-aware retrieval problem, whereas controlled generation formulates it as a conditional sequence\-to\-sequence generation problem\. The former evaluates whether a model can retrieve the correct counterpart from a candidate set, while the latter evaluates whether it can generate a complementary instruction under action\-level control\.

### 4\.1Candidate Matching Experiment

In the candidate matching experiment, the model receives a source maintenance instruction and selects the corresponding procedural counterpart from a set of candidates\. This experiment primarily addresses the first research question:*Can a model distinguish complementarity from surface similarity?*

In this experiment, each sample consists of a source instruction together with a candidate pool containing one correct procedural counterpart and other non\-matching instructions from the same evaluation split\.

Given a source instructionxix\_\{i\}and a candidate setC=\{c1,c2,…,cM\}C=\\\{c\_\{1\},c\_\{2\},\\ldots,c\_\{M\}\\\}, the objective is to identify the procedural counterpartci\+c\_\{i\}^\{\+\}ofxix\_\{i\}\. The model learns an encoding functionfθ​\(⋅\)f\_\{\\theta\}\(\\cdot\)that maps the source instruction and each candidate instruction to vector representations:

zx=fθ​\(x\),zc=fθ​\(c\)z\_\{x\}=f\_\{\\theta\}\(x\),\\quad z\_\{c\}=f\_\{\\theta\}\(c\)\(1\)
Matching scores are then computed using vector similarity:

s​\(x,c\)=zx⊤​zcs\(x,c\)=z\_\{x\}^\{\\top\}z\_\{c\}\(2\)
Candidates are ranked by this score, with the gold counterpart expected to receive the highest rank\.

#### 4\.1\.1Training Objective

We train the model with a symmetric InfoNCE loss using verified CAM pairs as positives\. Within each batch, all non\-matching target instructions naturally serve as in\-batch negatives\. For example,*remove the protective cover*and*install the protective cover*form a positive CAM pair\. For a batch containingNNpositive sample pairs:

\{\(xi,yi\)\}i=1N\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}\(3\)
wherexix\_\{i\}is the source instruction andyiy\_\{i\}is its procedural counterpart\. The model encodes the source and target separately:

zix=fθ​\(xi\),ziy=fθ​\(yi\)z\_\{i\}^\{x\}=f\_\{\\theta\}\(x\_\{i\}\),\\qquad z\_\{i\}^\{y\}=f\_\{\\theta\}\(y\_\{i\}\)\(4\)
It then computes a similarity matrix across all source\-target pairs:

Si​j=\(zix\)⊤​zjyτS\_\{ij\}=\\frac\{\(z\_\{i\}^\{x\}\)^\{\\top\}z\_\{j\}^\{y\}\}\{\\tau\}\(5\)
Here,τ\\taudenotes the temperature parameter\.

For eachxix\_\{i\}, the correct match isyiy\_\{i\}, while all other targetsyjy\_\{j\}within the batch serve as negative samples\. The corresponding loss function can be formulated as:

ℒx→y=−1N​∑i=1Nlog⁡exp⁡\(Si​i\)∑j=1Nexp⁡\(Si​j\)\\mathcal\{L\}\_\{x\\to y\}=\-\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\log\\frac\{\\exp\(S\_\{ii\}\)\}\{\\sum\_\{j=1\}^\{N\}\\exp\(S\_\{ij\}\)\}\(6\)
We apply the same objective in the reverse direction:

ℒy→x=−1N​∑i=1Nlog⁡exp⁡\(Si​i\)∑j=1Nexp⁡\(Sj​i\)\\mathcal\{L\}\_\{y\\to x\}=\-\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\log\\frac\{\\exp\(S\_\{ii\}\)\}\{\\sum\_\{j=1\}^\{N\}\\exp\(S\_\{ji\}\)\}\(7\)
The final symmetric contrastive loss is then given by:

ℒmatch=12​\(ℒx→y\+ℒy→x\)\\mathcal\{L\}\_\{\\mathrm\{match\}\}=\\frac\{1\}\{2\}\\left\(\\mathcal\{L\}\_\{x\\to y\}\+\\mathcal\{L\}\_\{y\\to x\}\\right\)\(8\)

#### 4\.1\.2Experiment Setup

We use a bi\-encoder architecture for the candidate matching experiment\. Source and candidate instructions are encoded independently and ranked by dot\-product similarity\. This architecture supports efficient retrieval over large candidate sets\. We compare four encoders, summarized in Table[1](https://arxiv.org/html/2606.27808#S4.T1): mBERT, German BERT, multilingual Sentence\-Transformer, and a character\-level Transformer\. Detailed encoder configurations are provided in Appendix[A](https://arxiv.org/html/2606.27808#A1)\.

EncoderDescriptionmBERTMultilingual Transformer baselineGerman BERTGerman\-specific pretrained encoderMultilingual STSentence\-transformer retrieval encoderChar TransformerCharacter\-level encoder trained from scratchTable 1:Encoders used in candidate matching\.

### 4\.2Controlled Generation Experiment

In the controlled generation experiment, the model receives a source instruction and generates its complementary counterpart\. The output should modify only the action phrase while preserving the surrounding context\. This experiment primarily addresses the second research question:*Can a generative model modify only the action phrase instead of freely rewriting the whole sentence?*

Given a source maintenance instructionxx, the model generates a target instructionyythat expresses its procedural counterpart:

Pθ​\(y∣x\)P\_\{\\theta\}\(y\\mid x\)\(9\)
Thus, CAM generation requires both action transformation and context preservation\.

#### 4\.2\.1Training Objective

The primary training objective combines bidirectional sequence\-to\-sequence loss with an embedding\-space contrastive regularization term\. For a training pair\(x,y\)\(x,y\), the model maximizes the probability of the target sequence conditioned on the input:

Pθ​\(y\|x\)=∏t=1TPθ​\(yt\|y<t,x\)P\_\{\\theta\}\(y\|x\)=\\prod\_\{t=1\}^\{T\}P\_\{\\theta\}\(y\_\{t\}\|y\_\{<t\},x\)\(10\)
The corresponding negative log\-likelihood loss is:

LN​L​L=−∑t=1Tlog⁡Pθ​\(yt\|y<t,x\)L\_\{NLL\}=\-\\sum\_\{t=1\}^\{T\}\\log P\_\{\\theta\}\(y\_\{t\}\|y\_\{<t\},x\)\(11\)
Since token\-level loss alone may emphasize surface reproduction, we additionally use an embedding\-space contrastive regularizer in the main generation setting\. The contrastive regularizer encourages representations of paired source and target instructions to remain close while separating non\-matching targets in representation space\. This helps the generative model learn not only token\-level output, but also sentence\-level procedural associations\.

The final loss is:

ℒ=ℒg​e​n\+λ​ℒc​t​r,\\mathcal\{L\}=\\mathcal\{L\}\_\{gen\}\+\\lambda\\mathcal\{L\}\_\{ctr\},\(12\)whereℒc​t​r\\mathcal\{L\}\_\{ctr\}is the embedding\-space contrastive loss andλ\\lambdacontrols its weight\.

#### 4\.2\.2Experimental Setup

In our experiment, we use pretrained encoder\-decoder models to map a source instruction to its complementary target\. We use a row\-wise 80/10/10 train/validation/test split\. Each pair is used in both directions, so the model is trained to generateyyfromxxandxxfromyy\.

We evaluate three multilingual Seq2Seq backbone families, summarized in Table[2](https://arxiv.org/html/2606.27808#S4.T2), to compare the effect of pretraining and model capacity\. Decoding uses greedy generation with a maximum length of 128 tokens\. The best checkpoint is selected by validation ROUGE\-L\. Additional reproducibility details and evaluation implementations are summarized in Appendix[A](https://arxiv.org/html/2606.27808#A1)\.

BackboneDescriptionmBARTMultilingual encoder\-decoder model for sequence\-to\-sequence generation\.mT5Multilingual T5 model used to test capacity effects across model sizes\.Flan\-T5Instruction\-tuned T5 variant used as a generation baseline\.Table 2:Seq2Seq backbones used in controlled generation\.EncoderR@1R@5R@10MRRChar Transformer0\.7134±0\.03130\.7134\\pm 0\.03130\.9445±0\.02710\.9445\\pm 0\.02710\.9857±0\.01330\.9857\\pm 0\.01330\.8083±0\.02270\.8083\\pm 0\.0227mBERT0\.7415±0\.01680\.7415\\pm 0\.01680\.9612±0\.0259\\mathbf\{0\.9612\\pm 0\.0259\}0\.9936±0\.0072\\mathbf\{0\.9936\\pm 0\.0072\}0\.8306±0\.0152\\mathbf\{0\.8306\\pm 0\.0152\}German BERT0\.7321±0\.02750\.7321\\pm 0\.02750\.9588±0\.02050\.9588\\pm 0\.02050\.9924±0\.00980\.9924\\pm 0\.00980\.8234±0\.01970\.8234\\pm 0\.0197Multilingual ST0\.7417±0\.0216\\mathbf\{0\.7417\\pm 0\.0216\}0\.9579±0\.02110\.9579\\pm 0\.02110\.9935±0\.00530\.9935\\pm 0\.00530\.8304±0\.01630\.8304\\pm 0\.0163Table 3:Candidate matching results under thegroups\_nativesetting\. Scores are reported as mean±\\pmstandard deviation over 10 folds\.

## 5Evaluation and Results

We first describe the evaluation protocol and then report results for candidate matching, controlled generation, and additional diagnostic analyses\.

### 5\.1Evaluation Protocol

We evaluate CAM under two experimental paradigms: candidate matching and controlled generation\. The evaluation protocol is designed to measure both retrieval accuracy and action\-level transformation quality\.

#### 5\.1\.1Candidate Matching Evaluation

For candidate matching, we consider three split settings:pairs,groups, andgroups\_native\.

Thepairssetting performs row\-wise splitting over all complementary instruction pairs, regardless of their originating repair process\. Thegroupssetting uses bucket\-level splitting on the rebucketed dataset, whilegroups\_nativerestricts this evaluation to the original repair\-process buckets derived from the maintenance manual hierarchy\. This prevents highly similar process\-local templates from appearing across training and evaluation splits\.

Candidate matching is evaluated with 10\-fold cross\-validation\. In each fold, eight folds are used for training, one for validation, and one for testing\.

We report Recall@1, Recall@5, Recall@10, and MRR\. Recall@K measures whether the correct counterpart appears within the top\-K retrieved candidates, while MRR reflects the average ranking quality across the full candidate list\.

We focus the main text on the conservativegroups\_nativesetting and report the full comparison in Appendix[C](https://arxiv.org/html/2606.27808#A3)\.

#### 5\.1\.2Controlled Generation Evaluation

For controlled generation, the best checkpoint is selected by validation ROUGE\-L and final automatic metrics are reported on the held\-out test split\.

We report BLEU and ROUGE as automatic overlap\-based metrics\. BLEU measures n\-gram overlap between generated and reference counterparts, while ROUGE evaluates lexical and contextual similarity\.

However, overlap\-based metrics alone do not determine whether the action phrase has been transformed correctly\. We therefore additionally conduct human evaluation on 100 directional outputs from the held\-out test split, stratified into 50 rule\-resolved and 50 rule\-unresolved cases\.

The human evaluation focuses on semantic complementarity and component consistency\. Semantic complementarity measures whether the generated instruction expresses the correct complementary action, while component consistency evaluates whether the generated output preserves the correct procedural entity or component\.

#### 5\.1\.3Rule\-Resolved and Rule\-Unresolved Analysis

We additionally report candidate matching results separately for rule\-resolved and rule\-unresolved pairs\. Rule\-resolved pairs are obtained by strict or whitelist\-supported matching, while rule\-unresolved pairs correspond to manually confirmed matches not directly resolved by deterministic rules\.

This split allows us to examine whether learned models remain effective on more challenging cases beyond deterministic rule patterns\.

### 5\.2Candidate Matching Results

Table[3](https://arxiv.org/html/2606.27808#S4.T3)reports candidate matching results on the conservativegroups\_nativesetting\.

The results show that CAM is learnable as a closed\-set matching task: the strongest systems reach approximately 0\.74 R@1 and 0\.83 MRR\. The multilingual sentence\-transformer achieves the best R@1, while mBERT achieves the strongest Recall@5/10 and the best overall MRR\. This difference suggests that CAM performance should not be reduced to a single retrieval score\.

Transformer\-based encoders outperform the character\-level encoder, although the latter remains competitive\. This suggests that surface lexical information is useful for CAM, but pretrained multilingual and German\-specific representations provide stronger signals for identifying procedural counterparts, even in highly formulaic maintenance instructions\. Extended retrieval results across all split settings are reported in Appendix[C](https://arxiv.org/html/2606.27808#A3)\.

We further analyze the held\-out evaluation results separately on rule\-resolved and rule\-unresolved subsets, as shown in Table[4](https://arxiv.org/html/2606.27808#S5.T4)\. As expected, rule\-resolved pairs are easier, but performance on rule\-unresolved cases remains strong, with the multilingual Sentence\-Transformer achieving over 0\.82 Recall@1\. This indicates that learned representations capture complementary relations beyond deterministic rule patterns\.

SubsetR@1R@5R@10MRRRule\-resolved0\.95420\.99770\.99890\.9739Rule\-unresolved0\.82370\.97100\.98510\.8856Table 4:Candidate matching results on rule\-resolved and rule\-unresolved subsets using the multilingual Sentence\-Transformer under thegroups\_nativeprotocol\.
### 5\.3Controlled Generation Results

Table[5](https://arxiv.org/html/2606.27808#S5.T5)reports controlled generation results for multilingual Seq2Seq backbones trained under the bidirectional full\-sentence generation setup\.

ModelBLEUR\-1R\-2R\-LmBART\-large\-5062\.96030\.80830\.71770\.7970Flan\-T5\-large61\.78580\.80750\.71330\.7955mT5\-large61\.43610\.79530\.69680\.7842Flan\-T5\-base57\.86720\.78600\.67580\.7729mT5\-base48\.09140\.71280\.58370\.6979mT5\-small0\.14880\.01740\.00000\.0169Table 5:Controlled generation results on the held\-out test set\. The best checkpoint is selected by validation ROUGE\-L\. All models are trained with bidirectional generation and embedding\-space contrastive regularization \(λ=0\.5\\lambda=0\.5,τ=0\.05\\tau=0\.05\)\. R\-1, R\-2, and R\-L denote ROUGE\-1, ROUGE\-2, and ROUGE\-L, respectively\.The generation results show that strong pretrained multilingual Seq2Seq models can produce outputs with high reference overlap\. The best checkpoint is selected by validation ROUGE\-L, and final results are reported on the held\-out test set\. mBART\-large\-50 obtains the strongest test scores, reaching 62\.9603 BLEU and 0\.7970 ROUGE\-L\. This suggests that much of the surrounding procedural context can be preserved in the generated counterpart\.

However, overlap\-based metrics alone do not prove CAM success\. In CAM generation, copying is both necessary and risky: the model should preserve the component and context, but must still transform the action phrase\. We therefore complement BLEU and ROUGE with human evaluation of semantic complementarity and component consistency\.

Among 100 directional test outputs, 95 were judged semantically complementary and 5 were incorrect with respect to the intended action transformation\. All sampled outputs preserved the relevant component or procedural entity, suggesting that remaining errors mainly concern action\-level transformation rather than component drift\.

In Table[6](https://arxiv.org/html/2606.27808#S5.T6), Correct and Incorrect refer to semantic complementarity judgments\. Component consistency is not shown as a separate column because all sampled outputs preserved the relevant component or procedural entity\.

SubsetCorrectIncorrectRule\-resolved491Rule\-unresolved464Table 6:Human evaluation of semantic complementarity for generated outputs by rule\-resolved and rule\-unresolved subset\.These findings show that high overlap\-based generation scores should still be interpreted together with action\-level semantic judgments\. Model capacity and pretraining quality also strongly affect generation performance\. mBART\-large\-50 performs best, while large T5 variants remain competitive\. In contrast, mT5\-small degenerates under the same training setting, suggesting that CAM generation is more sensitive to model capacity and optimization stability than candidate matching\.

### 5\.4Cross\-Vehicle Generalization Case Study

Beyond the controlled benchmark, we further examine whether CAM generalizes to maintenance instruction data from an unseen vehicle platform by applying the best\-performing candidate matching checkpoint in inference mode only, without additional fine\-tuning\. The new vehicle data is not used during training, validation, or testing in the main benchmark\.

Because the new vehicle data does not contain a fully annotated benchmark, we evaluate the retrieved counterparts through manual verification\. A prediction is counted as accepted if it expresses a semantically valid complementary action relation for the source instruction\.

Among 1,584 reviewed inference results, 1,022 predictions are accepted as valid complementary matches, corresponding to an acceptance rate of 64\.5%\.

The cross\-vehicle case study suggests that CAM captures procedural complementary relations that transfer beyond the original vehicle\-specific benchmark\. Although the evaluation is based on manual verification rather than a fully annotated retrieval benchmark, the results indicate promising generalization behavior on unseen automotive maintenance documentation\.

### 5\.5Discussion

Taken together, the benchmark results, human evaluation, and cross\-vehicle case study show that CAM requires evaluation from multiple perspectives\. In this work, mixed evaluation refers to combining retrieval metrics for counterpart identification, overlap\-based metrics for contextual adequacy, and human evaluation for action\-level correctness\. The combined results show that candidate matching and controlled generation expose different aspects of CAM\. Candidate matching evaluates whether the correct counterpart can be identified in a closed candidate set, while generation tests whether the model can produce a complementary instruction\. Human evaluation shows that generated outputs usually preserve the component context reliably, while the remaining errors mainly concern action\-level transformation\. Together with the rule\-resolved and rule\-unresolved analysis, these results suggest that CAM is learnable but remains challenging when complementarity cannot be resolved by deterministic lexical patterns alone\.

## 6Conclusion

This paper introduced Complementary Action Modeling \(CAM\) for automotive maintenance instructions, where the goal is to identify or generate a procedural counterpart by transforming the action phrase while preserving the surrounding context\. Our results show that CAM is learnable as a closed\-set matching task, with strong retrieval performance across multilingual and German\-specific encoders in thegroups\_nativesetting\. The rule\-resolved and rule\-unresolved analysis further shows that cases beyond deterministic matching rules remain harder but still learnable\.

For controlled generation, mBART\-large\-50 is the strongest backbone, achieving the best BLEU and ROUGE\-L scores among the evaluated models\. Human evaluation shows that generated outputs preserve component context reliably, while the remaining errors mainly concern action\-level transformation\. Together, these findings support mixed evaluation for CAM: retrieval metrics assess counterpart identification, overlap\-based metrics capture contextual adequacy, and human evaluation verifies action\-level correctness\.

Overall, complementary maintenance instructions should be modeled as procedural associations grounded in subtle lexical cues rather than as ordinary sentence similarity or paraphrase\. Future work should extend CAM to broader maintenance domains, develop automatic action\-aware diagnostics, and study more challenging cross\-document and cross\-vehicle generalization settings, including error analysis of no\-op, wrong\-action, context\-drift, and near\-match failures\. In short, CAM succeeds only when a model identifies or generates the right counterpart, preserves the context, and changes the action correctly\.

## Limitations

While our results demonstrate the effectiveness of learning\-based approaches for modeling complementary actions in automotive maintenance instructions, several limitations remain\.

First, the dataset is derived primarily from German automotive maintenance instructions from a single OEM, which may limit generalization to other manufacturers, languages, or manual styles\. Although the proposed methods are not tied to a specific action lexicon, broader validation across domains remains necessary\.

Second, the current benchmark depends on an initial rule\-based alignment stage followed by manual verification\. Although learning substantially improves robustness beyond handcrafted rules, biases or constraints introduced during pair construction may still affect the learned representations\. Fully data\-driven discovery of complementary actions therefore remains an open challenge\.

Third, the formulation emphasizes locally constrained disambiguation within structurally related procedural contexts rather than global retrieval across entire maintenance instructions\. While this reflects common industrial authoring practices, it does not address cases where complementary actions occur in distant or weakly connected sections\.

Fourth, the generative formulation exhibits a strong dependence on model capacity and pretraining quality\. Large pretrained sequence\-to\-sequence models perform reliably, whereas low\-capacity models may show instability or degenerate generation, suggesting that purely generative approaches may be unsuitable for constrained model settings and motivating future work on parameter\-efficient or hybrid contrastive–generative methods\.

Finally, the cross\-vehicle generalization analysis is based on manual verification of inference outputs rather than a fully annotated retrieval benchmark\. It provides evidence of transfer to unseen maintenance documentation, but does not replace a controlled benchmark with exhaustive gold counterparts\.

## Acknowledgments

The authors acknowledge the use of AI\-assisted tools for language refinement and editorial support during manuscript preparation\. All technical content, experimental design, analysis, and conclusions were verified and approved by the authors\.

## References

- Distributional ground truth: non\-redundant crowdsourcing data quality control in ui labeling tasks\.External Links:2012\.13546,[Link](https://arxiv.org/abs/2012.13546)Cited by:[§1](https://arxiv.org/html/2606.27808#S1.p4.1)\.
- L\. Chiticariu, Y\. Li, and F\. R\. Reiss \(2013\)Rule\-based information extraction is dead\! long live rule\-based information extraction systems\!\.InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,D\. Yarowsky, T\. Baldwin, A\. Korhonen, K\. Livescu, and S\. Bethard \(Eds\.\),Seattle, Washington, USA,pp\. 827–832\.External Links:[Link](https://aclanthology.org/D13-1079/)Cited by:[§2](https://arxiv.org/html/2606.27808#S2.p3.1)\.
- B\. Dalvi, L\. Huang, N\. Tandon, W\. Yih, and P\. Clark \(2018\)Tracking state changes in procedural text: a challenge dataset and models for process paragraph comprehension\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),M\. Walker, H\. Ji, and A\. Stent \(Eds\.\),New Orleans, Louisiana,pp\. 1595–1604\.External Links:[Link](https://aclanthology.org/N18-1144/),[Document](https://dx.doi.org/10.18653/v1/N18-1144)Cited by:[§1](https://arxiv.org/html/2606.27808#S1.p2.1),[§2](https://arxiv.org/html/2606.27808#S2.p2.1)\.
- R\. Das, T\. Munkhdalai, X\. Yuan, A\. Trischler, and A\. McCallum \(2018\)Building dynamic knowledge graphs from text using machine reading comprehension\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.27808#S2.p2.1)\.
- B\. Dhingra, Q\. Jin, Z\. Yang, W\. Cohen, and R\. Salakhutdinov \(2018\)Neural models for reasoning over multiple mentions using coreference\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 \(Short Papers\),M\. Walker, H\. Ji, and A\. Stent \(Eds\.\),New Orleans, Louisiana,pp\. 42–48\.External Links:[Link](https://aclanthology.org/N18-2007/),[Document](https://dx.doi.org/10.18653/v1/N18-2007)Cited by:[§2](https://arxiv.org/html/2606.27808#S2.p4.1)\.
- A\. Esposito, M\. Calvano, A\. Curci, G\. Desolda, R\. Lanzilotti, C\. Lorusso, and A\. Piccinno \(2023\)End\-user development for artificial intelligence: a systematic literature review\.InEnd\-User Development,Lecture Notes in Computer Science, Vol\.13917\.External Links:[Document](https://dx.doi.org/10.1007/978-3-031-34433-6%5F2),[Link](https://doi.org/10.1007/978-3-031-34433-6_2)Cited by:[§1](https://arxiv.org/html/2606.27808#S1.p5.1)\.
- A\. Gupta and G\. Durrett \(2019\)Effective use of transformer networks for entity tracking\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 759–769\.External Links:[Link](https://aclanthology.org/D19-1070/),[Document](https://dx.doi.org/10.18653/v1/D19-1070)Cited by:[§2](https://arxiv.org/html/2606.27808#S2.p2.1)\.
- H\. H\. Lee, K\. Shu, P\. Achananuparp, P\. K\. Prasetyo, Y\. Liu, E\. Lim, and L\. R\. Varshney \(2020\)RecipeGPT: generative pre\-training based cooking recipe generation and evaluation system\.InCompanion Proceedings of the Web Conference 2020,WWW ’20WWW ’20,New York, NY, USA,pp\. 181–184\.External Links:ISBN 9781450370240,[Link](https://doi.org/10.1145/3366424.3383536),[Document](https://dx.doi.org/10.1145/3366424.3383536)Cited by:[§2](https://arxiv.org/html/2606.27808#S2.p4.1)\.
- H\. Huang, X\. Geng, J\. Pei, G\. Long, and D\. Jiang \(2021\)Reasoning over entity\-action\-location graph for procedural text understanding\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),Online,pp\. 5100–5109\.External Links:[Link](https://aclanthology.org/2021.acl-long.396/),[Document](https://dx.doi.org/10.18653/v1/2021.acl-long.396)Cited by:[§2](https://arxiv.org/html/2606.27808#S2.p2.1)\.
- b\. ichter, A\. Brohan, Y\. Chebotar, C\. Finn, K\. Hausman, A\. Herzog, D\. Ho, J\. Ibarz, A\. Irpan, E\. Jang, R\. Julian, D\. Kalashnikov, S\. Levine, Y\. Lu, C\. Parada, K\. Rao, P\. Sermanet, A\. T\. Toshev, V\. Vanhoucke, F\. Xia, T\. Xiao, P\. Xu, M\. Yan, N\. Brown, M\. Ahn, O\. Cortes, N\. Sievers, C\. Tan, S\. Xu, D\. Reyes, J\. Rettinghouse, J\. Quiambao, P\. Pastor, L\. Luu, K\. Lee, Y\. Kuang, S\. Jesmonth, N\. J\. Joshi, K\. Jeffrey, R\. J\. Ruano, J\. Hsu, K\. Gopalakrishnan, B\. David, A\. Zeng, and C\. K\. Fu \(2023\)Do as i can, not as i say: grounding language in robotic affordances\.InProceedings of The 6th Conference on Robot Learning,K\. Liu, D\. Kulic, and J\. Ichnowski \(Eds\.\),Proceedings of Machine Learning Research, Vol\.205,pp\. 287–318\.External Links:[Link](https://proceedings.mlr.press/v205/ichter23a.html)Cited by:[§1](https://arxiv.org/html/2606.27808#S1.p1.1)\.
- C\. Kiddon, L\. Zettlemoyer, and Y\. Choi \(2016\)Globally coherent text generation with neural checklist models\.InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,J\. Su, K\. Duh, and X\. Carreras \(Eds\.\),Austin, Texas,pp\. 329–339\.External Links:[Link](https://aclanthology.org/D16-1032/),[Document](https://dx.doi.org/10.18653/v1/D16-1032)Cited by:[§2](https://arxiv.org/html/2606.27808#S2.p4.1)\.
- T\. Nishimura, A\. Hashimoto, and S\. Mori \(2019\)Procedural text generation from a photo sequence\.InProceedings of the 12th International Conference on Natural Language Generation,K\. van Deemter, C\. Lin, and H\. Takamura \(Eds\.\),Tokyo, Japan,pp\. 409–414\.External Links:[Link](https://aclanthology.org/W19-8650/),[Document](https://dx.doi.org/10.18653/v1/W19-8650)Cited by:[§2](https://arxiv.org/html/2606.27808#S2.p4.1)\.
- K\. Sakaguchi, C\. Bhagavatula, R\. Le Bras, N\. Tandon, P\. Clark, and Y\. Choi \(2021\)ProScript: partially ordered scripts generation\.InFindings of the Association for Computational Linguistics: EMNLP 2021,M\. Moens, X\. Huang, L\. Specia, and S\. W\. Yih \(Eds\.\),Punta Cana, Dominican Republic,pp\. 2138–2149\.External Links:[Link](https://aclanthology.org/2021.findings-emnlp.184/),[Document](https://dx.doi.org/10.18653/v1/2021.findings-emnlp.184)Cited by:[§2](https://arxiv.org/html/2606.27808#S2.p4.1)\.
- A\. Søgaard, S\. Ebert, J\. Bastings, and K\. Filippova \(2021\)We need to talk about random splits\.InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume,P\. Merlo, J\. Tiedemann, and R\. Tsarfaty \(Eds\.\),Online,pp\. 1823–1832\.External Links:[Link](https://aclanthology.org/2021.eacl-main.156/),[Document](https://dx.doi.org/10.18653/v1/2021.eacl-main.156)Cited by:[§2](https://arxiv.org/html/2606.27808#S2.p3.1)\.
- A\. van den Oord, Y\. Li, and O\. Vinyals \(2018\)Representation learning with contrastive predictive coding\.arXiv preprint arXiv:1807\.03748\.Cited by:[§2](https://arxiv.org/html/2606.27808#S2.p3.1)\.
- Y\. Zhou, T\. Shen, X\. Geng, G\. Long, and D\. Jiang \(2022\)ClarET: pre\-training a correlation\-aware context\-to\-event transformer for event\-centric generation and classification\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 2559–2575\.External Links:[Link](https://aclanthology.org/2022.acl-long.183/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.183)Cited by:[§2](https://arxiv.org/html/2606.27808#S2.p4.1)\.
- F\. Zhu, J\. Gao, C\. Yu, W\. Wang, C\. Xu, X\. Mu, M\. Yang, and R\. Xu \(2023\)A generative approach for script event prediction via contrastive fine\-tuning\.InProceedings of the Thirty\-Seventh AAAI Conference on Artificial Intelligence and Thirty\-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence,AAAI’23/IAAI’23/EAAI’23\.External Links:ISBN 978\-1\-57735\-880\-0,[Link](https://doi.org/10.1609/aaai.v37i11.26645),[Document](https://dx.doi.org/10.1609/aaai.v37i11.26645)Cited by:[§2](https://arxiv.org/html/2606.27808#S2.p4.1)\.
- B\. Zitkovich, T\. Yu, S\. Xu, P\. Xu, T\. Xiao, F\. Xia, J\. Wu, P\. Wohlhart, S\. Welker, A\. Wahid, Q\. Vuong, V\. Vanhoucke, H\. Tran, R\. Soricut, A\. Singh, J\. Singh, P\. Sermanet, P\. R\. Sanketi, G\. Salazar, M\. S\. Ryoo, K\. Reymann, K\. Rao, K\. Pertsch, I\. Mordatch, H\. Michalewski, Y\. Lu, S\. Levine, L\. Lee, T\. E\. Lee, I\. Leal, Y\. Kuang, D\. Kalashnikov, R\. Julian, N\. J\. Joshi, A\. Irpan, B\. Ichter, J\. Hsu, A\. Herzog, K\. Hausman, K\. Gopalakrishnan, C\. Fu, P\. Florence, C\. Finn, K\. A\. Dubey, D\. Driess, T\. Ding, K\. M\. Choromanski, X\. Chen, Y\. Chebotar, J\. Carbajal, N\. Brown, A\. Brohan, M\. G\. Arenas, and K\. Han \(2023\)RT\-2: vision\-language\-action models transfer web knowledge to robotic control\.InProceedings of The 7th Conference on Robot Learning,J\. Tan, M\. Toussaint, and K\. Darvish \(Eds\.\),Proceedings of Machine Learning Research, Vol\.229,pp\. 2165–2183\.External Links:[Link](https://proceedings.mlr.press/v229/zitkovich23a.html)Cited by:[§1](https://arxiv.org/html/2606.27808#S1.p1.1)\.

## Appendix AReproducibility Details

This appendix summarizes additional implementation and reproducibility details for the CAM experiments, including evaluation protocols, metric implementations, computational resources, training configurations, and extended retrieval results\.

### A\.1Evaluation Metrics and Implementations

Retrieval evaluation uses Recall@1, Recall@5, Recall@10, and Mean Reciprocal Rank \(MRR\) computed from ranked candidate lists produced by the bi\-encoder retrieval models\.

Generation evaluation uses BLEU and ROUGE\-1/2/L\. BLEU is computed with SacreBLEU, and ROUGE scores are computed with the HuggingFaceevaluateimplementation using standard settings\.

For retrieval experiments, reported values correspond to mean±\\pmstd across 10 folds\. For controlled generation, the best checkpoint is selected by validation ROUGE\-L and final results are reported on the held\-out test split\.

### A\.2Computational Resources

All experiments were conducted on a single NVIDIA A100 GPU\. The overall training budget across retrieval and generation experiments remained below approximately 100 GPU hours\.

We use publicly available pretrained backbones and fine\-tune them under the experimental configurations reported in the following appendix sections\.

## Appendix BContrastive Retrieval Configuration

This section summarizes the training configurations and encoder settings used in the contrastive retrieval experiments described in Section[4\.1](https://arxiv.org/html/2606.27808#S4.SS1)\. The appendix reports both the shared default configuration and the experiment\-specific overrides used for thepairs,groups, andgroups\_nativeevaluation settings\. Table[7](https://arxiv.org/html/2606.27808#A2.T7)lists the default hyperparameters shared across experiments, Table[8](https://arxiv.org/html/2606.27808#A2.T8)summarizes the setting\-specific overrides, and Table[9](https://arxiv.org/html/2606.27808#A2.T9)reports the encoder architectures, pooling strategies, and normalization choices used in retrieval\.

### B\.1Common Settings

SettingValueRandom seed42DeviceautoOptimizerAdamWLearning rate1×10−51\\times 10^\{\-5\}Weight decay0\.01Epochs \(default\)20Batch size \(default\)64Max length256Grad clip norm1\.0Log interval10 stepsMixed precisionTrue \(CUDA only\)InfoNCE temperature0\.05Eval metricsR@1, R@5, R@10, MRRText columnsaus\_text,ein\_textBucket columnbucket\_idDataLoader workers0Table 7:Common hyperparameters used across contrastive experiments \(fromconfigs/base\.yaml\)\.
### B\.2Experiment\-Specific Overrides

SettingPairsGroupsGroups\-nativeDatasettrain\_dedup\_by\_text\_pairs\.csvrebucketed\_singletons\.csvrebucketed\_singletons\.csvSplit typek\-fold \(row\)k\-fold \(group\)k\-fold \(group\)Group column–bucket\_idbucket\_idkk\(folds\)101010Fold index000Test fold offset111Epochs203030Batch size64128128Temperatureτ\\tau0\.070\.050\.05Native\-only filter––bucket\_idnot startingwith‘‘new’’Table 8:Experiment\-specific configuration overrides \(fromconfigs/pairs\.yaml,configs/groups\.yaml, andconfigs/groups\_native\.yaml\)\.

### B\.3Encoder Configurations

EncoderBackendModel / ArchitecturePooling / NormalizeMultilingual BERTHFbert\-base\-multilingual\-casedCLS / L2\-normGerman BERTHFbert\-base\-german\-casedCLS / L2\-normST\-MPNetSTparaphrase\-multilingual\-mpnet\-base\-v2ST / L2\-normChar TransformerCharV=5000,dd=512,hh=8,LL=6,FF=2048, drop=0\.1Mean / L2\-normTable 9:Configurations of different encoders used in contrastive experiments \(fromconfigs/encoders/\*\.yaml\)\.

## Appendix CContrastive Retrieval Results

This section reports the full retrieval results across all split settings discussed in Section[5\.1\.1](https://arxiv.org/html/2606.27808#S5.SS1.SSS1), including thepairs,groups, andgroups\_nativeprotocols with progressively stricter separation constraints\. All reported values correspond to mean±\\pmstd across 10\-fold cross\-validation\.

Pairs\(k=10, mean±\\pmstd\)EncoderRecall@1Recall@5Recall@10MRRMultilingual BERT0\.9027±0\.02500\.9027\\pm 0\.02500\.9863±0\.01150\.9863\\pm 0\.01150\.9945±0\.00600\.9945\\pm 0\.00600\.9388±0\.01660\.9388\\pm 0\.0166German BERT0\.8979±0\.02060\.8979\\pm 0\.02060\.9808±0\.01140\.9808\\pm 0\.01140\.9925±0\.00480\.9925\\pm 0\.00480\.9348±0\.01570\.9348\\pm 0\.0157ST\-MPNet0\.9089±0\.02100\.9089\\pm 0\.02100\.9870±0\.01120\.9870\\pm 0\.01120\.9966±0\.00630\.9966\\pm 0\.00630\.9430±0\.01350\.9430\\pm 0\.0135Char Transformer0\.8472±0\.03280\.8472\\pm 0\.03280\.9548±0\.01570\.9548\\pm 0\.01570\.9726±0\.01500\.9726\\pm 0\.01500\.8942±0\.02330\.8942\\pm 0\.0233Groups\(k=10, mean±\\pmstd\)EncoderRecall@1Recall@5Recall@10MRRMultilingual BERT0\.6820±0\.02040\.6820\\pm 0\.02040\.8680±0\.01810\.8680\\pm 0\.01810\.9326±0\.01720\.9326\\pm 0\.01720\.7667±0\.01740\.7667\\pm 0\.0174German BERT0\.6812±0\.02140\.6812\\pm 0\.02140\.8672±0\.01990\.8672\\pm 0\.01990\.9296±0\.01840\.9296\\pm 0\.01840\.7672±0\.01860\.7672\\pm 0\.0186ST\-MPNet0\.6785±0\.02070\.6785\\pm 0\.02070\.8698±0\.02030\.8698\\pm 0\.02030\.9345±0\.01810\.9345\\pm 0\.01810\.7662±0\.01740\.7662\\pm 0\.0174Char Transformer0\.6656±0\.02670\.6656\\pm 0\.02670\.8546±0\.01970\.8546\\pm 0\.01970\.9203±0\.01970\.9203\\pm 0\.01970\.7543±0\.02180\.7543\\pm 0\.0218Groups\-native\(k=10, mean±\\pmstd\)EncoderRecall@1Recall@5Recall@10MRRMultilingual BERT0\.7337±0\.01980\.7337\\pm 0\.01980\.9599±0\.02650\.9599\\pm 0\.02650\.9935±0\.00720\.9935\\pm 0\.00720\.8259±0\.01610\.8259\\pm 0\.0161German BERT0\.7330±0\.02430\.7330\\pm 0\.02430\.9619±0\.02200\.9619\\pm 0\.02200\.9946±0\.00730\.9946\\pm 0\.00730\.8260±0\.01890\.8260\\pm 0\.0189ST\-MPNet0\.7384±0\.02210\.7384\\pm 0\.02210\.9589±0\.01990\.9589\\pm 0\.01990\.9935±0\.00530\.9935\\pm 0\.00530\.8289±0\.01660\.8289\\pm 0\.0166Char Transformer0\.7058±0\.03230\.7058\\pm 0\.03230\.9403±0\.02480\.9403\\pm 0\.02480\.9867±0\.01580\.9867\\pm 0\.01580\.8040±0\.02440\.8040\\pm 0\.0244Table 10:Contrastive retrieval results forpairs,groups, andgroups\-native\. All scores are mean±\\pmstd over 10\-fold cross\-validation\. Within each fold, the best checkpoint is selected by validation MRR; Recall@1/5/10 and MRR are reported on the validation split\.

Similar Articles

Learning Agentic Policy from Action Guidance

arXiv cs.CL

The paper proposes ActGuide-RL, a method for training agentic policies in LLMs by using human action data as guidance to overcome exploration barriers in reinforcement learning without extensive supervised fine-tuning.