RuleChef: Grounding LLM Task Knowledge in Human-Editable Rules
Summary
RuleChef is a framework that uses LLMs to generate human-editable, executable rules for NLP tasks, iteratively improving them based on examples and human feedback, resulting in fast, deterministic, and inspectable rule systems.
View Cached Full Text
Cached at: 07/03/26, 05:39 AM
# RuleChef: Grounding LLM Task Knowledge in Human-Editable Rules Source: [https://arxiv.org/html/2607.01293](https://arxiv.org/html/2607.01293) Ádám Kovács1,Nadia Verdha2,Gábor Recski1,2 1KR Labs,2TU Wien Correspondence:[kovacs@krlabs\.eu](https://arxiv.org/html/2607.01293v1/mailto:[email protected]) ###### Abstract We present RuleChef, a framework that uses large language models \(LLMs\) to generate executable rules for NLP tasks such as text classification, Named Entity Recognition \(NER\), or relation extraction\. Rules are generated based on a task description and a set of labeled examples, then they are iteratively improved based both on additional examples and on human feedback over existing rules\. RuleChef can also be used to bootstrap rules using the observed input\-output pairs from any existing model for a given task\. LLMs are used only at learning time, synthesizing rules and iteratively patching them based on failures measured on a held\-out split\. The result of this process is a fast, deterministic, and inspectable rule system\. Preliminary evaluation is performed on both classification and NER tasks\. We release RuleChef as open\-source software under an Apache 2\.0 license111[https://github\.com/KRLabsOrg/rulechef](https://github.com/KRLabsOrg/rulechef)\. RuleChef: Grounding LLM Task Knowledge in Human\-Editable Rules Ádám Kovács1, Nadia Verdha2, Gábor Recski1,21KR Labs,2TU WienCorrespondence:[kovacs@krlabs\.eu](https://arxiv.org/html/2607.01293v1/mailto:[email protected]) ## 1Introduction Early NLP systems were predominantly rule\-based: hand\-crafted patterns and template grammars provided transparent, deterministic models that could be inspected, debugged, and explained to end users\(Chiticariuet al\.,[2013](https://arxiv.org/html/2607.01293#bib.bib62); Valenzuela\-Escárcegaet al\.,[2020](https://arxiv.org/html/2607.01293#bib.bib27); Kovácset al\.,[2022](https://arxiv.org/html/2607.01293#bib.bib21)\)\. The shift to feature\-based classifiers and then to pretrained transformers\(Devlinet al\.,[2019](https://arxiv.org/html/2607.01293#bib.bib63)\)and large language models \(LLMs\) brought substantial gains in coverage and flexibility, but also moved decision logic into latent parameters that are difficult to audit or edit directly\(Lipton,[2018](https://arxiv.org/html/2607.01293#bib.bib61)\)\. For applications in sensitive and higly regulated domains such as medical, legal, or financial NLP, a prediction that cannot be traced to an explicit pattern cannot be certified\. For many real\-world NLP tasks, especially those with recurring lexical or structural patterns, explicit rules remain attractive: cheap to run, deterministic, easy to version, and straightforward to inspect\. The main problem of rules is not their accuracy; it is that writing and maintaining them by hand is labor\-intensive and requires substantial domain expertise\(Chiticariuet al\.,[2013](https://arxiv.org/html/2607.01293#bib.bib62); Shnarchet al\.,[2017](https://arxiv.org/html/2607.01293#bib.bib23); Kovácset al\.,[2022](https://arxiv.org/html/2607.01293#bib.bib21)\)\. RuleChef addresses this gap by using LLMs for*rule learning*: instead of using neural models at inference time, their role is to translate supervision into executable symbolic rules\. Given a task specification and supervision signalSS, RuleChef synthesizes a rule setR=\{r1,…,rk\}R=\\\{r\_\{1\},\\ldots,r\_\{k\}\\\}, evaluates it on a held\-out development split, clusters the failures, and prompts the LLM to improve the rules\. A patch is accepted only if it improves quality on the held\-out split, and each surviving rule records the precision it reached there\. Rules can also be reviewed by human domain experts, and their feedback as well as explicit corrections can also be used to prompt LLMs to make updates\. The main contributions of this paper are the following: - •A system for synthesizing executable rules from task supervision \(examples, corrections, free\-text feedback, or observed model behavior\), separating learning\-time LLM use from inference\-time deterministic execution\. - •A refinement loop that iteratively improves a rule system, measures the impact of changes on held\-out data and resolves conflicts between rules by their held\-out precision\. - •Evaluation on two tasks, comparing learned rules against prompting the same LLM and against a dedicated neural extractor, with an ablation study isolating the pipeline’s contribution over one\-shot rule prompting\. - •Additional experiments on the human\-in\-the\-loop repair process and the unsuperviesd observation mode that learns rules from the behavior of an external model\. The remainder of this paper is structured as follows\. Section[2](https://arxiv.org/html/2607.01293#S2)reviews related work on rule synthesis, weak supervision, and LLM\-based program generation\. Section[3](https://arxiv.org/html/2607.01293#S3)describes the RuleChef framework\. Section[4](https://arxiv.org/html/2607.01293#S4)presents our experimental setup\. Section[5](https://arxiv.org/html/2607.01293#S5)reports preliminary evaluation results on a simple intent classification dataset and a more challenging NER task\. Section[6](https://arxiv.org/html/2607.01293#S6)discusses key findings and presents opportunities for future work\. ## 2Related Work In this section we review prior work related to RuleChef across four areas: automatic rule and regex synthesis, interactive rule systems, weak supervision, and LLM\-based code and rule generation\. Automatic regex synthesis from examples has been studied with evolutionary methods\(Bartoliet al\.,[2016](https://arxiv.org/html/2607.01293#bib.bib3),[2018](https://arxiv.org/html/2607.01293#bib.bib5)\), neural sequence\-to\-sequence models\(Locascioet al\.,[2016](https://arxiv.org/html/2607.01293#bib.bib6); Zhonget al\.,[2018](https://arxiv.org/html/2607.01293#bib.bib7)\), and systems that combine natural language descriptions with positive and negative examples\(Chenet al\.,[2020](https://arxiv.org/html/2607.01293#bib.bib10); Liet al\.,[2021](https://arxiv.org/html/2607.01293#bib.bib11)\)\. These approaches produce a single rule or small rule set from examples, but do not include iterative validation or a human\-in\-the\-loop component\. Interactive rule systems such as HEIDL\(Senet al\.,[2019](https://arxiv.org/html/2607.01293#bib.bib22)\), GrASP\(Shnarchet al\.,[2017](https://arxiv.org/html/2607.01293#bib.bib23); Lertvittayakumjornet al\.,[2022](https://arxiv.org/html/2607.01293#bib.bib24)\), Odinson\(Valenzuela\-Escárcegaet al\.,[2020](https://arxiv.org/html/2607.01293#bib.bib27)\), and POTATO\(Kovácset al\.,[2022](https://arxiv.org/html/2607.01293#bib.bib21)\)help users build patterns over texts as well as over graph\-based representations of semantics and syntax \(e\.g\. Abstract Meaning Representations\(Banarescuet al\.,[2013](https://arxiv.org/html/2607.01293#bib.bib28)\)and Universal Dependencies\(Nivre and others,[2018](https://arxiv.org/html/2607.01293#bib.bib29)\)\)\), but still require substantial manual authoring\. RuleChef reduces this manual effort by creating an interface between various sources of supervision \(examples, corrections, feedback on rules\) and the LLM performing updates to the rule system\. Weak supervision systems such as Snorkel\(Ratneret al\.,[2017](https://arxiv.org/html/2607.01293#bib.bib30)\)and Snuba\(Varma and Ré,[2018](https://arxiv.org/html/2607.01293#bib.bib31)\)use labeling functions to create training labels, and recent work also uses LLM prompts as labeling functions\(Smithet al\.,[2024](https://arxiv.org/html/2607.01293#bib.bib35); Yu and Bach,[2023](https://arxiv.org/html/2607.01293#bib.bib34)\)\. Unlike these systems, RuleChef makes the rules the final model rather than a source of noisy labels for a downstream classifier\. LLMs have also been used to synthesize executable programs from examples, including Evaporate\-Code\+\(Aroraet al\.,[2023](https://arxiv.org/html/2607.01293#bib.bib37)\)and Hypothesis Search\(Wanget al\.,[2024](https://arxiv.org/html/2607.01293#bib.bib41)\), broader surveys discuss LLM\-based rule and hypothesis generation\(Sivasothyet al\.,[2024](https://arxiv.org/html/2607.01293#bib.bib42); He and Chen,[2025](https://arxiv.org/html/2607.01293#bib.bib43)\)\. RuleChef can also be viewed as the symbolic alternative for LLM knowledge distillation\(Westet al\.,[2022](https://arxiv.org/html/2607.01293#bib.bib44); Zhouet al\.,[2024](https://arxiv.org/html/2607.01293#bib.bib45); Hsiehet al\.,[2023](https://arxiv.org/html/2607.01293#bib.bib46)\)that maximizes explainability and minimizes inference cost\. For the two tasks evaluated here, the relevant neural baselines are GLiNER\(Zaratianaet al\.,[2024](https://arxiv.org/html/2607.01293#bib.bib48)\)and its schema\-driven successor GLiNER2\(Zaratianaet al\.,[2025](https://arxiv.org/html/2607.01293#bib.bib66)\)for zero\-shot extraction of text spans, and dual sentence encoders for intent detection\(Casanuevaet al\.,[2020](https://arxiv.org/html/2607.01293#bib.bib49); Zhanget al\.,[2021](https://arxiv.org/html/2607.01293#bib.bib67)\)\. ## 3The RuleChef Framework In this section we describe the main functionalities of the RuleChef framework, including task definition, rule synthesis, the refinement loop, conflict resolution, and the observation mode that allows autonomous operation in the presence of a pre\-existing model\. A high\-level overview is presented in Figure[1](https://arxiv.org/html/2607.01293#S3.F1)\. We draw the examples in this section from the Text Anonymization Benchmark dataset \(TAB, see Section[4](https://arxiv.org/html/2607.01293#S4)\), our main testbed\. The prompt templates corresponding to the various rule learning strategies described here are presented in Appendix[A](https://arxiv.org/html/2607.01293#A1)\. learning time \(LLM involved\)supervisionexamples, corrections,feedback, observationssynthesizeper class, grex hintsevaluateon dev splitclusterfailure modespatch\+ critic, auditaccept iffdev F1 improvesstamp validatedprecision, supportinference:rules only≈\\approx1 ms/dociterate Figure 1:The RuleChef pipeline\. Orange components call the LLM; blue components are deterministic\. The LLM proposes rules and patches, but acceptance is decided by quality measured on a held\-out development split, and each rule records its precision there\. At inference time only the rules run\.### 3\.1Rule Synthesis We structure the synthesis prompt in four sections: \(1\) the task definition with input/output schemas, \(2\) sampled training examples and corrections, \(3\) data evidence with pattern suggestions, and \(4\) format instructions with response schema\. For multi\-class tasks, we generate one prompt per label with positive examples and counter\-examples from other classes, preventing cross\-class interference\. The task definition specifies input/output schemas and one of four task types:Classification,NER,Extraction\(untyped spans\), orTransformation\(structured output\)\. Rules can be requested in one of three formats, including regular expressions \(regexes\), spaCy rules that rely on the output of part\-of\-speech tagging and syntactic parsing, and arbitrary Python code\. While each of these three formats is supported by RuleChef, the examples and evaluation in this paper focus only on regex patterns\. Each generated rule is stored along with metadata that specifies its priority and measures its precision and match\-count over the held\-out development set\. By default the data evidence provided to the LLM consists of labeled examples, but RuleChef can also augment this with regex suggestions obtained fromgrex\(Stahl,[2019](https://arxiv.org/html/2607.01293#bib.bib55)\), a regex generator that derives structural patterns from example strings\. Such patterns serve as hints, not constraints, helping the LLM identify structural regularities without overfitting\. We validate each generated rule before acceptance: regex patterns must compile, output templates must match the task schema, and patterns that match arbitrary text are rejected by probing them against generic strings\. ### 3\.2Refinement After initial synthesis, a refinement loop evaluates the current rules, identifies failures, and generates*patch rules*targeting missed or misclassified inputs\. Data is split into training and development portions \(explicit user corrections always stay in train, since they are the highest\-value signal for patching\), and the synthesis step for rule patching can only access training data, the dev split is used to decide whether a newly generated rule is accepted\. A patch is kept only if held\-out F1 does not degrade or precision improves\. This filter prevents memorization: as we show in Section[5\.2](https://arxiv.org/html/2607.01293#S5.SS2), the same loop without holdout acceptance drifts toward patterns that fit the failures it was shown rather than ones that generalize\. With hundreds of training documents, the failures of an intermediate rule set number in the thousands, far more than fit in a patch prompt\. RuleChef clusters failures by their signature\. For NER\-type tasks, failure modes includemissed span,spurious span, andwrong type\. The LLM is presented with the full distribution of failure modes plus a sample of failed instances from each cluster\. The patch prompt also carries the current rule set with per\-rule metrics and all accumulated feedback, an example is shown in Figure[2](https://arxiv.org/html/2607.01293#S3.F2)\. The LLM can modify existing rules, add new ones, or delete overly broad rules when providing narrower replacements\. Rule: "case\_and\_echr\_numbers", P=86% on devfailure\_mode: CODE missed\_span \(84 cases\),e\.g\. "nos\. 6210/73 and 6877/75" Figure 2:Example input rule for the refinement step\. The rule is referred to by its name, followed by its dev\-set precision, failure type, and number of matchesFor streaming or batch\-wise data, RuleChef can skip the initial synthesis step and update an existing ruleset directly: we commit new examples, corrections, and feedback, evaluate the current rules, and use the resulting failures to drive patch synthesis, preserving the rest of the ruleset\. This is the mechanism behind the human\-feedback experiment described in Section[5\.3](https://arxiv.org/html/2607.01293#S5.SS3)\. #### Agentic coordination\. The loop can be driven by a fixed schedule or by an LLM that reasons about the current state\. The*simple coordinator*runs a fixed number of iterations; the*agentic coordinator*instead reads the per\-class metrics after each iteration to steer the next patch, runs a periodic*critic*that adds rule\-level feedback through the same channel as a human, and runs a periodic*audit*that merges redundant rules or removes ineffective ones, reverting any change that lowers measured quality\. Appendix[A](https://arxiv.org/html/2607.01293#A1)shows the complete prompt templates for each agent, Section[5\.2](https://arxiv.org/html/2607.01293#S5.SS2)measures their effect\. ### 3\.3Conflict Resolution and Pruning When learning concludes, we run each rule alone on the development split and record its precision there along with the number of matches\. When two rules overlap or disagree on a span, the executor keeps the match from the higher\-priority rule, breaking ties based on dev\-set precision\. Low\-support estimates are discounted by a Wilson lower bound\(Wilson,[1927](https://arxiv.org/html/2607.01293#bib.bib68)\), so a rule right twice out of twice does not outrank one right 95 times out of 100\. A leave\-one\-out pass measures each rule’s marginal contribution to ensemble F1 and drops rules whose removal does not hurt it\. Measuring precision on data the rule was not tuned on also separates good rules from memorized ones\. On the TAB data, the rulecase\_and\_echr\_numbers, corresponding to the regex`\(?:no\\\.?\\s\*\)\(\\d\{4,6\}/\\d\{2,4\}\)`, achieves 0\.86 precision with 22 true positive matches on the development set\. The overly generic ruleQuantity, by contrast, matched only two development spans at precision 1\.00, which the Wilson bound rightly discounts: the rule also matched fraction\-shaped case numbers such as*1432/03*\(5\.7 F1 on test\) before the feedback repair step \(see Section[5\.3](https://arxiv.org/html/2607.01293#S5.SS3)\)\. Rules that memorized training lexicons show up the same way, with low development\-set precision \(see Section[5\.4](https://arxiv.org/html/2607.01293#S5.SS4)\)\. #### Observation mode\. For settings where an LLM already handles production traffic, RuleChef can use the model’s behavior as a supervision source rather than requiring labels upfront\. Each LLM call is treated as a training example, and RuleChef periodically synthesizes rules from the accumulated observations using the standard pipeline\. Matched queries are then routed to rules instead of the LLM\. A deployment can delegate only to rules whose measured precision exceeds a threshold, leaving the rest to the LLM\. This mode also supportstask discovery, where RuleChef prompts an LLM to discover the full task specification based on the input\-output examples\. This mode allows the system to generate an initial rule system based on an external black box model without human input\. ## 4Experimental Setup We provide preliminary evaluation of the RuleChef methodology introduced in this paper via experiments on two datasets that represent two of the most common NLP tasks, Named Entity Recoginition and text classification\. In this section we describe the datasets and the experimental setup, results follow in Section[5](https://arxiv.org/html/2607.01293#S5)\. ### 4\.1TAB: Anonymization of Court Decisions Our primary experiment uses the Text Anonymization Benchmark\(TAB; Pilánet al\.,[2022](https://arxiv.org/html/2607.01293#bib.bib65)\), a corpus of 1,268 decisions of the European Court of Human Rights, with human annotation marking all text spans that represent personal information\. We use TAB’s eight official entity types unchanged:Person,Code\(case numbers, phone numbers, license plates, etc\.\),Datetime,Quantity,Org,Loc,Dem\(demographic attributes such as nationality or profession\), andMisc\. For analysis we additionally group them by what governs their surface form:*format types*\(Code,Datetime,Quantity\), whose mentions follow formal patterns, and*semantic types*\(Person,Org,Loc,Dem,Misc\), defined by meaning rather than form\. Since full documents are too long to fit into the context window of LLMs for rule learning, we segment them into chunks of at most 600 characters\. For the experiments described here we generate a sample of 1,000 training and 600 test chunks\. Rule systems generated by RuleChef are evaluated on this test set as well as on the official test split, which contains 127 full, unchunked documents\.*RuleChef*uses Kimi\-K2\.6 as the rule\-writing LLM with the agentic coordinator, three refinement iterations, and a 20% development holdout; at inference time only the learned rules run\. Our main baseline for comparison is*GLiNER2*\(Zaratianaet al\.,[2025](https://arxiv.org/html/2607.01293#bib.bib66)\), a 205M\-parameter schema\-driven extractor\. ### 4\.2Banking77: Intent Classification and Observation Mode The text classification experiments use Banking77\(Casanuevaet al\.,[2020](https://arxiv.org/html/2607.01293#bib.bib49)\), a dataset of over 13k customre service queries classified by user intent into 77 categories that include classes that are easily detected via keywords \(such asexchange\_rate\) but also more challenging ones such asbeneficiary\_not\_allowed\. A set of 200 examples spanning 25 intent classes is used as the held\-out test set in our experiments\. Given the relative simplicity of this task, our baseline for comparison is based on directly prompting the same LLM that we use for rule generation \(Kimi\-K2\-Instruct\)\. This approach achieves over 98% accuracy on this test set and is also used to evaluate RuleChef in Observation mode, where system\-generated labels provide the only source of supervision for rule generation\. At each iteration we measure rule coverage, precision, and the fraction of subsequent LLM calls replaced\. As context, fine\-tuned dual\-encoder and contrastive models reach 86–87% accuracy on the full 77\-class task with 10 shots per class\(Zhanget al\.,[2021](https://arxiv.org/html/2607.01293#bib.bib67)\)\. ## 5Results In this section we report results on the two tasks introduced in Section[4](https://arxiv.org/html/2607.01293#S4)\. Sections[5\.1](https://arxiv.org/html/2607.01293#S5.SS1)–[5\.3](https://arxiv.org/html/2607.01293#S5.SS3)evaluate on TAB, comparing RuleChef\-generated rules against published baselines and measuring the impact of the feedback\-repair process\. Section[5\.4](https://arxiv.org/html/2607.01293#S5.SS4)provides qualitative analysis of the rules learned on the TAB dataset\. Section[5\.5](https://arxiv.org/html/2607.01293#S5.SS5)uses the Banking77 dataset, measures performance against few\-shot classification approaches, and evaluates RuleChef’s observation mode\. ### 5\.1Rule system performance FormatSemanticSystemPRF1PRF1ms/docRuleChef \(rules only\)89\.170\.578\.775\.734\.947\.80\.6LLM prompting65\.187\.974\.854\.660\.557\.4≈\\approx1500GLiNER2 \(labels\)66\.578\.071\.831\.746\.237\.6190GLiNER2 \(schema\)69\.679\.974\.434\.949\.040\.8190 Table 1:Results on the 600 test chunks of the TAB dataset\. TheFormatgroup includes the entity typesCode,DatetimeandQuanitity, theSemanticgroup includesPer, Org, Loc, Dem,andMisc\.Table[1](https://arxiv.org/html/2607.01293#S5.T1)compares the performance of learned rules against direct LLM prompting and two GLiNER2 baselines\. We observe that learned rules are superior to all other systems on classes that can typically be detected based on the surface form of the entities, while on the more challenging types their overall performance remains below that of direct LLM prompting, but they still outperform the GLiNER baselines and also achieve the highest precision among all systems\. Higher precision at the cost of lower recall is the intended behavior, since RuleChef is designed to generate individual rules with high precision, ensuring that incrementally growing a rule system by adding additional patterns can eventually lead to a system with high precision and high recall\. A qualitative analysis of learned rules is provided in Section[5\.4](https://arxiv.org/html/2607.01293#S5.SS4)\. We also measure inference latency for each system to quantify the obvious fact that using regex\-based rules means having virtually zero inference costs\. Meanwhile, the rule learning process involved less than 20 LLM calls and took approx\. 12 minutes\. SystemPRall\{\}\_\{\\text\{all\}\}Rdirect\{\}\_\{\\text\{direct\}\}supervisionRuleChef \(22 rules\)\.738\.719\.8301,000 chunksLongformer†\.836\.919—fine\-tuned, 1,013 docsRoBERTa NER†\.441\.906—off\-the\-shelf Table 2:Official TAB test split \(127 full documents\), scored with the benchmark’s own evaluation script: token\-level precision weighted by BERT information content, and mention\-level recall over annotated spans \(all identifiers / direct identifiers only\)\.†\\dagger: as reported byPilánet al\.\([2022](https://arxiv.org/html/2607.01293#bib.bib65)\)\.We also run the learned ruleset over the 127 full, unchunked documents of the official TAB test split and score it with the benchmark’s own evaluation script, which measures recall over annotated spans and token\-level precision weighted by BERT\-estimated information content\. Table[2](https://arxiv.org/html/2607.01293#S5.T2)compares results with the two baselines reported under the same protocol byPilánet al\.\([2022](https://arxiv.org/html/2607.01293#bib.bib65)\)\. Performance of the 22 patterns learned by RuleChef remains well below that of the fine\-tuned Longformer system on both precision and recall, but is already superior to the off\-the\-shelf RoBERTa system, achieving nearly 30 points higher precision and only 19 points lower recall\. These results illustrate that RuleChef can generate fully transparent, inspectable and editable rule systems that are competitive with off\-the\-shelf neural models\. In application scenarios where transparency, explainability, or auditability of decisions is critical, such high\-precision systems can serve either as the basis of further development of high\-performing rule systems or as the preferred model in a cascade of systems that also include models with higher recall at the cost of precision and/or explainability\. ### 5\.2Ablation ConfigurationFmtF1SemF1rulescallsone\-shot rule prompting74\.27\.4318\+ refinement on train63\.238\.23812\+ holdout acceptance83\.544\.43614\+ agentic coord\. \(3 iter\.\)80\.935\.92119agentic, 8 iterations81\.747\.8933 Table 3:Ablation on TAB\. Each row adds one component; “calls” counts all learning\-time LLM calls \(synthesis, patches, critic, audit\)\. One\-shot prompting writes plausible format regexes but is nearly useless on semantic types; the validated refinement loop does the work\.A simple ablation study is performed to understand the impact of RuleChef’s iterative learning process on the quality of the final rule system\. In particular this involves isolating the baseline performance of prompting an LLM to generate a rule system in a single pass\. We observe that in such settings LLMs rely on the task description at least as much as on the initial training examples, and it is the subsequent iterations guided by additional examples and feedback on failures that allows LLMs to move from a generic rule system towards one that fits the actual task that is represented by the training dataset\. Table[3](https://arxiv.org/html/2607.01293#S5.T3)shows the effect of various refinement steps on the TAB dataset\. Note that the row\+agentic coord\.corresponds to the main configuration evaluated in earlier sections, but the numbers differ \(e\.g\. precision of 80\.9 vs\. 78\.7\) due to the non\-determinism of RuleChef’s process that is inherent to the iterative prompting of LLMs \(see also the section on Limitations\)\. Once again we observe a stark contrast between the two label groupsFormat\(FMT\) andSemantic\(SEM\)\. On the FMT group, which contains entity types easily identified by their format \(dates, quantities, various identification numbers\), a single pass of an LLM can create a ruleset that is within 5\-10 percentage points of the best ones in terms of F\-score\. In case of the SEM group containing traditional NE types such asPER, LOC,andORGan initial rule\-system is practically useless\. Refining the rules by providing more training examples achieves substantial improvement on the SEM group, while the strong baseline on the FMT group deteriorates\. However, the additional step of filtering new rules based on their performance on the dev set \(holdout acceptance\) substantially increases performance on both sets of entity types\. The effect ofagentic coordinationvaries greatly with the number of iterations\. In this simple setup we see that 3 iteration leads to a decrease in performance compared to the strongest system on both groups, while 8 iterations produce a superior system on the more SEM group but cannot restore the original top performance on the FMT group\. The total number of rules accross all entity types is of particular interest\. After 3 iterations of the coordinator, the number of patterns is already reduced to 21 from the 36 of the previous system, and after 8 iterations it contains only 9 rules altogether, which still achieve top scores on the SEM group and are within 2 points of the best system on the FMT group\. We believe that these results, while preliminary, already give a clear indication that a guided, iterative rule generation approach can result in robust rule\-based systems even for challenging tasks like NER\. Properly understanding the nature of how a rule system evolves as a result of such iterative improvement strategies and how this process varies across types of language processing tasks will require considerable further experimentation, including also the qualitative analysis of intermediate rule systems for each task\. ### 5\.3Repairing Rules with Feedback Because RuleChef models are lists of readable rules, its defects can also be addressed explicitly via human\-in\-the\-loop \(HITL\) feedback\. The next set of experiments measures the impact of this process\. We inspected the learned rules and targeted three issues: theQuantityrule matched fraction\-shaped case numbers like*1432/03*,Coderule missed bare application numbers, and aPersonrule for initials fired on ordinary abbreviations\. We attached one sentence of feedback to each rule through the standard interface, e\.g\.: > “Never match number/number patterns like ‘1432/03’—those are case numbers, not quantities\.” Typebefore F1after F1Δ\\DeltaQuantity\(feedback\)5\.735\.6\+29\.8Code\(feedback\)45\.448\.8\+3\.4Person\(feedback\)53\.454\.0\+0\.6Datetime\(untouched\)85\.185\.10\.0Dem\(untouched\)40\.437\.7−\-2\.7 Table 4:Results of a single round of human\-in\-the\-loop repair using three sentences of rule\-level feedbackResults are shown in Table[4](https://arxiv.org/html/2607.01293#S5.T4)\. A single repair round \(92 seconds, two LLM calls\) substantially improves performance on theQuantityclass \(from 5\.7 to 35\.6 F1\), while on the other two criticized classes the performance improves modestly\. This workflow of improving individual rules based on a human’s understanding is one that hand\-crafted rule systems always supported and neural models do not\. ### 5\.4Qualitative Analysis of Learned Rules Rule namePattern \(abridged\)dev Pnnsingle\_date\\b\\d\{1,2\}\\s\+ \(?:January\|\.\.\.\)\.95362court\_names\(?:District\|Regional\| Supreme\|\.\.\.\)\\s\+Court\.79118titled\_full\_ name\\b\(?:Mr\|Mrs\|Dr\| Justice\)\\\.?\\s\+\[A\-Z\]\.\.\.\.9369republic\_ kingdom\_gov\.\(?:Federal Republic of Germany\|\.\.\.\)\.9737specific\_ institutions\(?:BBC Scotland\|House of Lords\|\.\.\.\)\.4730case\_and\_ echr\_numbers\(?:no\\\.?\\s\*\) \(\\d\{4,6\}/\\d\{2,4\}\)\.8622initials\_ with\_period\(?:\[A\-Z\]\\\. \[\\s\-\]?\)\{1,4\}\.\.\.\.3517lives\_in\_ location\(?:lives\|resides\)\\s\+in \\s\+\(\[A\-Z\]\.\.\.\)1\.0012 Table 5:Examples of rules learned on the TAB dataset, ranked by the number of matches on the dev set\.Table[5](https://arxiv.org/html/2607.01293#S5.T5)shows examples of rules learned on the TAB dataset, together with their precision and match count on the development set\. Descriptive rule names have been generated by the LLMs\. We observe that high\-coverage high\-precision rules encode task knowledge such as formatting rules \(date formats in the rulesingle\_date, honorifics in the ruletitled\_full\_name, etc\.\) and lists of terms \(court\_names,republic\_kingdom\_gov\)\. In contrast, the rulelives\_in\_locationencodes patterns about the context in which entities of a certain type occur\. We can also observe that some rules encoding lists of entities or formatting conventions achieve substantially lower precision than others\. The list of organizationsspecific\_institutionsis overly generic, also containing standalone words such asParliament, and so is the patterninitials\_with\_periodthat matches abbreviations that are not person names\. These are the kinds of rules that must be the target of subsequent improvement steps that refine them based on examples of false positives presented to the LLM\. ### 5\.5Few\-Shot Classification and Observation Mode RulesCov\.Repl\.PRF1LLM prompting—100—98\.098\.098\.0Zero\-shot NLI \(DeBERTa\-v3\)—100—96\.596\.596\.5LogReg \(MiniLM emb\.\)—100—95\.095\.095\.0RuleChef \(few\-shot\)12662\.5—97\.661\.075\.1*Observation mode \(no labels\)*after 10 calls1420\.519\.592\.719\.031\.5after 25 calls2442\.541\.095\.340\.556\.8after 50 calls4049\.548\.096\.049\.563\.6 Table 6:Results on the 200 dev set queries of the Banking77 dataset\. Top rows compare rule performance to three baselines: simple LLM prompting, a logistic\-regression classifier on MiniLM sentence embeddings, and the DeBERTa\-v3 modelMoritzLaurer/deberta\-v3\-base\-zeroshot\-v1\.1\-all\-33used for zero\-shot classification, which scores each candidate label as an entailment hypothesis\. Bottom rows show the performance RuleChef’sobservation mode, which discovers class labels and learns rules based on a growing number of input\-output examples from the top\-performing LLM\-based approach\. Coverage \(Cov\.\) measures the fraction of inputs covered by any learned rule\.We use the Banking77 dataset to test RuleChef on a simple 5\-way intent classification task\. The results in Table[6](https://arxiv.org/html/2607.01293#S5.T6)show that simple LLM prompting as well as standard supervised learning approaches achieve precision and recall in the \.95\-\.98 range, and that the rule system learned by RuleChef is competitive with these methods in terms of precision but not in terms of recall, a typical outcome for rule\-based systems\. The table alsow shows rule performance in various stages of RuleChef’s observation mode, where the system discovers labels and learns rules based on input\-output examples from an external model \(the top\-performing LLM\-based approach\)\. Here we can see that as the number of observed examples grows from 10 to 50, precision grows from \.93 to \.96 and recall increases from \.19 to \.50\. Regardless of whether or not our method would be able ot further increase recall without sacrificing precision, a system such as the one learned based on only 50 examples and without human intervention is already robust enough to be used as the first tier of a hybrid system, replacing black box inference for half of the inputs, offering a clear gain in both transparency and computational cost\. ## 6Discussion and Future Work We presented the RuleChef methodology for constructing rule\-based systems for various NLP tasks via iterative refinement based on annotated examples and human feedback\. Using a generic Named Entity Recognition task as our main testbed, we have shown that for classes of entities exhibiting highly regular surface patterns it is possible to learn fully transparent rule\-based systems that are competitive with standard black box approaches\. On the more challenging groups of entities RuleChef can produce a set of high\-precision rules with substantial recall that can form the basis of subsequent improvements and can function as the preferred model in hybrid systems, increasing explainability and reducing computational cost\. Our simple ablation study offers a preliminary evaluation of the various strategies for iterative learning implemented by RuleChef\. In addition to the process of introducing training examples gradually and only accepting refinements that improve overall performance, we also showcase the capabilities of theagentic coordinationfeature, which allows an external model to orchestrate learning steps based on observed rule performance and perceived rule quality\. RuleChef is released as open\-source software under an Apache 2\.0 license222[https://github\.com/KRLabsOrg/rulechef](https://github.com/KRLabsOrg/rulechef), including all code necessary to reproduce the experimental results presented in this paper\. RuleChef is a generic framework implementing a variety of approaches to rule learning for NLP, and the empricial results presented in this paper are strictly preliminary\. We expect that the utility of each RuleChef feature will vary greatly across domain, genres, and datasets\. The main approaches described in Section[3](https://arxiv.org/html/2607.01293#S3)would each benefit from in\-depth experimental evaluation\. The quality of fule synthesis via LLMs depends on the choice of prompts and their ability to separate training signals from pre\-existing model bias\. Iterative improvement by introducing additional training examples depends heavily on chunking and sampling strategies as well as on the acceptance criteria for newly introduced rules\. The use of direct human feedback should be explored under various constraints on the form and content of such user input, while agentic coordination depends on stand\-alone models for assessing and criticizing rules, a complex task in its own right\. ## Limitations The broad range of methods introduced and evaluated in this paper introduce a number of limitations\. Quantitative results presented are based on single runs and fixed splits\. Repeated runs of selected configurations show variations in F1 performance of approx\.±\\pm3 points even for the more regularFORMATgroup of entities, and we report metrics from representative experiments rather than means over multiple runs\. All experiments use English texts as input, and a single LLM family for rule generation \(Kimi\-K2\)\. Each main method involving LLM invocations also introduces the need for the systematic evaluation of how model bias influences the rule learning process and how this may introduce bias into the final rule systems\. ## References - S\. Arora, B\. Yang, S\. Eyuboglu, A\. Narayan, A\. Hojel, I\. Trummer, and C\. Ré \(2023\)Language models enable simple systems for generating structured views of heterogeneous data lakes\.Proceedings of the VLDB Endowment17\(2\),pp\. 92–105\.External Links:[Document](https://dx.doi.org/10.14778/3626292.3626294)Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p3.1)\. - L\. Banarescu, C\. Bonial, S\. Cai, M\. Georgescu, K\. Griffitt, U\. Hermjakob, K\. Knight, P\. Koehn, M\. Palmer, and N\. Schneider \(2013\)Abstract meaning representation for sembanking\.InProceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse,pp\. 178–186\.External Links:[Link](https://aclanthology.org/W13-2322/)Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p2.1)\. - A\. Bartoli, A\. De Lorenzo, E\. Medvet, and F\. Tarlao \(2016\)Inference of regular expressions for text extraction from examples\.IEEE Transactions on Knowledge and Data Engineering28\(5\),pp\. 1217–1230\.Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p1.1)\. - A\. Bartoli, A\. De Lorenzo, E\. Medvet, and F\. Tarlao \(2018\)Active learning of regular expressions for entity extraction\.IEEE Transactions on Cybernetics48\(3\),pp\. 1067–1080\.Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p1.1)\. - I\. Casanueva, T\. Temčinas, D\. Gerz, M\. Henderson, and I\. Vulić \(2020\)Efficient intent detection with dual sentence encoders\.InProceedings of the 2nd Workshop on Natural Language Processing for Conversational AI,pp\. 38–45\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.nlp4convai-1.5),[Link](https://aclanthology.org/2020.nlp4convai-1.5/)Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p3.1),[§4\.2](https://arxiv.org/html/2607.01293#S4.SS2.p1.1)\. - Q\. Chen, X\. Wang, X\. Ye, G\. Durrett, and I\. Dillig \(2020\)Multi\-modal synthesis of regular expressions\.InProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation \(PLDI\),pp\. 487–502\.Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p1.1)\. - L\. Chiticariu, Y\. Li, and F\. R\. Reiss \(2013\)Rule\-based information extraction is dead\! long live rule\-based information extraction systems\!\.InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 827–832\.Cited by:[§1](https://arxiv.org/html/2607.01293#S1.p1.1)\. - J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 4171–4186\.External Links:[Document](https://dx.doi.org/10.18653/v1/N19-1423),[Link](https://aclanthology.org/N19-1423/)Cited by:[§1](https://arxiv.org/html/2607.01293#S1.p1.1)\. - K\. He and Z\. Chen \(2025\)From reasoning to learning: a survey on hypothesis discovery and rule learning with large language models\.Transactions on Machine Learning Research \(TMLR\)\.Note:arXiv:2505\.21935External Links:[Link](https://arxiv.org/abs/2505.21935)Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p3.1)\. - C\. Hsieh, C\. Li, C\. Yeh, H\. Nakhost, Y\. Fujii, A\. Ratner, R\. Krishna, C\. Lee, and T\. Pfister \(2023\)Distilling step\-by\-step\! outperforming larger language models with less training data and smaller model sizes\.InFindings of the 61st Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 8003–8017\.Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p3.1)\. - Á\. Kovács, K\. Gémes, E\. Iklódi, and G\. Recski \(2022\)POTATO: exPlainable infOrmation exTrAcTiOn framework\.InProceedings of the 31st ACM International Conference on Information & Knowledge Management \(CIKM\),pp\. 4897–4901\.External Links:[Document](https://dx.doi.org/10.1145/3511808.3557196)Cited by:[§1](https://arxiv.org/html/2607.01293#S1.p1.1),[§2](https://arxiv.org/html/2607.01293#S2.p2.1)\. - P\. Lertvittayakumjorn, L\. Choshen, E\. Shnarch, and F\. Toni \(2022\)GrASP: a library for extracting and exploring human\-interpretable textual patterns\.InProceedings of the 13th Language Resources and Evaluation Conference \(LREC\),pp\. 6093–6103\.Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p2.1)\. - Y\. Li, S\. Li, Z\. Xu, J\. Cao, Z\. Chen, Y\. Hu, H\. Chen, and S\. Cheung \(2021\)TransRegex: multi\-modal regular expression synthesis by generate\-and\-repair\.InProceedings of the 43rd International Conference on Software Engineering \(ICSE\),pp\. 1210–1222\.Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p1.1)\. - Z\. C\. Lipton \(2018\)The mythos of model interpretability\.Queue16\(3\),pp\. 31–57\.External Links:[Document](https://dx.doi.org/10.1145/3236386.3241340)Cited by:[§1](https://arxiv.org/html/2607.01293#S1.p1.1)\. - N\. Locascio, K\. Narasimhan, E\. DeLeon, N\. Kushman, and R\. Barzilay \(2016\)Neural generation of regular expressions from natural language with minimal domain knowledge\.InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 1918–1923\.Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p1.1)\. - J\. Nivreet al\.\(2018\)Universal dependencies 2\.3\.Note:LINDAT/CLARIAH\-CZ digital library at the Institute of Formal and Applied LinguisticsExternal Links:[Link](http://hdl.handle.net/11234/1-2895)Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p2.1)\. - I\. Pilán, P\. Lison, L\. Øvrelid, A\. Papadopoulou, D\. Sánchez, and M\. Batet \(2022\)The text anonymization benchmark \(TAB\): a dedicated corpus and evaluation framework for text anonymization\.Computational Linguistics48\(4\),pp\. 1053–1101\.Cited by:[§4\.1](https://arxiv.org/html/2607.01293#S4.SS1.p1.1),[§5\.1](https://arxiv.org/html/2607.01293#S5.SS1.p2.1),[Table 2](https://arxiv.org/html/2607.01293#S5.T2)\. - A\. Ratner, S\. H\. Bach, H\. Ehrenberg, J\. Fries, S\. Wu, and C\. Ré \(2017\)Snorkel: rapid training data creation with weak supervision\.Proceedings of the VLDB Endowment11\(3\),pp\. 269–282\.Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p3.1)\. - P\. Sen, Y\. Li, E\. Kandogan, Y\. Yang, and W\. Lasecki \(2019\)HEIDL: learning linguistic expressions with deep learning and human\-in\-the\-loop\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations \(ACL\),pp\. 135–140\.Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p2.1)\. - E\. Shnarch, R\. Levy, V\. Raykar, and N\. Slonim \(2017\)GrASP: rich patterns for argumentation mining\.InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 1345–1350\.Cited by:[§1](https://arxiv.org/html/2607.01293#S1.p1.1),[§2](https://arxiv.org/html/2607.01293#S2.p2.1)\. - S\. Sivasothy, S\. Barnett, R\. Logothetis, M\. Abdelrazek, Z\. Rasool, S\. Thudumu, and Z\. Brannelly \(2024\)Large language models for generating rules, yay or nay?\.arXiv preprint arXiv:2406\.06835\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2406.06835),[Link](https://arxiv.org/abs/2406.06835)Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p3.1)\. - R\. Smith, J\. A\. Fries, B\. Hancock, and S\. H\. Bach \(2024\)Language models in the loop: incorporating prompting into weak supervision\.ACM/IMS Journal of Data Science1\(2\),pp\. 1–30\.External Links:[Document](https://dx.doi.org/10.1145/3617130)Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p3.1)\. - P\. M\. Stahl \(2019\)Grex – a command\-line tool and rust library for generating regular expressions from user\-provided test cases\.External Links:[Link](https://github.com/pemistahl/grex)Cited by:[§3\.1](https://arxiv.org/html/2607.01293#S3.SS1.p3.1)\. - M\. A\. Valenzuela\-Escárcega, G\. Hahn\-Powell, and D\. Bell \(2020\)Odinson: a fast rule\-based information extraction framework\.InProceedings of the 12th Language Resources and Evaluation Conference \(LREC\),pp\. 2183–2191\.Cited by:[§1](https://arxiv.org/html/2607.01293#S1.p1.1),[§2](https://arxiv.org/html/2607.01293#S2.p2.1)\. - P\. Varma and C\. Ré \(2018\)Snuba: automating weak supervision to label training data\.Proceedings of the VLDB Endowment12\(3\),pp\. 223–236\.Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p3.1)\. - R\. Wang, E\. Zelikman, G\. Poesia, Y\. Pu, N\. Haber, and N\. D\. Goodman \(2024\)Hypothesis search: inductive reasoning with language models\.InThe Twelfth International Conference on Learning Representations \(ICLR\),Note:arXiv:2309\.05660Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p3.1)\. - P\. West, C\. Bhagavatula, J\. Hessel, J\. D\. Hwang, L\. Jiang, R\. Le Bras, X\. Lu, S\. Welleck, and Y\. Choi \(2022\)Symbolic knowledge distillation: from general language models to commonsense models\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 4602–4625\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.naacl-main.341),[Link](https://aclanthology.org/2022.naacl-main.341/)Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p3.1)\. - E\. B\. Wilson \(1927\)Probable inference, the law of succession, and statistical inference\.Journal of the American Statistical Association22\(158\),pp\. 209–212\.Cited by:[§3\.3](https://arxiv.org/html/2607.01293#S3.SS3.p1.1)\. - P\. Yu and S\. H\. Bach \(2023\)Alfred: a system for prompted weak supervision\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics: System Demonstrations \(ACL\),pp\. 479–488\.Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p3.1)\. - U\. Zaratiana, G\. Pasternak, O\. Boyd, G\. Hurn\-Maloney, and A\. Lewis \(2025\)GLiNER2: schema\-driven multi\-task learning for structured information extraction\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,pp\. 130–140\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-demos.10),[Link](https://aclanthology.org/2025.emnlp-demos.10/)Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p3.1),[§4\.1](https://arxiv.org/html/2607.01293#S4.SS1.p2.1)\. - U\. Zaratiana, N\. Tomeh, P\. Holat, and T\. Charnois \(2024\)GLiNER: generalist model for named entity recognition using bidirectional transformer\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(NAACL\-HLT\),pp\. 5364–5376\.Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p3.1)\. - J\. Zhang, T\. Bui, S\. Yoon, X\. Chen, Z\. Liu, C\. Xia, Q\. H\. Tran, W\. Chang, and P\. Yu \(2021\)Few\-shot intent detection via contrastive pre\-training and fine\-tuning\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 1906–1912\.Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p3.1),[§4\.2](https://arxiv.org/html/2607.01293#S4.SS2.p1.1)\. - Z\. Zhong, J\. Guo, W\. Yang, J\. Peng, T\. Xie, J\. Lou, T\. Liu, and D\. Zhang \(2018\)SemRegex: a semantics\-based approach for generating regular expressions from natural language specifications\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 1608–1618\.Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p1.1)\. - W\. Zhou, S\. Zhang, Y\. Gu, M\. Chen, and H\. Poon \(2024\)UniversalNER: targeted distillation from large language models for open named entity recognition\.InProceedings of the 12th International Conference on Learning Representations \(ICLR\),Note:arXiv:2308\.03279External Links:[Link](https://openreview.net/forum?id=r65xfUb76p)Cited by:[§2](https://arxiv.org/html/2607.01293#S2.p3.1)\. ## Appendix APrompt Templates RuleChef uses large language models exclusively at learning time, issuing calls through eleven distinct prompt templates that span every phase of rule synthesis, refinement, agentic coordination, and observation\. The prompts fall into four functional groups\. ### A\.1Learning prompts The*synthesis prompt*\(Figure[3](https://arxiv.org/html/2607.01293#A1.F3)\) is the primary learning call: it is issued once at the start of a run to generate an initial rule set from the full training dataset\. For multi\-class tasks, the*per\-class synthesis prompt*\(Figure[4](https://arxiv.org/html/2607.01293#A1.F4)\) is used instead, producing one focused call per label with its positive examples and counter\-examples from other classes, preventing cross\-class interference\. The*patch prompt*\(Figure[5](https://arxiv.org/html/2607.01293#A1.F5)\) is issued once per refinement iteration to incrementally update the rule set by fixing observed failures; it receives the current rule set with per\-rule metrics alongside a clustered, sampled set of errors\. Six variants of the patch prompt \(from richest to most compact\) are pre\-built; RuleChef selects the longest one fitting the context window\. Task:\{task\.name\} Description:\{task\.description\} Inputschema:\{task\.input\_schema\} Outputschema: \{task\.output\_schema\} \[AVAILABLEENTITYTYPES:\{labels\}\#\#onlyforNER/classificationtasks RulesMUSTuseoneofthesetypesinoutput\_template\.\] CORRECTIONS\(Learnfromfailures\-\{n\}shown\):\#\#omittedwhendatasethasnocorrections Input:\{json\(correction\.input\)\} Got\(WRONG\):\{json\(correction\.model\_output\)\} Expected\(CORRECT\):\{json\(correction\.expected\_output\)\} \[Feedback:\{correction\.feedback\}\]\#\#includedonlywhenper\-correctionfeedbackisset TRAININGEXAMPLES\(\{n\}shown\): Input:\{json\(example\.input\)\} Output:\{json\(example\.expected\_output\)\} \{data\_evidence\}\#\#task\-type\-specificsection:entitycounts,grex\-derivedregexhints, \#\#classdistributions;omittedwhenevidenceisempty \[USERFEEDBACK:\#\#optional;omittedwhennofeedbackisattached \-\{feedback\_item\}\] \[EXISTINGRULES\(\{n\}current\):\#\#optional;includedonre\-synthesiswhenrulesexist \-\{rule\.name\}\(priority\{rule\.priority\},success:\{success\_rate\}\) Format:\{rule\.format\} Pattern:\{rule\.content\[:100\]\}\.\.\. Confidence:\{rule\.confidence\} CONSIDER: \-Refineexistinghigh\-performingrules \-Fixorreplacelow\-performingrules \-Keeprulesthatworkwell \-Addnewrulesforuncoveredpatterns\] YOURTASK: \{Updateandrefine\|Synthesizeacomplete\}ruleset\(max\{max\_rules\}rules\)that: 1\.Handlesallcorrectionscorrectly\(CRITICAL\-theseshowfailuremodes\) 2\.Worksonallexamples 3\.Respectsuserfeedback 4\.Isgeneralandminimal\(avoidredundantrules\) WHATMAKESAGOODRULE: \-PRECISIONOVERRECALL:Arulethatmatches10thingscorrectlybeatsonethatmatches100with20wrong\.Neversacrificeprecisionforrecall\.Missingamatchisfixablelater;awrongmatchpoisonsresults\. \-GENERALIZE,DON’TMEMORIZE:Rulesrunonunseentext\.Matchthe\*structure\*ofvalues,nottheexactstringsyouseeintrainingdata\. \-USECONTEXT:Thesamestringcanmeandifferentthings\.Matchsurroundingwords/punctuationandusecapturegroups\($1\)toextractjustthevalue\. \-ONERULE,ONEPATTERN:Multiplefocusedrulesbeatonegiantregex\.Eachruleshouldhaveaclear,testablereasontoexist\. \-THINKADVERSARIALLY:Beforewritingarule,ask"whatelsecouldthismatch?"Iftheanswerisn’tempty,narrowthepatternoraddcontext\. RULESCANBE: \[\-Regexpatterns\(forstructuredextraction\)\]\#\#eachlineincludedonlyiftheformatisallowed \[\-Pythoncode\(forcomplexlogic\)\] \[\-spaCytokenmatcherpatterns\(forlinguistic/NLPpatterns\)\] IMPORTANT:YoumustONLYusetheallowedformatslistedabove\. \[IMPORTANT:ForCODErules,writestandardmulti\-linePythonfunctionswithproperindentation\.\] \[IMPORTANT:spaCyrulesmustbevalidJSONarraysoftokendicts\(spaCyMatcherformat\)\.\] \[IMPORTANT:spaCyNERisdisabled\.DoNOTuseENT\_TYPEorENT\_IDinspaCypatterns\.\] \[REGEXSYNTAXREFERENCE:\#\#includedwhenREGEXformatisallowed \-\\bwordboundaries,\(?:\.\.\.\)non\-capturinggroups,\(a\|b\|c\)alternation \-\[A\-Z\]\[a\-z\]\\d\\scharacterclasses,\+\*?\{n,m\}quantifiers \-\(?=\.\.\.\)lookahead,\(?<=\.\.\.\)lookbehind\-\-matchcontextwithoutconsumingitCAPTUREGROUPS\($1\)vsFULLMATCH\($0\):Use$0whentheentirematchISthevalue;use$1whencontextisneeded\.With"text":"$1",$start/$endauto\-adjusttothegroupposition\. GOODvsBADPATTERNS: BAD:"\\d\+"\-\-matcheseverynumber;mostwillbewrongtype GOOD:"\(\\d\+\)\\s\*\(?:kg\|lbs\|oz\)"\-\-onlynumbersfollowedbyweightunits BAD:"Alice\|Bob\|Tokyo"\-\-memorizestrainingstrings,missesunseenvalues BAD:"\[A\-Z\]\[a\-z\]\+"\-\-matchesanycapitalizedword\(toobroad\) GOOD:"\(?:in\|from\|at\|near\)\\s\+\(\[A\-Z\]\[a\-z\]\+\(?:\\s\+\[A\-Z\]\[a\-z\]\+\)\*\)"\-\-locationsafterprepositions Preferstructuralpatternsovermemorizedalternations\.\] \[SPACYTECHNIQUES:\#\#includedwhenSPACYformatisallowed \-UsePOStags\(PROPN,NOUN,VERB\),DEPlabels\(nsubj,dobj,pobj\),SHAPEforstructuralpatterns\. \-UseINforvaluesets;OPforquantifiers\("?"optional,"\+"oneormore\)\.\] \(\.\.\.\)\#\#instructionsonreturnformatomittedduetospaceconstraints \[\{code\_or\_spacy\_format\_examples\}\]\#\#task\-type\-specificformatexamples,whenapplicable FocusonlearningfromCORRECTIONS\-theyshowexactlywhatwentwrong\! IMPORTANT:ReturnONLYvalidJSON\.Ensure: \-Allstringsusedoublequotesandareproperlyescaped \-Allbracesandbracketsarebalanced \-Notrailingcommas \-Responseiscomplete\(nottruncated\) Figure 3:Rule synthesis prompt\. Issued once at the start of a learning run to generate an initial rule set from the full training dataset\. Optional sections \(shown in square brackets\) are included only when the corresponding data or configuration is present\. Variable slots are shown as\{name\}; lines beginning with\#\#are annotations in this figure, not part of the prompt\. The data evidence section is filled with task\-type\-specific pattern statistics andgrex\-derived regex hints\.Task:\{task\.name\} Description:\{task\.description\} Inputschema:\{task\.input\_schema\} Outputschema: \{task\.output\_schema\} \[AVAILABLEENTITYTYPES:\{labels\} RulesMUSTuseoneofthesetypesinoutput\_template\.\] FOCUS:Generaterulesforclass’\{target\_class\}’ONLY\. \#\#\(or:"toextractthe’\{target\_class\}’field"forTRANSFORMATIONtasks\) POSITIVEEXAMPLESfor’\{target\_class\}’\(\{n\}total\): Input:\{json\(example\.input\)\} Output:\{json\(example\.expected\_output\)\} \[COUNTER\-EXAMPLES\(theseareNOT’\{target\_class\}’\-\-yourrulesmustNOTmatchthese\): \#\#omittedforTRANSFORMATIONtasks Input:\{json\(counter\_example\.input\)\} Label:\{json\(counter\_example\.expected\_output\)\}\] \{data\_evidence\} \{format\_instructions\} WHATMAKESAGOODRULE:\#\#RULE\_QUALITY\_GUIDEconstant\(fulltextinFigure3\) \[\.\.\.\] INSTRUCTIONS: \-Generateupto\{max\_rules\}rulesthatmatchexamplesof’\{target\_class\}’\. \-Rulesshouldgeneralizetounseentext,notjustmemorizetheexamplesshown\. \-Usestructuralpatternswithwordboundariesthatgeneralizetounseentext\. \[\-RulesmustNOTmatchthecounter\-examplesshownabove\.\]\#\#omittedforTRANSFORMATION \{response\_schema\} \{format\_examples\} FocusonlearningfromCORRECTIONS\-theyshowexactlywhatwentwrong\! IMPORTANT:ReturnONLYvalidJSON\.Ensure: \-Allstringsusedoublequotesandareproperlyescaped \-Allbracesandbracketsarebalanced \-Notrailingcommas \-Responseiscomplete\(nottruncated\) Figure 4:Per\-class synthesis prompt\. Used in place of the synthesis prompt for multi\-class tasks: one call is issued per class label, showing only that class’s positive examples and counter\-examples from other classes\. Prevents cross\-class interference in the generated rules\. Not used for binary or untyped extraction tasks\.Youareupdatinganexistingrule\-basedextractor\.DoNOTrewritegoodrules;addoradjust onlywhatisneeded\. Task:\{task\.name\} Description:\{task\.description\} Inputschema:\{task\.input\_schema\} Outputschema:\{task\.output\_schema\} \[AVAILABLEENTITYTYPES:\{labels\} RulesMUSTuseoneofthesetypesinoutput\_template\.\] \{data\_evidence\}\#\#omittedincompactvariantstosavetokens \[USERGUIDANCE\(task\-levelfeedback\): \-\{feedback\_item\}\] \[COORDINATORGUIDANCE\(prioritizethis\): \{guidance\}\]\#\#injectedbyAgenticCoordinator\.guide\_refinement\(\) \[PER\-CLASSMETRICS\(currentperformance\): \{class\}:P=\{p\}R=\{r\}F1=\{f1\}\(TP=\{tp\}FP=\{fp\}FN=\{fn\}\)\] CURRENTRULES\(fulldetails,noteanyuser\_feedbackonspecificrules\): \{json\(rules\_detail\)\}\#\#eachentry:name,format,pattern,metrics,user\_feedback \[OTHERCURRENTRULES\(compactindex;avoidduplicatingthesenames/labels\): \{json\(compact\_rules\)\}\]\#\#non\-relevantrulesshowncompactinthe"relevant"variant \{failure\_summary\}\#\#mode\-basedcountofallfailures,notjustthesampledsetbelow FAILURESTOFIX\(sampled,correctionsarehighpriority\): \{json\(failure\_snippets\)\}\#\#upto20/10/5itemsdependingonvariant;corrections:is\_correction=true \[FALSEPOSITIVES\(rulesareincorrectlymatchingthese\-\-tightentheresponsiblerules\): Predicted"\{predicted\_text\}"as\{predicted\_type\} \-\-\{notanentity\|shouldbe\{correct\_type\}\}\] Instructions: \-Add,tweak,orDELETErulestofixtheshownfailuresandreducefalsepositives\. \-Paycloseattentiontouser\_feedbackonrulesANDtask\-levelUSERGUIDANCE\-\-thesearedirect instructionsfromtheuserandMUSTbeaddressedeveniftherearenofailures\. \-Ifarulehasuser\_feedback,modifyorreplacethatruletoaddressthefeedback\. \-IMPORTANT:Whenupdatinganexistingrule,youMUSTreusetheEXACTsame"name"astheoriginal\. DoNOTaddsuffixeslike"\_fixed","\_v2","\_updated",etc\.Themergesystemusesname\-matchingto replacetheoldversion\-\-adifferentnamecreatesaduplicateinsteadofreplacing\. \-Ifaruleisfundamentallytoobroad\(FP\>\>TP\)andyou’reprovidingbetter,narrower replacements,listtheoldrule’sexactnamein"deleted\_rules"\.Onlydeleteifyou’re providingreplacementsin"rules"\. \-Ifarulehashighfalsepositives,TIGHTENitspatternorDELETEitandaddnarrower replacements\.Addingcontextornarrowingthematchisbetterthanpilingonnewrules\. \-Usestructuralpatternswithwordboundariesthatgeneralizetounseentext\. \-Keeptotalnew/updatedrules<=\{max\_rules\}\. \-Useformats:\{allowed\_formats\} \-Avoidtouchingunrelatedbehaviors\. WHATMAKESAGOODRULE:\#\#RULE\_QUALITY\_GUIDEconstant\(fulltextinFigure3\) \[\.\.\.\] \{format\_instructions\} \{response\_schema\} Figure 5:Patch prompt\. Issued once per refinement iteration to incrementally fix failures\. Six variants are pre\-built and differ in the number of failure snippets shown \(20, 10, or 5\), whether the data evidence section is included, and whether non\-relevant rules are shown in compact or full form\. RuleChef selects the longest variant that fits the context window\. Lines beginning with\#\#are annotations in this figure only\. ### A\.2Agentic coordinator prompts When the agentic coordinator is enabled, four additional prompts drive autonomous decision\-making\. After each refinement iteration the*refinement coordinator prompt*\(Figure[6](https://arxiv.org/html/2607.01293#A1.F6)\) analyses per\-class metrics and returns strategic guidance for the next patch call\. Periodically, the*rule critic prompt*\(Figure[8](https://arxiv.org/html/2607.01293#A1.F8)\) performs a holistic review of the rule set and produces actionable per\-rule feedback, which re\-enters the system through the standard feedback interface and is visible in subsequent patch prompts\. The*rule auditor prompt*\(Figure[7](https://arxiv.org/html/2607.01293#A1.F7)\) compares rules for overlap or redundancy and proposes merge, remove, or tighten actions; any structural change is verified on the held\-out split before being accepted\. When RuleChef is configured to learn from a live data stream, the*learning trigger decision prompt*\(Figure[9](https://arxiv.org/html/2607.01293#A1.F9)\) decides whether accumulated new examples warrant starting a retraining loop\. YouaretheRefinementCoordinatorforarulelearning system\.Aftereachrefinementiteration,youanalyze per\-classperformanceandguidethenextpatch\. ITERATION:\{iteration\+1\}/\{max\_iterations\} OVERALL:accuracy=\{exact\_match\},micro\_F1=\{micro\_f1\}, macro\_F1=\{macro\_f1\} PER\-CLASSPERFORMANCE\(sortedworsttobest\): \{class\}:F1=\{f1\}P=\{p\}R=\{r\}\(TP=\{tp\}FP=\{fp\}FN=\{fn\}\) \[\.\.\.\] ReturnJSON: \{ "focus\_classes":\["classnamesneedingmostimprovement"\], "guidance":"Specificadvicefortherulegenerator\-\- whichclassestoprioritize,whatpatternstotry, whattoavoid\.2\-3sentencesmax\.", "should\_continue":boolean \} Figure 6:Refinement coordinator prompt\. Issued after each iteration when the agentic coordinator is enabled\. The returnedguidancestring is injected verbatim into the next patch prompt as coordinator guidance;should\_continue= false terminates the refinement loop early\.YouareaRuleAuditorforarule\-basedextraction/classificationsystem\. Analyzethese\{n\}rulesandtheirper\-rulemetrics\. RULES: \{json\(rule\_entries\)\}\#\#eachentry:id,name,format,pattern\[:300\],priority, \#\#output\_key,output\_template,precision,recall,F1,TP,FP YourjobistoCONSOLIDATEandCLEANtheruleset\. ACTIONS\(inpriorityorder\): 1\.MERGE:Two\+ruleswithsimilar/overlappingpatternstargetingthesameoutput/label\. Combineintoonerule\.Onlymergerulesofthesameformatandsameoutput\_template/output\_key\. 2\.REMOVErulesthathurtmorethantheyhelp: \-precision=0ANDmatches\>0\(purenoise\-\-everymatchiswrong\) \-false\_positives\>2xtrue\_positives\(rulecausesmoreharmthangood\) \-Memorizedexactstringsfromtrainingdatathatwon’tgeneralize 3\.TIGHTEN:IfarulehashighFP,returnitasamerge\-with\-self\-\-samerule\_idbutnarrower pattern\. IMPORTANT\-\-doNOTremove: \-Theonlyruleforaclass/label\-\-evenifitlooksweak,tightenitinstead \-Ruleswith0matches\-\-thetrainingsetmaybesmall,theycouldhelponunseendata LOOKFOR: \-Near\-duplicaterules\(sametype,similarregex\)\-\-mergethem \-Overlybroadpatterns\(likebare\\d\+or\[A\-Z\]\[a\-z\]\+\)withhighFP\-\-tightenorremove \-Rulesthataresubsetsofotherrules\(onepatternalreadycoveredbyanother\) ReturnJSON: \{ "analysis":"Briefsummary\(1\-2sentences\)", "actions":\[ \{"action":"merge","rule\_ids":\["id1","id2"\],"merged\_pattern":"newregex", "merged\_name":"Combinedrulename","reason":"why"\}, \{"action":"remove","rule\_ids":\["id"\],"reason":"why"\} \] \} Return\{"analysis":"Allrulesareuseful","actions":\[\]\}ifnochangesneeded\. Figure 7:Rule auditor prompt\. Issued periodically by the agentic coordinator to merge redundant rules, remove pure\-noise rules, and tighten over\-broad patterns\. Any structural change proposed here is applied and then re\-evaluated on the held\-out split; changes that lower F1 are reverted\. Only fires when there are at least two rules\.YouareanexpertRuleCriticactingasahumandomainexpert\.Youarereviewingarule\-based \{task\.type\}systemandprovidingactionablefeedback\. TASK:\{task\.name\} \{task\.description\} Type:\{task\.type\} Inputschema:\{json\(task\.input\_schema\)\} Outputschema:\{json\(task\.output\_schema\)\} OVERALLPERFORMANCE: microF1=\{micro\_f1\},P=\{micro\_precision\},R=\{micro\_recall\} exact\_match=\{exact\_match\}\(\{total\_docs\}documents\) PER\-CLASSPERFORMANCE\(sortedworst\-first\): \{class\}:P=\{p\}R=\{r\}F1=\{f1\}\(TP=\{tp\}FP=\{fp\}FN=\{fn\}\) RULES\(\{n\}total\): Rule:"\{rule\.name\}"\(id=\{rule\.id\}\) Format:\{rule\.format\},Priority:\{rule\.priority\} Pattern:\{rule\.content\} \[Outputtemplate:\{json\(rule\.output\_template\)\}\] \[Outputkey:\{rule\.output\_key\}\] Metrics:P=\{p\}R=\{r\}F1=\{f1\}\(TP=\{tp\}FP=\{fp\},\{total\}totalmatches\) \[FPexamplesfromthisrule: Input:"\{input\_context\[:150\]\}" Ruleproduced:\{json\(rule\_output\[:3\]\)\} Expected:\{json\(expected\[:3\]\)\}\] \[FALSEPOSITIVES\(system\-level,\{n\}examples\): Predicted"\{predicted\_text\}"as\{predicted\_type\} \-\-\{notanentity\|shouldbe\{correct\_type\}\}\(context:"\{context\[:80\]\}"\)\] \[MISSEDENTITIES\(sampledocumentswitherrors\): Input:"\{input\_text\[:150\]\}" Expected:\{json\(expected\)\} Got:\{json\(got\)\}\] ANALYZEHOLISTICALLY: 1\.WhichrulescausethemostharmandWHY?Showyourreasoning\. 2\.Arethereinter\-classconflicts?\(sametextmatchedbyrulesfordifferenttypes\) 3\.Arepriorityassignmentscorrect?\(higherpriorityrunsfirst,winsconflicts\) 4\.WhatpatternsareMISSINGforclasseswithlowrecall? 5\.Whatwouldahumanregexexpertchangeaboutthesepatterns? PROVIDEFEEDBACK: \-rule\_feedback:ForEACHproblematicrule,provideSPECIFIC,ACTIONABLEadvice\. Bad:"Thisruleistoobroad"\(vague\) Good:"Narrow\\d\+byaddingword\-boundarycontext:use\(\\d\+\)\\s\*\(?:million\|billion\)forlarge numbers,andletMONEY/PERCENTruleshandle$\-prefixednumbersbyhigherpriority" \-task\_guidance:StrategicadviceabouttheENTIREruleset\. ReturnJSON: \{ "analysis":"1\-2sentencesummaryofthemainissues", "rule\_feedback":\{ "\{rule\_id\}":"Specificactionableadvice\.\.\." \}, "task\_guidance":"Strategicguidanceaboutthefullruleset\.\.\." \} Figure 8:Rule critic prompt\. Issued periodically by the agentic coordinator to perform holistic expert review of the full rule set\. The returnedrule\_feedbackentries are attached to the named rules as feedback items;task\_guidanceis stored as task\-level feedback\. Both re\-enter the system through the same interface as human feedback and are visible in subsequent patch prompts\.YouaretheCoordinatorforarulelearningsystem\. DecideifweshouldtriggeraretrainingloopNOW basedonnewdata\. STATUS: \-Newexamples:\{n\_examples\} \-Newcorrections\(highpriority\):\{n\_corrections\} \-Currentrules:\{n\_rules\} NEWDATASAMPLES\(upto10\): \-\[CORRECTION\]Input:\{json\(ex\.input\)\}\-\>Output:\{json\(ex\.output\)\} \-\[EXAMPLE\]Input:\{json\(ex\.input\)\}\-\>Output:\{json\(ex\.output\)\} DECISIONCRITERIA: 1\.TRIGGERifwehavecorrections\(usersfixingmistakes\)\. 2\.TRIGGERifwehaveasignificantbatchofnewexamples\(5\+\)\. 3\.WAITifdatalookssparseorredundant\. STRATEGIES: \-’balanced’:Standardmix\(default\) \-’corrections\_first’:Ifwehavecorrections \-’diversity’:Ifwehavemanysimilarexamples \-’uncertain’:Ifexampleslookambiguous ReturnJSON: \{ "should\_learn":boolean, "strategy":"balanced"\|"corrections\_first"\| "diversity"\|"uncertain", "max\_iterations":integer\(1\-3\), "reasoning":"Shortexplanation" \} Figure 9:Learning trigger decision prompt\. Issued when RuleChef operates in streaming or batch\-wise mode and a new buffer of data has accumulated\. The coordinator decides whether to start a retraining loop immediately or wait for more data, and selects the training strategy and iteration budget\. ### A\.3Observation mode prompts In observation mode, where RuleChef bootstraps rules from an existing LLM’s input\-output pairs rather than from manually labelled data, two prompts are used\. The*task discovery prompt*\(Figure[10](https://arxiv.org/html/2607.01293#A1.F10)\) infers the task schema \(type, input/output fields\) from a sample of raw API call logs\. The*observation mapping prompt*\(Figure[11](https://arxiv.org/html/2607.01293#A1.F11)\) then maps each batch of up to ten raw observations to the discovered schema, filtering irrelevant calls\. YouareanalyzingLLMAPIcallstodiscoverthe underlyingtaskpattern\. Hereare\{n\}sampleLLMinteractions: \{obs\_text\}\#\#formattedrawAPIlogs: \#\#systemprompt,usermessage,assistantresponse Basedontheseinteractions,identify: 1\.Whattaskisbeingperformed?\(nameanddescription\) 2\.Whattypeoftask?ChooseONE:extraction,ner, classification,transformation \-extraction:findingtextspans\(untyped\) \-ner:findingtypedentitieswithlabels \-classification:assigningalabeltoinputtext \-transformation:extractingstructuredfieldsfromtext 3\.Whataretheinputfieldsandtheirtypes? 4\.Whataretheoutputfieldsandtheirtypes? 5\.Whichinputfieldcontainsthemaintext? ReturnONLYvalidJSON: \{ "name":"task\_name", "description":"onesentencedescription", "type":"classification", "input\_schema":\{"text":"str"\}, "output\_schema":\{"label":"str"\}, "text\_field":"text" \} Figure 10:Task discovery prompt\. Issued once in observation mode when no task schema is provided upfront\. Requires a minimum of accumulated raw observations \(default: 5\) before firing\. The discovered schema is used for all subsequent observation mapping calls\.YouareextractingstructuredtrainingdatafromLLM interactions\. Taskdefinition: Name:\{task\.name\} Description:\{task\.description\} Type:\{task\.type\} Inputschema:\{json\(task\.input\_schema\)\} Outputschema:\{json\(task\.output\_schema\)\} Hereare\{n\}LLMinteractionstoanalyze: \{obs\_text\}\#\#upto10rawAPIcalllogsperbatch ForEACHinteraction,determine: 1\.Isitrelevanttothetaskabove?\(relevant:true/false\) 2\.Ifrelevant,extracttheinput\(matchinginput\_schema keys\)andoutput\(matchingoutput\_schemakeys\)\. ReturnONLYaJSONarraywithexactly\{n\}objects: \[ \{"relevant":true,"input":\{\.\.\.\},"output":\{\.\.\.\}\}, \{"relevant":false,"input":null,"output":null\}, \.\.\. \] Figure 11:Observation mapping prompt\. Issued once per batch of up to ten raw API call logs\. Filters irrelevant calls and extracts structured input\-output pairs matching the task schema\. Mapped examples are added to the training buffer and consumed by subsequent synthesis or patch calls\. ### A\.4Utility prompts Two lightweight prompts cover auxiliary functions\. The*synthetic example generation prompt*\(Figure[12](https://arxiv.org/html/2607.01293#A1.F12)\) generates a single realistic training input on demand, used to augment sparse training sets\. The*LLM fallback prompt*\(Figure[13](https://arxiv.org/html/2607.01293#A1.F13)\) is the only prompt active at inference time: it is issued as a last resort when no rule fires and the executor is configured to delegate to an LLM rather than abstain\. Generatearealistictrainingexampleforthistask: Task:\{task\.name\} Description:\{task\.description\} Inputschema:\{task\.input\_schema\} ReturnJSONwithinputfieldsonly\. Example\#\{seed\+1\}: Figure 12:Synthetic example generation prompt, generates a single realistic training input on demand\. Theseedparameter drives diversity across calls\.Task:\{task\.name\} Description:\{task\.description\} Input:\{json\(input\_data\)\} Outputschema:\{task\.output\_schema\} ReturnONLYvalidJSONmatchingtheoutputschema, noexplanation\. Figure 13:LLM fallback execution prompt: the only prompt active at inference time, issued only when no rule fires and the executor is not configured to abstain\.
Similar Articles
LabGuard: Grounding Natural-Language Laboratory Rules into Runtime Guards for Embodied Laboratory Agents
LabGuard introduces a framework that translates natural-language laboratory safety rules into executable runtime monitors for embodied agents, achieving a reduction in unsafe events from 39.5% to 23.8% while maintaining task success.
@dair_ai: New paper on giving LLM agents experience that improves the weights and stays readable at the same time. Agent-experien…
JERP introduces a method for LLM agents to jointly learn interpretable natural-language rules and update policy parameters from the same interaction trajectories, improving performance on AlfWorld and WebShop while maintaining inspectability.
FactoryLLM: A Safe and Open-Source AI Playground for Evaluating LLMs in Smart Factories
FactoryLLM is an open-source AI playground for evaluating LLM-based RAG models in smart factory fault diagnostics, supporting local LLMs and dual evaluation metrics. A case study with three LLMs showed groundedness scores above 0.88 across 30 maintenance queries from 600 pages of cross-machine documentation.
From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning
This paper introduces LLM-as-Environment-Engineer, a framework where LLMs design their own training environments for reinforcement learning in multi-agent reasoning tasks, enabling self-improving training that surpasses larger proprietary models.
Counterexample Guided Learning in the Large using Reasoning Agents
This paper proposes using counterexample-guided learning for LLMs to perform regular-expression induction, where a verifier provides counterexamples to refine candidate expressions. The method significantly improves sample efficiency and success rates on challenging tasks, demonstrating that LLMs can benefit from structured feedback beyond treating it as additional data.