Counterexample Guided Learning in the Large using Reasoning Agents

arXiv cs.LG 06/11/26, 04:00 AM Papers
Summary
This paper proposes using counterexample-guided learning for LLMs to perform regular-expression induction, where a verifier provides counterexamples to refine candidate expressions. The method significantly improves sample efficiency and success rates on challenging tasks, demonstrating that LLMs can benefit from structured feedback beyond treating it as additional data.
arXiv:2606.11521v1 Announce Type: new Abstract: LLMs and LLM agents should improve when given feedback, but identifying when they are able to do so is difficult: feedback is heterogeneous, domain-specific, and difficult to control. We approach this challenge by asking LLMs to perform regular-expression induction, a classical symbolic learning problem where precise mechanisms for feedback exist in the form of counterexamples. In counterexample-guided learning, a learner (LLM) proposes candidate regular expressions from positive/negative-labeled strings, and the teacher (verifier) returns counterexamples showcasing the difference between the candidate and target languages. We identify novel counterexample-guided refinement strategies that enable effective regex learning, such as regularization and symbolic counterexample clusters. We also explore agentic strategies such as reflection and repair loops. Empirically, we find that verifier feedback substantially improves sample efficiency on challenging regex-induction tasks, reducing the number of labeled examples required and enabling learning of complex target expressions where standard prompting fails. For example, on the hardest task groups, our counterexample-guided framework improves success from 3.2% to 38.1% and from 38.9% to 74.1% on two different regex domains. These results suggest that LLMs can benefit from rich feedback beyond treating it as additional data, opening the door for robust verifier-guided methods for LLM-based program synthesis and formal reasoning.
Original Article
View Cached Full Text
Cached at: 06/11/26, 01:48 PM
# Counterexample Guided Learning in the Large using Reasoning Agents
Source: [https://arxiv.org/html/2606.11521](https://arxiv.org/html/2606.11521)
Frederic SalaDepartment of Computer Sciences, University of Wisconsin\-MadisonThomas RepsDepartment of Computer Sciences, University of Wisconsin\-MadisonAdithya MuraliDepartment of Computer Sciences, University of Wisconsin\-Madison

###### Abstract

LLMs and LLM agents should improve when given feedback, but identifying when they are able to do so is difficult: feedback is heterogeneous, domain\-specific, and difficult to control\. We approach this challenge by asking LLMs to perform regular\-expression induction, a classical symbolic learning problem where precise mechanisms for feedback exist in the form of counterexamples\. In counterexample\-guided learning, a learner \(LLM\) proposes candidate regular expressions from positive/negative\-labeled strings, and the teacher \(verifier\) returns counterexamples showcasing the difference between the candidate and target languages\. We identify novel counterexample\-guided refinement strategies that enable effective regex learning, such as regularization and symbolic counterexample clusters\. We also explore agentic strategies such as reflection and repair loops\. Empirically, we find that verifier feedback substantially improves sample efficiency on challenging regex\-induction tasks, reducing the number of labeled examples required and enabling learning of complex target expressions where standard prompting fails\. For example, on the hardest task groups, our counterexample\-guided framework improves success from3\.2%3\.2\\%to38\.1%38\.1\\%and from38\.9%38\.9\\%to74\.1%74\.1\\%on two different regex domains\. These results suggest that LLMs can benefit from rich feedback beyond treating it as additional data, opening the door for robust verifier\-guided methods for LLM\-based program synthesis and formal reasoning\.

## 1Introduction

Models, agents, and systems are increasingly required to learn and adapt from feedback\. For example, coding agents must respond to errors and unit test failures, tool\-using agents must revise their plans after failed API calls, and personalized assistants need to adapt their responses over time based on user corrections and preferences\. Despite the importance of this capability, measuring how well models use feedback is surprisingly challenging\. Feedback is diverse, heterogeneous, often closely tied to the task or domain, and difficult to control systematically\.

Is there a simple controllable way to measure models and agents’ ability to exploit feedback? We make the case for*regular\-expression induction*as a clean testbed for measuring this capability\. Given a set of labeled strings, the goal is to infer a description of the underlying language in the form of a regular expression by generalizing from the examples\. This task is a canonical problem in symbolic learning, and a highly challenging one: many candidate expressions can fit a small set of labeled strings, and naive in\-context learning can easily overfit to superficial patterns\. The exploration of LLM capabilities in this setting revisits a long line of work in automata learning and program synthesis through a modern lens\.

This problem, and in fact our larger goal, are both not fully addressed by prior art\. Traditional neural sequence models such as RNNs can treat regex recognition as a classification problem, but they do not naturally expose symbolic hypotheses and do not clearly benefit from counterexample\-only learning\. Conversely, LLMs can generate explicit regexes, but current usage is often prompt\-driven and heuristic, with limited structure for how verifier feedback should be incorporated across rounds\. Prior works also largely focus on synthesizing regexes from natural\-language descriptions, as opposed to pure example\-based learningTanget al\.\([2026](https://arxiv.org/html/2606.11521#bib.bib7)\); Siddiqet al\.\([2024](https://arxiv.org/html/2606.11521#bib.bib8)\)\. On the other hand, while counterexample\-guided methods are well established in symbolic synthesisSolar\-Lezama \([2013](https://arxiv.org/html/2606.11521#bib.bib10)\); Abateet al\.\([2018](https://arxiv.org/html/2606.11521#bib.bib9)\); Aluret al\.\([2013](https://arxiv.org/html/2606.11521#bib.bib11)\), it remains unclear how to adapt them to modern LLM\-based inference in a way that is sample\-efficient, stable, and robust to overfitting\. Furthermore, traditional program\-synthesis techniques can stall as the number of examples becomes large\. Our work contributes a much needed synergy between the two kinds of approaches, framing regex induction as an oracle\-guided symbolic generation problem where the model iteratively refines its hypothesis through verifier feedback in the form of \(potentially a large number of\) counterexamples\.

Concretely, in this work we present a counterexample\-guided framework for LLM\-based regex learning\. Our approach uses LLMs and agents to generate candidate regexes from labeled strings, checks them against a target expression, and synthesizes informative counterexamples from the symmetric difference between the predicted and target languages\. We further study \(a\) regularization strategies that bias the model toward shorter, simpler, and more plausible symbolic hypotheses, \(b\) richer variants of counterexamples that cluster several individual counterexamples with common failure information, and \(c\) agentic strategies that use reflection to extract the structural implications of counterexamples and repair loops to iteratively revise failed regex hypotheses\. Empirically, we find that counterexample\-guided learning is especially helpful in the symbolic generation setting: on several challenging regexes, it substantially reduces the number of training examples required and, in some cases, enables successful recovery where standard prompting fails, as shown in Figure[1](https://arxiv.org/html/2606.11521#S1.F1)\.

Our Contributions\.The main contributions of this paper are:

- •An oracle\-guided formulation of LLM\-based regex induction:We formulate regex induction with LLMs as an iterative symbolic learning problem in which candidate regexes are refined using feedback from an oracle in the form of counterexamples\.
- •An active learning methodology for regex induction using LLMs:To build effective LLM\-based active learners for regex induction, we lift the classic teacher\-learner learning framework to use LLM\-based learners, investigating techniques such as regularization to prioritize simpler hypotheses and to cluster counterexamples symbolically\.
- •An agentic workflow for counterexample\-guided learning in the large:We build novel agentic frameworks that learn regular expressions by reasoning about counterexamples across many rounds of learning, using techniques such as agentic reflection and iterative expression repair\.
- •A rigorous evaluation of the dynamics of LLM learning with verifier feedback:We study the performance of several counterexample\-based learners across two datasets for regex induction\. We show that our agentic framework improves learning performance compared to standard prompting from3\.2%3\.2\\%to38\.1%38\.1\\%and38\.9%38\.9\\%to74\.1%74\.1\\%on the hardest instances in the two benchmark suites\. We further show that our technique improves sample efficiency and that these gains persist across various base model families\. We also compare our technique against baselines and perform ablation studies, demonstrating that each component of our agentic framework contributes significantly\.

![Refer to caption](https://arxiv.org/html/2606.11521v1/figs/current_guess_diff_ratio_sample_budget_composite_rerun0.png)Figure 1:Learning dynamics of our agentic counterexample\-guided learner \(orange\) versus standard prompting without counterexamples \(blue\)\. As the number of examples increases, the hypothesis returned by the agentic learner converges closer222Distance is measured using Diff Ratio, which is the fraction of strings up to a fixed length \(3232in the figure\) on which the hypothesis regex and target regex disagree, normalized by the strings accepted by either regex\.and faster to the target concept, showing the value of leveraging rich feedback\. Points show the mean distance and shaded regions show variation\.
## 2Related Work

Regular\-Language Learning or Regex Induction\.Learning regular languages from examples and queries has a long history in the study of formal languages\. A classic example is Angluin’sL∗L^\{\\ast\}algorithmAngluin \([1987](https://arxiv.org/html/2606.11521#bib.bib17)\), which learns regular sets through membership and equivalence queries with counterexamples; this oracle setting closely matches ours\. Beyond query\-based learning, prior work has also studied direct regex inference from examples, including learning from positive dataFernau \([2009](https://arxiv.org/html/2606.11521#bib.bib18)\), search\-based synthesis from labeled examplesLeeet al\.\([2016](https://arxiv.org/html/2606.11521#bib.bib19)\); Bartoliet al\.\([2014](https://arxiv.org/html/2606.11521#bib.bib20)\), and constraint\- or optimization\-based approachesGao and Zhang \([2020](https://arxiv.org/html/2606.11521#bib.bib21)\)\. Prior to LLMs, neural methods also explored regular\-language learning with recurrent models trained on labeled strings, sometimes followed by automata extraction or analysis of learned finite\-state structureCohenet al\.\([2018](https://arxiv.org/html/2606.11521#bib.bib22)\); Oliva and Lago\-Fernández \([2021](https://arxiv.org/html/2606.11521#bib.bib23)\)\. These approaches provide important background, but they do not study regex generation with verifier\-guided refinement in an LLM\-centered setting\.

Counterexample\-Guided Symbolic Synthesis\.Counterexample\-guided refinement is also central to symbolic synthesis\. Counterexample\-Guided Inductive Synthesis \(CEGIS\) combines candidate generation with automated validation and counterexample generation, and has been highly successful in program synthesis and related symbolic\-reasoning tasksSolar\-Lezama \([2013](https://arxiv.org/html/2606.11521#bib.bib10)\); Abateet al\.\([2018](https://arxiv.org/html/2606.11521#bib.bib9)\)\. Beyond program generation, counterexample\-based supervision has also been used in symbolic learning problems, such as invariant synthesisGarget al\.\([2014](https://arxiv.org/html/2606.11521#bib.bib24),[2016](https://arxiv.org/html/2606.11521#bib.bib25)\)and automata\-related reasoningWeisset al\.\([2020](https://arxiv.org/html/2606.11521#bib.bib26)\), where verifier feedback helps expose mistakes that are difficult to identify from positive and negative examples alone\. Our work is aligned with this line of research, but adapts it to an LLM\-centered setting in which the learner is not a symbolic search procedure, but a language model that must interpret counterexamples and revise its own symbolic hypothesis\.

LLMs for Symbolic Reasoning, Program Synthesis, and Regex Induction\.Recent work has shown that LLMs can support symbolic reasoning and program synthesis by generating structured outputs such as code, logical forms, and symbolic rulesChenet al\.\([2021](https://arxiv.org/html/2606.11521#bib.bib27)\); Austinet al\.\([2021](https://arxiv.org/html/2606.11521#bib.bib28)\); Shinet al\.\([2021](https://arxiv.org/html/2606.11521#bib.bib29)\)\. In regex\-related tasks, prior work has shown that language models can generate explicit regular expressions from natural\-language descriptions or prompts, often combined with examples, repair heuristics, or multimodal and sketch\-based synthesis mechanisms rather than pure induction from labeled strings aloneTanget al\.\([2026](https://arxiv.org/html/2606.11521#bib.bib7)\); Siddiqet al\.\([2024](https://arxiv.org/html/2606.11521#bib.bib8)\); Chenet al\.\([2020](https://arxiv.org/html/2606.11521#bib.bib30)\); Yeet al\.\([2020](https://arxiv.org/html/2606.11521#bib.bib31)\)\. Furthermore, LLMs are increasingly used in settings with external feedback, including process\-supervision methods that verify intermediate reasoning stepsLightmanet al\.\([2023](https://arxiv.org/html/2606.11521#bib.bib32)\)and reinforcement learning with verifiable rewardsLiuet al\.\([2025](https://arxiv.org/html/2606.11521#bib.bib33)\)\. Our work differs from these directions by studying LLM\-based regex induction from labeled strings, and by asking not only whether verifier\-provided feedback helps, but also how it can be incorporated\.

## 3Problem Statement and Motivation

Our study is motivated by the long history of work on active learning where a learner actively solicits feedback from a teacher, and uses the feedback to ‘tease out’ the hypothesis over several rounds of interaction\. Specifically, in the realm of symbolic learning/reasoning tasks, the capability frontier of LLMs can be spiky, and there is little consensus on broad approaches for harnessing these abilities well\. The setting of oracle\-aware learning through iterative feedback offers a richer framework to elicit symbolic learning capabilities from LLMs\. Simultaneously, traditional search\-based approaches for counterexample\-guided learning can suffer when challenging problem instances accrue many counterexamples over many rounds, and machine\-learning\-based approaches provide a compelling alternative to address this issue\. Together, these facts motivate the high\-level research question of this paper: what can LLMs and Agents bring to counterexample\-guided learning in the large?

Oracle\-Guided Learning\.We formulate the problem we study as an active\-learning problem guided by an oracle/teacher\. Let𝒳\\mathcal\{X\}be a domain of elements\. We refer to subsets of this domain as*languages*\. LetL⋆⊆𝒳L^\{\\star\}\\subseteq\\mathcal\{X\}denote the target language we wish to learn\. The learner observes finite evidence about the target language in the form of labeled elements from the domain, and synthesizes a candidate language\. The teacher then inspects the proposed candidate and provides feedback in the form of labeled elements that witness the incorrectness of the proposed candidate, namely elements that should be included in the target language but are not included in the candidate language, and vice versa\. Note that in a concrete realization of this framework, the teacher does not necessarily need access to the target concept to provide such feedback: it can analyze the candidate with respect to test cases, constraints, or other properties it knows to be true about the target, and return witnesses that showcase violations of these properties\.

Problem Statement: Regular\-Language Induction\.We instantiate the above framework for regular\-language induction\. The instance space𝒳\\mathcal\{X\}is a set of finite stringsΣ⋆\\Sigma^\{\\star\}, whereΣ\\Sigmais a finite alphabet\. The target languages we study are those that can be represented by a regular expression or, equivalently, a finite\-state automaton\. Given a regular expressionrr, we denote the language it represents byL\(r\)L\(r\)\. Given labeled examples, a learner must synthesize a regular expressionr^\\hat\{r\}such thatL\(r^\)L\(\\hat\{r\}\)matches the target languageL⋆L^\{\\star\}\. This task is both predictive and symbolic: the output must classify strings correctly, but it must also be an interpretable expression that can be compiled, verified, and compared against the target\. We construct the teacher by sampling from the symmetric differenceL\(r^\)△L⋆L\(\\hat\{r\}\)\\triangle L^\{\\star\}\. We can build such a teacher in this setting because regular languages are closed under Boolean operations, and there exist efficient off\-the\-shelf libraries for constructing a symmetric\-difference automatonRomero \([2021](https://arxiv.org/html/2606.11521#bib.bib34)\)\.

## 4Reasoning Agents for Counterexample\-Guided Inductive Learning

![Refer to caption](https://arxiv.org/html/2606.11521v1/x1.png)Figure 2:Overview of our iterative counterexample\-guided agentic refinement framework with a teacher and a learner\. The learner uses agentic reflection and repair along with counterexamples \(see Section[4](https://arxiv.org/html/2606.11521#S4)\)\. The teacher constructs counterexamples by checking equivalence between the candidate and target regexes\.We describe the various LLM\-based learners that we built for counterexample\-guided learning\. Figure[2](https://arxiv.org/html/2606.11521#S4.F2)describes the overall construction\. We first present the teacher\-learner interaction, and then describe the key parts of our learning framework\. We provide full prompt templates in Appendix[F](https://arxiv.org/html/2606.11521#A6)\.

Teacher\-Learner Iterative Framework\.The learning framework is built around the teacher\-learner setting described earlier\. The learning proceeds in several*rounds*\. In each round, the learner proposes a candidate regular expression\. The teacher analyzes it, and determines if it is equivalent to the target language\. Note that the equivalence of regular languages is decidable, and can be performed efficiently by existing libraries\.

If the proposed candidate is not equivalent, the teacher returns both positive and negative counterexamples, as applicable\. The learner adds these counterexamples to the existing set \(provided as part of the prompt\) and attempts to synthesize a better candidate\. The process continues until the learner finds the target or reaches a predetermined maximum attempt bound\. Note that this process is a closed\-loop refinement procedure\. Unlike learning paradigms where the training set is fixed in advance, supervision here is adaptive: the teacher selects feedback based on the learner’s current errors\. We argue that this approach is especially useful for symbolic synthesis, where a hypothesis may fit a limited sample dataset but may fail on unseen instances\.

Improving Effectiveness of Counterexamples\.We synthesize counterexamples as follows: given the candidate regular expressionrrand the targetr⋆r^\{\\star\}, we construct the corresponding minimal finite automataAAandA⋆A^\{\\star\}\. Positive counterexamples \(i\.e\., strings that must be included in the target language but are not a member of the candidate language\) are sampled from the differenceA⋆∖AA^\{\\star\}\\setminus A, and negative counterexamples are sampled fromA∖A⋆A\\setminus A^\{\\star\}\. The sampling procedure traverses the structure of the difference automata and returns a diverse set of strings that cover different paths, representing a varied set of “mistakes” in the proposed concept\.

However, the number of paths in an automaton can grow rapidly\. Multiple symbols may induce the same state transition, and loops—corresponding to Kleene\-stars in the regular expression—generate infinitely many counterexamples\. We therefore propose the idea ofclusteredcounterexamples, which are symbolic counterexamples that capture many similar individual counterexample instances\.

Clustered Counterexamples\.We construct clustered counterexamples as follows\. For a stateqqin the minimized difference automaton, we group outgoing symbols by their destination state

Γ\(q,q′\)=\{a∈Σ:δ\(q,a\)=q′\}\.\\Gamma\(q,q^\{\\prime\}\)=\\\{a\\in\\Sigma:\\delta\(q,a\)=q^\{\\prime\}\\\}\.Symbols in the same group are behaviorally equivalent at stateqq, because they induce the same transition\. We represent each group with a new super\-character symbol that we add to the alphabet\. For example, we represent the set \{A,…,Z,a,…,z\} as a new super\-character \[A\-Za\-z\]\. We may also use negations to represent the class when convenient, e\.g\.\[^0\-9\]\[\\hat\{\\ \}0\\text\{\-\}9\]\. Singleton groups are of course identified with their literal characters\.

Formally, a clustered counterexample is a path from the start state of the difference DFA to an accepting state, where each edge contributes one super\-character\. A clustered counterexample compactly represents many concrete strings that follow the same DFA trajectory and share the same disagreement label\. For example,S\[0\-9\]\#runrepresents strings such asS2\#runandS8\#run\. Clustered counterexamples are interpreted universally during learning: a clustered example is accepted only if every string represented by it is accepted\. We enumerate loop\-free accepting paths and use a finite frontier and expansion budget to keep counterexample generation bounded\.

Agentic Workflow\.We build on the above ideas by lifting the teacher\-learner framework into an agentic workflow\. Recent work has shown that LLMs can benefit from agentic workflows where they iteratively evaluate and improve their own hypotheses\(Madaanet al\.,[2023](https://arxiv.org/html/2606.11521#bib.bib1); Shinnet al\.,[2023](https://arxiv.org/html/2606.11521#bib.bib2); Xionget al\.,[2024](https://arxiv.org/html/2606.11521#bib.bib4); Songet al\.,[2025](https://arxiv.org/html/2606.11521#bib.bib5)\)\. We develop such an iterative agentic approach to help LLMs use counterexamples more effectively\. Our approach has an outer loop and an inner loop \(see Figure[2](https://arxiv.org/html/2606.11521#S4.F2)\)\. The outer loop accumulates feedback from the teacher and usesreflectionto augment the native reasoning capabilities of the underlying model\. The inner loop refines the process of proposing candidate regular expressions by performingrepairover a smaller set of counterexamples chosen from the full set of counterexamples available to the learner\. The detailed algorithm is in Appendix[B](https://arxiv.org/html/2606.11521#A2)\.

We now describe the reflection mechanism and the repair loop in detail\.

Reflection\.The key idea behind reflection is to ask the LLM to identify specific conceptual errors in previous hypotheses and use them to improve future proposals\. We therefore instruct the LLM to respond not only with a candidate regular expression, but also a short reasoning trace explaining how the synthesized expression was chosen\. We then provide this reasoning trace back to the model along with counterexamples generated by the teacher\. The learner is then instructed to understand the previous attempt, explain what failed, and revise the hypothesis according to the teacher’s feedback\. Concretely, we construct a reflection prompt in each epoch of the outer loop consisting of the reflection instruction, the previous hypothesis, the reasoning trace corresponding to the previous hypothesis, and the new set of counterexamples provided by the teacher\.

Repair Loop\.The mere presence of examples in the context does not ensure that LLMs will return candidates that are consistent with all of them\. In fact, we observe that LLMs may return candidates that are not consistent with provided instructions, are syntactically invalid, or most importantly fail to fit the accumulated training examples\. However, the hypotheses they return are not completely wrong or irrelevant either\. Therefore, we introduce a repair loop that repeatedly queries the LLM for candidates, providing pointed feedback on the failures\. Specifically, we require that the following conditions are satisfied:

- •Syntax Check: the expression must be syntactically valid;
- •Satisfaction Check: the regex must correctly label all counterexamples seen so far \(𝒞\\mathcal\{C\}\)\.

We retry repair by providing the feedback from all previous tries until a candidate satisfying both requirements is synthesized, or a predetermined retry budget is exhausted\. We score the candidates obtained in this manner as follows:

RepairScore\(m,e,𝒞\)=12Compiles\(m\)\+12\(1−\|e\|\|𝒞\|\),\\displaystyle\\textsc\{RepairScore\}\(m,e,\\mathcal\{C\}\)=\\frac\{1\}\{2\}\\mathrm\{Compiles\}\(m\)\+\\frac\{1\}\{2\}\\left\(1\-\\frac\{\|e\|\}\{\|\\mathcal\{C\}\|\}\\right\),wheremmis syntax error message,Compiles\(m\)∈\{0,1\}\\mathrm\{Compiles\}\(m\)\\in\\\{0,1\\\}indicates whether the regex compiles,eeis the set of misclassified examples, and𝒞\\mathcal\{C\}is the accumulated counterexample set\. The repair loop returns the highest scoring candidate synthesized across retries\.

Regularization\.In preliminary investigations, we observed that LLMs often fit the given examples with overly specific or unnecessarily complex regular expressions\. To address this issue we add prompt\-level regularization: we instruct the learner to prefer simpler hypotheses while remaining consistent with the examples\. During inference, we limit the maximum length of the expression, bound the nesting depth of complex operators, and restrict the alphabet to valid symbols for the task\.

## 5Experiments

We study the following research questions:

- •RQ1: Can feedback be used by models and agents, and what affects its usefulness?We ask whether counterexample feedback improves regex learning relative to learning from randomly sampled examples alone\. We study how this effect changes with task complexity\.
- •RQ2: Can agentic setups make better use of feedback, and at what cost?We compare single\-shot approaches with agentic workflows that perform reflection and repair, studying whether additional reasoning effort improves the use of feedback, and whether those gains justify the increased token cost\.
- •RQ3: How does the choice of backbone model affect feedback\-guided induction?We evaluate whether the benefits of feedback\-guided induction persist across different model families and reasoning configurations\.
- •RQ4: How robust is counterexample\-guided learning to prompt design?We vary the prompt components used to specify the input & output formats, reasoning style, and strategy\.
- •RQ5: How do different ways of providing counterexamples affect performance?We compare clustered and non\-clustered counterexamples to test whether feedback quality matters beyond the number of examples provided\.

We also perform ablations to gauge the importance of each component in our pipeline\.

Data\.We evaluate our methods mainly on two datasets with different regular\-expression variants\. A summary of the syntax for each variant and how we collect the datasets are provided in Appendix[C](https://arxiv.org/html/2606.11521#A3)\.

- •Simple Regex\.The simple regex language uses the standard textbook regular\-expression operators—concatenation, union, Kleene star, and epsilon—over a fixed three\-symbol alphabet\{a,b,c\}\\\{a,b,c\\\}\.
- •Extended Regex\.The extended regular language is closer to real\-world usage: beyond the simple setting, it additionally supports one\-or\-more, optional items, counted repetition, intersection, and negation, yielding a richer and more flexible syntax\. The extended regexes we use are adapted from KB13Kushman and Barzilay \([2013](https://arxiv.org/html/2606.11521#bib.bib6)\)\.

We measure regex complexity using two criteria: the number of states in the minimized DFA \(*\#States*\) and the maximum Kleene\-star depth \(*StarDepth*\)\. We select three representative languages for each\(*\#States*,*StarDepth*\)\(\\emph\{\\\#States\},\\emph\{StarDepth\}\)combination, with*\#States*ranging from33to99,*StarDepth*from0to44for Simple Regex, and from0to22for Extended Regex due to limitations of the original KB13 dataset\.

Setup\.Detailed configuration of LLM and hyperparameters can be found in Appendix[D](https://arxiv.org/html/2606.11521#A4)\.

### 5\.1RQ1: Can feedback be used by models and agents, and what affects its usefulness?

Setup\.We evaluate three settings\.Standarduses no counterexamples; instead, it constructs the training set with randomly sampled labeled strings at logarithmically increasing scales, together with regularization\.Singleuses clustered counterexamples for training data and applies regularization, but excludes the agentic workflow\.Agenticuses the full proposed framework\. All counterexamples are clustered as described in Section[2](https://arxiv.org/html/2606.11521#S4.F2)\.

Results\.

Simple Regex Run\-Level Success Rate↑\\uparrowMethodSD=0SD=1SD=2SD=3SD=4Standard100\.0%50\.8%23\.8%27\.0%3\.2%Single100\.0%92\.1%73\.0%61\.9%23\.8%Agentic100\.0%100\.0%82\.5%77\.8%38\.1%
Extended Regex Run\-Level Success Rate↑\\uparrowMethodSD=0SD=1SD=2Standard100\.0%66\.7%38\.9%Single100\.0%77\.8%68\.5%Agentic100\.0%84\.1%74\.1%

Table 1:Run\-level success rate by regex complexity \(aggregated by*StarDepth*\(SD\)\) on the Simple Regex and Extended Regex\. Each cell reports success ratio over 21 regexes and 3 runs per regex\.Table[1](https://arxiv.org/html/2606.11521#S5.T1)shows that feedback in the form of counterexamples substantially improve regex induction on both datasets\. On Simple Regex,Standardfalls sharply as*StarDepth*increases, reaching only3\.2%3\.2\\%success at*StarDepth 4*, whileSingleandAgenticperform much better at23\.8%23\.8\\%and38\.1%38\.1\\%, respectively\. The same pattern appears on Extended Regex: at*StarDepth 2*, counterexample\-guided methods are better thanStandardby nearly 30 percentage points, withAgenticachieving the best result\. The benefit of counterexamples becomes more significant as tasks become harder\. As*StarDepth*increases,Standardprompting degrades quickly, while counterexample\-guided settings remain substantially more reliable\. This result suggests that counterexamples are most useful when the initial labeled examples underdetermine the target language and the model must recover more complex symbolic structure\.

### 5\.2RQ2: Can agentic setups make better use of feedback, and at what cost?

Setup\.To isolate the value of the agentic workflow, we compareSingleandAgentic\. Both settings use clustered counterexamples and regularization, but onlyAgenticperforms additional reflection and repair steps\. We measure both final success rate and inference cost, where cost is approximated by total token usage\.

Results\.TheAgenticworkflow makes more effective use of counterexamples\. Across both datasets and all nontrivial*StarDepth*levels,Agenticconsistently outperformsSingle\. This observation indicates that reflection and repair loops help the model extract useful information from counterexamples and translate it into improved regex hypotheses\.

![Refer to caption](https://arxiv.org/html/2606.11521v1/figs/pareto_depth_overlay_simplyrx.png)\(a\)Main Simple Regex Dataset
![Refer to caption](https://arxiv.org/html/2606.11521v1/figs/pareto_depth_overlay_extrx.png)\(b\)Main Extended Regex Dataset

Figure 3:Pareto frontiers of the main results on Simple and Extended Regex\. Points at the same*StarDepth*share the same color, while marker shapes distinguish learning settings\. Gray dashed lines connect the same setting across difficulty levels, showing how the cost–accuracy trade\-off shifts with task difficulty\. Colored solid segments with black dashed overlays indicate Pareto\-optimal trade\-offs\.The Pareto plots in Figures[3\(a\)](https://arxiv.org/html/2606.11521#S5.F3.sf1)and[3\(b\)](https://arxiv.org/html/2606.11521#S5.F3.sf2)further show that counterexample\-guided learning is more cost\-efficient than theStandardbaseline\. For most nontrivial*StarDepth*levels,Singleachieves higher success rates with fewer total tokens thanStandard\. The plots also indicate the cost of theAgenticworkflow: it typically achieves the highest success rate, but requires more tokens due to additional reflection and repair rounds\. Counterexamples improve both success and sample efficiency, while the full agentic workflow further increases success but uses more tokens\.

### 5\.3RQ3: How does the choice of backbone model affect feedback\-guided induction?

Setup\.We compare different backbone models on Simple Regex while holding the task, evaluation protocol, and maximum output\-token budget fixed\. We evaluateStandardandAgenticacross Qwen3\-235BTeam \([2025](https://arxiv.org/html/2606.11521#bib.bib13)\), GPT\-5\.4 Medium, GPT\-5\.4 xHighOpenAI \([2026](https://arxiv.org/html/2606.11521#bib.bib14)\), and GPT\-oss\-120BOpenAIet al\.\([2025](https://arxiv.org/html/2606.11521#bib.bib12)\)\.

Results\.Table[2](https://arxiv.org/html/2606.11521#S5.T2)shows that for every tested model and every nontrivial*StarDepth*level,Agenticconsistently improves overStandard\. This result indicates that the proposed framework can broadly use counterexamples to improve regular\-language induction, rather than depending on a single backbone model\. However, the absolute performance still varies across models\. GPT\-5\.4 xHigh performs best overall, while Qwen3\-235B performs worst, with the other two models in between\.

Because GPT\-5\.4 xHigh achieves the best overall performance, we further explored its behavior on the harder*StarDepth=4*split\. Under the agentic setting, GPT\-5\.4 xHigh substantially outperforms GPT\-oss\-120B, achieving85\.7%85\.7\\%success, compared with38\.1%38\.1\\%for GPT\-oss\-120B\.

Table 2:Performance of different backbone models on Simply Regex aggregated by*StarDepth*\. Each cell reports run\-level success ratio over 21 regexes \(3 per*\#States*\), one run per regex\. Qwen3\-235B denotesQwen/Qwen3\-235B\-A22B\-Instruct\-2507\-tput\. GPT\-5\.4 Medium and GPT\-5\.4 xHigh both useopenai/gpt\-5\.4, withreasoning\_effortset tomediumandxhigh, respectively\.
### 5\.4RQ4: How robust is counterexample\-guided learning to prompt design?

Setup\.We ablate the prompt design on Simple Regex\. All prompt variants include a short task instruction that asks the model to infer a regular language from labeled strings\. We then vary four additional components: an*input instruction*that explains the format of labeled examples, an*output instruction*that specifies the regex syntax and answer format, a*CoT\-style output instruction*Weiet al\.\([2023](https://arxiv.org/html/2606.11521#bib.bib16)\)that asks for intermediate reasoning before the final answer, and a*strategy*component obtained from GEPAAgrawalet al\.\([2026](https://arxiv.org/html/2606.11521#bib.bib15)\)optimization\.

Results\.The most important component is the*output instruction*\(see Appendix[E\.3](https://arxiv.org/html/2606.11521#A5.SS3)\), which suggests that models need a clear specification of the output language and answer format; otherwise, many failures come from producing invalid or unusable regexes rather than from the induction problem alone\. In contrast, explicitly prompting for*CoT*does not consistently improve final success\. The most likely reason for the lack of improvement is because GPT\-oss already separates reasoning from final answers, while many recent reasoning models perform latent reasoning internally\. As a result, prompt\-level*CoT*guidance offers limited additional benefit once the model already uses a reasoning\-style decoding interface\. The GEPA\-optimized*strategy*component provides some benefit, especially when regexes are easier at lower*StarDepth*\. However, the gain is limited and becomes less clear on harder regexes\. This result may indicate mild overfitting of the optimized strategy to simpler regex structures, and does not seem to provide a general solution for more difficult tasks\.

Overall, the results suggest that prompt design is relatively robust once the input and output formats, and the syntax of the desired output regex are specified clearly\. The exact wording of auxiliary reasoning or strategy instructions has a smaller effect\.

### 5\.5RQ5: How Do Different Ways of Providing Counterexamples Affect Performance?

Setup\.We study whether clustered counterexamples affect learning\.

Results\.Table[3](https://arxiv.org/html/2606.11521#S5.T3)shows that across both datasets, clustered counterexamples consistently improve success rates\. This result suggests that the benefit of counterexamples depends not only on adding more labeled strings, but also on how those strings cover the disagreement between the predicted and target languages\. Clustered counterexamples provide a more systematic view of the error region, making the revealed mismatch clearer and less redundant\. As a result, they give the model a stronger signal for diagnosing the failed hypothesis and revising the regex\.

Table 3:Comparison between non\-clustered and clustered counterexamples on both datasets\. Each row uses regularization and the full agentic workflow\. Each*StarDepth*column aggregates 21 regexes, with 3 regexes from every*\#States*, one run per regex\.
### 5\.6Ablations

We ablate the main components of our workflow in Table[4](https://arxiv.org/html/2606.11521#S5.T4)\.

Table 4:Algorithm\-components ablation on Simply Regex Dataset\.*C\-CE*,*Reg\.*,*Refl\.*,*Repair*indicates whether clustered counterexamples, regularization, reflection, repair loops are used\. Each*StarDepth*column aggregates 21 runs, with 3 regexes from every*\#States*, one run per regex\.Regularization has only a small effect on final success\. From the trajectories, we observe that without regularization the model is more likely to overfit early, when only a few examples are available, by producing long and overly specific regexes\. As more counterexamples are added, this issue is naturally reduced: the larger and more diverse training set makes it harder for a highly specific expression to fit all observations, and the model is pushed toward more general patterns\. Consequently, removing regularization does not substantially hurt final performance\. Reflection and repair loops both contribute to the performance, but in different ways\. Empirically, repair loops provide the larger gain, which is expected because they use additional inference tokens to perform multiple rounds of regex generation and verification\. The Extended Regex ablation and Pareto plots are provided in Appendix[E\.4](https://arxiv.org/html/2606.11521#A5.SS4)\. The Pareto plot in Figure[8\(a\)](https://arxiv.org/html/2606.11521#A5.F8.sf1)shows the same trade\-off:Singlelies on a strong efficiency frontier, while Agentic variants usually achieve higher success, but require more tokens\. We also evaluate a multi\-candidate repair variant, where1010regexes are proposed simultaneously each round, and the one with the best repair score is selected\. This variant does not outperformSingle, achieving81\.0%81\.0\\%,61\.9%61\.9\\%, and61\.9%61\.9\\%success atSD=1,2,3\\mathrm\{SD\}=1,2,3, versus85\.7%85\.7\\%,71\.4%71\.4\\%, and61\.9%61\.9\\%forSingle\. This result suggests that broader candidate exploration alone is insufficient to improve refinement\.

## 6Conclusion

Counterexample\-guided feedback provides a controllable way to study how LLMs use feedback for symbolic learning\. We showed that verifier\-generated counterexamples—especially clustered counterexamples—substantially improve regex induction\. More broadly, our findings suggest that in verifiable domains, for which informative failures can be synthesized, LLM agents can use structured feedback to move beyond merely fitting examples and toward reliable iterative refinement\.

## References

- A\. Abate, C\. David, P\. Kesseli, D\. Kroening, and E\. Polgreen \(2018\)Counterexample guided inductive synthesis modulo theories\.InComputer Aided Verification,Lecture Notes in Computer Science, Vol\.10981,pp\. 270–288\.External Links:[Document](https://dx.doi.org/10.1007/978-3-319-96145-3%5F15)Cited by:[§1](https://arxiv.org/html/2606.11521#S1.p3.1),[§2](https://arxiv.org/html/2606.11521#S2.p2.1)\.
- L\. A\. Agrawal, S\. Tan, D\. Soylu, N\. Ziems, R\. Khare, K\. Opsahl\-Ong, A\. Singhvi, H\. Shandilya, M\. J\. Ryan, M\. Jiang, C\. Potts, K\. Sen, A\. G\. Dimakis, I\. Stoica, D\. Klein, M\. Zaharia, and O\. Khattab \(2026\)GEPA: reflective prompt evolution can outperform reinforcement learning\.External Links:2507\.19457,[Link](https://arxiv.org/abs/2507.19457)Cited by:[§5\.4](https://arxiv.org/html/2606.11521#S5.SS4.p1.1)\.
- R\. Alur, R\. Bodik, G\. Juniwal, M\. M\. K\. Martin, M\. Raghothaman, S\. A\. Seshia, R\. Singh, A\. Solar\-Lezama, E\. Torlak, and A\. Udupa \(2013\)Syntax\-guided synthesis\.In2013 Formal Methods in Computer\-Aided Design,pp\. 1–8\.Cited by:[§1](https://arxiv.org/html/2606.11521#S1.p3.1)\.
- D\. Angluin \(1987\)Learning regular sets from queries and counterexamples\.Information and Computation75\(2\),pp\. 87–106\.External Links:ISSN 0890\-5401,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/0890-5401%2887%2990052-6),[Link](https://www.sciencedirect.com/science/article/pii/0890540187900526)Cited by:[§2](https://arxiv.org/html/2606.11521#S2.p1.1)\.
- J\. Austin, A\. Odena, M\. Nye, M\. Bosma, H\. Michalewski, D\. Dohan, E\. Jiang, C\. Cai, M\. Terry, Q\. Le, and C\. Sutton \(2021\)Program synthesis with large language models\.External Links:2108\.07732,[Link](https://arxiv.org/abs/2108.07732)Cited by:[§2](https://arxiv.org/html/2606.11521#S2.p3.1)\.
- A\. Bartoli, G\. Davanzo, A\. De Lorenzo, E\. Medvet, and E\. Sorio \(2014\)Automatic synthesis of regular expressions from examples\.Computer47\(12\),pp\. 72–80\.External Links:ISSN 0018\-9162,[Link](https://doi.org/10.1109/MC.2014.344),[Document](https://dx.doi.org/10.1109/MC.2014.344)Cited by:[§2](https://arxiv.org/html/2606.11521#S2.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, A\. Paino, N\. Tezak, J\. Tang, I\. Babuschkin, S\. Balaji, S\. Jain, W\. Saunders, C\. Hesse, A\. N\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba \(2021\)Evaluating large language models trained on code\.External Links:2107\.03374,[Link](https://arxiv.org/abs/2107.03374)Cited by:[§2](https://arxiv.org/html/2606.11521#S2.p3.1)\.
- Q\. Chen, X\. Wang, X\. Ye, G\. Durrett, and I\. Dillig \(2020\)Multi\-modal synthesis of regular expressions\.InProceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation,PLDI 2020,New York, NY, USA,pp\. 487–502\.External Links:ISBN 9781450376136,[Link](https://doi.org/10.1145/3385412.3385988),[Document](https://dx.doi.org/10.1145/3385412.3385988)Cited by:[§2](https://arxiv.org/html/2606.11521#S2.p3.1)\.
- M\. Cohen, A\. Caciularu, I\. Rejwan, and J\. Berant \(2018\)Inducing regular grammars using recurrent neural networks\.External Links:1710\.10453,[Link](https://arxiv.org/abs/1710.10453)Cited by:[§2](https://arxiv.org/html/2606.11521#S2.p1.1)\.
- H\. Fernau \(2009\)Algorithms for learning regular expressions from positive data\.Information and Computation207\(4\),pp\. 521–541\.External Links:ISSN 0890\-5401,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.ic.2008.12.008),[Link](https://www.sciencedirect.com/science/article/pii/S0890540109000169)Cited by:[§2](https://arxiv.org/html/2606.11521#S2.p1.1)\.
- J\. Gao and Y\. Zhang \(2020\)Regular expression learning from positive examples based on integer programming\.International Journal of Software Engineering and Knowledge Engineering30\(10\),pp\. 1443–1479\.External Links:[Document](https://dx.doi.org/10.1142/S0218194020400203)Cited by:[§2](https://arxiv.org/html/2606.11521#S2.p1.1)\.
- P\. Garg, C\. Löding, P\. Madhusudan, and D\. Neider \(2014\)ICE: a robust framework for learning invariants\.InComputer Aided Verification,Lecture Notes in Computer Science, Vol\.8559,pp\. 69–87\.External Links:[Document](https://dx.doi.org/10.1007/978-3-319-08867-9%5F5)Cited by:[§2](https://arxiv.org/html/2606.11521#S2.p2.1)\.
- P\. Garg, D\. Neider, P\. Madhusudan, and D\. Roth \(2016\)Learning invariants using decision trees and implication counterexamples\.InProceedings of the 43rd Annual ACM SIGPLAN\-SIGACT Symposium on Principles of Programming Languages,pp\. 499–512\.External Links:ISBN 9781450335492,[Link](https://doi.org/10.1145/2837614.2837664),[Document](https://dx.doi.org/10.1145/2837614.2837664)Cited by:[§2](https://arxiv.org/html/2606.11521#S2.p2.1)\.
- N\. Kushman and R\. Barzilay \(2013\)Using semantic unification to generate regular expressions from natural language\.InProceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,L\. Vanderwende, H\. Daumé III, and K\. Kirchhoff \(Eds\.\),Atlanta, Georgia,pp\. 826–836\.External Links:[Link](https://aclanthology.org/N13-1103/)Cited by:[Appendix C](https://arxiv.org/html/2606.11521#A3.p3.1),[2nd item](https://arxiv.org/html/2606.11521#S5.I2.i2.p1.1)\.
- M\. Lee, S\. So, and H\. Oh \(2016\)Synthesizing regular expressions from examples for introductory automata assignments\.SIGPLAN Not\.52\(3\),pp\. 70–80\.External Links:ISSN 0362\-1340,[Link](https://doi.org/10.1145/3093335.2993244),[Document](https://dx.doi.org/10.1145/3093335.2993244)Cited by:[§2](https://arxiv.org/html/2606.11521#S2.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.External Links:2305\.20050,[Link](https://arxiv.org/abs/2305.20050)Cited by:[§2](https://arxiv.org/html/2606.11521#S2.p3.1)\.
- X\. Liu, T\. Liang, Z\. He, J\. Xu, W\. Wang, P\. He, Z\. Tu, H\. Mi, and D\. Yu \(2025\)Trust, but verify: a self\-verification approach to reinforcement learning with verifiable rewards\.External Links:2505\.13445,[Link](https://arxiv.org/abs/2505.13445)Cited by:[§2](https://arxiv.org/html/2606.11521#S2.p3.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang, S\. Gupta, B\. P\. Majumder, K\. Hermann, S\. Welleck, A\. Yazdanbakhsh, and P\. Clark \(2023\)Self\-refine: iterative refinement with self\-feedback\.External Links:2303\.17651,[Link](https://arxiv.org/abs/2303.17651)Cited by:[§4](https://arxiv.org/html/2606.11521#S4.p8.1)\.
- C\. Oliva and L\. F\. Lago\-Fernández \(2021\)Stability of internal states in recurrent neural networks trained on regular languages\.Neurocomputing452,pp\. 212–223\.External Links:ISSN 0925\-2312,[Link](http://dx.doi.org/10.1016/j.neucom.2021.04.058),[Document](https://dx.doi.org/10.1016/j.neucom.2021.04.058)Cited by:[§2](https://arxiv.org/html/2606.11521#S2.p1.1)\.
- OpenAI, :, S\. Agarwal, L\. Ahmad, J\. Ai, S\. Altman, A\. Applebaum, E\. Arbus, R\. K\. Arora, Y\. Bai, B\. Baker, H\. Bao, B\. Barak, A\. Bennett, T\. Bertao, N\. Brett, E\. Brevdo, G\. Brockman, S\. Bubeck, C\. Chang, K\. Chen, M\. Chen, E\. Cheung, A\. Clark, D\. Cook, M\. Dukhan, C\. Dvorak, K\. Fives, V\. Fomenko, T\. Garipov, K\. Georgiev, M\. Glaese, T\. Gogineni, A\. Goucher, L\. Gross, K\. G\. Guzman, J\. Hallman, J\. Hehir, J\. Heidecke, A\. Helyar, H\. Hu, R\. Huet, J\. Huh, S\. Jain, Z\. Johnson, C\. Koch, I\. Kofman, D\. Kundel, J\. Kwon, V\. Kyrylov, E\. Y\. Le, G\. Leclerc, J\. P\. Lennon, S\. Lessans, M\. Lezcano\-Casado, Y\. Li, Z\. Li, J\. Lin, J\. Liss, Lily, Liu, J\. Liu, K\. Lu, C\. Lu, Z\. Martinovic, L\. McCallum, J\. McGrath, S\. McKinney, A\. McLaughlin, S\. Mei, S\. Mostovoy, T\. Mu, G\. Myles, A\. Neitz, A\. Nichol, J\. Pachocki, A\. Paino, D\. Palmie, A\. Pantuliano, G\. Parascandolo, J\. Park, L\. Pathak, C\. Paz, L\. Peran, D\. Pimenov, M\. Pokrass, E\. Proehl, H\. Qiu, G\. Raila, F\. Raso, H\. Ren, K\. Richardson, D\. Robinson, B\. Rotsted, H\. Salman, S\. Sanjeev, M\. Schwarzer, D\. Sculley, H\. Sikchi, K\. Simon, K\. Singhal, Y\. Song, D\. Stuckey, Z\. Sun, P\. Tillet, S\. Toizer, F\. Tsimpourlas, N\. Vyas, E\. Wallace, X\. Wang, M\. Wang, O\. Watkins, K\. Weil, A\. Wendling, K\. Whinnery, C\. Whitney, H\. Wong, L\. Yang, Y\. Yang, M\. Yasunaga, K\. Ying, W\. Zaremba, W\. Zhan, C\. Zhang, B\. Zhang, E\. Zhang, and S\. Zhao \(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.External Links:2508\.10925,[Link](https://arxiv.org/abs/2508.10925)Cited by:[Appendix D](https://arxiv.org/html/2606.11521#A4.p1.3),[§5\.3](https://arxiv.org/html/2606.11521#S5.SS3.p1.1)\.
- OpenAI \(2026\)ChatGPT \(GPT\-5\.4\) \[Large language model\]\.Note:Accessed: 2026\-04External Links:[Link](https://chat.openai.com/)Cited by:[Appendix D](https://arxiv.org/html/2606.11521#A4.p1.3),[§5\.3](https://arxiv.org/html/2606.11521#S5.SS3.p1.1)\.
- J\. Romero \(2021\)Pyformlang: an educational library for formal language manipulation\.InProceedings of the 52nd ACM Technical Symposium on Computer Science Education,SIGCSE ’21,New York, NY, USA,pp\. 576–582\.External Links:ISBN 9781450380621,[Link](https://doi.org/10.1145/3408877.3432464),[Document](https://dx.doi.org/10.1145/3408877.3432464)Cited by:[Appendix D](https://arxiv.org/html/2606.11521#A4.p3.7),[§3](https://arxiv.org/html/2606.11521#S3.p3.9)\.
- R\. Shin, C\. H\. Lin, S\. Thomson, C\. Chen, S\. Roy, E\. A\. Platanios, A\. Pauls, D\. Klein, J\. Eisner, and B\. V\. Durme \(2021\)Constrained language models yield few\-shot semantic parsers\.External Links:2104\.08768,[Link](https://arxiv.org/abs/2104.08768)Cited by:[§2](https://arxiv.org/html/2606.11521#S2.p3.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.External Links:2303\.11366,[Link](https://arxiv.org/abs/2303.11366)Cited by:[§4](https://arxiv.org/html/2606.11521#S4.p8.1)\.
- M\. L\. Siddiq, J\. Zhang, L\. Roney, and J\. C\. S\. Santos \(2024\)Re\(gex\|dos\)eval: evaluating generated regular expressions and their proneness to dos attacks\.InProceedings of the 46th International Conference on Software Engineering, NIER Track \(ICSE\-NIER ’24\),External Links:[Document](https://dx.doi.org/10.1145/3639476.3639757)Cited by:[§1](https://arxiv.org/html/2606.11521#S1.p3.1),[§2](https://arxiv.org/html/2606.11521#S2.p3.1)\.
- A\. Solar\-Lezama \(2013\)Program sketching\.International Journal on Software Tools for Technology Transfer15\(5\),pp\. 475–495\.External Links:[Document](https://dx.doi.org/10.1007/s10009-012-0249-7),[Link](https://doi.org/10.1007/s10009-012-0249-7),ISSN 1433\-2787Cited by:[§1](https://arxiv.org/html/2606.11521#S1.p3.1),[§2](https://arxiv.org/html/2606.11521#S2.p2.1)\.
- X\. Song, Y\. Wu, W\. Wang, J\. Liu, W\. Su, and B\. Zheng \(2025\)ProgCo: program helps self\-correction of large language models\.External Links:2501\.01264,[Link](https://arxiv.org/abs/2501.01264)Cited by:[§4](https://arxiv.org/html/2606.11521#S4.p8.1)\.
- Z\. Tang, Y\. Yan, R\. Li, H\. Dong, B\. Sun, H\. Chen, and H\. Gao \(2026\)Multi\-modal regular expression synthesis method based on large language models and semantics\.Journal of Systems Architecture175,pp\. 103762\.External Links:ISSN 1383\-7621,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.sysarc.2026.103762),[Link](https://www.sciencedirect.com/science/article/pii/S1383762126000809)Cited by:[§1](https://arxiv.org/html/2606.11521#S1.p3.1),[§2](https://arxiv.org/html/2606.11521#S2.p3.1)\.
- Q\. Team \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[Appendix D](https://arxiv.org/html/2606.11521#A4.p1.3),[§5\.3](https://arxiv.org/html/2606.11521#S5.SS3.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. Le, and D\. Zhou \(2023\)Chain\-of\-thought prompting elicits reasoning in large language models\.External Links:2201\.11903,[Link](https://arxiv.org/abs/2201.11903)Cited by:[§5\.4](https://arxiv.org/html/2606.11521#S5.SS4.p1.1)\.
- G\. Weiss, Y\. Goldberg, and E\. Yahav \(2020\)Extracting automata from recurrent neural networks using queries and counterexamples\.External Links:1711\.09576,[Link](https://arxiv.org/abs/1711.09576)Cited by:[§2](https://arxiv.org/html/2606.11521#S2.p2.1)\.
- W\. Xiong, Y\. Song, X\. Zhao, W\. Wu, X\. Wang, K\. Wang, C\. Li, W\. Peng, and S\. Li \(2024\)Watch every step\! llm agent learning via iterative step\-level process refinement\.External Links:2406\.11176,[Link](https://arxiv.org/abs/2406.11176)Cited by:[§4](https://arxiv.org/html/2606.11521#S4.p8.1)\.
- X\. Ye, Q\. Chen, X\. Wang, I\. Dillig, and G\. Durrett \(2020\)Sketch\-driven regular expression generation from natural language and examples\.Transactions of the Association for Computational Linguistics8,pp\. 679–694\.External Links:[Link](https://aclanthology.org/2020.tacl-1.44/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00339)Cited by:[§2](https://arxiv.org/html/2606.11521#S2.p3.1)\.

## Appendix ABackground Details

We give formal definition of membership oracle queries and equivalence oracle queries:

#### Membership Oracle Queries\.

A membership oracle labels individual instances\. For a queried instancex∈𝒳x\\in\\mathcal\{X\}, it returns

𝒪mem\(x\)=𝟏\[x∈L⋆\]\.\\mathcal\{O\}\_\{\\mathrm\{mem\}\}\(x\)=\\mathbf\{1\}\[x\\in L^\{\\star\}\]\.Membership queries provide local supervision about the behavior of the target language on specific examples\. In an active setting, these examples may be selected to be informative\. However, finitely many membership labels are generally insufficient to identify the full target language\.

#### Equivalence Oracle Queries\.

An equivalence oracle evaluates an entire hypothesis\. Given a proposed hypothesishh, the oracle checks whetherL\(h\)=L⋆L\(h\)=L^\{\\star\}\. If the two languages are equivalent, learning succeeds\. Otherwise, the oracle returns counterexamples

C⊆L\(h\)△L⋆,C\\subseteq L\(h\)\\triangle L^\{\\star\},where△\\triangledenotes symmetric difference\. Each counterexample exposes a concrete disagreement between the hypothesis and the target, indicating either that the hypothesis is too restrictive or that it over\-generalizes\.

## Appendix BPseudo Codes for Agentic Workflow

Algorithm[1](https://arxiv.org/html/2606.11521#alg1)describes the outer loop\. At each epoch, the teacher checks the current hypothesis and returns counterexamples when the hypothesis is incorrect\. These counterexamples are added to the aggregate training set\. The learner is then prompted to reflect on the previous hypothesis, its reasoning, and the newly revealed failures before producing an updated hypothesis\.

Algorithm[2](https://arxiv.org/html/2606.11521#alg2)describes the inner repair loop\. For each retry, the learner proposes a reasoning trace and a regex\. The teacher evaluates whether the regex compiles, whether it fits the accumulated examples, and whether it is equivalent to the target\. If the candidate fails, the resulting feedback is converted into a repair prompt for the next retry\.

## Appendix CDetailed information of Regex Datasets

We summarize the main syntax feature differences between Simple Regex and Extended Regex in Table[5](https://arxiv.org/html/2606.11521#A3.T5)\.

For Simple Regex dataset, in order to synthesize regular expressions at scale, we recursively sample syntax trees over union, concatenation, and Kleene\-star under size and star\-depth budgets\. We use randomness to efficiently cover a diverse range of structures without exhaustively enumerating the search space\. Afterwards, we simplify duplicated Kleene\-stars, for instance, reducing “\(a∗\)∗\(a\*\)\*” to “a∗a\*”, and then construct the actual regular expressions out of those syntax trees\.

The extended regexes used in our experiments come from the KB13 datasetKushman and Barzilay \[[2013](https://arxiv.org/html/2606.11521#bib.bib6)\]\. Because the expressions from KB13 have a slightly different syntax, we first parse and then covert them into our desired form\. We then group these regexes by the complexity criteria described in Section[5](https://arxiv.org/html/2606.11521#S5), which guides the selection of our final Extended dataset\.

Table 5:Comparison of the syntax features for Simple Regex and Extended Regex
## Appendix DDetailed Configuration and Setup

The learner is implemented by LLMs\. We use the open\-source model “openai/gpt\-oss\-120b”OpenAIet al\.\[[2025](https://arxiv.org/html/2606.11521#bib.bib12)\]\(also “GPT”OpenAI \[[2026](https://arxiv.org/html/2606.11521#bib.bib14)\]and “Qwen”Team \[[2025](https://arxiv.org/html/2606.11521#bib.bib13)\]models for model ablation\)\. For the main experiments, we perform three trials per regex for robustness and reliability\. And from RQ3 to RQ5 in Section[5](https://arxiv.org/html/2606.11521#S5), we perform only one trial per regex for relevant experiments\. For all the experiments, we set the maximum context\-window length to be6553665536, the maximum output\-token length to be3276832768\(including both potential reasoning tokens and revealed output tokens\), and the temperature to be0\.00\.0\.

For “openai/gpt\-oss\-120b” experiments, we use 2 NVIDIA L40S GPUs to load the model and do inference\. The CPU memory used is less than6464GB, and disk usage is less than200200GB\. For other models \(“GPT\-5\.4” and “Qwen3\-235B”\), we use API calls for inference\.

For implementation of regex, we use the off\-the\-shelf python library “pyformlang”Romero \[[2021](https://arxiv.org/html/2606.11521#bib.bib34)\]\. The maximum length of the training examples is set to be3232\. ForStandardsetting, we start with33training examples, and scale logarithmically up to maximally30003000before reporting failure\. For settings involving counterexamples, we maximally starts with88initial labeled examples, and run maximally1212epochs with number of counterexamples added each round clipped to250250maximally\. For settings involving repair loops, the repair budget is set to be33\.

The main metric that we use to evaluate the performance is run\-level success ratio\. For example, for the main results in Table[1](https://arxiv.org/html/2606.11521#S5.T1), we evaluate 21 regexes per cell, and run 3 times per regex, yielding 63 runs in total, and we report the ratio of the number of successful runs over the number of total runs\.

## Appendix EMore Results

### E\.1Heatmaps of Main Evaluation

Figures[4](https://arxiv.org/html/2606.11521#A5.F4)and[5](https://arxiv.org/html/2606.11521#A5.F5)show the full main\-evaluation heatmaps over both*StarDepth*and the number of DFA states\. These plots complement the aggregated results in Table[1](https://arxiv.org/html/2606.11521#S5.T1), where we report success rates grouped by*StarDepth*\.

The heatmaps show that*StarDepth*is a stronger indicator of learning difficulty than the number of states\. Within the same*StarDepth*level, performance varies with the number of states, but the overall degradation is more clearly aligned with increasing*StarDepth*\. This motivates our choice to aggregate the main results by*StarDepth*in the main text\.

The heatmaps also confirm the main trend: counterexample\-guided settings improve success rates across a broad range of complexity levels\. The improvement is most visible in higher\-*StarDepth*regions, whereStandardprompting often fails but counterexample\-guided methods recover a larger fraction of target regexes\.

![Refer to caption](https://arxiv.org/html/2606.11521v1/figs/solve_rate_heatmap_compare_simplyrx.png)Figure 4:Success rate heatmaps across regex complexity for main Simple Regex experiments\. Rows correspond to*StarDepth*and columns correspond to the number of DFA states\. Darker colors indicate higher success rates\.![Refer to caption](https://arxiv.org/html/2606.11521v1/figs/solve_rate_heatmap_compare_extrx.png)Figure 5:Success rate heatmaps across regex complexity for main Extended Regex experiments\. Rows correspond to*StarDepth*and columns correspond to the number of DFA states\. Darker colors indicate higher success rates\.
### E\.2Sample Efficiency Comparison

Figures[6](https://arxiv.org/html/2606.11521#A5.F6)and[7](https://arxiv.org/html/2606.11521#A5.F7)compareStandardandAgenticas a function of sample budget\. For each budget, we take the last hypothesis in the learning trajectory whose cumulative number of labeled training examples does not exceed that budget, and evaluate this hypothesis on the corresponding evaluation set, which contains sufficiently800800labeled strings\. The curves report mean evaluation accuracy across all regexes in the corresponding*StarDepth*group, and the shaded regions indicate variation across regexes\.

Across both datasets,Agenticgenerally reaches higher evaluation accuracy at the same sample budget, and it tends to improve more steadily as more examples are observed\. The advantage is most visible at nontrivial difficulty levels, whereStandardoften plateaus earlier or fluctuates more, whileAgenticcontinues to refine stronger hypotheses\. This pattern indicates that verifier\-guided counterexamples help the learner extract more information from each additional example\.

Taken together, these curves show that our method is more sample\-efficient: it typically achieves better intermediate hypotheses with fewer labeled examples, and this advantage persists across a broad range of budgets and difficulty levels\.

![Refer to caption](https://arxiv.org/html/2606.11521v1/figs/current_guess_eval_accuracy_sample_budget_stardepth_0_1_2_3_4.png)Figure 6:Evaluation accuracy of hypothesis selected under sample budget on each*StarDepth*of Simple Regex Dataset\. For each budget, we take the last hypothesis in the learning trajectory whose cumulative number of labeled training examples does not exceed that budget, and evaluate it on the corresponding evaluation set\. Each point is the mean evaluation accuracy, and the shaded region indicates variation\.![Refer to caption](https://arxiv.org/html/2606.11521v1/figs/current_guess_eval_accuracy_sample_budget_stardepth_0_1_2.png)Figure 7:Evaluation accuracy of hypothesis selected under sample budget on each*StarDepth*of Extended Regex Dataset\. For each budget, we take the last hypothesis in the learning trajectory whose cumulative number of labeled training examples does not exceed that budget, and evaluate it on the corresponding evaluation set\. Each point is the mean evaluation accuracy, and the shaded region indicates variation\.
### E\.3Details on Robustness of Prompt Design

Table[6](https://arxiv.org/html/2606.11521#A5.T6)reports a prompt ablation for theSinglesetting on the Simple Regex dataset\. The largest improvement comes from adding output\-side instructions \(syntax and output format requirements\): compared with naive prompting, specifying the regex syntax and answer format raises success from14\.3%/0\.0%/4\.8%14\.3\\%/0\.0\\%/4\.8\\%to71\.4%/33\.3%/23\.8%71\.4\\%/33\.3\\%/23\.8\\%across*StarDepth*1–3\. Adding input\-format instructions provides smaller gains, while explicit*CoT*prompting and the GEPA\-optimized*strategy*yield only modest additional improvements\. These results suggest that prompt design is relatively robust once the output structure is clearly specified\.

Table 6:Prompt ablation on Simply Regex varying included prompt components\. For simplicity, we directly use non\-clustered counterexamples without regularization and without agentic refinements\. Each*StarDepth*column aggregates 21 runs, with 3 regexes from every*\#States*, one run per regex\.
### E\.4More Ablation Study

This section provides additional ablation results for the algorithmic components\. Table[7](https://arxiv.org/html/2606.11521#A5.T7)reports the component ablation on the Extended Regex dataset, complementing the Simple Regex ablation in the main text\. The Extended Regex results are broadly consistent with the main findings\. Using counterexamples improves over theStandardsetting, and the largest gains appear on harder tasks\. The agentic variants further improve performance relative to using counterexamples only as additional training data, although the effect of each individual component is not strictly monotonic\. In particular, removing regularization or reflection can sometimes achieve comparable or higher success on this dataset, suggesting that these components affect the balance between exploration, correction, and overfitting rather than acting as uniformly beneficial switches\.

Table 7:Algorithm\-components ablation on Extended Regex Dataset\.*C\-CE*,*Reg\.*,*Refl\.*,*Repair*indicates whether clustered counterexamples, regularization, reflection, repair loops are used\. Each*StarDepth*column aggregates 21 runs, with 3 regexes from every*\#States*, one run per regex\.Figure[8](https://arxiv.org/html/2606.11521#A5.F8)shows the corresponding Pareto frontiers over success rate and total token cost for both Simple and Extended Regex\. The plots illustrate the main trade\-off across ablation settings: simpler counterexample\-guided settings are more token\-efficient, while repair\-based agentic variants often achieve higher success at a larger inference cost\. Overall, these results support the conclusion that counterexamples are the primary driver of improvement, while the remaining workflow components mainly control how effectively and how expensively the model uses this feedback\.

![Refer to caption](https://arxiv.org/html/2606.11521v1/figs/pareto_depth_overlay_all_settings_depth_1_2_3.png)\(a\)Component Ablation on Simple Regex Dataset
![Refer to caption](https://arxiv.org/html/2606.11521v1/figs/pareto_depth_overlay_all_settings_depth_0_1_2.png)\(b\)Component Ablation on Extended Regex Dataset

Figure 8:Pareto frontiers across full ablation settings on Simple and Extended Regex Datasets\.

## Appendix FPrompt Templates

This appendix documents the prompt templates used in our experiments\. Across all settings, the learner is prompted with a task description, a description of the input format, domain\-specific output\-syntax requirements, and an output\-format instruction\. Additional modules are optionally appended for regularization, clustered counterexample interpretation, reflection, repair, and multi\-candidate repair\.

#### Prompt Families\.

We use separate base prompt families forSimple RegexandExtended Regex\. Both ask the model to infer a single regex consistent with labeled examples, but they differ in the target syntax\. TheSimple Regexprompt requires output inpyformlang\.regular\_expression\.Regexsyntax over a small alphabet, while theExtended Regexprompt uses an extended regex syntax that includes conjunction, negation, and additional quantifiers over the fixed alphabetΣ=\[A\-Za\-z0\-9\#\]\\Sigma=\\texttt\{\[A\-Za\-z0\-9\\\#\]\}\.

#### Regularization and Clustered Counterexamples\.

When regularization is enabled, we prepend a constraints block encouraging concise, general regexes\. ForSimple Regex, this constrains the total regex length and Kleene\-star depth; forExtended Regex, the length bound is larger and the allowed star nesting is smaller\. When clustered counterexamples are used, the input\-format section is augmented with an explanation of grouped character classes, since some counterexamples are presented as character\-class abstractions rather than only literal strings\.

#### Agentic Update Prompts\.

In agentic settings, the base prompt is augmented with additional instructions\. Reflection prompts summarize the previous reasoning, previous regex, and new counterexamples, and ask the model to explain what failed and what should change\. Repair prompts ask the model to revise the previous regex using failure feedback and misclassified examples while preserving correct structure from the earlier attempt\. In the multi\-candidate repair ablation, the model is instead asked to generate several repaired regexes, each wrapped in its own<ans\>block, after which the system selects the best one externally\.

#### Training\-Data Formatting\.

In all settings, the labeled examples are appended under a fixed header:

> Training Data \(Each line has one input\-output pair separated by comma\):

Each subsequent line has the form<string\>, <label\>\. Empty strings are represented by leaving the string field blank before the comma\.

#### Base Prompt: Simple Regex\.

```
TASK
You will be given labeled strings and must infer a single regular language
that matches all positives (label 1) and rejects all negatives (label 0).
Output a concise regex in pyformlang.regular_expression.Regex syntax.

INPUT FORMAT
- You receive a block titled Training Data (Each line has one input-output
  pair separated by comma):.
- Each line is "<string>, <label>" where label in {1, 0}. The string may be
  empty; an empty string appears as nothing before the comma (", 1") and
  represents epsilon.
- The alphabet is exactly the set of characters appearing in the data
  (typically a, b, c). Do not introduce other symbols.
{clustered_ce_instr}

PYFORMLANG REGEX SYNTAX
- Union: +
- Concatenation: space-separated symbols.
- Kleene star: *
- Parentheses are allowed for grouping.
- Use the literal epsilon when needed.
- Do NOT use: | . ? [] {} anchors/lookarounds, multi-character tokens,
  or any symbol not present in the training data.

INFERENCE STRATEGY
1) Check start/end constraints.
2) Look for block structure and repetition.
3) Choose between star-of-union and union-of-stars.
4) Prefer compact factorizations.
5) Handle epsilon only when supported.
6) Avoid over-generalization.
7) Verify all positives and negatives before finalizing.

{regularization_instr}
{agentic_reflection_instr}

OUTPUT FORMAT
- First, provide 1-3 concise sentences wrapped in <reasoning>...</reasoning>.
- Then output ONLY the final regex wrapped in <ans>...</ans>.

Training Data (Each line has one input-output pair separated by comma):
{0}
```

#### Base Prompt: Extended Regex\.

```
TASK
You will be given labeled strings and must infer a single regular language
that matches all positives (label 1) and rejects all negatives (label 0).
Output a concise regex in our specified syntax (extended from
pyformlang.regular_expression.PythonRegex).

INPUT FORMAT
- You receive a block titled Training Data (Each line has one input-output
  pair separated by comma):.
- Each line is "<string>, <label>" where label in {1, 0}. The string may be
  empty; an empty string appears as nothing before the comma (", 1") and
  represents epsilon.
- The alphabet is fixed. Do not introduce other symbols.
{clustered_ce_instr}

EXT REGEX SYNTAX (Extended PythonRegex)
- Alphabet is fixed: Sigma = {sigma}
- Concatenation is implicit by adjacency.
- Union: |
- Grouping: ( ... )
- Conjunction / intersection: &
- Negation / complement: ~( ... )
- Quantifiers: *, +, ?, {n}, {n,m}, {n,}
- Do not use dot, negated character classes, anchors, word boundaries,
  lookarounds, or backreferences.

INFERENCE STRATEGY
1) Check prefix and suffix constraints.
2) Look for repeated substrings and block structure.
3) Choose between star-of-union and union-of-stars.
4) Factor common prefixes/suffixes and use character classes when appropriate.
5) Handle epsilon only when required.
6) Avoid over-generalization.
7) Verify positives, negatives, and boundary cases.

{regularization_instr}
{agentic_reflection_instr}

OUTPUT FORMAT
- First, provide 1-3 concise sentences wrapped in <reasoning>...</reasoning>.
- Then output ONLY the final regex wrapped in <ans>...</ans>.

Training Data (Each line has one input-output pair separated by comma):
{0}
```

#### Regularization Blocks\.

```
Simple Regex regularization:
CONSTRAINTS
- Prefer simpler, more general regexes while staying consistent with all
  datapoints.
- Total regex length (ignoring spaces) must be <= 60 characters.
- Nesting depth of Kleene stars must be <= 4.
- Use only symbols that appear in the training data.

Extended Regex regularization:
CONSTRAINTS
- Prefer simpler, more general regexes while staying consistent with all
  datapoints.
- Total regex length (ignoring spaces) must be <= 150 characters.
- Nesting depth of Kleene stars (*, +, ?) must be <= 2.
- Use only symbols that appear in the alphabet (except metacharacters).
```

#### Reflection Prompt\.

```
AGENTIC REFLECTION UPDATE
- You will receive the previous attempt’s reasoning and regex, plus new
  counterexamples.
- First, briefly revise the previous reasoning to explain what failed and
  what should be changed.
- Then output an updated regex consistent with all training data and the
  counterexamples.
- Keep reasoning concise (1-3 sentences) and directly tied to the regex
  revision.

Reasoning of previous epoch:
<previous reasoning>
Regex of previous epoch:
<previous regex>
What failed in the previous epoch:
<feedback note, if available>
New counterexamples (string, label):
<counterexample lines>
```

#### Repair Prompt\.

```
AGENTIC REPAIR UPDATE
- You are repairing the previous attempt using the failure feedback and
  repair examples below.
- Repair goals:
  1) Produce a regex that compiles under the required regex syntax.
  2) Fix the specific mistakes exposed by the counterexamples, disagreement
     witness, and any reported errors.
  3) Preserve the parts of the previous solution that still fit the
     training data.
- If the previous regex is invalid, first correct the syntax.
- If a witness or repair example is handled incorrectly, revise the regex
  so that string gets the correct label.
- Keep the reasoning concise and focused on what changed.
- Return the repaired regex in the required output format.

Reasoning of previous attempt:
<previous reasoning>
Regex of previous attempt:
<previous regex>
Repair feedback:
<feedback note, if available>
Repair examples (string, label):
<repair-example lines>
```

#### Multi\-Candidate Repair Prompt\.

```
CANDIDATE REGEX GENERATION INSTRUCTION
- Do not return only one repaired regex. Generate exactly K diverse
  candidate regexes.
- Wrap every candidate regex in its own <ans> and </ans> block.
- You may include one concise shared <reasoning>...</reasoning> block
  before the candidates.
- Do not include any text after the final </ans> block.
```

#### Direct Output Format\.

```
OUTPUT FORMAT
- Output ONLY the final regex wrapped in <ans> and </ans>.
```

Algorithm 1Outer Agentic Learning Loop1:Initial labeled examples

𝒟0\\mathcal\{D\}\_\{0\}, task instructions

II, reflection instructions

I𝑟𝑒𝑓𝑙I\_\{\\mathit\{refl\}\}, teacher

TT, learner

MM, maximum epochs

EE, repair budget

RR
2:Final hypothesis regex

r^\\hat\{r\}
3:Initialize aggregate training set

𝒞←∅\\mathcal\{C\}\\leftarrow\\emptyset
4:Initialize current best hypothesis

r^←None\\hat\{r\}\\leftarrow\\textsc\{None\}
5:Initialize current best reasoning

z^←None\\hat\{z\}\\leftarrow\\textsc\{None\}
6:for

e=1,…,Ee=1,\\ldots,Edo

7:if

e=1e=1then

8:

c←𝒟0c\\leftarrow\\mathcal\{D\}\_\{0\}
9:

p𝑟𝑒𝑓𝑙←∅p\_\{\\mathit\{refl\}\}\\leftarrow\\emptyset
10:else

11:Teacher

TTchecks

r^\\hat\{r\}against the target language

12:Gather counterexamples from Teacher

TTfeedback:

c=\{\(x,y\):r^\(x\)≠y\}⊆L\(r\)△L⋆c=\\\{\(x,y\):\\hat\{r\}\(x\)\\neq y\\\}\\subseteq L\(r\)\\triangle L^\{\\star\}
13:Build reflection prompt

p𝑟𝑒𝑓𝑙←BuildReflectionPrompt\(I𝑟𝑒𝑓𝑙,r^,z,c\)p\_\{\\mathit\{refl\}\}\\leftarrow\\textsc\{BuildReflectionPrompt\}\(I\_\{\\mathit\{refl\}\},\\hat\{r\},z,c\)
14:endif

15:Update aggregate training set:

𝒞←𝒞∪c\\mathcal\{C\}\\leftarrow\\mathcal\{C\}\\cup c
16:Construct base prompt

pep\_\{e\}from:

\(I,𝒞,p𝑟𝑒𝑓𝑙\)\(I,\\mathcal\{C\},p\_\{\\mathit\{refl\}\}\)
17:Run inner repair loop:

\(r,z,a\)←RepairLoop\(pe,𝒞,T,M,R\)\(r,z,a\)\\leftarrow\\textsc\{RepairLoop\}\(p\_\{e\},\\mathcal\{C\},T,M,R\)
18:Update current best hypothesis:

r^←r\\hat\{r\}\\leftarrow r,

z^←z\\hat\{z\}\\leftarrow z
19:if

a=1a=1then

20:return

r^\\hat\{r\}
21:endif

22:endfor

23:return

r^prev\\hat\{r\}\_\{\\mathrm\{prev\}\}

Algorithm 2Inner Reflection–Repair Loop1:Epoch prompt

pep\_\{e\}, examples

𝒞\\mathcal\{C\}, repair instructions

I𝑟𝑒𝑝I\_\{\\mathit\{rep\}\}, teacher

TT, learner

MM, retry budget

RR
2:Best regex

r^\\hat\{r\}, best reasoning

z^\\hat\{z\}, solved flag

aa
3:Initialize the repair prompt,

p𝑟𝑒𝑝←∅p\_\{\\mathit\{rep\}\}\\leftarrow\\emptyset\.

4:Initialize the best score,

bestScore←−∞\\mathrm\{bestScore\}\\leftarrow\-\\infty\.

5:Initialize the best candidate,

\(r^,z^\)←\(None,None\)\(\\hat\{r\},\\hat\{z\}\)\\leftarrow\(\\textsc\{None\},\\textsc\{None\}\)\.

6:for

t=1,…,Rt=1,\\ldots,Rdo

7:Query the LLM\-learner:

\(r,z\)←M\(pe⊕p𝑟𝑒𝑝\)\(r,z\)\\leftarrow M\(p\_\{e\}\\oplus p\_\{\\mathit\{rep\}\}\)
8:Syntax error message from compilation

m←compile\(r\)m\\leftarrow\\text\{compile\}\(r\)
9:Equivalence flag from equivalence queries:

a←T\.Equiv\(r\)a\\leftarrow T\.\\mathrm\{Equiv\}\(r\)
10:Gather misclassified exmaples

e←\{\(x,y\)∈𝒞:r\(x\)≠y\}e\\leftarrow\\\{\(x,y\)\\in\\mathcal\{C\}:r\(x\)\\neq y\\\}
11:if

a=1a=1then

12:

return\(r,z,1\)\\textbf\{return\}\{\}\(r,z,1\)
13:endif

14:

score←RepairScore\(c,e,𝒞\)\\mathrm\{score\}\\leftarrow\\textsc\{RepairScore\}\(c,e,\\mathcal\{C\}\)
15:if

score\>bestScore\\mathrm\{score\}\>\\mathrm\{bestScore\}then

16:

bestScore←score\\mathrm\{bestScore\}\\leftarrow\\mathrm\{score\}
17:

\(r^,z^\)←\(r,z\)\(\\hat\{r\},\\hat\{z\}\)\\leftarrow\(r,z\)
18:endif

19:Build the next repair prompt

p𝑟𝑒𝑝←BuildRepairPrompt\(I𝑟𝑒𝑝,r,z,m,e\)p\_\{\\mathit\{rep\}\}\\leftarrow\\textsc\{BuildRepairPrompt\}\(I\_\{\\mathit\{rep\}\},r,z,m,e\)
20:endfor

21:

return\(r^,z^,0\)\\textbf\{return\}\{\}\(\\hat\{r\},\\hat\{z\},0\)
Counterexample Guided Learning in the Large using Reasoning Agents

Similar Articles

Learning to reason with LLMs

Logic-Regularized Verifier Elicits Reasoning from LLMs

Reinforcement Learning for Evidence-Seeking Diagnostic Reasoning with Large Language Models

@burny_tech: A Survey on Latent Reasoning "Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especia…

@Kevin_GuoweiXu: How should LLMs sample on hard reasoning problems during post-training and inference where direct rollouts rarely produ…

Submit Feedback

Similar Articles

Logic-Regularized Verifier Elicits Reasoning from LLMs
Reinforcement Learning for Evidence-Seeking Diagnostic Reasoning with Large Language Models
@burny_tech: A Survey on Latent Reasoning "Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especia…
@Kevin_GuoweiXu: How should LLMs sample on hard reasoning problems during post-training and inference where direct rollouts rarely produ…