What Drives Interactive Improvement from Feedback?
Summary
This paper investigates whether natural-language feedback leads to improvement beyond repeated attempts alone in multi-turn language agent settings. Using a controlled student-teacher protocol across multiple benchmarks, the authors find that self-generated feedback adds little, while strong external teachers yield larger gains, and that the student's ability to act on feedback is a key bottleneck.
View Cached Full Text
Cached at: 07/01/26, 05:36 AM
# What Drives Interactive Improvement from Feedback?
Source: [https://arxiv.org/html/2606.30774](https://arxiv.org/html/2606.30774)
Bartłomiej Cupiał1,2,\*Jan Łojek1Mikołaj Garstecki1 Szymon Pobłocki1Alicja Ziarko1,2,3Piotr Miłoś1,4,5
###### Abstract
We study when natural\-language feedback produces improvement beyond the gains obtainable from repeated attempts alone\. In multi\-turn language agent setting, higher final accuracy can reflect useful feedback, but it can also arise from resampling, format correction, or additional test\-time computation\. To separate these effects, we introduce a controlled student\-teacher protocol across Omni\-MATH, Codeforces, BBEH Linguini, and ARC\-AGI1, evaluating thirteen open\-weight models in both student and teacher roles\. We compare external feedback, self\-feedback, and unguided self\-refinement, while varying interaction history, task difficulty, and teacher access to privileged task information\. Across settings, we find that multi\-turn improvement is often not evidence of feedback use: self\-generated feedback adds little beyond unguided self\-refinement, whereas the strongest external teachers produce substantially larger feedback\-specific gains, suggesting that useful feedback must provide guidance beyond generic retry\. Dense student\-teacher interaction matrices further show that interactive gains are driven more by the student’s ability to use feedback than by the teacher’s identity, although teacher choice remains important for a fixed student\. These results suggest that feedback\-based agents should be evaluated against repeated\-attempt baselines, and that ability to act on feedback, not merely feedback availability, is a central bottleneck for interactive improvement\. We release our controlled student\-teacher evaluation framework at https://j\-lojek\.github\.io/feedback\-generation\-is\-a\-bottleneck/\.
00footnotetext:Accepted to “RLxF: Reinforcement Learning from World Feedback”\.00footnotetext:\*Corresponding author:bartlomiej\.cupial@gmail\.com\.1University of Warsaw2AKCES NCBR3Princeton University4Mistral AI5Institute of Mathematics, Polish Academy of Sciences## 1Introduction
Figure 1:Multi\-turn trajectories separate mostly by student\.While stronger teachers like Gemma4\-31B provide noticeable boosts for highly capable students, the dominant vertical separation between the curves is driven by student rather than teacher\.Language models \(LMs\) are increasingly deployed as agents that act, observe, and revise rather than as systems that produce a single isolated answer\. Coding agents react to tests and error traces, computer\-use agents observe changing interface states, and assistants revise plans in response to user corrections or clarifications\. Across these settings, feedback can do more than mark success or failure: it can explain what went wrong, identify missing constraints, and suggest a path to recovery\. Improving from feedback is therefore becoming an important capability in its own right: an agent must recognize which parts of the feedback are actionable, preserve the parts of its attempt that are already correct, and revise the parts that caused failure\.
Studying improvement from feedback requires separating several mechanisms that are usually entangled\. A model may improve over turns because it receives useful corrective information, but it may also improve simply because it samples another attempt with more computation\. When feedback does help, the limiting factor may lie on either side of the interaction: the teacher may fail to identify the relevant error, or the student may fail to translate a correct diagnosis into a better solution\. We therefore distinguish feedback generation from improving from feedback by evaluating separate student and teacher models under controlled interaction histories, teacher\-information conditions, and repeated\-attempt baselines\. This lets us ask not only whether feedback improves performance, but when it improves performance beyond self\-refinement, and which part of the feedback system constrains that improvement\.
We answer these questions in a student\-teacher protocol on verifiable tasks\. In each episode, a student model attempts a problem\. If the attempt is incorrect, a teacher model provides natural\-language feedback, and the student revises its answer\. The verifier makes every attempt measurable, while the student\-teacher split lets us vary the student, the teacher, the interaction budget, the visible history, and the teacher’s access to privileged task information\. We instantiate this evaluation across four verifiable reasoning environments: Omni\-MATH, Codeforces, BBEH Linguini, and ARC\-AGI1, and run student\-teacher interaction grids in which each model acts in both roles\.
Across these settings, we find that multi\-turn improvement is often not evidence of feedback use: repeated attempts explain a large share of the gain, self\-feedback improves over generic retry inconsistently, the strongest feedback\-specific improvements come from high\-quality external teachers, and the student’s ability to use feedback explains more variation than the teacher’s identity\.
In summary, our contributions are as follows:
- •We introduce a controlled zero\-shot student\-teacher evaluation for natural\-language feedback across four verifiable reasoning environments\.
- •We show that multi\-turn improvement is not necessarily feedback use: self\-feedback adds little beyond self\-refinement, while the best external teachers add substantial feedback\-specific gains\.
- •We find that interactive performance is primarily student\-dependent: student identity explains substantially more gain variation than teacher identity, although teacher choice can matter for a fixed student\.
- •We show that longer histories and privileged teacher information are selective interventions rather than default improvements, with benefits depending on task and model capability\.
## 2Related Work
#### Self\-refinement and repeated sampling\.
Large language models are increasingly used as interactive agents that solve tasks over multiple turns rather than through a single generation\(Wanget al\.,[2024](https://arxiv.org/html/2606.30774#bib.bib19)\)\. In such settings, models must follow sequential instructions and incorporate new feedback across turns\(Labanet al\.,[2025](https://arxiv.org/html/2606.30774#bib.bib18)\)\. However, multi\-turn improvement does not necessarily imply that the model has used feedback: later attempts also receive more test\-time computation and additional samples\. This distinction is especially important because additional computation is known to improve performance through mechanisms such as chain\-of\-thought prompting\(Weiet al\.,[2023](https://arxiv.org/html/2606.30774#bib.bib27)\), and Best\-of\-NNsampling\(Wanget al\.,[2023](https://arxiv.org/html/2606.30774#bib.bib28)\)\. Related agent systems such as Reflexion use verbal feedback or memories of past failures to improve future attempts\(Shinnet al\.,[2023](https://arxiv.org/html/2606.30774#bib.bib16)\)\. Our work builds on this line of work by explicitly comparing feedback\-conditioned improvement against unguided self\-refinement, allowing us to estimate how much of the observed gain is feedback\-specific rather than a consequence of repeated attempts\.
#### Post\-training LMs with teacher feedback\.
Recent post\-training methods increasingly frame model improvement as a student\-teacher problem, where a student model learns from signals produced by a stronger teacher, verifier, or environment\. Standard approaches include supervised fine\-tuning, preference optimization, reinforcement learning from AI feedback, and reinforcement learning with verifiable rewards\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.30774#bib.bib26); Leeet al\.,[2024](https://arxiv.org/html/2606.30774#bib.bib25); Rafailovet al\.,[2024](https://arxiv.org/html/2606.30774#bib.bib24); Guoet al\.,[2025](https://arxiv.org/html/2606.30774#bib.bib29); Shaoet al\.,[2024](https://arxiv.org/html/2606.30774#bib.bib30)\)\. These methods are effective, but their feedback signals are often demonstrations, pairwise preferences, scalar rewards, or outcome\-level verification\. Such signals can improve behavior, but they provide limited information about which part of a student’s attempted solution
Figure 2:\(Top\) Single Didactic Interaction\.We transform single\-turn problems into multi\-turn didactic interactions\. A teacher model, conditioned on optionally privileged information \(e\.g\., the ground\-truth solution\), provides natural language feedback to a student model without revealing the final answer, guiding it to correct its errors\.\(Bottom\) Multi\-turn feedback protocol\.Each failed student attempt is converted into teacher feedback for the next attempt\. This cycle repeats until the predefined maximum number of turns is reached or the student correctly solves the problem\.caused failure\. This limitation has motivated recent work on denser teacher supervision, including on\-policy distillation, where trajectories are sampled from the student and a teacher provides supervision on the student\-generated states\(Lu and Lab,[2025](https://arxiv.org/html/2606.30774#bib.bib31)\)\. It has also motivated work on synthetic and interactive data generation, especially as high\-quality human\-generated training data becomes increasingly limited\(Villaloboset al\.,[2024](https://arxiv.org/html/2606.30774#bib.bib1); Kessleret al\.,[2026](https://arxiv.org/html/2606.30774#bib.bib20)\)\. In verifiable domains, natural\-language feedback can be used to construct multi\-turn training trajectories or improve interactive learning from feedback\(Klissarovet al\.,[2026](https://arxiv.org/html/2606.30774#bib.bib2); Cooket al\.,[2026](https://arxiv.org/html/2606.30774#bib.bib3)\)\. Related work has also used natural\-language feedback as a fine\-tuning signal or as part of reinforcement\-learning\-style updates\(Choudhury and Sodhi,[2024](https://arxiv.org/html/2606.30774#bib.bib21); Fenget al\.,[2024](https://arxiv.org/html/2606.30774#bib.bib22)\)\. Our work studies the test\-time analogue of this post\-training shift: a teacher observes a student’s failed attempt and provides diagnostic feedback, and we ask when this feedback improves performance beyond repeated attempts, whether stronger teachers provide more useful corrections, and whether the main bottleneck lies in feedback generation or student uptake\.
#### Feedback generation and tutoring\.
Several works study the quality of feedback itself\.\(Xuet al\.,[2025](https://arxiv.org/html/2606.30774#bib.bib17)\)identify conditions under which language feedback can support no\-regret learning, including unbiased feedback, the agent’s ability to translate feedback into evaluations of candidate hypotheses, and sufficient representational coverage of the true environment\. In mathematical reasoning,\(Liet al\.,[2024](https://arxiv.org/html/2606.30774#bib.bib14)\)evaluate whether models can detect and correct errors in flawed reasoning traces, while\(Guptaet al\.,[2025](https://arxiv.org/html/2606.30774#bib.bib13)\)analyze the tutoring behavior of GPT\-family models beyond final\-answer correctness\. These works focus on whether feedback is valid, helpful, or pedagogically appropriate\. Our work complements them by evaluating the full student\-teacher interaction: a teacher may generate useful feedback, but improvement also depends on whether the student can act on it\.
## 3Experimental Setup
In this section we describe our experimental setup\.
#### Environments\.
We restrict our evaluation to environments, where for each problem and solution pair, it is possible to verify whether the given solution is correct\. This assumption is often made in the literature \. To make our results robust, we consider four challenging environments, that focus on different domains and test for different abilities:Omni\-MATH\(Gao and others,[2024](https://arxiv.org/html/2606.30774#bib.bib8)\)Olympiad\-level mathematical reasoning benchmark\.Codeforces\(juvi21,[2024](https://arxiv.org/html/2606.30774#bib.bib9)\), a competitive programming benchmark, theLinguinitask fromBIG\-Bench Extra Hard \(BBEH\)\(Kazemi and others,[2025](https://arxiv.org/html/2606.30774#bib.bib10)\), which specifically challenges the model’s ability to learn new rules andARC\-AGI\(Chollet,[2019](https://arxiv.org/html/2606.30774#bib.bib12)\), which presents grid transformation puzzles solvable only by inferring abstract rules from a few examples\.
#### Interaction protocol\.
For investigating the influence of feedback, we consider a realistic scenario of a back\-and\-forth conversation between a model and an entity capable of giving it feedback \(for simplicity, we call them student and teacher\)\. Each episode is a trajectoryτ=\(problem formulation,s1,a1,t1,…,sn,an\)\\tau=\(\\text\{problem formulation\},s\_\{1\},a\_\{1\},t\_\{1\},\\dots,s\_\{n\},a\_\{n\}\), where the initial model input contains the problem formulation\. In turnii,sis\_\{i\}is the student’s output,aia\_\{i\}is the student’s answer\. In the case where the task\-specific verifier recognizedaia\_\{i\}as the wrong answer,tit\_\{i\}is the teacher’s feedback on the student’s attempt\. Then, the next turni\+1i\+1starts, where the student model attempts to solve the problem again, with the added information of the teacher’s feedback\. The conversation continues until a maximalnnis reached or the student’s answer is correct\. Our interaction protocol is visualized in[Figure˜2](https://arxiv.org/html/2606.30774#S2.F2)\.
#### Student’s input\.
Lead by the intuition that the recent feedback is more important than old feedback, we restrict the student’s input to onlyhhhistorical turns, wherehhis a hyperparameter\. Therefore, the student’s input is\(problem formulation,sk,ak,tk,…,sn,an\)\(\\text\{problem formulation\},s\_\{k\},a\_\{k\},t\_\{k\},\\dots,s\_\{n\},a\_\{n\}\), wherek=max\(n−h,0\)k=\\text\{max\}\(n\-h,0\)\.
#### Teacher’s input\.
In previous works\(Klissarovet al\.,[2026](https://arxiv.org/html/2606.30774#bib.bib2); Cooket al\.,[2026](https://arxiv.org/html/2606.30774#bib.bib3)\), the teacher model was granted access to specific task knowledge, denoted asktk\_\{t\}\. Due to the varying nature of the evaluation environments we have chosen, this additional information in this work have taken different forms depending on dataset used: an in\-depth step\-by\-step solution, a plain final answer, or an error stack trace derived from the student’s generated code\. Because smaller models are prone to inadvertently leaking the direct solution when given such context, we establish a strict baseline where no additional information is provided to the teacher \(kt=nonek\_\{t\}=\\text\{none\}\)\. However, to better understand the impact of privileged context, we also include ablations that selectively provide this knowledge based on the specific environment\. The configurations are as follows: for the Math environment, the teacher receives either no information, the final answer, or the full solution; for the BBEH environment, the knowledge is restricted to either no information or the final answer; and for the Code environment, the teacher receives either no additional information or the complete solution\.
![[Uncaptioned image]](https://arxiv.org/html/2606.30774v1/x2.png)
Figure 3:Dense matrices separate zero\-shot ability from feedback\-mediated performance\.Rows are students and columns are teachers for Omni\-MATH\. The left panel reports acc@1 before feedback, so values are constant across each student row; the right panel reports acc@10 after up to ten attempts with teacher feedback\. Comparing the panels shows that feedback changes final performance substantially, some brighter columns highlight strong teachers, with the strongest structure remaining row\-wise, indicating that the student model is a major driver of interactive outcomes\.
#### Models considered\.
To evaluate the performance of both teacher and student, we choose from thirteen models from four open\-weight families, spanning a parameter range of 1B to 120B, both dense and Mixture\-of\-Experts \(MoE\) architectures, and models with and without dedicated reasoning modes\. These are series of Gemma4\-\{E2B, E4B, 26B\-A4B, 31B\}\(Gemma Team,[2026](https://arxiv.org/html/2606.30774#bib.bib5)\); the collection of Llama\-3\.x\-\{1B, 3B, 8B\}\(Dubeyet al\.,[2024](https://arxiv.org/html/2606.30774#bib.bib4)\); the family of Qwen3\.5\-\{2B, 4B, 9B, 27B\}\(Team,[2026](https://arxiv.org/html/2606.30774#bib.bib6)\); and the GPT\-OSS models: GPTOSS\-\{20B, 120B\}\(OpenAI,[2025](https://arxiv.org/html/2606.30774#bib.bib7)\)\.
Figure 4:Most feedback gains arrive in the first few turns\.Each panel reports the marginal improvement from one attempt to the next, averaged over dense\-matrix student\-teacher pairs\. The first feedback\-conditioned attempt produces the largest gain, and later turns contribute progressively smaller improvements\. This front\-loaded pattern suggests that short feedback trajectories capture most of the benefit, while longer rollouts are mainly useful for harder cases with slower recovery\.
#### Ablation tier\.
To further deepen our research, we conducted ablation studies to isolate the effect of different parameters on multi\-turn performance\. Due to the large search space and computational limitations, we focused solely on the Gemma\-4 family, resulting in a4×44\\times 4student\-teacher matrix for each ablation\. This provides us with a relatively exhaustive set of interactions between smaller and larger models\. Because of the conversation setup, we test whether access to information from previous turns yields any benefits\. For this purpose, we vary the maximum visible history length in\{1,2,3,4,5\}\\\{1,2,3,4,5\\\}\. Additionally, we test how the number of turns, also called the interaction budget, affects the final accuracy\. We evaluate it forK∈\{1,5,10\}K\\in\\\{1,5,10\\\}\. Since the conversation progresses sequentially, we can generate the full sequence forK=10K=10and subsequently extract the states forK=5K=5andK=1K=1during post\-processing\. Finally, to decompose the gains from additional test\-time compute, we added a self\-refinement baseline for each model\. This approach maintains the student’s history but does not receive feedback from the teacher\.
#### Metrics\.
To evaluate the different configurations and teacher\-student pairs, we measure the accuracy at various steps of the interaction trajectory, denoted asacc@K\\text\{acc\}@K\. For a more comprehensive analysis, we introduce additional metrics: raw gain \(acc@K−acc@1\\text\{acc\}@K\-\\text\{acc\}@1\), normalized gain \(acc@K−acc@11−acc@1\\frac\{\\text\{acc\}@K\-\\text\{acc\}@1\}\{1\-\\text\{acc\}@1\}\), and the Area Under the Curve computed as the mean cumulative accuracy over all turns \(1T∑k=1Tacc@k\\frac\{1\}\{T\}\\sum\_\{k=1\}^\{T\}\\text\{acc\}@k\)\. Bothacc@K\\text\{acc\}@Kand AUC serve as primary measures of the system’s overall problem\-solving capability, additionally reflecting how quickly the student improves across turns\. Raw and normalized gain instead isolate how much of that improvement is attributable to the teacher’s feedback and iterative reasoning, independent of the student’s starting level\. Notably, normalized gain provides a saturation\-adjusted comparison across different model scales by accounting for benchmark saturation, ensuring that larger models are not penalized for possessing a higher initial baseline accuracy \(acc@1\\text\{acc\}@1\)\.
Table 1:Disentangling interaction, feedback, and feedback quality\.Self\-refinement measures improvement from additional attempts without external feedback; self\-feedback adds feedback generated by the same model; and best feedback selects the strongest available teacher for each student\. Cells report meanacc@10\\mathrm\{acc\}@10with meangain@10\\mathrm\{gain\}@10in parentheses, in percentage points, averaged over students with all three conditions in the same environment\. Full per\-student results are reported in[table˜7](https://arxiv.org/html/2606.30774#A2.T7)\.
## 4What matters for improvement from feedback?
In this section we present our experimental results\. We organize them around the following six empirical questions, which we answer through research findings \(RFs\):
1. 1\.How much of multi\-turn improvement comes from feedback rather than repeated attempts? \(RF1\)
2. 2\.Can a student that is weaker on the first attempt become stronger after receiving feedback? \(RF2\)
3. 3\.Is interactive performance driven primarily by the student receiving feedback or by the teacher generating it? \(RF3\)
4. 4\.Is a teacher’s single\-turn task performance a good predictor of its competence as an interactive teacher? \(RF4\)
5. 5\.How much improvement do longer interaction histories buy, given their cost and context\-length tradeoffs? \(RF5\)
6. 6\.When does teacher access to privileged information improve interactive outcomes? \(RF6\)
![[Uncaptioned image]](https://arxiv.org/html/2606.30774v1/x4.png)
Figure 5:Student identity explains more gain variation than teacher identity\.Points show role means at K = 10, averaging over the opposite role; segments show the min\-to\-max spread\. Across environments, student means span a much wider range than teacher means\.
### 4\.1Feedback\-specific gains and interaction budget
#### RF1: Feedback\-specific gains are limited unless the teacher provides substantially better guidance than generic retry or self\-feedback\.
The feedback\-control comparison in[table˜1](https://arxiv.org/html/2606.30774#S3.T1)shows that multi\-turn improvement is not, by itself, evidence that a student benefits from feedback\. Averaged over matched students, self\-feedback is inconsistent relative to the self\-refinement baseline: it changesacc@10\\mathrm\{acc\}@10by−1\.2\-1\.2to\+8\.7\+8\.7points andgain@10\\mathrm\{gain\}@10by only\+0\.1\+0\.1to\+6\.0\+6\.0points across environments\. The largest self\-feedback lift appears on ARC\-AGI1, while BBEH Linguini slightly favors unguided self\-refinement in final accuracy\. Thus, simply adding a feedback message, even one generated by the same model, does not guarantee feedback\-specific improvement\. In contrast, selecting the best available teacher for each student adds9\.29\.2to16\.616\.6points inacc@10\\mathrm\{acc\}@10and9\.99\.9to17\.317\.3points ingain@10\\mathrm\{gain\}@10over self\-refinement\. This suggests that gains depend not just on the presence of a feedback message, but on whether the feedback provides corrective information that the student can use\.
#### Interaction budget also matters\.
The marginal\-gain curves in[Figure˜4](https://arxiv.org/html/2606.30774#S3.F4)show that the benefits of additional turns are front\-loaded\. Attempt 2, the first with feedback, accounts for30\.5%30\.5\\%–58\.1%58\.1\\%of totalgain@10\\mathrm\{gain\}@10, and attempts throughK=5K=5recover70\.8%70\.8\\%–86\.8%86\.8\\%\. Later turns can still matter for hard long\-horizon cases, especially ARC\-AGI1, but their average marginal return is small\. For data collection, this favors shorter trajectories with informative feedback over long rollouts whose gains mostly come from repeated attempts\.
### 4\.2Dense model matrix: separating student and teacher effects
We next analyze the dense model matrix described in[section˜3](https://arxiv.org/html/2606.30774#S3): each model is evaluated both as a student and as a teacher across the four environments, with episodes run for up toK=10K=10student attempts\. This cross\-product lets us ask whether interactive outcomes are explained more by the model receiving feedback or by the model generating it\.[Figure˜3](https://arxiv.org/html/2606.30774#S3.F3)shows a concrete slice of this matrix for Omni\-MATH, while the full dense matrices for all environments and metrics are reported in[section˜B\.3](https://arxiv.org/html/2606.30774#A2.SS3)\.
Figure 6:First\-turn accuracy and feedback uptake are separate capabilities\.Each panel shows one environment, and each point is a student model averaged over teachers\. Color denotes model family\. The y\-axis reportsngain@10=\(acc@10−acc@1\)/\(1−acc@1\)\\mathrm\{ngain\}@10=\(\\mathrm\{acc\}@10\-\\mathrm\{acc\}@1\)/\(1\-\\mathrm\{acc\}@1\), the fraction of initially unsolved problems recovered by the interaction\. Dashed curves mark equal finalacc@10\\mathrm\{acc\}@10values\.#### RF2: First\-turn accuracy and feedback adherence are different capabilities\.
A high initial solve rate does not imply that a model makes the best use of feedback\.[Figure˜6](https://arxiv.org/html/2606.30774#S4.F6)shows that this distinction is not a one\-off comparison: students with similaracc@1\\mathrm\{acc\}@1can have very differentngain\\mathrm\{ngain\}\. A qualitative example can be seen in[A\.9](https://arxiv.org/html/2606.30774#A1.SS9)\. It showcases how teacher and student cooperate to get to correct solution\. A concrete comparison is given in[table˜6](https://arxiv.org/html/2606.30774#A2.T6)of[section˜B\.2](https://arxiv.org/html/2606.30774#A2.SS2): GPT\-OSS\-120B starts higher on Omni\-MATH, Codeforces, and BBEH Linguini, but Gemma4\-31B usually has much largergain\\mathrm\{gain\}and reaches the better final outcome\. ARC\-AGI1 is the complementary case: GPT\-OSS\-120B has slightly largergain\\mathrm\{gain\}, but Gemma4\-31B starts much higher and remains the best final student\. Thusacc@1\\mathrm\{acc\}@1measures what the student can solve immediately, whilengain\\mathrm\{ngain\}measures how efficiently it converts feedback rounds into recovery opportunities\.
#### RF3: Interactive performance is mostly student\-dependent\.
We use the dense matrix to marginalize over roles: for each student, we average performance over teachers; for each teacher, we average performance over students\.[Figure˜5](https://arxiv.org/html/2606.30774#S4.F5)shows the resulting role means forgain\\mathrm\{gain\}atK=10K=10\. Across all environments, the spread across students is much larger than the spread across teachers, indicating that the model receiving feedback explains more of the interactive gain than the model generating it\. The appendix fixed\-effect analysis in[table˜3](https://arxiv.org/html/2606.30774#A2.T3)supports the same conclusion\. Student fixed effects explain77\.1%77\.1\\%–96\.5%96\.5\\%of pair\-levelgain\\mathrm\{gain\}variation, while teacher fixed effects explain only1\.3%1\.3\\%–12\.4%12\.4\\%\. Adding teacher identity after student identity gives a small incremental improvement, at most0\.1240\.124R2R^\{2\}\. This does not mean that teacher identity is irrelevant\. For a fixed student, the gap between its best and worst teacher can be substantial, as shown by the appendix teacher\-range analysis in[table˜4](https://arxiv.org/html/2606.30774#A2.T4)\. The dense\-matrix results suggest that teacher effects are smaller, but can become meaningful once we condition on a particular student\.
#### RF4: Teacher task competence only partially predicts interactive teacher competence\.
We use each model’sacc@1\\mathrm\{acc\}@1as a student as a noisy proxy for single\-turn task competence, and ask whether this ranking predicts its performance as a teacher for a fixed student\. If models that solve more problems on the first turn were always stronger teachers, teacher rank and interactiveauc\\mathrm\{auc\}would be monotonic within each student\.[Table˜5](https://arxiv.org/html/2606.30774#A2.T5)shows that this monotonicity only partially holds\. Codeforces has the strongest relation, with mean Spearmanρ=0\.56±0\.02\\rho=0\.56\\pm 0\.02, but the other environments are weaker \(0\.21±0\.060\.21\\pm 0\.06to0\.39±0\.060\.39\\pm 0\.06\)\. Inversions are also common: in28\.9%±1\.1%28\.9\\%\\pm 1\.1\\%–41\.4%±1\.9%41\.4\\%\\pm 1\.9\\%of comparable teacher pairs, the lower\-acc@1\\mathrm\{acc\}@1teacher produces higher interactiveauc\\mathrm\{auc\}\. This suggests that single\-turn solving ability and teaching ability are related but not equivalent\. A stronger solver may be more likely to know the correction, but interactive performance also depends on diagnosing the student’s specific error, phrasing actionable feedback, and matching the student’s ability to use it\.
### 4\.3Gemma4 ablations: interaction history and privileged teacher information
The dense matrix uses a relatively cheap and scalable protocol: short visible history and no ground\-truth answer or solution access for the teacher\. We use this as the main setting because both alternatives create tradeoffs\. Longer histories may help models diagnose repeated failures, but they increase cost and can run into context limits\. Privileged information may make feedback more accurate, but answers or solutions are not always available, may be costly to obtain, and can leak into the teacher’s feedback\. The Gemma4 ablations measure how much performance these more expensive sources of information buy\.
Figure 7:History and privileged teacher information help, but for different reasons\.Each panel reportsaucaucatK=10K=10as we vary the maximum number of previous interaction turns retained in the student and teacher prompts for Gemma4 student\-teacher pairs, with line color indicating teacher reference access\.#### RF5: Longer visible history is a capability\-dependent tradeoff\.
Longer histories expose repeated failure modes to the teacher and give the student access to earlier corrections\. This information is not free: every additional visible turn increases cost and makes context overflow more likely\. It is also not automatically useful, the models must be able to identify which parts of the history matter and use them to revise the next attempt\. The max\-history ablation shows little change from history 1 to 2, followed by a larger jump at history 3\. Averaged over teacher\-access settings and Gemma4 pair cells, moving from history 1 to 5 increasesauc\\mathrm\{auc\}by0\.0940\.094on Omni\-MATH\. In the full grid, Codeforces shows the largest average history benefit, withauc\\mathrm\{auc\}increasing by0\.1230\.123\. BBEH Linguini shows a smaller but positive gain, rising from0\.2100\.210at history 1 to0\.2480\.248at history 3\. These averages suggest that history can help, but the benefit should be interpreted as conditional on model capacity rather than as a free improvement from longer context\.
This capability dependence is visible when we split by teacher\-student strength\. In the matched history\-5 minus history\-1 diagnostic \([figure˜19](https://arxiv.org/html/2606.30774#A2.F19)\), Codeforces gains about0\.210\.21auc\\mathrm\{auc\}for pairs where the teacher’s single\-turnacc@1\\mathrm\{acc\}@1exceeds the student’s, compared with about0\.050\.05for the remaining pairs\. This is consistent with stronger teachers using history to diagnose repeated failures, although the ablation still does not isolate teacher adaptation from student\-side conditioning\.
#### RF6: Privileged information helps selectively, not uniformly\.
We ask whether giving the teacher additional task\-side information reliably improves the interaction\. The answer is task\-dependent\. At history 5, BBEH Linguini shows the largest effect: answer access increasesauc\\mathrm\{auc\}from0\.1510\.151to0\.3310\.331andacc@10\\mathrm\{acc\}@10from0\.1950\.195to0\.4630\.463\. This suggests that privileged information is most useful when the teacher would otherwise struggle to infer the target correction\. This large BBEH separation is visible in[figure˜7](https://arxiv.org/html/2606.30774#S4.F7)The effect is much smaller elsewhere\. On Omni\-MATH, answer access improvesauc\\mathrm\{auc\}from0\.4280\.428to0\.4750\.475, while full solution access reaches only0\.4710\.471\. On Codeforces, where the teacher already receives structured execution feedback, adding the reference solution changesauc\\mathrm\{auc\}only from0\.5470\.547to0\.5560\.556\. ARC\-AGI1 shows a smaller but measurable answer\-access effect:auc\\mathrm\{auc\}increases from0\.3030\.303to0\.3280\.328, whileacc@10\\mathrm\{acc\}@10increases from0\.3620\.362to0\.4080\.408\. Thus, additional teacher\-side information is not automatically useful; it helps when it resolves a real teacher uncertainty and can be converted into actionable feedback\. The teacher\-level lift diagnostic in[figure˜20](https://arxiv.org/html/2606.30774#A2.F20)shows the same non\-uniformity across teachers and environments, and the full access summary is shown in[figure˜18](https://arxiv.org/html/2606.30774#A2.F18)\. Because answers and solutions are not always available and may leak into the feedback, we treat these conditions as upper\-bound ablations rather than the default scalable protocol\. To quantify how privileged information affects solution leakage from teacher to student, we conduct a leakage audit in[figure˜8](https://arxiv.org/html/2606.30774#A1.F8)of[section˜A\.6](https://arxiv.org/html/2606.30774#A1.SS6)\.
## 5Discussion
#### Summary
Our findings have practical consequences for evaluating and building feedback\-based agents\. First, repeated\-attempt and self\-feedback controls should be included whenever multi\-turn gains are reported; otherwise, extra inference compute can be mistaken for feedback use\. Second, teacher selection should not be based only on single\-turn task accuracy\. The teacher monotonicity analysis shows that stronger solvers are not always stronger teachers for a particular student\. Third, longer histories and privileged teacher information should be treated as targeted interventions rather than default improvements: they help when they resolve a real diagnostic bottleneck, but they also increase cost, context pressure and leakage risk\.
#### Limitations
Our study has several limitations\. First, the current ARC\-AGI1 analysis does not yet include the same ablations as the other environments\. Although we provided preliminary qualitative analysis of the feedback trajectories, conducting a broader analysis in the future will be necessary to achieve more conclusive results\. These analyses are important for understanding whether improvements come from actionable feedback rather than formatting artifacts, repeated sampling, or direct leakage from the teacher context\. Second, our experiments focus on a zero\-shot setting with short\-horizon verifiable tasks and open\-weight models\. This makes the protocol controlled and reproducible, but it leaves open how the conclusions transfer to models trained to use feedback, longer\-horizon interactive domains, and closed\-source systems\.
#### Future Work
A natural direction for future work is to connect this diagnostic protocol to post\-training\. In supervised fine\-tuning or reinforcement\-learning settings, feedback trajectories could be used not only to measure whether a model improves within an episode, but also to train models that become better feedback users over time\. Extending the study to longer\-horizon domains such as computer usage would test whether the same student\-side bottleneck appears when feedback must be integrated across many actions, persistent state, and delayed consequences\. Finally, comparing open\-weight and closed\-source models would clarify which findings reflect general properties of current language models and which depend on the model families studied here\.
## References
- On the measure of intelligence\.arXiv preprint arXiv:1911\.01547\.Cited by:[§3](https://arxiv.org/html/2606.30774#S3.SS0.SSS0.Px1.p1.1)\.
- S\. Choudhury and P\. Sodhi \(2024\)Better than your teacher: llm agents that learn from privileged ai feedback\.External Links:2410\.05434,[Link](https://arxiv.org/abs/2410.05434)Cited by:[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px2.p2.1)\.
- J\. Cook, D\. Antognini, M\. Klissarov, C\. Musat, and E\. Grefenstette \(2026\)Learning to learn from language feedback with social meta\-learning\.External Links:2602\.16488,[Link](https://arxiv.org/abs/2602.16488)Cited by:[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px2.p2.1),[§3](https://arxiv.org/html/2606.30774#S3.SS0.SSS0.Px4.p1.2)\.
- A\. Dubey, A\. Jauhri, A\. Pandey,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§3](https://arxiv.org/html/2606.30774#S3.SS0.SSS0.Px5.p1.1)\.
- X\. Feng, Z\. Wan, M\. Yang, Z\. Wang, G\. A\. Koushik, Y\. Du, Y\. Wen, and J\. Wang \(2024\)Natural language reinforcement learning\.External Links:2402\.07157,[Link](https://arxiv.org/abs/2402.07157)Cited by:[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px2.p2.1)\.
- Y\. Gaoet al\.\(2024\)Omni\-math: a universal olympiad level mathematics benchmark for large language models\.External Links:2410\.07985Cited by:[§3](https://arxiv.org/html/2606.30774#S3.SS0.SSS0.Px1.p1.1)\.
- G\. D\. Gemma Team \(2026\)Gemma 4: open models for advanced reasoning and efficiency\.Technical reportGoogle DeepMind\.Cited by:[§3](https://arxiv.org/html/2606.30774#S3.SS0.SSS0.Px5.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi, X\. Zhang, X\. Yu, Y\. Wu, Z\. F\. Wu, Z\. Gou, Z\. Shao, Z\. Li, Z\. Gao, A\. Liu, B\. Xue, B\. Wang, B\. Wu, B\. Feng, C\. Lu, C\. Zhao, C\. Deng, C\. Ruan, D\. Dai, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Xu, H\. Ding, H\. Gao, H\. Qu, H\. Li, J\. Guo, J\. Li, J\. Chen, J\. Yuan, J\. Tu, J\. Qiu, J\. Li, J\. L\. Cai, J\. Ni, J\. Liang, J\. Chen, K\. Dong, K\. Hu, K\. You, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Zhao, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, M\. Zhang, M\. Zhang, M\. Tang, M\. Zhou, M\. Li, M\. Wang, M\. Li, N\. Tian, P\. Huang, P\. Zhang, Q\. Wang, Q\. Chen, Q\. Du, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. J\. Chen, R\. L\. Jin, R\. Chen, S\. Lu, S\. Zhou, S\. Chen, S\. Ye, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, S\. S\. Li, S\. Zhou, S\. Wu, T\. Yun, T\. Pei, T\. Sun, T\. Wang, W\. Zeng, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, W\. L\. Xiao, W\. An, X\. Liu, X\. Wang, X\. Chen, X\. Nie, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yang, X\. Li, X\. Su, X\. Lin, X\. Q\. Li, X\. Jin, X\. Shen, X\. Chen, X\. Sun, X\. Wang, X\. Song, X\. Zhou, X\. Wang, X\. Shan, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. Zhang, Y\. Xu, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Wang, Y\. Yu, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Ou, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Xiong, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Y\. X\. Zhu, Y\. Huang, Y\. Li, Y\. Zheng, Y\. Zhu, Y\. Ma, Y\. Tang, Y\. Zha, Y\. Yan, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Ma, Z\. Yan, Z\. Wu, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Pan, Z\. Huang, Z\. Xu, Z\. Zhang, and Z\. Zhang \(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.External Links:ISSN 1476\-4687,[Link](http://dx.doi.org/10.1038/s41586-025-09422-z),[Document](https://dx.doi.org/10.1038/s41586-025-09422-z)Cited by:[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Gupta, J\. Reddig, T\. Calo, D\. Weitekamp, and C\. J\. MacLellan \(2025\)Beyond final answers: evaluating large language models for math tutoring\.External Links:2503\.16460,[Link](https://arxiv.org/abs/2503.16460)Cited by:[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px3.p1.1)\.
- juvi21 \(2024\)Competitive\-coding\-benchmark\.Note:[https://github\.com/juvi21/Competitive\-Coding\-Benchmark/](https://github.com/juvi21/Competitive-Coding-Benchmark/)Cited by:[§3](https://arxiv.org/html/2606.30774#S3.SS0.SSS0.Px1.p1.1)\.
- S\. M\. Kazemiet al\.\(2025\)BIG\-bench extra hard\.arXiv preprint arXiv:2502\.19187\.Cited by:[§3](https://arxiv.org/html/2606.30774#S3.SS0.SSS0.Px1.p1.1)\.
- S\. Kessler, M\. Xia, D\. M\. Diaz, D\. Han, H\. Heshemi, S\. Rajmohan, V\. Ruehle, and J\. T\. Ash \(2026\)Towards active synthetic data generation for finetuning language models\.External Links:2512\.00884,[Link](https://arxiv.org/abs/2512.00884)Cited by:[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px2.p2.1)\.
- M\. Klissarov, J\. Cook, D\. Antognini, H\. Sun, J\. Li, N\. Jaques, C\. Musat, and E\. Grefenstette \(2026\)Improving interactive in\-context learning from natural language feedback\.External Links:2602\.16066,[Link](https://arxiv.org/abs/2602.16066)Cited by:[§A\.1](https://arxiv.org/html/2606.30774#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px2.p2.1),[§3](https://arxiv.org/html/2606.30774#S3.SS0.SSS0.Px4.p1.2)\.
- P\. Laban, H\. Hayashi, Y\. Zhou, and J\. Neville \(2025\)LLMs get lost in multi\-turn conversation\.External Links:2505\.06120,[Link](https://arxiv.org/abs/2505.06120)Cited by:[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Lee, S\. Phatale, H\. Mansoor, T\. Mesnard, J\. Ferret, K\. Lu, C\. Bishop, E\. Hall, V\. Carbune, A\. Rastogi, and S\. Prakash \(2024\)RLAIF vs\. rlhf: scaling reinforcement learning from human feedback with ai feedback\.External Links:2309\.00267,[Link](https://arxiv.org/abs/2309.00267)Cited by:[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Li, W\. Wang, M\. Li, J\. Guo, Y\. Zhang, and F\. Feng \(2024\)Evaluating mathematical reasoning of large language models: a focus on error identification and correction\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand\.External Links:[Link](https://aclanthology.org/2024.findings-acl.673/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.673)Cited by:[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px3.p1.1)\.
- K\. Lu and T\. M\. Lab \(2025\)On\-policy distillation\.Thinking Machines Lab: Connectionism\.Note:https://thinkingmachines\.ai/blog/on\-policy\-distillationExternal Links:[Document](https://dx.doi.org/10.64434/tml.20251026)Cited by:[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px2.p2.1)\.
- OpenAI \(2025\)GPT\-oss: open\-weight reasoning models\.Technical reportOpenAI\.External Links:[Link](https://openai.com/research/gpt-oss)Cited by:[§3](https://arxiv.org/html/2606.30774#S3.SS0.SSS0.Px5.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.External Links:2203\.02155,[Link](https://arxiv.org/abs/2203.02155)Cited by:[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, S\. Ermon, C\. D\. Manning, and C\. Finn \(2024\)Direct preference optimization: your language model is secretly a reward model\.External Links:2305\.18290,[Link](https://arxiv.org/abs/2305.18290)Cited by:[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.External Links:2303\.11366,[Link](https://arxiv.org/abs/2303.11366)Cited by:[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px1.p1.1)\.
- Q\. Team \(2026\)Qwen 3\.5 technical report: advancements in multilingual and reasoning capabilities\.arXiv preprint arXiv:2602\.09123\.Cited by:[§3](https://arxiv.org/html/2606.30774#S3.SS0.SSS0.Px5.p1.1)\.
- P\. Villalobos, A\. Ho, J\. Sevilla, T\. Besiroglu, L\. Heim, and M\. Hobbhahn \(2024\)Position: will we run out of data? Limits of LLM scaling based on human\-generated data\.External Links:2211\.04325,[Link](https://arxiv.org/abs/2211.04325)Cited by:[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px2.p2.1)\.
- X\. Wang, Z\. Wang, J\. Liu, Y\. Chen, L\. Yuan, H\. Peng, and H\. Ji \(2024\)MINT: evaluating llms in multi\-turn interaction with tools and language feedback\.External Links:2309\.10691,[Link](https://arxiv.org/abs/2309.10691)Cited by:[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.External Links:2203\.11171,[Link](https://arxiv.org/abs/2203.11171)Cited by:[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. Le, and D\. Zhou \(2023\)Chain\-of\-thought prompting elicits reasoning in large language models\.External Links:2201\.11903,[Link](https://arxiv.org/abs/2201.11903)Cited by:[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Xu, A\. Nie, R\. Zheng, A\. Modi, A\. Swaminathan, and C\. Cheng \(2025\)Provably learning from language feedback\.External Links:2506\.10341,[Link](https://arxiv.org/abs/2506.10341)Cited by:[§2](https://arxiv.org/html/2606.30774#S2.SS0.SSS0.Px3.p1.1)\.
## Appendix AExperimental Setup Details
### A\.1Protocol Definition
Our evaluation protocol follows the didactic feedback loop ofKlissarovet al\.\[[2026](https://arxiv.org/html/2606.30774#bib.bib2)\], but uses it only as a zero\-shot evaluation environment\. A single episode contains a task instance, a student model, a teacher model, and a maximum number of student attempts\. The first student message contains the task but no teacher feedback\. After each student attempt, the environment evaluates correctness\. Correct attempts terminate the episode immediately\. Incorrect attempts trigger a teacher turn unless the attempt budget has already been exhausted\.
Letxxbe the public task prompt,zzbe optional privileged task information, andhth\_\{t\}be the prompt\-visible interaction history before thett\-th student attempt\. The student generates
at∼πS\(⋅∣x,ht,ft−1\),a\_\{t\}\\sim\\pi\_\{S\}\(\\cdot\\mid x,h\_\{t\},f\_\{t\-1\}\),whereft−1f\_\{t\-1\}is empty on the first turn\. The verifier returnsct=1c\_\{t\}=1ifata\_\{t\}is correct\. Ifct=0c\_\{t\}=0andt<Kmaxt<K\_\{\\max\}, the teacher generates
ft∼πT\(⋅∣x,at,ht,zT\),f\_\{t\}\\sim\\pi\_\{T\}\(\\cdot\\mid x,a\_\{t\},h\_\{t\},z\_\{T\}\),wherezTz\_\{T\}is determined by the teacher\-information condition\. In the dense model matrix,zTz\_\{T\}excludes the final answer and full solution for math and BBEH, and excludes the trusted reference solution for code\. In code, the teacher can still see execution feedback unless explicitly ablated\.
The teacher is always instructed to provide diagnostic feedback rather than the final answer\. In the main prompt style, the teacher may reason in<think\>tags, but the feedback passed to the student is parsed from the final<feedback\>block\. If parsing fails, the raw teacher text is passed through\.
#### History truncation\.
The implementation maintains separate student\-side and teacher\-side chat histories\. The parametermax\_historycontrols how many completed interaction turns are kept in the prompt\. When the history window is full, older refinement turns are removed before adding the next user message\. The original user prompt containing the task description is preserved, so every later student or teacher call still has the problem context and the chat remains a natural continuation of the same task\.
### A\.2Run Configuration
Table 2:Current dense\-matrix run settings\.The current dense\-matrix runs use the following common settings:
- •Agent:hierarchical\_separate, with separate vLLM actor instances for student and teacher\.
- •Interaction budget:Kmax=10K\_\{\\max\}=10student attempts\.
- •History:max\_history=1\.
- •Student sampling: temperature0\.70\.7, top\-p=0\.95p=0\.95, max tokens81928192\.
- •Teacher sampling: temperature1\.01\.0, top\-p=0\.95p=0\.95, max tokens81928192\.
- •Context length:3276832768tokens\.
- •Max output tokens:81928192tokens\.
- •Number of workers: 16 evaluation workers\.
ARC\-AGI uses max output tokens1600016000for both student and teacher because the task JSON and output grids can be long\.
### A\.3Model Set
The dense matrix uses four model families totaling to thirteen model identifiers:
- •Gemma4: - –gemma\-4\-E2B\-it, - –gemma\-4\-E4B\-it, - –gemma\-4\-26B\-A4B\-it, - –gemma\-4\-31B\-it,
- •Qwen3\.5: - –Qwen3\.5\-2B, - –Qwen3\.5\-4B, - –Qwen3\.5\-9B, - –Qwen3\.5\-27B,
- •Llama: - –Llama\-3\.2\-1B\-Instruct, - –Llama\-3\.2\-3B\-Instruct, - –Llama\-3\.1\-8B\-Instruct,
- •GPT\-OSS: - –gpt\-oss\-20b, - –gpt\-oss\-120b\.
Each model appears once as the student and once as the teacher, for 169 pairs per dense\-matrix environment\.
### A\.4Datasets and Verifiers
#### Omni\-MATH\.
Each example contains a problem, final answer, ground\-truth solution, and source\. The student is asked to end with one boxed final answer\. Correctness is evaluated by two math answer checkers: a PRIME\-style normalization and symbolic equivalence checker, followed by amath\_verify\-based checker\. The environment returns reward 1 if one checker accepts the answer\.
#### Codeforces\.
Each example contains a prompt, optional starter code, trusted reference solution, test specification, source, and difficulty metadata\. The student returns a complete Python solution in a fenced code block\. Candidate code is run in a subprocess sandbox on at most 16 test cases\. A task is solved only when all tests pass\. Current timeouts use dataset\-aware scaling: per\-case timeout defaults to 10 seconds, is multiplied by 10 when a dataset time limit is available, and is clamped between 5 and 20 seconds; the full\-task timeout is derived from the case timeout and number of cases\.
#### BBEH \(Linguini\)\.
Each example contains a task name, problem text, reference answer, source, and deterministic problem id derived from the task, problem, and answer\. We filter the BBEH dataset to the Linguini task\. Correctness is evaluated by lowercasing and normalizing the prediction and reference, stripping simple LaTeX wrappers, handling common answer prefixes, and applying a small set of fuzzy matches for parenthesized choices, numeric strings, quotes, brackets, and trailing question marks\. The environment returns reward 1 if the normalized prediction matches the reference under this evaluator\.
#### ARC\-AGI1\.
Each example contains a task id, training input/output grid pairs, test input grids, and hidden target output grids used only by the verifier\. The student must return a JSON object, using the schema\{"outputs": \[\[\[\.\.\.\]\], \.\.\.\]\}\. Correctness is evaluated by parsing the JSON, validating that each predicted output is a well\-formed ARC grid with the expected number of test outputs, and comparing the predicted grids to the hidden target grids\. The environment returns reward 1 only when all test outputs match exactly\. Structured feedback reports parsing failures, shape errors, or grid mismatch diagnostics, truncated to 2000 characters\.
### A\.5Teacher Information Conditions
We use the following teacher reference modes:
- •none: the teacher receives no ground\-truth answer or solution\.
- •answer: the teacher receives only the final answer or reference target\.
- •solution: the teacher receives the full ground\-truth solution or trusted reference solution\.
For code, reference mode is crossed with execution\-context mode:
- •structured: number of tests passed, pass rate, timeout status, and first\-failure details when available\.
- •feedback: the environment’s compact textual execution feedback\.
- •traceback: only stderr or traceback text when available\.
- •none: no execution\-side signal beyond the student code and problem prompt\.
### A\.6Leakage audit
For the purpose of evaluating we have chosen two out of four environments, specifically Omni\-MATH and BBEH Linguini\. To assess whether the models leak information, we prompted an external large language model judge with a triplet containing the \(solution, student answer, teacher response\)\. The judge model was instructed to return True if the teacher leaked confidential information that trivially helped the student solve the problem\. Fixing the student model as Gemma 4 26B, we iterated across all teacher models to quantify the extent of information leakage\. We calculated the leak rate as the ratio of identified leaks to the total number of judge responses\.
Figure 8:Absolute leak rates across various teacher models and teacher access types, evaluated with a fixed student model \(Gemma 4 26B\)\.We can see in[Figure˜8](https://arxiv.org/html/2606.30774#A1.F8)that no more than 5% of teacher answers contained a leak information\. Consistently no additional teacher access scores lower leak rates compared to both answer access and solution access\. The differences mostly vary in smaller models\.
### A\.7Metrics from Episode Logs
Each completed episode log contains at leastsolved,attempts\_used,max\_turns, and, when available,problem\_id\. For each cutoffkk, the postprocessing code marks an episode as solved bykkifsolved=1andattempts\_used <= k\. If multiple episodes share a problem id, the code averages within problem first and then averages across problems\. Otherwise each episode is treated as its own problem\.
The main reported metrics are:
acc@k=1N∑i𝟏\[problemisolved byk\]\\mathrm\{acc\}@k=\\frac\{1\}\{N\}\\sum\_\{i\}\\mathbf\{1\}\[\\text\{problem \}i\\text\{ solved by \}k\]\(1\)
gain@K=acc@K−acc@1,\\mathrm\{gain\}@K=\\mathrm\{acc\}@K\-\\mathrm\{acc\}@1,\(2\)
ngain@K=\{\(acc@K−acc@1\)/\(1−acc@1\),acc@1<10,acc@1=1\\mathrm\{ngain\}@K=\\begin\{cases\}\(\\mathrm\{acc\}@K\-\\mathrm\{acc\}@1\)/\(1\-\\mathrm\{acc\}@1\),&\\mathrm\{acc\}@1<1\\\\ 0,&\\mathrm\{acc\}@1=1\\end\{cases\}\(3\)
auc=1Kmax∑k=1Kmaxacc@k\\mathrm\{auc\}=\\frac\{1\}\{K\_\{\\max\}\}\\sum\_\{k=1\}^\{K\_\{\\max\}\}\\mathrm\{acc\}@k\(4\)
We also reportacc@2−acc@1\\mathrm\{acc\}@2\-\\mathrm\{acc\}@1because a very large first refinement jump can indicate a formatting or first\-turn protocol artifact rather than steady feedback use\.
### A\.8Prompt Templates
This section records the prompt templates used by the current implementation\. Braced names such as\{problem\}denote task\-specific fields inserted at runtime\.
#### Shared teacher system prompt\.
```
You are a careful tutor. Read the student’s latest attempt, identify the most
important mistake, and give targeted feedback without revealing the final answer.
```
#### Shared teacher output format\.
```
Pinpoint the step in which I am making a mistake.
Provide the most informative piece of information for me to succeed on the next
try, without telling me the final answer.
You may reason about the mistake first inside <think>...</think> tags.
Your final feedback message must appear exactly once inside
<feedback>...</feedback> tags.
Return only:
<feedback>
your feedback
</feedback>
```
#### Omni\-MATH student prompts\.
```
System:
You are a mathematician solving competition-level math problems.
Initial user prompt:
Problem:
{problem}
Solve the problem carefully.
You may reason in <think>...</think>. End with exactly one final line in this
format:
\boxed{your answer}
Refinement user prompt, feedback condition:
Feedback:
{teacher_feedback}
Revise your previous solution so it uses the feedback.
You may reason in <think>...</think>. End with exactly one final line in this
format:
\boxed{your answer}
Refinement user prompt, self-refinement/no-feedback condition:
Revise your previous solution and produce a corrected solution.
You may reason in <think>...</think>. End with exactly one final line in this
format:
\boxed{your answer}
```
#### Omni\-MATH teacher prompts\.
```
Initial teacher user prompt:
I am trying to solve this question:
{problem}
Here is my current attempt:
{previous_attempt}
{reference_context}
{teacher_output_rules}
Follow-up teacher user prompt:
Here is my current attempt:
{previous_attempt}
{teacher_output_rules}
```
The reference context is one of:
```
You do not have access to the ground truth solution or final answer.
Here is the ground-truth final answer:
{answer}
Here is the ground-truth solution:
{solution}
```
### A\.9Example of interaction between student and teacher
This section provides an example of a complete dialogue between teacher gpt\-oss 120B and student Gemma\-4\-31B\. To make it more parsable outputs of student have been summarized\. Teacher in this trajectory provided student with necessary refinement to complete the task\.
Agent Interaction: Student\-Teacher Correction
Student \[Step 0\] — Reasoning & Initial Hypothesis: Analyzes the input grids and identifies the bounding box of the active shape \(color 4\)\. Hypothesizes that the target output is a3×33\\times 3grid of blocks forming a cross shape\. Attempts to deduce the geometric transformations for each arm of the cross, but incorrectly formulates the policy: assuming the right arm is a horizontal reflection and the bottom arm is a vertical reflection, rather than pure rotations\.
Student \[Step 0\] — Action: Submits a17×1717\\times 17output grid\. \(Result: Incorrect, Reward: 0\.0\)
Teacher \[Step 1\] — Feedback: “You correctly noticed that the output is built from four copies of the input’s colored shape arranged in a cross, but the transformations you applied to the arms are not the ones used in the examples\. In the training pairs, the top arm is the shape rotated90∘90^\{\\circ\}clockwise, the left arm is the original shape, the right arm is the shape rotated180∘180^\{\\circ\}\(i\.e\., both axes flipped\), and the bottom arm is the shape rotated270∘270^\{\\circ\}clockwise \(or equivalently a90∘90^\{\\circ\}counter\-clockwise rotation\)\.
For non\-square shapes, the rotated arms swap height and width, so the top and bottom blocks have the dimensions of the inputW×HW\\times H, while the left and right blocks keep the originalH×WH\\times W\. Because you used a horizontal flip… several cells are placed in the wrong positions\. Re\-examine the transformations: left = original, top =90∘90^\{\\circ\}cw, right =180∘180^\{\\circ\}, bottom =270∘270^\{\\circ\}cw, and assemble the output accordingly\.”
Student \[Step 1\] — Reasoning & Correction: Integrates the teacher’s feedback to update the transformation policy\. Re\-evaluates the active bounding box, explicitly noting it is non\-square \(H=5,W=6H=5,W=6\)\. Calculates the necessary dimensional swaps for the top and bottom arms \(W=5,H=6W=5,H=6\)\. Methodically maps the pixel coordinates forS90S\_\{90\},S180S\_\{180\}, andS270S\_\{270\}applied to the base shapeSS, and reconstructs the spatial layout of the final grid using the corrected rotational rules\.
Student \[Step 1\] — Action: Submits the corrected17×1717\\times 17output grid\. \(Result: Success, Reward: 1\.0\)
## Appendix BAdditional Results
The initial sections present detailed dense\-matrix tables and performance visualizations for all model pairings across the tested environments, including Omni\-MATH, Codeforces, BBEH Linguini, and ARC\-AGI1\. They do it for all types of our metrices we used: accuracy, cumulative accuracy, and performance gains\.
Subsequent sections focus on specific ablation studies and deeper mechanistic insights, In them we are using gemma 4 dense matrices\. We isolate the impact of relative teacher strength, information access and maximum history, providing a thorough analysis of the variables that govern reasoning improvements in multi\-turn feedback settings\.
### B\.1Supplementary Dense\-Matrix Tables
Table 3:Most pair\-level gain variation is explained by the student\.Metrics are averaged over student\-teacher pair cells atK=10K=10\. Student FE and teacher FE are single\-factor fixed\-effectR2R^\{2\}values for pair\-levelgain=acc@10−acc@1\\mathrm\{gain\}=\\mathrm\{acc\}@10\-\\mathrm\{acc\}@1; the last column is the incrementalR2R^\{2\}from adding teacher identity after student identity\.Table 4:Teacher effects are smaller globally but still consequential for a fixed student\.For each student, we compute the gap between the best and worst teacher atK=10K=10, then average these within\-student gaps over students\.Table 5:Teacher task competence is only partially predictive of interactive teacher competence\.We rank teachers by empiricalacc@1\\mathrm\{acc\}@1when acting as a student in the same environment, using it as a noisy proxy for single\-turn task competence\. Spearman correlations are computed within each fixed student between this teacher\-rank proxy and interactiveauc\\mathrm\{auc\}\. Inversion rate is the fraction of comparable teacher pairs where the lower\-acc@1\\mathrm\{acc\}@1teacher gives higher interactiveauc\\mathrm\{auc\}\. Values are means over fixed\-student groups±\\pmone standard error\.
### B\.2Supplementary Plasticity Table
Table 6:High acc@1 can lose to better feedback adherence\.The table compares GPT\-OSS\-120B and Gemma4\-31B, averaging metrics over teachers\. GPT\-OSS\-120B often starts from a stronger first\-turn baseline, but Gemma4\-31B achieves higher interactiveauc\\mathrm\{auc\}across environments because it gains more from feedback\. This supports the claim that feedback adherence is distinct from initial task competence\. Grey terms report standard errors over teachers\.
### B\.3Supplementary Model\-Matrix Figures
Figure 9:Omni\-MATH dense model matrices\. Rows are students and columns are teachers\. We annotate all four metrics:acc@1\\mathrm\{acc\}@1,acc@10\\mathrm\{acc\}@10,gain\\mathrm\{gain\}, and cumulative accuracyauc\\mathrm\{auc\}\.Figure 10:Codeforces dense model matrices\. Rows are students and columns are teachers\. We annotate all four metrics:acc@1\\mathrm\{acc\}@1,acc@10\\mathrm\{acc\}@10,gain\\mathrm\{gain\}, and cumulative accuracyauc\\mathrm\{auc\}\.Figure 11:BBEH Linguini dense model matrices\. Rows are students and columns are teachers\. We annotate all four metrics:acc@1\\mathrm\{acc\}@1,acc@10\\mathrm\{acc\}@10,gain\\mathrm\{gain\}, and cumulative accuracyauc\\mathrm\{auc\}\.Figure 12:ARC\-AGI1 dense model matrices\. Rows are students and columns are teachers\. We annotate all four metrics:acc@1\\mathrm\{acc\}@1,acc@10\\mathrm\{acc\}@10,gain\\mathrm\{gain\}, and cumulative accuracyauc\\mathrm\{auc\}\.
### B\.4Supplementary Trajectory Figure
Figure 13:Cumulative accuracy curves are separated primarily by the student model\.Each panel shows one environment, and each curve averages across teachers for one student\. The broad vertical separation between students is larger than the within\-student variation induced by changing teachers\.
### B\.5Supplementary Gain\-by\-Role Figures
Figure 14:Omni\-MATH\. We decompose gain by student\(left\) and teacher\(right\) for each model we compute meangain\\mathrm\{gain\}atK=10K=10across all teachers\(left\) and students\(right\)\.Figure 15:Codeforces\. We decompose gain by student\(left\) and teacher\(right\) for each model we compute meangain\\mathrm\{gain\}atK=10K=10across all teachers\(left\) and students\(right\)\.Figure 16:BBEH Linguini\. We decompose gain by student\(left\) and teacher\(right\) for each model we compute meangain\\mathrm\{gain\}atK=10K=10across all teachers\(left\) and students\(right\)\.Figure 17:ARC\-AGI1\. We decompose gain by student \(left\) and teacher \(right\) for each model we compute meangain\\mathrm\{gain\}atK=10K=10across all teachers\(left\) and students\(right\)\.
### B\.6Supplementary Gemma4 Ablation Figures
Figure 18:Gemma4 teacher\-information access ablation across environments\.Each bar reports cumulative accuracyauc\\mathrm\{auc\}atK=10K=10with max history 5, averaged over Gemma4 student\-teacher pair cells\. Omni\-MATH varies final\-answer and solution access, Codeforces varies execution\-context access, and BBEH Linguini and ARC\-AGI1 compare no access against answer access\.Figure 19:History lift by relative teacher capability\.Matchedauc\\mathrm\{auc\}lift from history 1 to history 5 is largest when the teacher is stronger than the student\.Figure 20:Information\-access lift by teacher\.Matched information\-access gains vary by teacher and environment, rather than appearing as a uniform benefit of revealing the answer or solution\.
### B\.7Feedback\-Control Tables
Table 7:Per\-student controls disentangling interaction, feedback, and feedback quality\.Cells reportacc@10\\mathrm\{acc\}@10withgain@10\\mathrm\{gain\}@10in parentheses, in percentage points\. Self\-refinement is the generic retry condition, self\-feedback is the dense\-matrix diagonal, and best feedback selects the highest\-acc@10\\mathrm\{acc\}@10teacher available for the student and environment\.Similar Articles
Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most
This paper benchmarks seven LLM feedback agents in propositional logic tutoring, finding they perform well on optimal steps but systematically fail to correctly diagnose valid suboptimal and incorrect solutions, highlighting limitations for adaptive tutoring.
@Mnilax: Google and Stanford engineers just dropped a 39-page PDF on what actually makes an AI agent self-improve. input → outpu…
A 39-page paper from Google and Stanford engineers analyzes the key factors that enable AI agents to self-improve through feedback loops, noting that only 9% of agents actually run a real loop.
@zachlloydtweets: https://x.com/zachlloydtweets/status/2069428152338665622
This post explains how to create an automated feedback loop for AI agents to iteratively improve their skills, using computer use and an observer skill to evaluate and update the skill code.
What Makes Interaction Trajectories Effective for Training Terminal Agents?
This paper investigates what makes interaction trajectories effective for training terminal-based AI agents, introducing the Terminal-Lego pipeline and revealing a pedagogical paradox where weaker agents can produce better training data. It finds that environment-grounded supervision, rather than teacher performance, is key for student generalization.
@petradonka: https://x.com/petradonka/status/2054897826149101588
The article argues that AI agents performing judgment-heavy tasks need feedback loops to improve over time, rather than relying on static prompts, using the example of Buzz, an agent developed by Warp to monitor and respond to social mentions.