ACIL: Auto Chain of Thoughts for In-Context Learning

arXiv cs.CL Papers

Summary

This paper introduces ACIL, an automatic Chain-of-Thought framework to enhance In-Context Learning by generating and pruning reasoning chains, improving LLM performance on complex tasks.

arXiv:2605.17088v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have shown that Chain-of-Thought (CoT) reasoning can substantially improve performance on complex reasoning tasks. At the same time, In-Context Learning (ICL) has become an important mechanism for adapting LLMs to new tasks without updating model parameters, using only examples provided in the prompt. However, standard ICL often struggles on tasks that require multi-step reasoning, because the demonstrations usually contain only input-output pairs and lack explicit intermediate reasoning steps. This paper introduces an Automatic Chain-of-Thought (Auto-CoT) framework to improve ICL by automatically constructing reasoning-enhanced demonstrations. Auto-CoT generates reasoning chains for input-output examples, augments the prompt context with structured intermediate explanations, and removes irrelevant or low-quality demonstrations through a systematic selection process. By incorporating high-quality reasoning examples into the ICL prompt, Auto-CoT guides the model toward more reliable reasoning and improves prediction accuracy. Experiments across multiple reasoning tasks demonstrate that the proposed framework improves ICL performance by providing explicit intermediate reasoning guidance.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:37 AM

# ACIL: Auto Chain of Thoughts for In-Context Learning
Source: [https://arxiv.org/html/2605.17088](https://arxiv.org/html/2605.17088)
###### Abstract

Recent advancements in Large Language Models \(LLMs\) have highlighted the critical role of Chain of Thought \(CoT\) reasoning in improving model performance across complex reasoning tasks\. In parallel, In\-Context Learning \(ICL\) has emerged as a vital mechanism that allows models to adapt to new tasks without parameter updates by leveraging examples provided in the prompt\. However, traditional ICL approaches struggle to generalize well on tasks requiring intricate reasoning due to the lack of explicit intermediate reasoning steps\.

This paper introduces an Automatic Chain of Thought \(Auto\-CoT\) framework designed to enhance the performance of ICL\. Auto\-CoT automatically generates reasoning chains for input\-output pairs, augments the context with these structured explanations, and prunes irrelevant or low\-quality demonstrations through a systematic selection process\. By integrating high\-quality reasoning examples into the ICL prompt, Auto\-CoT improves the model’s reasoning ability and prediction accuracy\. The approach is validated across various tasks, demonstrating its efficacy in optimizing ICL by guiding the model with intermediate reasoning steps\.

## 1Introduction

### 1\.1Background

Recent advancements in Large Language Models \(LLMs\) have demonstrated the significant impact of Chain of Thought \(CoT\) reasoning on model performance\. Notably, OpenAI’s O1 model has showcased the importance of CoT in enhancing LLMs’ problem\-solving capabilities across various complex tasks\. Concurrently, In\-Context Learning \(ICL\) has been recognized as a crucial step in the inference process of LLMs, enabling models to adapt and learn from contextual information\. However, current ICL methodologies face limitations when dealing with intricate reasoning tasks\. Research Objective: This study aims to integrate Automatic Chain of Thought \(Auto\-CoT\) generation with In\-Context Learning to improve LLMs’ performance on complex reasoning tasks\. The primary goal is to develop a novel methodology that automatically generates high\-quality CoT examples and seamlessly incorporates them into the ICL process, thereby enhancing the model’s reasoning capabilities and adaptability\.

### 1\.2Research Problem Description

We are wondering if the CoT could enhance the In\-Context Learning Performance\. Even though CoT has few\-shot CoT scenarios, to the best of our knowledge, there is lacking research on improving the In\-Context Learning performance through zero\-shot CoT\. By refering to Fig[1](https://arxiv.org/html/2605.17088#S1.F1), assumping we are considering In\-Context Learning as a function, we are trying to optimize the performance of given a accurate \(x1x\_\{1\},y1y\_\{1\}\), \(x2x\_\{2\},y2y\_\{2\}\), how we can make the yellow points which we are predicting either the bi\-directional or the next\-word prediction can be optimized onto the red line through minimizing the loss\. In a real text scenario in Large Language Models, we are trying to enhance the accuracy of the next\-word prediction, here is an example:

"The movie was fantastic\!"→\\topositive,"I absolutely loved the storyline\."→\\topositive,"The plot was a bit predictable\."→\\toneutral,"The acting was mediocre but the visuals werestunning\."→\\topromptneutral

HypopthesisThus, we have,

H0H\_\{0\}: The Chain of Thoughts can not improve the performance of LLM’s In\-Context Learning through Zero\-Shot auto CoT\. H1H\_\{1\}: The Chain of Thoughts can improve the performance of LLM’s In\-Context Learning through Zero\-Shot auto CoT\.

ScopeThis project will address mostly on the linear and non\-linear regression function scenarios on a simple Transformer or GPT2, and will extend some of the perspective towards real case LLMs with financial classification dataset\. This research is a statistical learning on thecausal inference performance of language models\. Thus, there is no tuning on LLMs\.

Contribution and NoveltyTo our best knowledge, it is the first attempt on using CoT to enhance ICL performance on LLMs\.

\(x1,y1\)\(x\_\{1\},y\_\{1\}\)\(x2,y2\)\(x\_\{2\},y\_\{2\}\)\(xj,?\)\(x\_\{j\},?\)wtestT​xw\_\{\\text\{test\}\}^\{T\}x⋯\\cdotsFigure 1:Illustration of CoT enhancement for In\-Context Learning

## 2Prior Works

In Context Learningfirstly widely noticed from few\-shot learningBrown et al\. \([2020b](https://arxiv.org/html/2605.17088#bib.bib3)\)and was formed into mathematical functions for deeper logic\-level research to find explainable performance as in\-context learning whichGarg et al\. \([2022](https://arxiv.org/html/2605.17088#bib.bib4)\); Xie et al\. \([2021](https://arxiv.org/html/2605.17088#bib.bib8)\)presented a systematic investigation into transformers’ in\-context learning capabilities through the lens of simple function classes\. They demonstrated that standard transformers can be trained from scratch to perform in\-context learning of linear functions with performance comparable to optimal least squares estimation\. Their work showed that in\-context learning is possible even under distribution shifts between training and inference\-time prompts, as well as between in\-context examples and query inputs\. The study provided valuable insights into transformers’ ability to learn and generalize from in\-context examples by examining how they handle sparse linear functions, two\-layer neural networks, and decision trees\.

Auto\-CoTwhichZhang et al\. \([2022](https://arxiv.org/html/2605.17088#bib.bib9)\)an innovative approach to automate chain\-of\-thought prompting in large language models\. Their method eliminates the need for manual demonstration design by leveraging diversity\-based question sampling and automatic reasoning chain generation\. The authors demonstrated that Auto\-CoT consistently matches or exceeds the performance of manual chain\-of\-thought prompting across ten public benchmark reasoning tasks\. Their analysis revealed that diversity in demonstration selection is crucial for mitigating the effects of reasoning mistakes, and their approach effectively handles both arithmetic and commonsense reasoning tasks while maintaining robustness in streaming settings\. In the other hand,Automatic prompt augmentationShum et al\. \([2023](https://arxiv.org/html/2605.17088#bib.bib6)\)proposed a novel approach for automatic chain\-of\-thought prompt engineering in large language models\. The method addresses the limitations of manual demonstration design through a three\-stage framework: \(1\) augmenting rationale chains from labeled data, \(2\) pruning low\-quality chains based on answer consistency, and \(3\) selecting optimal chain combinations via variance\-reduced policy gradient optimization\. The authors demonstrate that Automate\-CoT achieves superior performance across multiple reasoning tasks, with significant improvements in arithmetic reasoning \(\+2\.7%\), commonsense reasoning \(\+3\.4%\), symbolic reasoning \(\+3\.2%\), and non\-reasoning tasks \(\+2\.5%\)\. Their analysis reveals that the method effectively handles various sensitivity issues in prompt engineering, including order sensitivity, complexity\-diversity trade\-offs, and linguistic style variations, while maintaining computational efficiency by requiring only 100 training examples\.

CoT\-ICLwhichHuang et al\. \([2024](https://arxiv.org/html/2605.17088#bib.bib5)\)provided a theoretical framework for understanding chain\-of\-thought prompting by examining how transformers learn multi\-layer perceptrons in\-context\. They decomposed chain\-of\-thought into two distinct phases: filtering relevant information from prompts and in\-context learning of individual computation steps\. Their work established that CoT\-I/O can learn MLPs with input dimension d and k neurons using O\(max\(k,d\)\) in\-context samples, significantly improving upon the \(kd\) lower bound of standard in\-context learning\. The study also demonstrated how CoT accelerates pretraining by enabling the model to learn compositional shortcuts, offering valuable insights into the mechanics underlying chain\-of\-thought reasoning\.

## 3Methodology

### 3\.1Dataset

non\-Linear regression functionAs for functional data, we are takingxxthrough a gaussian distribution and correlating generatingyythrough a Relu\-2NN function\.

Financial classification datasetFinBERT is a pre\-trained NLP model to analyze sentiment of financial text\. It is built by further training the BERT language model in the finance domain, using a large financial corpus and thereby fine\-tuning it for financial sentiment classification\.Araci \([2019](https://arxiv.org/html/2605.17088#bib.bib1)\)

LAMBADA datasetFollowing the early few\-shot learning researchBrown et al\. \([2020a](https://arxiv.org/html/2605.17088#bib.bib2)\), we have also chose the Lambada dataset for testing on next\-word prediction performance on ICL\-COT performance\.

### 3\.2In\-Context Learning Loss formulation optimization and testing

We will try to optimize the performance through minimizing the loss function which considered as the MSE\.

We firstly trained a Transformer for linear functions with sampled distribution among:ℱ=\{f∣f​\(x\)=𝐰⊤​𝐱,𝐰∈ℝd\}\\mathcal\{F\}=\\left\\\{f\\mid f\(x\)=\\mathbf\{w\}^\{\\top\}\\mathbf\{x\},\\,\\mathbf\{w\}\\in\\mathbb\{R\}^\{d\}\\right\\\}\. Then we have training progressPi=\(𝐱1,f\(𝐱1\),𝐱2,f\(𝐱2\),…,𝐱i,f\(𝐱i\),𝐱i\+1\)\)P^\{i\}=\(\\mathbf\{x\}\_\{1\},f\(\\mathbf\{x\}\_\{1\}\),\\mathbf\{x\}\_\{2\},f\(\\mathbf\{x\}\_\{2\}\),\\ldots,\\mathbf\{x\}\_\{i\},f\(\\mathbf\{x\}\_\{i\}\),\\mathbf\{x\}\_\{i\+1\}\)\)for minimizing the Mean Squared Error:

minθ⁡𝔼P​\[1k\+1​∑i=0kℓ​\(Mθ​\(Pi\),f​\(𝐱i\+1\)\)\]\\min\_\{\\theta\}\\,\\mathbb\{E\}\_\{P\}\\left\[\\frac\{1\}\{k\+1\}\\sum\_\{i=0\}^\{k\}\\ell\\left\(M\_\{\\theta\}\\left\(P^\{i\}\\right\),f\\left\(\\mathbf\{x\}\_\{i\+1\}\\right\)\\right\)\\right\]using a decoder\-only Transformer architecture consists of 12 layers, 8 attention heads, and a 256\-dimensional embedding space \(22\.4M parameters\)\.

Secondly, during the inference stage, we have promptP=\(𝐱1,f\(𝐱1\),𝐱2,f\(𝐱2\),…,𝐱k,f\(𝐱k\),𝐱k\+1\)\)P=\(\\mathbf\{x\}\_\{1\},f\(\\mathbf\{x\}\_\{1\}\),\\mathbf\{x\}\_\{2\},f\(\\mathbf\{x\}\_\{2\}\),\\ldots,\\mathbf\{x\}\_\{k\},f\(\\mathbf\{x\}\_\{k\}\),\\mathbf\{x\}\_\{\\text\{k\+1\}\}\)\)fromf​\(𝐱\)=𝐰ICL⊤​𝐱f\(\\mathbf\{x\}\)=\\mathbf\{w\}\_\{\\text\{ICL\}\}^\{\\top\}\\mathbf\{x\},𝐰ICL\\mathbf\{w\}\_\{\\text\{ICL\}\}is different from the functions we used during trainingℱ\\mathcal\{F\}\. Our input size is \(40,20\) dimensions\. For ICL testing case, with evaluating loss\(M​\(P\)−𝐰⊤​𝐱query\)2/d\\left\(M\(P\)\-\\mathbf\{w\}^\{\\top\}\\mathbf\{x\}\_\{\\text\{query\}\}\\right\)^\{2\}/d\. The goal is that ICL progress makef^𝐰,x1:k​\(𝐱query\)\\hat\{f\}\_\{\\mathbf\{w\},x\_\{1:k\}\}\(\\mathbf\{x\}\_\{\\text\{query\}\}\)approximates𝐰⊤​𝐱query\\mathbf\{w\}^\{\\top\}\\mathbf\{x\}\_\{\\text\{query\}\}, minimizing the loss\. In our case, the number of queries is 41\. We repeat the process 64 times and report the average performance\.

The In\-Context inference step logic can be refered to Fig[2](https://arxiv.org/html/2605.17088#S3.F2)

![Refer to caption](https://arxiv.org/html/2605.17088v1/ICLscenarioNumeral.png)Figure 2:In\-Context Learning step scenario
### 3\.3Auto\-Chain\-of\-Thought Implementation

we are trying toarg⁡min𝐲⁡ℓ​\(𝐲,𝐱k\+1\)\\arg\\min\_\{\\mathbf\{y\}\}\\ell\(\\mathbf\{y\},\\mathbf\{x\}\_\{\\text\{k\+1\}\}\)is for our MSE loss comparing between perturbed output and the ground truth at𝐲41∣𝐱k\+1\\mathbf\{y\}\_\{41\}\\mid\\mathbf\{x\}\_\{\\text\{k\+1\}\}

ℒ​\(𝜹\)=ℓ​\(Mθ​\(P\+𝜹\),f​\(𝐱k\+1\)\)\\mathcal\{L\}\_\{\\text\{\}\}\(\\boldsymbol\{\\delta\}\)=\\ell\\left\(M\_\{\\theta\}\(P\+\\boldsymbol\{\\delta\}\),f\(\\mathbf\{x\}\_\{k\+1\}\)\\right\)with a Auto\-CoT strategy:

First, we augment the training pool by generatingkkdifferent reasoning chains for each input\-output pair in our linear function:

𝒫=\{P1,P2,…,Pk\},where​Pi=\{\(𝐱j,𝐲j,𝐫j\)\}j=141\\mathcal\{P\}=\\\{P\_\{1\},P\_\{2\},\.\.\.,P\_\{k\}\\\},\\text\{ where \}P\_\{i\}=\\\{\(\\mathbf\{x\}\_\{j\},\\mathbf\{y\}\_\{j\},\\mathbf\{r\}\_\{j\}\)\\\}\_\{j=1\}^\{41\}
where𝐫j\\mathbf\{r\}\_\{j\}represents the reasoning chain for thejj\-th sample\. The augmented prompts are generated through:

𝐫j=G​\(𝐱j,𝐲j;θG\)\\mathbf\{r\}\_\{j\}=G\(\\mathbf\{x\}\_\{j\},\\mathbf\{y\}\_\{j\};\\theta\_\{G\}\)
whereGGis our large language model generating step\-by\-step reasoning\.

Then, we prune low\-quality chains based on the consistency between generated answers and ground truth:

𝒫′=\{Pi∈𝒫∣‖𝐲^41−𝐰⊤​𝐱41‖2≤ϵ\}\\mathcal\{P\}^\{\\prime\}=\\\{P\_\{i\}\\in\\mathcal\{P\}\\mid\\\|\\hat\{\\mathbf\{y\}\}\_\{41\}\-\\mathbf\{w\}^\{\\top\}\\mathbf\{x\}\_\{41\}\\\|^\{2\}\\leq\\epsilon\\\}
Finally, we optimize the selection of reasoning chains through a variance\-reduced policy gradient strategy:

∇πℒ=1N−1​∑i=1N\(ℒ​\(Pi\)−1N​∑j=1Nℒ​\(Pj\)\)​∇πlog⁡p​\(Pi\)\\nabla\_\{\\pi\}\\mathcal\{L\}=\\frac\{1\}\{N\-1\}\\sum\_\{i=1\}^\{N\}\\left\(\\mathcal\{L\}\(P\_\{i\}\)\-\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\mathcal\{L\}\(P\_\{j\}\)\\right\)\\nabla\_\{\\pi\}\\log p\(P\_\{i\}\)
whereπ\\pirepresents our selection policy andp​\(Pi\)p\(P\_\{i\}\)is the probability of selecting theii\-th prompt\.

The final loss for our ICL with Auto\-CoT becomes:

ℒAuto\-CoT=𝔼P∼π​\[1d​\(M​\(P\)−𝐰⊤​𝐱41\)2\]\\mathcal\{L\}\_\{\\text\{Auto\-CoT\}\}=\\mathbb\{E\}\_\{P\\sim\\pi\}\\left\[\\frac\{1\}\{d\}\\left\(M\(P\)\-\\mathbf\{w\}^\{\\top\}\\mathbf\{x\}\_\{41\}\\right\)^\{2\}\\right\]
This approach enables our model to learn better reasoning patterns by leveraging diverse, high\-quality reasoning chains, effectively reducing the prediction error at𝐲41\\mathbf\{y\}\_\{41\}while maintaining computational efficiency through our 64\-times repeated evaluation process\.

The detailed steps can be refered to Algorithm[1](https://arxiv.org/html/2605.17088#algorithm1)

Input:Training data

𝒟\\mathcal\{D\}with dimension \(40,20\), Query set

𝐱q​u​e​r​y\\mathbf\{x\}\_\{query\}
Output:Predicted value

𝐲^41\\hat\{\\mathbf\{y\}\}\_\{41\}
Step 1: Augment Stage

begin

Initialize prompt pool

𝒫=\{\}\\mathcal\{P\}=\\\{\\\};

for*i=1i=1toKK*do

Sample linear function

fi​\(𝐱\)=𝐰i⊤​𝐱f\_\{i\}\(\\mathbf\{x\}\)=\\mathbf\{w\}\_\{i\}^\{\\top\}\\mathbf\{x\}from

ℱ\\mathcal\{F\};

Generate sequence

Pi=\(𝐱1,fi​\(𝐱1\),…,𝐱k,fi​\(𝐱k\)\)P^\{i\}=\(\\mathbf\{x\}\_\{1\},f\_\{i\}\(\\mathbf\{x\}\_\{1\}\),\.\.\.,\\mathbf\{x\}\_\{k\},f\_\{i\}\(\\mathbf\{x\}\_\{k\}\)\);

Generate reasoning chain

𝐫i\\mathbf\{r\}\_\{i\}using LLM:

𝐫i=G​\(Pi\)\\mathbf\{r\}\_\{i\}=G\(P^\{i\}\);

Add

\(Pi,𝐫i\)\(P^\{i\},\\mathbf\{r\}\_\{i\}\)to

𝒫\\mathcal\{P\};

Step 2: Prune Stage

begin

Initialize pruned pool

𝒫′=\{\}\\mathcal\{P\}^\{\\prime\}=\\\{\\\};

for*each\(Pi,𝐫i\)∈𝒫\(P^\{i\},\\mathbf\{r\}\_\{i\}\)\\in\\mathcal\{P\}*do

Compute predicted output

𝐲^i=M​\(Pi\)\\hat\{\\mathbf\{y\}\}\_\{i\}=M\(P^\{i\}\);

Compute loss

ℓi=‖𝐲^i−𝐰⊤​𝐱41‖2/d\\ell\_\{i\}=\\\|\\hat\{\\mathbf\{y\}\}\_\{i\}\-\\mathbf\{w\}^\{\\top\}\\mathbf\{x\}\_\{41\}\\\|^\{2\}/d;

if*ℓi≤ϵ\\ell\_\{i\}\\leq\\epsilon*then

Add

\(Pi,𝐫i\)\(P^\{i\},\\mathbf\{r\}\_\{i\}\)to

𝒫′\\mathcal\{P\}^\{\\prime\};

Step 3: Select Stage

begin

Initialize selection policy

πθ\\pi\_\{\\theta\};

for*epoch = 1 to N*do

Sample batch of prompts from

𝒫′\\mathcal\{P\}^\{\\prime\}using

πθ\\pi\_\{\\theta\};

Compute policy gradient using:

∇θℒ=1B−1​∑i=1B\(ℓi−ℓ¯\)​∇θlog⁡πθ​\(Pi\)\\nabla\_\{\\theta\}\\mathcal\{L\}=\\frac\{1\}\{B\-1\}\\sum\_\{i=1\}^\{B\}\(\\ell\_\{i\}\-\\bar\{\\ell\}\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(P^\{i\}\);

Update

πθ\\pi\_\{\\theta\}using computed gradient;

Select best performing prompts according to

πθ\\pi\_\{\\theta\};

Step 4: Inference Stage

begin

Initialize results array

R=\[\]R=\[\];

for*i=1i=1to6464*do

Construct final prompt

Pf​i​n​a​lP\_\{final\}using selected examples;

Predict

𝐲^41=M​\(Pf​i​n​a​l\)\\hat\{\\mathbf\{y\}\}\_\{41\}=M\(P\_\{final\}\);

Append

𝐲^41\\hat\{\\mathbf\{y\}\}\_\{41\}to

RR;

Compute average prediction

𝐲¯41=mean​\(R\)\\bar\{\\mathbf\{y\}\}\_\{41\}=\\text\{mean\}\(R\);

return

𝐲¯41\\bar\{\\mathbf\{y\}\}\_\{41\}

Algorithm 1In\-Context Learning with Auto\-CoT for Linear Function Approximation

## 4Results Evaluation

### 4\.1Dataset and Environment Settings

All experiments are ran on either google Colab or HPC with L40s or A100 GPU\.

Numeral DataRecalling the latest work always form In\-Context Learning scenario into a numeral data scenario, I am mostly using numeral data for experiments

As for the experiment settings for numeral data scenario, I grab data from the 2 NN layer Relu function as followed:

y=ReLU​\(W2⋅ReLU​\(W1​x\+b1\)\+b2\)y=\\text\{ReLU\}\(W\_\{2\}\\cdot\\text\{ReLU\}\(W\_\{1\}x\+b\_\{1\}\)\+b\_\{2\}\)Where the W1 and W2 will be generated and fixed in 1 epoch, and each x will be generated from a Gaussion distribution\.

Text DataIn order to validate the conclusion and extend the scenario to the real Large Language model scenario, I am also extending the results to text data from SST\-2Socher et al\. \([2013](https://arxiv.org/html/2605.17088#bib.bib7)\)as well as fin\-BERT dataset\.

### 4\.2Benchmarkings

As for the non\-linear functions, we are mostly considering the Mean\-Squared error for our evaluation\.

#### 4\.2\.1Numeral Data Scenario

GPT2 TestingThe baseline is the Numeral In\-Context Learning setting, generating data from Relu 2NN without any Auto\-COT progress\.

Table 1:Performance comparison between baseline and Auto\-CoT![Refer to caption](https://arxiv.org/html/2605.17088v1/NumeralMSE.png)Figure 3:Numeral MSE comparisonPerformance metrics for different context lengths in Auto\-CoT enhanced ICL\. Results show that longer context lengths generally lead to better performance, with the 40\-length context achieving the best results across all metrics, details can be refered to[3](https://arxiv.org/html/2605.17088#S4.F3)\.

Error Analysis

The experimental results reveal distinct trends in model performance across varying context lengths\. For Mean Squared Error \(MSE\),Auto\-CoTconsistently outperforms the baseline across all context lengths, with the most significant reduction observed at4\-length context\(Baseline: 676\.819→\\rightarrowAuto\-CoT: 535\.041\)\. This indicates that Auto\-CoT enhances model accuracy by incorporating reasoning chains and selecting high\-quality demonstrations\.

However, the relationship between context length and AUC shows non\-linear behavior\. Notably, Auto\-CoT achieves its highest AUC \(0\.607\) at33\-length context, suggesting that intermediate context lengths provide the optimal balance between input information and reasoning complexity\. At shorter context lengths, such as1and4, the AUC fluctuates, reflecting the model’s difficulty in extracting robust patterns with limited demonstrations\. TheAUC drop to 0\.337 at 40\-length contextfurther highlights a potential saturation effect, where excessive context introduces noise that diminishes discriminative performance\.

Overall, the consistent MSE improvements across all context lengths confirm Auto\-CoT’s ability to reduce prediction error, while the irregular AUC trends indicate sensitivity to context complexity and noise\. These findings suggest that optimal performance requires careful balancing of context length to leverage Auto\-CoT’s enhanced reasoning capabilities effectively\.

#### 4\.2\.2Text Data Scenario

Table 2:Performance comparison of Auto\-CoT and Baseline on the LAMBADA datasetWe are also extending to text scenario as we discussed, using LAMBADA dataset\. The detailed can be refered to algorithm[2](https://arxiv.org/html/2605.17088#algorithm2)

The results in Table[2](https://arxiv.org/html/2605.17088#S4.T2)demonstrate the performance of Auto\-CoT and the baseline ICL approach on the LAMBADA dataset for varying context lengths\.

For a context length of 1, the Auto\-CoT loss is 1\.9734, which is significantly lower than the baseline loss of 4\.2728\. This indicates that Auto\-CoT achieves better performance even with minimal context information\. As the context length increases to 3, the Auto\-CoT loss increases slightly to 2\.0998, while the baseline loss decreases to 4\.1614\. Similarly, for a context length of 5, the Auto\-CoT loss further increases to 2\.2141, with the baseline loss reducing to 4\.0404\.

Across all context lengths, Auto\-CoT consistently outperforms the baseline, achieving lower losses\. The baseline approach exhibits a steady decrease in loss as the context length increases, suggesting that the additional context improves its performance\. In contrast, the Auto\-CoT loss increases marginally as the context length grows, though it remains significantly lower than the baseline across all settings\.

This analysis highlights the robustness of Auto\-CoT in achieving lower prediction error compared to the baseline, regardless of the context length\. However, the slight increase in Auto\-CoT loss with longer contexts may indicate a diminishing benefit from additional context information\.

## 5Conclusion

In this work, we presented the Auto\-Chain\-of\-Thought \(Auto\-CoT\) framework to enhance In\-Context Learning \(ICL\) performance by leveraging reasoning chain augmentation, demonstration pruning, and optimized selection mechanisms\.

Through rigorous experimentation on both \*\*numerical tasks\*\* and \*\*textual tasks\*\* \(e\.g\., LAMBADA dataset\), we demonstrated that Auto\-CoT significantly improves ICL performance across varying context lengths compared to the baseline\.

- •For \*\*numerical function approximation tasks\*\*, Auto\-CoT consistently reduced Mean Squared Error \(MSE\) by generating and integrating step\-by\-step reasoning chains that align closer with ground truth patterns\. Notable improvements were observed as the reasoning quality was iteratively refined through pruning and selection\.
- •For \*\*language modeling tasks\*\* using the LAMBADA dataset, Auto\-CoT achieved a substantial reduction in loss compared to baseline ICL, especially for shorter context lengths \(e\.g\.,k=1k=1\)\. The structured reasoning chains mitigated the ambiguity of incomplete textual prompts, leading to better predictions\.

Auto\-CoT’s advantage stems from its ability to:

1. 1\.Augment demonstrations with structured reasoning chains automatically generated by pre\-trained language models \(e\.g\., GPT\-2\)\.
2. 2\.Prune low\-quality demonstrations based on empirical prediction error, ensuring only the most relevant examples are retained\.
3. 3\.Optimize the selection policy via a variance\-reduced policy gradient, identifying the most informative prompts for the inference stage\.

Our experiments further highlighted that Auto\-CoT performs robustly across context lengths, achieving lower error and higher consistency compared to baseline ICL approaches\. These results suggest that integrating structured reasoning significantly enhances the model’s ability to generalize and solve complex prediction tasks\.

Future work will explore the scalability of Auto\-CoT to larger datasets and pre\-trained models, as well as its applicability to multimodal learning scenarios\.

Key Findings:

- •Auto\-CoT improves ICL accuracy by reducing MSE/loss across numerical and textual datasets\.
- •Shorter contexts benefit most from Auto\-CoT due to the augmentation of informative reasoning\.
- •The proposed pruning and selection mechanisms ensure computational efficiency and improved inference quality\.

In summary, Auto\-CoT provides a systematic and scalable approach to enhancing ICL by combining reasoning generation, pruning, and selection strategies, paving the way for more robust few\-shot learning solutions\.

## References

- Araci \(2019\)Dogu Araci\. 2019\.Finbert: Financial sentiment analysis with pre\-trained language models\. arxiv 2019\.*arXiv preprint arXiv:1908\.10063*\.
- Brown et al\. \(2020a\)Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert\-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei\. 2020a\.[Language models are few\-shot learners](https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf)\.In*Advances in Neural Information Processing Systems*, volume 33, pages 1877–1901\. Curran Associates, Inc\.
- Brown et al\. \(2020b\)Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al\. 2020b\.Language models are few\-shot learners\.*Advances in neural information processing systems*, 33:1877–1901\.
- Garg et al\. \(2022\)Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant\. 2022\.What can transformers learn in\-context? a case study of simple function classes\.*Advances in Neural Information Processing Systems*, 35:30583–30598\.
- Huang et al\. \(2024\)Sili Huang, Jifeng Hu, Hechang Chen, Lichao Sun, and Bo Yang\. 2024\.In\-context decision transformer: Reinforcement learning via hierarchical chain\-of\-thought\.*arXiv preprint arXiv:2405\.20692*\.
- Shum et al\. \(2023\)KaShun Shum, Shizhe Diao, and Tong Zhang\. 2023\.Automatic prompt augmentation and selection with chain\-of\-thought from labeled data\.*arXiv preprint arXiv:2302\.12822*\.
- Socher et al\. \(2013\)Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts\. 2013\.Recursive deep models for semantic compositionality over a sentiment treebank\.In*Proceedings of the 2013 conference on empirical methods in natural language processing*, pages 1631–1642\.
- Xie et al\. \(2021\)Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma\. 2021\.An explanation of in\-context learning as implicit bayesian inference\.*arXiv preprint arXiv:2111\.02080*\.
- Zhang et al\. \(2022\)Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola\. 2022\.Automatic chain of thought prompting in large language models\.*arXiv preprint arXiv:2210\.03493*\.

## Appendix AAppendix

### A\.1Acknowledgement

It is my honor to take the NLP course and I am very glad I learned a lot from Professor Shuo as well as by discussing with the group and the classmates\. With all the ideas and suggestions by the Professor as well as by the classmates, I finished the project by my own effort\.

### A\.2Time estimation

how long do you plan to do literature review: 1 week

how long do you plan to do data collection: 1 week

how long do you plan to write code: 1 week

### A\.3Illustrative Example of Auto\-CoT Enhanced ICL and visualization

To demonstrate how Auto\-CoT enhances ICL performance, we present a numerical example where the context lengthk=4k=4and the goal is to predict the query outputyqueryy\_\{\\text\{query\}\}atxqueryx\_\{\\text\{query\}\}\.

#### A\.3\.1Data Generation Stage

The input\-output pairs\(xi,yi\)\(x\_\{i\},y\_\{i\}\)are generated using a two\-layer ReLU neural network transformation:

𝐲i=ReLU​\(W2⋅ReLU​\(W1​𝐱i\+b1\)\+b2\)\.\\mathbf\{y\}\_\{i\}=\\text\{ReLU\}\(W\_\{2\}\\cdot\\text\{ReLU\}\(W\_\{1\}\\mathbf\{x\}\_\{i\}\+b\_\{1\}\)\+b\_\{2\}\)\.\(1\)
Given the parameters:

W1=\[1\.2,−0\.8\],b1=0\.5,W2=\[0\.9,1\.3\],b2=−0\.2,W\_\{1\}=\[1\.2,\-0\.8\],\\,b\_\{1\}=0\.5,\\quad W\_\{2\}=\[0\.9,1\.3\],\\,b\_\{2\}=\-0\.2,and the following input values:

x1=1\.0,x2=2\.0,x3=3\.0,x4=4\.0,x\_\{1\}=1\.0,\\,x\_\{2\}=2\.0,\\,x\_\{3\}=3\.0,\\,x\_\{4\}=4\.0,the corresponding outputsyiy\_\{i\}are computed as:

y1=2\.5,y2=4\.8,y3=7\.2,y4=9\.1\.y\_\{1\}=2\.5,\\,y\_\{2\}=4\.8,\\,y\_\{3\}=7\.2,\\,y\_\{4\}=9\.1\.

#### A\.3\.2Reasoning Chain Generation

For each input\-output pair, we generate reasoning chains using a pre\-trained GPT\-2 model\. For example:

> Reasoning Chain for\(x1,y1\)\(x\_\{1\},y\_\{1\}\): Input:x1=1\.0x\_\{1\}=1\.0, Output:y1=2\.5y\_\{1\}=2\.5 Reasoning: 1. 1\.Apply first layer:f1​\(1\.0\)=max⁡\(0,1\.2⋅1\.0−0\.8\+0\.5\)=0\.9f\_\{1\}\(1\.0\)=\\max\(0,1\.2\\cdot 1\.0\-0\.8\+0\.5\)=0\.9, 2. 2\.Apply second layer:f2​\(0\.9\)=max⁡\(0,0\.9⋅0\.9\+1\.3−0\.2\)=2\.5f\_\{2\}\(0\.9\)=\\max\(0,0\.9\\cdot 0\.9\+1\.3\-0\.2\)=2\.5, 3. 3\.Result:y1=2\.5y\_\{1\}=2\.5is obtained through this transformation\.

The full set of reasoning chains is as follows:

𝒫=\{\(x1,y1,r1\),\(x2,y2,r2\),\(x3,y3,r3\),\(x4,y4,r4\)\}\.\\mathcal\{P\}=\\\{\(x\_\{1\},y\_\{1\},r\_\{1\}\),\(x\_\{2\},y\_\{2\},r\_\{2\}\),\(x\_\{3\},y\_\{3\},r\_\{3\}\),\(x\_\{4\},y\_\{4\},r\_\{4\}\)\\\}\.

#### A\.3\.3Pruning Stage

At the pruning stage, we evaluate the prediction quality of each demonstration using the Transformer model\. Given the query input:

xquery=5\.0,x\_\{\\text\{query\}\}=5\.0,the model predicts the output:

y^query=11\.0,\\hat\{y\}\_\{\\text\{query\}\}=11\.0,while the ground truth is:

ytrue=11\.3\.y\_\{\\text\{true\}\}=11\.3\.
The Mean Squared Error \(MSE\) is computed as:

MSE=1d​‖𝐲^query−𝐲true‖2=120​\(11\.0−11\.3\)2=0\.045\.\\text\{MSE\}=\\frac\{1\}\{d\}\\\|\\hat\{\\mathbf\{y\}\}\_\{\\text\{query\}\}\-\\mathbf\{y\}\_\{\\text\{true\}\}\\\|^\{2\}=\\frac\{1\}\{20\}\(11\.0\-11\.3\)^\{2\}=0\.045\.\(2\)
Since the error is below the thresholdϵ=0\.1\\epsilon=0\.1, the demonstration is retained in the pruned pool:

𝒫′=\{\(x1,y1,r1\),\(x2,y2,r2\),\(x3,y3,r3\),\(x4,y4,r4\)\}\.\\mathcal\{P\}^\{\\prime\}=\\\{\(x\_\{1\},y\_\{1\},r\_\{1\}\),\(x\_\{2\},y\_\{2\},r\_\{2\}\),\(x\_\{3\},y\_\{3\},r\_\{3\}\),\(x\_\{4\},y\_\{4\},r\_\{4\}\)\\\}\.

#### A\.3\.4Selection Stage

A selection policyπθ\\pi\_\{\\theta\}, parameterized by a neural network, assigns probabilities to demonstrations in the pruned pool\. For example:

πθ​\(x1\)=0\.6,πθ​\(x2\)=0\.7,πθ​\(x3\)=0\.5,πθ​\(x4\)=0\.8\.\\pi\_\{\\theta\}\(x\_\{1\}\)=0\.6,\\quad\\pi\_\{\\theta\}\(x\_\{2\}\)=0\.7,\\quad\\pi\_\{\\theta\}\(x\_\{3\}\)=0\.5,\\quad\\pi\_\{\\theta\}\(x\_\{4\}\)=0\.8\.Using the variance\-reduced policy gradient, the policy is optimized to minimize prediction error\.

#### A\.3\.5Inference Stage

The final prompt is constructed by selecting demonstrations based on the selection policy:

𝒫final=\{\(x1,y1,r1\),\(x2,y2,r2\),\(x4,y4,r4\)\}∪\{xquery\}\.\\mathcal\{P\}\_\{\\text\{final\}\}=\\\{\(x\_\{1\},y\_\{1\},r\_\{1\}\),\(x\_\{2\},y\_\{2\},r\_\{2\}\),\(x\_\{4\},y\_\{4\},r\_\{4\}\)\\\}\\cup\\\{x\_\{\\text\{query\}\}\\\}\.\(3\)
The Transformer model predicts the query outputy^query\\hat\{y\}\_\{\\text\{query\}\}multiple times \(64 runs\) to reduce variance:

y^final=164​∑i=164Mθ​\(𝒫final\)\.\\hat\{y\}\_\{\\text\{final\}\}=\\frac\{1\}\{64\}\\sum\_\{i=1\}^\{64\}M\_\{\\theta\}\(\\mathcal\{P\}\_\{\\text\{final\}\}\)\.\(4\)
The final averaged prediction is:

y^final=11\.2\.\\hat\{y\}\_\{\\text\{final\}\}=11\.2\.

#### A\.3\.6Performance Improvement

The Auto\-CoT approach reduces the prediction error compared to standard ICL\. For instance:

Baseline ICL Error:​ϵbase=0\.15,Auto\-CoT Error:​ϵenhanced=0\.045\.\\text\{Baseline ICL Error: \}\\epsilon\_\{\\text\{base\}\}=0\.15,\\quad\\text\{Auto\-CoT Error: \}\\epsilon\_\{\\text\{enhanced\}\}=0\.045\.\(5\)
This demonstrates the effectiveness of Auto\-CoT in improving ICL performance through reasoning chain augmentation, pruning, and selection\.

The detailed can be refered to the Fig[4](https://arxiv.org/html/2605.17088#A1.F4)

This example demonstrates how Auto\-CoT systematically enhances ICL by:

- •Automatically generating interpretable reasoning chains
- •Pruning inconsistent or low\-quality demonstrations
- •Selecting optimal combinations for the final prompt
- •Maintaining the underlying mathematical structure while adding explanatory power

The effectiveness of this approach is particularly evident in cases where the numerical pattern exhibits non\-linear characteristics, as demonstrated by our two\-layer ReLU network setting\.

![Refer to caption](https://arxiv.org/html/2605.17088v1/NumeralWorkflow.png)Figure 4:Numeral Workflow Visualization

### A\.4Text data scenario

Input:Training data

𝒟\\mathcal\{D\}with dimension \(40,20\), Query set

𝐱q​u​e​r​y\\mathbf\{x\}\_\{query\}
Output:Predicted value

𝐲^41\\hat\{\\mathbf\{y\}\}\_\{41\}
Step 1: Augment Stage

begin

Initialize prompt pool

𝒫=\{\}\\mathcal\{P\}=\\\{\\\};

for*i=1i=1toKK*do

Sample linear function

fi​\(𝐱\)=𝐰i⊤​𝐱f\_\{i\}\(\\mathbf\{x\}\)=\\mathbf\{w\}\_\{i\}^\{\\top\}\\mathbf\{x\}from

ℱ\\mathcal\{F\};

Generate sequence

Pi=\(𝐱1,fi​\(𝐱1\),…,𝐱k,fi​\(𝐱k\)\)P^\{i\}=\(\\mathbf\{x\}\_\{1\},f\_\{i\}\(\\mathbf\{x\}\_\{1\}\),\.\.\.,\\mathbf\{x\}\_\{k\},f\_\{i\}\(\\mathbf\{x\}\_\{k\}\)\);

Generate reasoning chain

𝐫i\\mathbf\{r\}\_\{i\}using LLM:

𝐫i=G​\(Pi\)\\mathbf\{r\}\_\{i\}=G\(P^\{i\}\);

Add

\(Pi,𝐫i\)\(P^\{i\},\\mathbf\{r\}\_\{i\}\)to

𝒫\\mathcal\{P\};

Step 2: Prune Stage

begin

Initialize pruned pool

𝒫′=\{\}\\mathcal\{P\}^\{\\prime\}=\\\{\\\};

for*each\(Pi,𝐫i\)∈𝒫\(P^\{i\},\\mathbf\{r\}\_\{i\}\)\\in\\mathcal\{P\}*do

Compute predicted output

𝐲^i=M​\(Pi\)\\hat\{\\mathbf\{y\}\}\_\{i\}=M\(P^\{i\}\);

Compute loss

ℓi=‖𝐲^i−𝐰⊤​𝐱41‖2/d\\ell\_\{i\}=\\\|\\hat\{\\mathbf\{y\}\}\_\{i\}\-\\mathbf\{w\}^\{\\top\}\\mathbf\{x\}\_\{41\}\\\|^\{2\}/d;

if*ℓi≤ϵ\\ell\_\{i\}\\leq\\epsilon*then

Add

\(Pi,𝐫i\)\(P^\{i\},\\mathbf\{r\}\_\{i\}\)to

𝒫′\\mathcal\{P\}^\{\\prime\};

Step 3: Select Stage

begin

Initialize selection policy

πθ\\pi\_\{\\theta\};

for*epoch = 1 to N*do

Sample batch of prompts from

𝒫′\\mathcal\{P\}^\{\\prime\}using

πθ\\pi\_\{\\theta\};

Compute policy gradient using:

∇θℒ=1B−1​∑i=1B\(ℓi−ℓ¯\)​∇θlog⁡πθ​\(Pi\)\\nabla\_\{\\theta\}\\mathcal\{L\}=\\frac\{1\}\{B\-1\}\\sum\_\{i=1\}^\{B\}\(\\ell\_\{i\}\-\\bar\{\\ell\}\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(P^\{i\}\);

Update

πθ\\pi\_\{\\theta\}using computed gradient;

Select best performing prompts according to

πθ\\pi\_\{\\theta\};

Step 4: Inference Stage

begin

Initialize results array

R=\[\]R=\[\];

for*i=1i=1to6464*do

Construct final prompt

Pf​i​n​a​lP\_\{final\}using selected examples;

Predict

𝐲^41=M​\(Pf​i​n​a​l\)\\hat\{\\mathbf\{y\}\}\_\{41\}=M\(P\_\{final\}\);

Append

𝐲^41\\hat\{\\mathbf\{y\}\}\_\{41\}to

RR;

Compute average prediction

𝐲¯41=mean​\(R\)\\bar\{\\mathbf\{y\}\}\_\{41\}=\\text\{mean\}\(R\);

return

𝐲¯41\\bar\{\\mathbf\{y\}\}\_\{41\}

Algorithm 2In\-Context Learning with Auto\-CoT for Linear Function ApproximationExample for understanding text scenario

### A\.5Auto\-CoT Example: Text Completion Task

To demonstrate how Auto\-CoT enhances ICL performance, we present a text\-based example where the context lengthk=3k=3and the goal is to predict the query completionyqueryy\_\{\\text\{query\}\}\.

#### A\.5\.1Data Generation Stage

We sample three context sentences\(c1,c2,c3\)\(c\_\{1\},c\_\{2\},c\_\{3\}\)and one query sentenceqtargetq\_\{\\text\{target\}\}with its target completionytruey\_\{\\text\{true\}\}from the LAMBADA dataset:

- •c1c\_\{1\}:“The boy picked up the book and started reading\."
- •c2c\_\{2\}:“He turned to the next page, fascinated by the story\."
- •c3c\_\{3\}:“The plot twist revealed a shocking truth\."
- •Query:“In the end, the villain was revealed to be …"
- •Target Completion:“his own brother\."

#### A\.5\.2Reasoning Chain Generation

For each input context, we generate reasoning chains using a pre\-trained GPT\-2 model\. For example:

> Reasoning Chain for Query Completion: Context:“The boy picked up the book and started reading\. He turned to the next page, fascinated by the story\. The plot twist revealed a shocking truth\. Query:In the end, the villain was revealed to be …” Reasoning: 1. 1\.“The plot twist suggests a close connection to the protagonist\." 2. 2\.“The villain’s reveal is likely someone unexpected but familiar\." 3. 3\.“It could be a family member, which adds emotional impact to the story\." Completion:“his own brother\."

The augmented set of demonstrations with reasoning chains is as follows:

𝒫=\{\(c1,r1\),\(c2,r2\),\(c3,r3\)\}\.\\mathcal\{P\}=\\\{\(c\_\{1\},r\_\{1\}\),\(c\_\{2\},r\_\{2\}\),\(c\_\{3\},r\_\{3\}\)\\\}\.

#### A\.5\.3Pruning Stage

At the pruning stage, we evaluate the quality of augmented demonstrations based on the negative log\-likelihood \(NLL\) loss\. Given the query sentence:

qtarget=“In the end, the villain was revealed to be …"q\_\{\\text\{target\}\}=\\textit\{\`\`In the end, the villain was revealed to be \.\.\."\}
and the predicted completiony^query\\hat\{y\}\_\{\\text\{query\}\}, the NLL loss is computed as:

ℓ=−log⁡p​\(y^query\|𝒫,qtarget\)\.\\ell=\-\\log p\(\\hat\{y\}\_\{\\text\{query\}\}\|\\mathcal\{P\},q\_\{\\text\{target\}\}\)\.\(6\)
For example, if the model predicts:

y^query=“his own father\."\\hat\{y\}\_\{\\text\{query\}\}=\\textit\{\`\`his own father\."\}
with the ground truthytrue=“his own brother\."y\_\{\\text\{true\}\}=\\textit\{\`\`his own brother\."\}, the loss is calculated and compared to a thresholdϵ\\epsilon\. Ifℓ≤ϵ\\ell\\leq\\epsilon, the demonstration is retained in the pruned pool𝒫′\\mathcal\{P\}^\{\\prime\}\.

#### A\.5\.4Selection Stage

A selection policyπθ\\pi\_\{\\theta\}, parameterized by a neural network, assigns probabilities to demonstrations in the pruned pool\. For example:

πθ​\(c1\)=0\.7,πθ​\(c2\)=0\.6,πθ​\(c3\)=0\.8\.\\pi\_\{\\theta\}\(c\_\{1\}\)=0\.7,\\quad\\pi\_\{\\theta\}\(c\_\{2\}\)=0\.6,\\quad\\pi\_\{\\theta\}\(c\_\{3\}\)=0\.8\.
Using the policy gradient method, the policy is optimized to select demonstrations that minimize the prediction error\.

#### A\.5\.5Inference Stage

The final prompt is constructed by selecting demonstrations based on the selection policy:

𝒫final=\{\(c1,r1\),\(c3,r3\)\}∪\{qtarget\}\.\\mathcal\{P\}\_\{\\text\{final\}\}=\\\{\(c\_\{1\},r\_\{1\}\),\(c\_\{3\},r\_\{3\}\)\\\}\\cup\\\{q\_\{\\text\{target\}\}\\\}\.\(7\)
The language model generates predictions for the query sentence multiple times \(64 runs\) to reduce variance:

y^final=164​∑i=164Mθ​\(𝒫final,qtarget\)\.\\hat\{y\}\_\{\\text\{final\}\}=\\frac\{1\}\{64\}\\sum\_\{i=1\}^\{64\}M\_\{\\theta\}\(\\mathcal\{P\}\_\{\\text\{final\}\},q\_\{\\text\{target\}\}\)\.\(8\)
For example, the averaged prediction may be:

y^final=“his own brother\."\\hat\{y\}\_\{\\text\{final\}\}=\\textit\{\`\`his own brother\."\}

#### A\.5\.6Performance Improvement

The Auto\-CoT approach reduces the prediction error compared to the baseline ICL method\. For instance:

Baseline Loss:​ℓbase=4\.2728,Auto\-CoT Loss:​ℓenhanced=1\.9734\.\\text\{Baseline Loss: \}\\ell\_\{\\text\{base\}\}=4\.2728,\\quad\\text\{Auto\-CoT Loss: \}\\ell\_\{\\text\{enhanced\}\}=1\.9734\.\(9\)This demonstrates the effectiveness of Auto\-CoT in improving ICL performance by augmenting with reasoning chains, pruning low\-quality demonstrations, and optimizing selection\.

Similar Articles

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

arXiv cs.CL

Proposes ProxyCoT, a training framework that improves long-context reasoning in large language models by first obtaining chain-of-thought reasoning traces on short proxy contexts (via reinforcement learning or distillation) and then grounding them in full long contexts through supervised fine-tuning. Experiments show consistent improvements over baselines with reduced computational cost.

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn

Hugging Face Daily Papers

This paper investigates many-shot chain-of-thought in-context learning for reasoning tasks, revealing that standard scaling rules do not transfer and proposing Curvilinear Demonstration Selection (CDS) for improved ordering, achieving up to 5.42 percentage-point gain.

Adaptive Latent Agentic Reasoning

arXiv cs.CL

This paper introduces Adaptive Latent Agentic Reasoning (ALAR), a dual-mode framework for LLM agents that uses compact latent reasoning for routine turns and selectively escalates to explicit chain-of-thought for harder decisions, achieving up to 84.6% token reduction while maintaining task accuracy.

Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts

arXiv cs.CL

This research paper from MediaTek and National Taiwan University challenges the assumption that reasoning chains must be dense and sequential, showing that models can extract answers from sparse, shuffled, and noisy reasoning traces. The findings suggest that answer extraction is robust and order-independent, potentially enabling more efficient, parallelized reasoning generation.