Derivation Prompting: A Logic-Based Method for Improving Retrieval-Augmented Generation

arXiv cs.CL Papers

Summary

Derivation Prompting introduces a logic-inspired prompting method for Retrieval-Augmented Generation that constructs interpretable derivation trees, improving reasoning and reducing hallucinations in knowledge-intensive QA tasks.

arXiv:2605.14053v1 Announce Type: new Abstract: The application of Large Language Models to Question Answering has shown great promise, but important challenges such as hallucinations and erroneous reasoning arise when using these models, particularly in knowledge-intensive, domain-specific tasks. To address these issues, we introduce Derivation Prompting, a novel prompting technique for the generation step of the Retrieval-Augmented Generation framework. Inspired by logic derivations, this method involves deriving conclusions from initial hypotheses through the systematic application of predefined rules. It constructs a derivation tree that is interpretable and adds control over the generation process. We applied this method in a specific case study, significantly reducing unacceptable answers compared to traditional RAG and long-context window methods.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:18 AM

# Derivation Prompting: A Logic-Based Method for Improving Retrieval-Augmented Generation
Source: [https://arxiv.org/html/2605.14053](https://arxiv.org/html/2605.14053)
11institutetext:Instituto de Computación, Facultad de Ingeniería, Universidad de la República
Montevideo, Uruguay
11email:\{isastre,gmonce,aialar\}@fing\.edu\.uy###### Abstract

The application of Large Language Models to Question Answering has shown great promise, but important challenges such as hallucinations and erroneous reasoning arise when using these models, particularly in knowledge\-intensive, domain\-specific tasks\. To address these issues, we introduce Derivation Prompting, a novel prompting technique for the generation step of the Retrieval\-Augmented Generation framework\. Inspired by logic derivations, this method involves deriving conclusions from initial hypotheses through the systematic application of predefined rules\. It constructs a derivation tree that is interpretable and adds control over the generation process\. We applied this method in a specific case study, significantly reducing unacceptable answers compared to traditional RAG and long\-context window methods\.111Repo with all prompts:[https://github\.com/nsuruguay05/derivation\-prompting](https://github.com/nsuruguay05/derivation-prompting)

## 1Introduction

Question Answering \(QA\) has improved substantially with the advent of Large Language Models \(LLMs\)\. However, these models face important challenges, particularly in knowledge\-intensive, domain\-specific tasks, such as hallucinations and faulty reasoning\[[13](https://arxiv.org/html/2605.14053#bib.bib21),[5](https://arxiv.org/html/2605.14053#bib.bib20)\]\. The Retrieval\-Augmented Generation \(RAG\) framework addresses these limitations by retrieving the most relevant document chunks from a trusted domain\-specific document base, and grounding the LLM’s generation on the retrieved information\[[2](https://arxiv.org/html/2605.14053#bib.bib13)\]\.

![Refer to caption](https://arxiv.org/html/2605.14053v1/img/derivation_prompting_2.png)Figure 1:Schematic illustration of a derivation tree constructed using derivation prompting\.Substantial work has been done to improve the reasoning ability of LLMs\[[3](https://arxiv.org/html/2605.14053#bib.bib14)\]\. Techniques like Chain\-of\-Thought \(CoT\)\[[15](https://arxiv.org/html/2605.14053#bib.bib15)\]have reliably improved performance across various tasks, including QA\. However, these techniques do not explicitly definehowthe model should reason, as there are no restrictions on how the intermediate reasoning steps should be constructed\.

In this work, we propose Derivation Prompting, an alternative approach for the generation step in the RAG framework inspired by logical derivations\. In this method, a conclusion is inferred from initial hypotheses by applying well\-defined rules to transform and/or combine these hypotheses\. This novel approach offers some advantages over existing methods, mainly:

- •Interpretability:The method not only generates a final answer but also produces a tree structure, referred to as a derivation \(see Figure[1](https://arxiv.org/html/2605.14053#S1.F1)\)\. Each node in the derivation represents the application of an easily interpretable rule that transforms some of its children\. This structure provides a straightforward way to identify errors the model could have made and to understand how it arrived at the final answer\.
- •Controlled generation:By generating the answer through the sequential application of predefined rules, this method provides a clearer reasoning path for the model to follow\. This reduces hallucinations and faulty reasoning while ensuring that the generated answers remain grounded in the information from the documents\.

The paper is structured as follows: Section[2](https://arxiv.org/html/2605.14053#S2)presents related work\. Section[3](https://arxiv.org/html/2605.14053#S3)provides a detailed explanation of Derivation Prompting\. Section[4](https://arxiv.org/html/2605.14053#S4)describes the case study conducted\. Section[5](https://arxiv.org/html/2605.14053#S5)explains the evaluation method used\. Section[6](https://arxiv.org/html/2605.14053#S6)presents the results and analysis\. Finally, Section[7](https://arxiv.org/html/2605.14053#S7)contains the conclusions and outlines future work\.

## 2Related work

Retrieval Augmented Generation

Derivation Prompting, as proposed in this work, is a prompting technique applied within the context of Retrieval\-Augmented Generation \(RAG\)\. RAG augments LLMs with a retrieval component that recovers the most relevant information from an external knowledge base\[[8](https://arxiv.org/html/2605.14053#bib.bib16),[4](https://arxiv.org/html/2605.14053#bib.bib17),[11](https://arxiv.org/html/2605.14053#bib.bib18)\]\.

The naive RAG paradigm consists of two main steps: retrieval and generation\. First, documents are segmented into smaller units, referred to as chunks, which are subsequently converted into vector representations\. Upon a user query, during the retrieval step the query is converted to its vector representation, and similarity scores between the query and the indexed chunks are computed\. The topkkchunks are retrieved and used as context in the prompt for the generation step\[[2](https://arxiv.org/html/2605.14053#bib.bib13)\]\.

An alternative to using vector representations is to employ Cross\-Encoder models that directly process each chunk with the query and return a similarity score\[[9](https://arxiv.org/html/2605.14053#bib.bib24),[10](https://arxiv.org/html/2605.14053#bib.bib23)\]\. While this approach tends to yield better results, it is significantly less compute\-efficient, as it requires as many model inferences per query as number of chucks we have, compared to only one inference when using sentence embeddings\.

Prompting techniques for enhancing reasoning

Substantial work has been done to improve the reasoning ability of LLMs\[[3](https://arxiv.org/html/2605.14053#bib.bib14)\]\. Chain\-of\-Thought \(CoT\)\[[15](https://arxiv.org/html/2605.14053#bib.bib15)\]involves prompting the model to generate a coherent series of intermediate reasoning steps that lead to the final answer\. Few\-Shot prompting\[[1](https://arxiv.org/html/2605.14053#bib.bib4)\]is applied and a chain\-of\-thought is added to each example\. They show that sufficiently large LLMs can generate these reasoning chains, yielding promising results in arithmetic, commonsense, and symbolic reasoning tasks\.

The Tree of Thoughts \(ToT\) framework\[[17](https://arxiv.org/html/2605.14053#bib.bib19)\]is an evolution of CoT that enables models to explore several different reasoning paths\. In this approach, reasoning is conceptualized as searching through a tree, where each node represents a thought\. This framework addresses some limitations of CoT, such as the inability to explore different continuations within the same reasoning chain or to backtrack when incorrect conclusions are reached\.

While these methods significantly improve performance on various tasks, there is no control over how each thought is generated in the chain, as there is no systematic methodology the model has to follow\. This lack of control can lead to erroneous reasoning and susceptibility to hallucinations, which are well\-known problems when working with LLMs\[[13](https://arxiv.org/html/2605.14053#bib.bib21),[5](https://arxiv.org/html/2605.14053#bib.bib20)\]\.

Logic and LLMs

Similar to Derivation Prompting, some works explore combining classical logic with prompting techniques to improve reasoning\. The Logical Thoughts \(LoT\) prompting framework\[[18](https://arxiv.org/html/2605.14053#bib.bib1)\]uses logical equivalence, expressing premises in various logically equivalent forms to encourage the exploration of different solutions\. This is achieved by incorporating a verification step for each thought, where an explanation is generated for both the thought as\-is and its logical negation\. The LLM is then tasked to decide between the two\.

Symbolic CoT \(SymbCoT\)\[[16](https://arxiv.org/html/2605.14053#bib.bib22)\]is another proposed method that involves four LLM modules: \(i\) Translator: translating premises and the question to First\-Order Logic formulas, \(ii\) Planner: dividing the original problem into smaller subproblems and developing a step\-by\-step plan, \(iii\) Solver: deriving the answer through a logical inference process and \(iv\) Verifier: validating the correctness of the translations and the Solver’s output\.

## 3Derivation prompting

This technique focuses on the generation step of the Retrieval\-Augmented Generation \(RAG\) framework\. It relies on the premise that the expected answer to a given query must be obtained by combining and/or transforming the most relevant information extracted from a document base, since our objective is to rely only on the information available in the documents and not on the information the model could have learned on its training phase\.

The idea for this technique is inspired by how a derivation tree in propositional logic is constructed\. In this context, a conclusionφ\\varphiis derived from a set of premises or hypothesesΓ=\{δ1,…,δn\}\\Gamma=\\\{\\delta\_\{1\},\\dots,\\delta\_\{n\}\\\}\. We denoteΓ⊢φ\\Gamma\\vdash\\varphiif such a derivation exists\. The class of derivations forms an inductively defined set characterized by a list of inference rules that explicitly state how to derive new conclusions from existing ones\. These rules operate by systematically applying logical operations to the premises to construct a tree where each node represents an application of a rule, culminating in the conclusion at the root\[[14](https://arxiv.org/html/2605.14053#bib.bib3)\]\. Figure[2](https://arxiv.org/html/2605.14053#S3.F2)shows an example of a logic derivation\.

![Refer to caption](https://arxiv.org/html/2605.14053v1/img/derivation_example.png)Figure 2:Example of a derivation proving the statementp1∧p2⊢p2∧p1p\_\{1\}\\wedge p\_\{2\}\\vdash p\_\{2\}\\wedge p\_\{1\}, wherep1p\_\{1\}andp2p\_\{2\}are proposition symbols andE∧E\\wedgeandI∧I\\wedgeare the elimination and introduction rules respectively for∧\\wedge, as defined in\[[14](https://arxiv.org/html/2605.14053#bib.bib3)\]\.In a typical RAG framework, documents are divided into smaller units called chunks\. Given a query, thennmost relevant chunks are selected and used as context for generating the answer\. Following the analogy with logic derivations, in Derivation Prompting, we consider these most relevant chunks as a set of hypotheses\{h1,…,hn\}\\\{h\_\{1\},\\dots,h\_\{n\}\\\}\. The objective is to construct a derivation tree using predefined natural language rules, ultimately deriving a conclusioncc, such thath1,…,hn⊢ch\_\{1\},\\dots,h\_\{n\}\\vdash c, as depicted in Figure[1](https://arxiv.org/html/2605.14053#S1.F1)\.

In contrast to logic derivations, where we usually start from a candidate conclusion and seek to construct the proof, in this case, the conclusion is not known beforehand\. Therefore, a queryqqis needed to guide the construction of the derivation tree, with the goal that the resulting conclusion serves as the answer to the queryqq\.

For each step in the construction of the derivation, the task the LLM has to follow involves deciding which rule to apply, selecting the appropriate hypotheses, and generating the conclusion that arises from the application of the chosen rule\. Although it may seem counter\-intuitive to let the LLM decide which rule to apply, this is a key aspect for making this method viable due to the LLM’s ability to disambiguate natural language in both the rule explanation and the hypotheses\.

### 3\.1Rules

For Derivation Prompting, a set of derivation rules must be defined\. These rules are specified in natural language and used by the language model to construct a derivation tree\. We define a set of derivation rules that are convenient for our use case \(Section[4](https://arxiv.org/html/2605.14053#S4)\)\. It is important to notice that these rules are specific to this problem, and any set of rules could be defined that best fits the type of combinations and/or transformations necessary in different use cases\. Table[1](https://arxiv.org/html/2605.14053#S3.T1)presents each rule with a description and figure[3](https://arxiv.org/html/2605.14053#S3.F3)shows a toy example for each rule\.

Table 1:List of defined rules with a brief description\.![Refer to caption](https://arxiv.org/html/2605.14053v1/img/rules.png)Figure 3:Toy examples of application for each rule\. Examples \(E\) and \(F\) have information of the query for better understanding\.
### 3\.2Algorithm

Algorithm[1](https://arxiv.org/html/2605.14053#alg1)presents the pseudo\-code for constructing a derivation\. When looking at the algorithm in detail, it is important to notice that lines 3, 4, and 5 correspond to steps that the LLM should execute\. The responsibilities of the LLM are to decide which rule to apply and which hypotheses to use, as well as to construct the conclusion that arises from the application of the chosen rule\. Additionally, the LLM is used to determine whether the conclusion serves as the final answer to the user’s query\.

Algorithm 1Derivation Prompting pseudo\-code1:

h​y​p​o​t​h​e​s​e​s​\_​l​i​s​t=\{h1,…,hn\}hypotheses\\\_list=\\\{h\_\{1\},\\dots,h\_\{n\}\\\}: List of hypotheses,

qq: Query

2:

f​i​n​a​l​\_​a​n​s​w​e​r←Falsefinal\\\_answer\\leftarrow\\textbf\{False\}
3:whilenot

f​i​n​a​l​\_​a​n​s​w​e​rfinal\\\_answerdo

4:Decide which rule

rrto apply

5:Decide which hypotheses

\{hi,…,hk\}\\\{h\_\{i\},\\dots,h\_\{k\}\\\}to apply

rrto

6:

c​o​n​c​l​u​s​i​o​n←conclusion\\leftarrowapply rule

rrover

\{hi,…,hk\}\\\{h\_\{i\},\\dots,h\_\{k\}\\\}and query

qq
7:if

c​o​n​c​l​u​s​i​o​nconclusionis the final answerthen

8:

f​i​n​a​l​\_​a​n​s​w​e​r←Truefinal\\\_answer\\leftarrow\\textbf\{True\}
9:else

10:

h​y​p​o​t​h​e​s​e​s​\_​l​i​s​t\.a​p​p​e​n​d​\(c​o​n​c​l​u​s​i​o​n\)hypotheses\\\_list\.append\(conclusion\)
11:endif

12:endwhile

13:return

c​o​n​c​l​u​s​i​o​nconclusion

The rest of the algorithm is straightforward\. If the conclusion is considered the final answer, the derivation is complete, and the last conclusion is used as the answer\. If not, the conclusion is added to the list of hypotheses and can be used in subsequent rule applications \(though it might potentially never be used\)\. Optionally, each rule application can be stored with pointers to its arguments and conclusion to later reconstruct the derivation tree\.

We explored different ways of implementing the aforementioned algorithm and considered two main alternatives:

1. 1\.One\-step prompt:This approach isolates each rule application as an independent LLM call\. Given the list of hypotheses in the middle of a derivation, the model is prompted to produce, in a single inference, the rule to apply, the hypotheses to use, the resulting conclusion, and whether it is the final answer\. The algorithm is implemented similarly to the Algorithm[1](https://arxiv.org/html/2605.14053#alg1), with lines 3, 4, and 5 replaced by this single call to the LLM, followed by parsing the result\.
2. 2\.Whole derivation prompt:In contrast to the previous alternative, this approach allows the LLM to construct the entire derivation in one inference call, effectively emulating the execution of Algorithm[1](https://arxiv.org/html/2605.14053#alg1)\. To achieve this, we applied a Few\-Shot strategy\[[1](https://arxiv.org/html/2605.14053#bib.bib4)\], crafting six complete examples of manual executions of the algorithm to create different derivations using all the rules \(Appendix[0\.A](https://arxiv.org/html/2605.14053#Pt0.A1)shows one of these examples\)\. The model is then prompted to follow the same steps with a new query and initial hypotheses\. The result is then parsed to obtain each rule execution and intermediate hypotheses\.

In our experiments, the second approach yielded results as good as the first one but was much faster and computationally cheaper, as it replacesnnLLM inferences with just one\. Therefore, we decided to extensively use the whole derivation approach\.

## 4Case study

We investigated this idea in the context of a specific use case: developing a platform for question answering in the domain of administrative information for the School of Engineering at Universidad de la República \(UDELAR\), for the Spanish language\. Currently, the school operates the Orientation and Consultation Space \(OCS\), where students can ask questions via email or in person, and OCS staff provide answers\. We explored the feasibility of building a tool to assist with this work by offering an automated system for students to obtain answers to their questions\.

We gathered a small set of documents available on the school’s webpage\. Specifically, 17 websites were scraped and converted to markdown format using LangChain’sHtml2TextTransformerclass\.

For evaluation purposes, we constructed a QA dataset consisting of 135 real user queries\. These queries were derived from past emails sent to the OCS over the last few years\. Each student email was preprocessed using the Llama 2 7B model\[[12](https://arxiv.org/html/2605.14053#bib.bib5)\]to remove irrelevant information typical of email communication \(e\.g\., greetings, apologies\) and, most importantly, personal information such as names, identification numbers, and phone numbers\.

In the context of this project, we explored several methods using LLMs which are explained below:

Retrieval Augmented Generation \(RAG\)

For the retrieval step, we explored using sentence embeddings generated withintfloat/multilingual\-e5\-largeas well asBAAI/bge\-reranker\-largeCross\-Encoder model\. Given that the use case involves fewer than a hundred chunks, using a Cross\-Encoder was feasible and, as expected, consistently yielded better results than using sentence embeddings\.

For the generation step, we experimented with models from the Anthropic’s Claude 3 family222[https://www\.anthropic\.com/news/claude\-3\-family](https://www.anthropic.com/news/claude-3-family)\(specifically, Haiku, which is faster but less capable, and Opus, which is the best performing and competitive with OpenAI’s GPT\-4\)\. Thekkmost relevant chunks to the user’s query were added as context, and a prompt was crafted to explicitly instruct the model to use them for generating the answer\.

Using Long Context Windows

Another approach we explored was leveraging the long context windows of closed models, specifically the Claude models, which support up to 200k tokens\. In this method, we inserted all the full documents as context, thereby avoiding the retrieval step\.

Derivation Prompting

Utilizing the retrieval method described in the RAG experiment, we explored the use of Derivation Prompting for the generation step, as detailed in Section[3](https://arxiv.org/html/2605.14053#S3)\. For this method, we used three initial hypotheses, corresponding to the three most relevant chunks obtained in the retrieval step\. We did not explore using more hypotheses because of how the Few\-Shot examples were designed, but this is an area for future work that we plan to investigate\.

## 5Evaluation

Evaluating Open\-Domain Question Answering, especially when using LLMs, remains an open problem, and human evaluation still appears to have no substitute\[[6](https://arxiv.org/html/2605.14053#bib.bib10)\]\. Nevertheless, it has been shown that state\-of\-the\-art LLMs tend to exhibit a high degree of agreement with human evaluation when used as judges\[[19](https://arxiv.org/html/2605.14053#bib.bib12)\]\. Therefore, we decided to follow this approach for evaluating each experiment separately\. We are currently conducting human evaluation on the best performing experiments and results will be presented in future work\.

We designed an evaluation prompt following the format defined for the Feedback Collection dataset\[[7](https://arxiv.org/html/2605.14053#bib.bib11)\], which encompasses four components:

1. 1\.Instruction to evaluate:The instruction for the task to evaluate\. In our case, this is the particular question that the answer addresses\.
2. 2\.Response to Evaluate:The response to the question that the LLM has to evaluate \(with a score on a scale from 1 to 5\)\.
3. 3\.Reference Answer:A reference answer that corresponds to a score of 5\.
4. 4\.Customized Score Rubric:Specific criteria defined for our use case, specifying what the evaluator should focus on\. This includes a description of the criteria and a detailed explanation for each possible score \(1 to 5\)\.

We defined the score rubric criteria as determining whether the generated answer is correct and truthful\. This is clearly specified in each score description\. Table[2](https://arxiv.org/html/2605.14053#S5.T2)presents a brief explanation of each score\.

Table 2:Score rubric criteria defined for each score, for evaluating generated answers\.Additionally, we classified scores 1 and 2 as unacceptable and scores 3 to 5 as acceptable, thereby obtaining an aggregated metric for evaluation\. Scores 1 and 2 correspond to answers that fully or partially contradict the reference answer and are therefore considered unacceptable\. Scores 3 to 5 may have none of the information correct \(but are not incorrect either, as no information is provided at all\), part of the information correct, or be fully correct\. In all these cases, the answers do not provide false or contradictory information and are considered acceptable\.

## 6Results

The evaluation was carried out using Claude Opus as the evaluator\. Table[3](https://arxiv.org/html/2605.14053#S6.T3)shows the percentage of acceptable answers and the distribution for each score from 1 to 5 for the best performing experiments\. These experiments utilize Claude Opus and Claude Haiku, and where applicable, the Cross\-Encoder for the retrieval step, with the number of chunks used as context set tok=3k=3\.

Table 3:Percentage of acceptable answers, distribution of scores, average and standard deviation metrics for each experiment\. CH is Claude Haiku and CO is Claude Opus\.As can be observed in Table[3](https://arxiv.org/html/2605.14053#S6.T3), Derivation Prompting with Claude Opus significantly reduces the number of unacceptable answers compared to the other experiments\. However, it does not necessarily increase the number of answers with scores of 4 and 5\. Many unacceptable answers from the other experiments receive a score of 3 in Derivation Prompting\. There are two primary reasons for this: \(1\) The NoInfo rule has a more direct impact than simply prompting the model not to answer questions when the information is not available, as done in both the RAG and Long Context experiments; \(2\) Generating answers through the application of explicitly defined and constrained rules reduces the likelihood of hallucinations or misinterpretations of the context chunks and minimizes the potential for faulty reasoning\. These results suggest that while there is minimal impact on recall \(i\.e\., answering as much as possible\), there is a significant improvement in precision \(i\.e\., avoiding incorrect answers\)\.

It is important to note that while Derivation Prompting with Claude Haiku also reduces unacceptable answers, it does have an impact on the number of answers with scores of 4 and 5, resulting in fewer such answers compared to RAG\. This suggests that the size of the model is an important factor\. Larger and more powerful models, such as Claude Opus, have a better understanding of the task of constructing the derivation and are more capable of applying the rules effectively, yielding better results\.

Although unacceptable answers have been reduced, there are still some examples that scored 1 and 2\. A significant advantage of Derivation Prompting is that the resulting derivation is interpretable, and it is easy to identify mistakes in the application of rules\. This, when compared to simple RAG, is a notable advantage for users, as it often eliminates the need to verify answers directly from the source\. Instead, users can follow the reasoning in the derivation and identify faulty steps\. Figure[4](https://arxiv.org/html/2605.14053#S6.F4)presents a real example of an incorrect derivation\. In the application of the Refine rule, it is clear that the model has hallucinated facts not present in the hypotheses\.

![Refer to caption](https://arxiv.org/html/2605.14053v1/img/incorrect-example-enhanced.png)Figure 4:Example of an incorrect derivation \(translated from Spanish\)\. In the application of the Refine rule, the model hallucinates that having completed 5th year of high school in biology fulfills the required pre\-university studies \(hallucination is underlined in red\)\.
## 7Conclusions

In this paper we introduced Derivation Prompting, a new prompting technique inspired by logic derivations, to improve the generation step in the Retrieval\-Augmented Generation framework for open\-domain question answering\. Our experiments showed that Derivation Prompting significantly reduces the occurrence of unacceptable answers compared to traditional RAG and long\-context window approaches\.

However, the performance of Derivation Prompting is influenced by the size and capability of the underlying LLM\. While Claude Opus exhibited robust performance, smaller models like Claude Haiku showed a decrease in useful answers \(though unacceptable answers were reduced\), indicating the importance of model capacity when constructing effective derivations\.

Future work will focus on refining Derivation Prompting by experimenting with different sets of rules and adjusting the number of initial hypotheses\. We are also working on formalizing the underlying formal language behind the application of the rules, and using this to add further verification methods to ensure the correctness of the resulting derivation\.

We believe that this method can be applied to additional use cases and may be generalized to non\-RAG scenarios with different sets of rules\. However, it is important to evaluate this method further to ensure its utility in such cases\.

## References

- \[1\]T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan,et al\.\(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 1877–1901\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf)Cited by:[§2](https://arxiv.org/html/2605.14053#S2.p6.1),[item 2](https://arxiv.org/html/2605.14053#S3.I1.i2.p1.1)\.
- \[2\]Y\. Gao, Y\. Xiong, X\. Gao, K\. Jia, J\. Pan, Y\. Bi, Y\. Dai, J\. Sun, M\. Wang, and H\. Wang\(2024\)Retrieval\-augmented generation for large language models: a survey\.External Links:2312\.10997,[Link](https://arxiv.org/abs/2312.10997)Cited by:[§1](https://arxiv.org/html/2605.14053#S1.p1.1),[§2](https://arxiv.org/html/2605.14053#S2.p3.1)\.
- \[3\]J\. Huang and K\. C\. Chang\(2023\-07\)Towards reasoning in large language models: a survey\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 1049–1065\.External Links:[Link](https://aclanthology.org/2023.findings-acl.67),[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.67)Cited by:[§1](https://arxiv.org/html/2605.14053#S1.p2.1),[§2](https://arxiv.org/html/2605.14053#S2.p6.1)\.
- \[4\]G\. Izacard, P\. Lewis, M\. Lomeli, L\. Hosseini, F\. Petroni, T\. Schick, J\. Dwivedi\-Yu, A\. Joulin, S\. Riedel, and E\. Grave\(2024\-03\)Atlas: few\-shot learning with retrieval augmented language models\.J\. Mach\. Learn\. Res\.24\(1\)\.External Links:ISSN 1532\-4435Cited by:[§2](https://arxiv.org/html/2605.14053#S2.p2.1)\.
- \[5\]Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. J\. Bang, A\. Madotto, and P\. Fung\(2023\-03\)Survey of hallucination in natural language generation\.ACM Comput\. Surv\.55\(12\)\.External Links:ISSN 0360\-0300,[Link](https://doi.org/10.1145/3571730),[Document](https://dx.doi.org/10.1145/3571730)Cited by:[§1](https://arxiv.org/html/2605.14053#S1.p1.1),[§2](https://arxiv.org/html/2605.14053#S2.p8.1)\.
- \[6\]E\. Kamalloo, N\. Dziri, C\. Clarke, and D\. Rafiei\(2023\-07\)Evaluating open\-domain question answering in the era of large language models\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 5591–5606\.External Links:[Link](https://aclanthology.org/2023.acl-long.307),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.307)Cited by:[§5](https://arxiv.org/html/2605.14053#S5.p1.1)\.
- \[7\]S\. Kim, J\. Shin, Y\. Cho, J\. Jang, S\. Longpre, H\. Lee, S\. Yun, S\. Shin, S\. Kim, J\. Thorne, and M\. Seo\(2024\)Prometheus: inducing fine\-grained evaluation capability in language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=8euJaTveKw)Cited by:[§5](https://arxiv.org/html/2605.14053#S5.p2.1)\.
- \[8\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela\(2020\)Retrieval\-Augmented Generation for Knowledge\-Intensive NLP Tasks\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 9459–9474\.External Links:[Link](https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html)Cited by:[§2](https://arxiv.org/html/2605.14053#S2.p2.1)\.
- \[9\]R\. F\. Nogueira and K\. Cho\(2019\)Passage re\-ranking with BERT\.CoRRabs/1901\.04085\.External Links:[Link](http://arxiv.org/abs/1901.04085),1901\.04085Cited by:[§2](https://arxiv.org/html/2605.14053#S2.p4.1)\.
- \[10\]N\. Reimers and I\. Gurevych\(2019\-11\)Sentence\-BERT: Sentence Embeddings using Siamese BERT\-Networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 3982–3992\.External Links:[Link](https://aclanthology.org/D19-1410),[Document](https://dx.doi.org/10.18653/v1/D19-1410)Cited by:[§2](https://arxiv.org/html/2605.14053#S2.p4.1)\.
- \[11\]W\. Shi, S\. Min, M\. Yasunaga, M\. Seo, R\. James, M\. Lewis, L\. Zettlemoyer, and W\. Yih\(2024\-06\)REPLUG: retrieval\-augmented black\-box language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 8371–8384\.External Links:[Link](https://aclanthology.org/2024.naacl-long.463)Cited by:[§2](https://arxiv.org/html/2605.14053#S2.p2.1)\.
- \[12\]H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi,et al\.\(2023\-07\)Llama 2: Open Foundation and Fine\-Tuned Chat Models\.arXiv\.Note:arXiv:2307\.09288 \[cs\]External Links:[Link](http://arxiv.org/abs/2307.09288),[Document](https://dx.doi.org/10.48550/arXiv.2307.09288)Cited by:[§4](https://arxiv.org/html/2605.14053#S4.p3.1)\.
- \[13\]K\. Valmeekam, A\. Olmo, S\. Sreedharan, and S\. Kambhampati\(2022\)Large language models still can’t plan \(a benchmark for LLMs on planning and reasoning about change\)\.InNeurIPS 2022 Foundation Models for Decision Making Workshop,External Links:[Link](https://openreview.net/forum?id=wUU-7XTL5XO)Cited by:[§1](https://arxiv.org/html/2605.14053#S1.p1.1),[§2](https://arxiv.org/html/2605.14053#S2.p8.1)\.
- \[14\]D\. Van Dalen\(2013\)Logic and Structure\.Universitext,Springer,London\(en\)\.External Links:ISBN 978\-1\-4471\-4557\-8 978\-1\-4471\-4558\-5,[Link](https://link.springer.com/10.1007/978-1-4471-4558-5),[Document](https://dx.doi.org/10.1007/978-1-4471-4558-5)Cited by:[Figure 2](https://arxiv.org/html/2605.14053#S3.F2),[§3](https://arxiv.org/html/2605.14053#S3.p2.3)\.
- \[15\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou\(2024\)Chain\-of\-thought prompting elicits reasoning in large language models\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.External Links:ISBN 9781713871088Cited by:[§1](https://arxiv.org/html/2605.14053#S1.p2.1),[§2](https://arxiv.org/html/2605.14053#S2.p6.1)\.
- \[16\]J\. Xu, H\. Fei, L\. Pan, Q\. Liu, M\. Lee, and W\. Hsu\(2024\)Faithful logical reasoning via symbolic chain\-of\-thought\.External Links:2405\.18357,[Link](https://arxiv.org/abs/2405.18357)Cited by:[§2](https://arxiv.org/html/2605.14053#S2.p11.1)\.
- \[17\]S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. L\. Griffiths, Y\. Cao, and K\. R\. Narasimhan\(2023\)Tree of thoughts: deliberate problem solving with large language models\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=5Xc1ecxO1h)Cited by:[§2](https://arxiv.org/html/2605.14053#S2.p7.1)\.
- \[18\]X\. Zhao, M\. Li, W\. Lu, C\. Weber, J\. H\. Lee, K\. Chu, and S\. Wermter\(2024\-05\)Enhancing zero\-shot chain\-of\-thought reasoning in large language models through logic\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),N\. Calzolari, M\. Kan, V\. Hoste, A\. Lenci, S\. Sakti, and N\. Xue \(Eds\.\),Torino, Italia,pp\. 6144–6166\.External Links:[Link](https://aclanthology.org/2024.lrec-main.543)Cited by:[§2](https://arxiv.org/html/2605.14053#S2.p10.1)\.
- \[19\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 46595–46623\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/91f18a1287b398d378ef22505bf41832-Paper-Datasets_and_Benchmarks.pdf)Cited by:[§5](https://arxiv.org/html/2605.14053#S5.p1.1)\.

## Appendix 0\.AFew\-shot example

Hypotheses: \{retrieved chunks removed for brevity, enumerated from 1 to 3\}

User question: What is the meaning of a credit and how many hours of weekly study should be dedicated to a 13\-credit course?

Extract \| 2 \| A credit is a measure of the dedication required for a subject\. It is assumed that if a subject has more credits, it requires more hours of dedication\. \| Not a final answer

New hypothesis: a\. \{Previous conclusion is repeated, removed for brevity\}

Extract \| 2 \| A credit approximately equals one hour of weekly study, throughout an entire semester\. \| Not a final answer

New hypothesis: b\. \{Previous conclusion is repeated, removed for brevity\}

Instantiate \| b \| A credit approximately equals one hour of weekly study, throughout an entire semester\. Therefore, a subject with 13 credits implies an approximate dedication of 13 hours per week\. \| Not a final answer

New hypothesis: c\. \{Previous conclusion is repeated, removed for brevity\}

Concat \| a,c \| A credit is a measure of the dedication required for a subject\. It is assumed that if a subject has more credits, it requires more hours of dedication\. A credit approximately equals one hour of weekly study, throughout an entire semester\. Therefore, a subject with 13 credits implies an approximate dedication of 13 hours per week\. \| Final answer

Similar Articles

Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation

arXiv cs.CL

This paper proposes SGR, a framework that enhances LLM stepwise reasoning by integrating external knowledge graphs through query-relevant subgraph generation, combining Cypher-based reasoning with collaborative reasoning integration. Experiments on CWQ, WebQSP, GrailQA, and KQA Pro show improved reasoning accuracy over standard prompting and knowledge-enhanced baselines.

Prompting fundamentals

OpenAI Blog

OpenAI Academy guide on prompting fundamentals that teaches users how to write clear, effective prompts to get better responses from ChatGPT through techniques like being specific, adding context, specifying output format, and breaking down complex tasks.

Why Retrieval-Augmented Generation Fails: A Graph Perspective

arXiv cs.CL

This paper investigates why Retrieval-Augmented Generation (RAG) systems fail despite having access to correct evidence. Using circuit tracing and attribution graphs, the authors find that correct predictions exhibit deeper reasoning paths and more distributed evidence flow, while failures show shallow and fragmented patterns. They propose a graph-based error detection framework and targeted interventions to improve RAG reliability.