EverydayGPT: Confidence-Gated Routing for Efficient and Safe Hybrid GPT-RAG Conversational QA

arXiv cs.CL Papers

Summary

EverydayGPT introduces Confidence-Gated Routing (CGR), a mechanism that per query decides whether to use RAG, direct GPT generation, or refusal, achieving 120x latency reduction on 85% of queries while maintaining answer quality, as demonstrated on a 500-question benchmark.

arXiv:2606.11212v1 Announce Type: new Abstract: Standard Retrieval-Augmented Generation (RAG) pipelines route every query through retrieval and generation unconditionally, incurring unnecessary computation and propagating low-quality context to the generator. We introduce EverydayGPT, a lightweight conversational QA system built around a Confidence-Gated Routing (CGR) mechanism that formalises the routing decision as a joint policy over retrieval distance and extraction adequacy. The backbone is a 205M-parameter GPT trained from scratch on 10B tokens of FineWeb-Edu. CGR avoids invoking the costly GPT pathway (~5.9s) for 85 percent of queries by resolving them via fast RAG extraction (~45 ms), yielding over 120x latency reduction on the majority of queries while maintaining answer quality. On a 500-question in-domain benchmark, the system achieves F1 = 0.226 +/- 0.004 compared to 0.171 for GPT-only and 0.210 for unconditional RAG. Gains over strong baselines are modest but consistent, while efficiency improvements are substantial (6.3x mean latency reduction). A structured grounding audit finds no unsupported claims in the sampled set, with explicit scope limitations. We position this work as a study of routing strategies under resource constraints rather than a claim of state-of-the-art performance.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:35 PM

# 1. Introduction
Source: [https://arxiv.org/html/2606.11212](https://arxiv.org/html/2606.11212)
###### Abstract

Standard Retrieval\-Augmented Generation \(RAG\) pipelines route every query through retrieval and generation unconditionally, incurring unnecessary computation and propagating low\-quality context to the generator\. We introduceEverydayGPT, a lightweight conversational QA system built around aConfidence\-Gated Routing\(cgr\) mechanism that formalises the routing decision as a joint policy over retrieval distance and extraction adequacy:π:𝒬×𝒟→\{rag,gpt,refuse\}\\pi\\colon\\mathcal\{Q\}\\times\\mathcal\{D\}\\to\\\{\\textsc\{rag\},\\textsc\{gpt\},\\textsc\{refuse\}\\\}\. This is strictly distinct from output\-level abstention methods, which defer*after*a full forward pass, and from distance\-only RAG filtering, which ignores answer responsiveness\. The backbone is a 205 M\-parameter GPT trained from scratch on 10 B tokens of FineWeb\-Edu \(pretraining loss 4\.21→\\to2\.84\), avoiding dependence on proprietary weights\. The primary contribution of this work is the routing architecture itself:cgravoids invoking the costly GPT pathway \(≈\{\}\\approx\{\}5\.9 s\) for 85 % of queries by resolving them via fast RAG extraction \(≈\{\}\\approx\{\}45 ms\), yielding a more than120×120\{\\times\}latency reduction on that majority while maintaining answer quality\. On a 500\-question in\-domain benchmark,cgragachieves F1=0\.226±0\.004=0\.226\\pm 0\.004vs\. F1=0\.171=0\.171for GPT\-only and F1=0\.198=0\.198for unconditional dense RAG\. Gains over GPT\-only are large and significant \(\+\+0\.055,p<0\.001p<0\.001, Wilcoxon signed\-rank\)\. Gains over the strongest comparable baseline, LangChain unconditional RAG \(F1=0\.210=0\.210\), are modest but consistent \(\+0\.016\)\. A structured grounding audit on 300 in\-domain samples finds no responses containing claims unsupported by retrieved context under a five\-category annotation protocol \(κ=0\.81\\kappa=0\.81\); scope limitations of this result are discussed explicitly\. The full system runs at sub\-6 s mean latency on consumer CPU with<<2 GB memory\. All code and evaluation scripts are publicly released\. We position this work as a study of routing strategies under resource constraints rather than a claim of state\-of\-the\-art performance\.

Retrieval\-Augmented Generation \(RAG\)\[[9](https://arxiv.org/html/2606.11212#bib.bib9)\]has become the dominant paradigm for grounding generative language models in external knowledge, substantially reducing hallucination compared to purely parametric generation\[[7](https://arxiv.org/html/2606.11212#bib.bib7),[15](https://arxiv.org/html/2606.11212#bib.bib15)\]\. Despite this success, standard RAG deployments share a critical architectural assumption: retrieval and generation are applied*unconditionally*for every query, regardless of whether the retrieved context is informative or whether the extracted answer is adequate\. This assumption has two practical consequences:

- •Wasted computation\.Invoking a generative model for queries that a simple extraction step would answer correctly is expensive, especially under CPU inference constraints\.
- •Quality degradation\.Passing low\-quality retrieved context to the generator without a quality gate can produce worse outputs than refusing or routing differently\.

We address both problems by introducingConfidence\-Gated Routing\(cgr\), a routing policy that makes an explicit decision at inference time—before expensive generation is committed—based on the joint quality of retrieval and extraction\. Our system,EverydayGPT, implementscgrover a custom\-trained 205 M\-parameter GPT and a FAISS\-based dense retrieval index\.

#### The central claim\.

The primary contribution of this work is*not*a large accuracy gain over strong large\-model baselines—we do not claim to surpass systems with orders\-of\-magnitude more parameters\. Instead, the contribution is anefficiency\-safety architecture: a formally defined routing policy that achieves comparable or better answer quality to unconditional RAG while avoiding GPT inference cost for 85 % of queries \(120×120\{\\times\}latency reduction on those queries\), providing an explicit safe\-refusal pathway for out\-of\-domain inputs, and running entirely on consumer CPU hardware\. We believe this is a practically useful contribution for resource\-constrained deployment settings that the NLP community has not fully addressed\.

#### Contributions\.

- C1A formally defined three\-way routing policyπ:𝒬×𝒟→\{rag,gpt,refuse\}\\pi\\colon\\mathcal\{Q\}\\times\\mathcal\{D\}\\to\\\{\\textsc\{rag\},\\textsc\{gpt\},\\textsc\{refuse\}\\\}conditioned on*joint*retrieval distance and extraction confidence, distinct from output\-level abstention and distance\-only filtering\.
- C2A 205 M\-parameter GPT trained end\-to\-end without pretrained weights, with pretraining loss curves confirming stable convergence\. Base model evaluated against GPT\-2 Small on WikiText\-103 and PTB\.
- C3Empirical evaluation against eight baselines with bootstrap confidence intervals, Wilcoxon significance tests, threshold sensitivity analysis, a structured grounding audit, and out\-of\-domain evaluation on Natural Questions and TriviaQA\.
- C4A fully deployed CPU\-runnable system \(<<2 GB, sub\-6 s latency\) with public release of all code and evaluation infrastructure\.

## 2\. Related Work

#### Retrieval\-Augmented Generation\.

RAG\[[9](https://arxiv.org/html/2606.11212#bib.bib9)\]substantially reduces hallucination on knowledge\-intensive tasks\[[15](https://arxiv.org/html/2606.11212#bib.bib15)\]\. Dense Passage Retrieval\[[8](https://arxiv.org/html/2606.11212#bib.bib8)\]improves recall via bi\-encoder retrieval; hybrid sparse\-dense pipelines further improve coverage\. Fusion\-in\-Decoder \(FiD\)\[[6](https://arxiv.org/html/2606.11212#bib.bib6)\]encodes passages jointly at the encoder level\. REALM\[[4](https://arxiv.org/html/2606.11212#bib.bib4)\]jointly trains retrieval and generation\. Production frameworks such as LangChain and Haystack provide RAG pipelines but apply retrieval unconditionally\.cgris architecturally distinct from all of these: it treats the routing decision as a first\-class operation conditioned on joint uncertainty*before*generation is invoked\.

#### Confidence, Abstention, and Selective Prediction\.

Calibration research\[[3](https://arxiv.org/html/2606.11212#bib.bib3)\]motivates models that express reliable uncertainty\. SQuAD 2\.0\[[13](https://arxiv.org/html/2606.11212#bib.bib13)\]introduced unanswerable questions, prompting output\-level abstention\. Selective prediction\[[2](https://arxiv.org/html/2606.11212#bib.bib2)\]defers when model output confidence is low\. These methods condition the abstention decision onP​\(y∣x\)P\(y\\mid x\)—requiring a full forward pass—and operate only at the output level\.cgrextends this principle*upstream*: the routing decision is made before generation, conditioning on retrieval quality, and avoids committing compute to an unreliable generation path\. This is the key architectural distinction\.

#### Autoregressive Language Models\.

The transformer architecture\[[16](https://arxiv.org/html/2606.11212#bib.bib16)\]and GPT\-family models\[[11](https://arxiv.org/html/2606.11212#bib.bib11),[1](https://arxiv.org/html/2606.11212#bib.bib1)\]established the autoregressive paradigm\. Our backbone occupies the same parameter regime as GPT\-2 \(117–345 M\), but is trained on FineWeb\-Edu\[[10](https://arxiv.org/html/2606.11212#bib.bib10)\], a curated educational corpus better aligned with our target domain than web\-crawl data\. We compare directly to GPT\-2 Small on standard benchmarks to contextualise base model quality\.

#### Scope relative to large models\.

We do not compare against GPT\-4, Llama 2/3, or Mistral, as these require GPU inference infrastructure incompatible with our CPU deployment constraint\. This is an explicit limitation, not an oversight\. Our system is designed for the resource\-constrained setting where large models are inaccessible, a practically important but understudied scenario\. The comparison set is intentionally matched to our scale and deployment context\.

## 3\. System Architecture

EverydayGPTintegrates three modules—a GPT backbone, a FAISS retrieval pipeline, and thecgr—into a unified inference stack\. Figure[1](https://arxiv.org/html/2606.11212#S3.F1)illustrates the routing flow with per\-block latency annotations\.

User QueryQuery Embeddingall\-MiniLM\-L6\-v2,∼\{\\sim\}8 msFAISS Searchtop\-k=10k\{=\}10, L2,∼\{\\sim\}12 msdmin≤δd\_\{\\min\}\{\\leq\}\\delta?Safe Refusalout\-of\-domainContext Assemblydedup, 800 tokExtraction \+Confidenceccc≥τc\{\\geq\}\\tau?RAG Answer∼\{\\sim\}45 ms totalGPT Answer∼\{\\sim\}5\.9 sFinal ResponseyesnoyesnoFigure 1:EverydayGPT inference pipeline\. Thecgrgate at each diamond makes an explicit routing decision*before*generation is committed\. On 85 % of queries the RAG path resolves the query at∼\{\\sim\}45 ms, avoiding the∼\{\\sim\}5\.9 s GPT forward pass entirely\.
## 4\. GPT Model

### 4\.1 Architecture

The backbone is a standard causal GPT with Pre\-LN layer normalisation\[[17](https://arxiv.org/html/2606.11212#bib.bib17)\], GELU activations, and4×4\\timesFFN expansion\. Configuration is in Table[1](https://arxiv.org/html/2606.11212#S4.T1)\.

Table 1:GPT model configuration\.
### 4\.2 Training

The model is pretrained on FineWeb\-Edu\[[10](https://arxiv.org/html/2606.11212#bib.bib10)\]using AdamW \(lr=10−4\\mathrm\{lr\}=10^\{\-4\}, cosine decay, 500 warmup steps\), batch size 32 with gradient accumulation \(×\\times4\), on an NVIDIA Tesla P4 GPU \(8 GB VRAM\) for 48–72 h across Kaggle sessions\. Selective loss masking during instruction fine\-tuning computes gradients only over response tokens, preventing template memorisation\.

#### Loss convergence\.

Figure[2](https://arxiv.org/html/2606.11212#S4.F2)shows pretraining loss decreasing from 4\.21 to 2\.84 over 10 B tokens without divergence, confirming stable training despite session\-based checkpointing\.

02244668810102\.52\.5333\.53\.5444\.54\.5Tokens seen \(billions\)Training lossFigure 2:Pretraining loss converges from 4\.21 to 2\.84 over 10 B tokens, confirming stable training on consumer\-grade GPU hardware\.
#### Base model quality\.

Table[2](https://arxiv.org/html/2606.11212#S4.T2)compares perplexity against GPT\-2 Small \(117 M\)\. Our model achieves lower perplexity on both benchmarks, consistent with its larger size and domain\-specialised pretraining corpus\.

Table 2:Perplexity vs\. GPT\-2 Small\. Lower is better\.

### 4\.3 Inference

Generation uses top\-kksampling \(k=50k\{=\}50,τ=0\.4\\tau\{=\}0\.4\) with a sliding\-window 3\-gram repetition detector\[[5](https://arxiv.org/html/2606.11212#bib.bib5)\]\.

## 5\. Retrieval Pipeline

Documents are encoded offline withall\-MiniLM\-L6\-v2\[[14](https://arxiv.org/html/2606.11212#bib.bib14)\]into 384\-dim embeddings indexed in FAISSIndexFlatL2\(𝒪​\(N​d\)\\mathcal\{O\}\(Nd\)retrieval\)\. At inference, top\-k=10k\{=\}10neighbours are retrieved \(∼\{\\sim\}12 ms\), filtered by distance and token count, deduplicated by 120\-character prefix fingerprinting, and truncated to 800 tokens\. A rule\-guided sentence ranker classifies question type \(factoid, definitional, temporal, causal, yes/no\) and scores candidates by keyword overlap and type\-specific signals, running in𝒪​\(S⋅\|q\|\)\\mathcal\{O\}\(S\{\\cdot\}\|q\|\)\.

## 6\. Confidence\-Gated Routing

### 6\.1 Formal Routing Policy

###### Definition 1\(Routing Policy\)\.

Let𝒬\\mathcal\{Q\}be the query space and𝒟\\mathcal\{D\}the retrieved document space\. Thecgrpolicy is:

π:𝒬×𝒟⟶\{rag,gpt,refuse\}\\pi\\colon\\mathcal\{Q\}\\times\\mathcal\{D\}\\;\\longrightarrow\\;\\\{\\textsc\{rag\},\\;\\textsc\{gpt\},\\;\\textsc\{refuse\}\\\}parameterised by retrieval distancedmin=mini⁡did\_\{\\min\}=\\min\_\{i\}d\_\{i\}and extraction confidencec∈\[0,1\]c\\in\[0,1\]\.

###### Definition 2\(Decision Rule\)\.

Given distance ceilingδ\\deltaand confidence floorτ\\tau:

π​\(q,D\)=\{refusedmin\>δragdmin≤δ∧c≥τgptdmin≤1\.0∧c<τrefuseotherwise\\pi\(q,D\)=\\begin\{cases\}\\textsc\{refuse\}&d\_\{\\min\}\>\\delta\\\\ \\textsc\{rag\}&d\_\{\\min\}\\leq\\delta\\;\\wedge\\;c\\geq\\tau\\\\ \\textsc\{gpt\}&d\_\{\\min\}\\leq 1\.0\\;\\wedge\\;c<\\tau\\\\ \\textsc\{refuse\}&\\text\{otherwise\}\\end\{cases\}

#### What makescgrnovel\.

We formalise routing as a*joint decision over retrieval and answer adequacy*, rather than treating retrieval and generation independently as in all prior RAG systems\. Output\-level abstention\[[13](https://arxiv.org/html/2606.11212#bib.bib13),[2](https://arxiv.org/html/2606.11212#bib.bib2)\]conditions onP​\(y\|x\)P\(y\|x\)after a full forward pass\. Distance\-only RAG filtering\[[9](https://arxiv.org/html/2606.11212#bib.bib9)\]usesdmind\_\{\\min\}alone, ignoring whether the extracted answer is responsive\. To our knowledge,cgris among the first to condition the routing decision on the*joint signal*\(dmin,c\)\(d\_\{\\min\},c\), enabling early termination before generation and finer\-grained discrimination between out\-of\-domain queries \(highdmind\_\{\\min\}\), adequate extraction \(highcc\), and inadequate extraction \(lowcc, fallback to GPT\)\. The practical effect is that generation cost is paid only when actually needed\.

### 6\.2 Confidence Score

c=min⁡\(1\.0,\|w\|25⋅0\.3\+ovlp​\(q,a\)⋅0\.4\+η⋅0\.3\)c=\\min\\\!\\left\(1\.0,\\;\\frac\{\|w\|\}\{25\}\{\\cdot\}0\.3\+\\mathrm\{ovlp\}\(q,a\)\{\\cdot\}0\.4\+\\eta\{\\cdot\}0\.3\\right\)\(1\)where\|w\|\|w\|is answer word count,ovlp​\(q,a\)\\mathrm\{ovlp\}\(q,a\)is keyword overlap, andη∈\{0\.3,1\.0,1\.5\}\\eta\\in\\\{0\.3,1\.0,1\.5\\\}is a type\-correctness bonus\. The feature weights were selected by grid search overτ∈\{0\.1,0\.3,0\.5,0\.7,0\.9\}\\tau\\in\\\{0\.1,0\.3,0\.5,0\.7,0\.9\\\}on a held\-out 50\-question development set\.

We acknowledge that Eq\.[1](https://arxiv.org/html/2606.11212#S6.E1)is a weighted heuristic, not a probabilistically calibrated score\[[3](https://arxiv.org/html/2606.11212#bib.bib3)\]\. This is an intentional design choice under the constraint that the routing decision must run in<<1 ms \(the RAG pathway latency budget\)\. A learned confidence estimator would be more principled and is identified as the most important direction for future work\.

### 6\.3 Efficiency Analysis

The efficiency gain from routing is the central practical benefit ofcgr\. For a batch ofQQqueries:

Costcgrag=Q⋅TRAG\+α​Q⋅TGPT\\text\{Cost\}\_\{\\textsc\{cgrag\}\}=Q\{\\cdot\}T\_\{\\text\{RAG\}\}\+\\alpha Q\{\\cdot\}T\_\{\\text\{GPT\}\}\(2\)whereTRAG≈45T\_\{\\text\{RAG\}\}\\approx 45ms,TGPT≈5900T\_\{\\text\{GPT\}\}\\approx 5900ms, andα=0\.15\\alpha=0\.15is the fraction of queries routed to GPT\. This gives:

Costcgrag≈Q⋅\(45\+0\.15×5900\)=Q⋅930​ms\\text\{Cost\}\_\{\\textsc\{cgrag\}\}\\approx Q\{\\cdot\}\(45\+0\.15\{\\times\}5900\)=Q\{\\cdot\}930\\,\\text\{ms\}compared toQ⋅5900Q\{\\cdot\}5900ms for unconditional generation, a6\.3×\\mathbf\{6\.3\\times\}mean latency reduction while maintaining the quality ceiling of GPT generation where it is needed\.

### 6\.4 Routing Algorithm

Algorithm 1Confidence\-Gated Routing \(cgr\)0:query

qq, threshold

τ\\tau, ceiling

δ=1\.5\\delta\{=\}1\.5
1:

𝒟←FaissSearch​\(q,k=10\)\\mathcal\{D\}\\leftarrow\\textsc\{FaissSearch\}\(q,k\{=\}10\)\{

𝒪​\(N​d\)\\mathcal\{O\}\(Nd\),

∼12\{\\sim\}12ms\}

2:if

mini⁡di\>δ\\min\_\{i\}d\_\{i\}\>\\deltathen

3:returnRefuse\{out\-of\-domain\}

4:endif

5:

ctx←Assemble​\(𝒟\)\\mathrm\{ctx\}\\leftarrow\\textsc\{Assemble\}\(\\mathcal\{D\}\)
6:

a←Extract​\(q,ctx\)a\\leftarrow\\textsc\{Extract\}\(q,\\mathrm\{ctx\}\)\{

𝒪​\(S​\|q\|\)\\mathcal\{O\}\(S\|q\|\),

∼\{\\sim\}20 ms\}

7:

c←Confidence​\(a,q\)c\\leftarrow\\textsc\{Confidence\}\(a,q\)\{Eq\.[1](https://arxiv.org/html/2606.11212#S6.E1)\}

8:if

a≠∅a\\neq\\emptysetand

c≥τc\\geq\\tauthen

9:return

aa\{RAG path, total

∼\{\\sim\}45 ms\}

10:elseif

mini⁡di≤1\.0\\min\_\{i\}d\_\{i\}\\leq 1\.0then

11:return

GptGenerate​\(ctx,q\)\\textsc\{GptGenerate\}\(\\mathrm\{ctx\},q\)\{GPT path,

∼\{\\sim\}5\.9 s\}

12:else

13:returnRefuse

14:endif

## 7\. Experiments

### 7\.1 Benchmark and Metrics

We evaluate on a500\-question in\-domain SQuAD\-derived benchmarkspanning six categories aligned with our pretraining corpus: Computer Science \(125\), Mathematics \(125\), General Science \(63\), Machine Learning \(63\), RAG/IR \(62\), and NLP \(62\)\. We report token\-level F1\[[12](https://arxiv.org/html/2606.11212#bib.bib12)\]and ROUGE\-L as primary metrics, with bootstrap 95 % CIs \(1000 resamples\) and Wilcoxon signed\-rank tests\. Exact Match \(EM\) is reported for completeness only: as a generative system producing full\-sentence responses, EM = 0 throughout is expected and does not indicate factual incorrectness; F1 is the appropriate primary metric\.

### 7\.2 Baselines

All baselines share the same retrieval index and GPT checkpoint:

1. 1\.GPT\-only: Pure parametric generation, no retrieval\.
2. 2\.GPT\-2 Small \(117M\): Same\-scale public model\[[11](https://arxiv.org/html/2606.11212#bib.bib11)\]\.
3. 3\.BM25: Okapi BM25 sparse retrieval\.
4. 4\.FAISS dense \(unconditional\): Dense retrieval, no routing\.
5. 5\.BM25\+Dense hybrid: Score interpolation \(λ=0\.5\\lambda\{=\}0\.5\)\.
6. 6\.LangChain RAG: Unconditional retrieve\-and\-generate using the same index and GPT backbone—the strongest directly comparable baseline\.
7. 7\.RAG\-Only\(τ=1\.0\\tau\{=\}1\.0\): Never invokes GPT\.
8. 8\.GPT\-Dominant\(τ=0\.1\\tau\{=\}0\.1\): Almost always invokes GPT\.

We explicitly do not compare against large language models \(GPT\-4, Llama, Mistral\) because they require GPU inference infrastructure incompatible with our CPU deployment setting\. This is astated hardware constraint, not selective avoidance of stronger baselines\. Our work targets the resource\-constrained deployment scenario specifically; large model comparisons are orthogonal to this research question\.

## 8\. Results

### 8\.1 Aggregate Performance

Table 3:cgragHybrid aggregate results \(τ=0\.50\\tau\{=\}0\.50\)\.
### 8\.2 Baseline Comparison

Table 4:Full baseline comparison\.†\\dagger:p<0\.05p\{<\}0\.05;‡\\ddagger:p<0\.001p\{<\}0\.001vs\.cgrag, Wilcoxon signed\-rank, bootstrap 95 % CI\.#### Interpreting the margins\.

cgragachieves the best F1 and ROUGE\-L across all baselines\. We distinguish two regimes of improvement:

- •Large, significant gains: vs\. GPT\-only \(\+\+0\.055,p<0\.001p\{<\}0\.001\), GPT\-2 Small \(\+\+0\.068\), BM25 \(\+\+0\.037\), and unconditional FAISS dense RAG \(\+\+0\.028\)\. These gaps are large relative to CI width and confirm that retrieval grounding and routing together substantially outperform generation\-only and simpler retrieval approaches\.
- •Modest, consistent gains: vs\. LangChain RAG \(\+\+0\.016\) and RAG\-Only \(\+\+0\.002\)\. We report these conservatively: the gains are statistically significant but small\. Their practical value is not the F1 delta itself—it is thatcgragachieves this qualityat6\.3×6\.3\\timeslower mean latencythan unconditional generation \(Eq\.[2](https://arxiv.org/html/2606.11212#S6.E2)\), with an explicit safety valve for out\-of\-domain queries that LangChain and RAG\-Only lack entirely\.

### 8\.3 Efficiency and Routing Benefit

The efficiency argument is the primary practical contribution and deserves direct quantification\. Figure[3](https://arxiv.org/html/2606.11212#S8.F3)shows the latency decomposition across routing pathways\.

10110^\{1\}10210^\{2\}10310^\{3\}10410^\{4\}GPT\-onlyGPT pathMean \(cgrag\)RAG path8\.688\.688\.688\.686\.846\.843\.813\.81Latency \(ms, log scale\)Figure 3:Latency comparison \(log scale\)\. The RAG path resolves 85 % of queries at∼\{\\sim\}45 ms\. Meancgraglatency is 930 ms, a6\.3×6\.3\\timesreduction vs\. unconditional GPT at 5900 ms\. Full GPT path \(15 % of queries\) matches unconditional latency\.
### 8\.4 Ablation Study

Table 5:Ablation study\.p∗<0\.05\{\}^\{\*\}p\{<\}0\.05, Wilcoxon signed\-rank\.The ablation confirms that hybrid routing consistently outperforms both single\-modality extremes\. The gains over GPT\-Dominant are modest but reliable \(p<0\.05p\{<\}0\.05\)\. The more important observation is thatcgragachieves the quality of RAG\-Only at substantially lower latency whenever the RAG path is sufficient, and falls back to GPT generation only when extraction confidence is genuinely low\.

### 8\.5 Per\-Category Analysis

Table 6:Per\-category F1 and ROUGE\-L \(cgrag\)\.Computer Science achieves the highest F1 \(0\.330\), reflecting alignment between FineWeb\-Edu and CS terminology\. NLP and RAG/IR score lowest \(0\.076 and 0\.141\), as these domains require precise technical vocabulary that the model paraphrases rather than reproduces exactly\.

CSSciMathMLRAGNLP00\.10\.10\.20\.20\.30\.30\.40\.4ScoreF1ROUGE\-LFigure 4:Per\-category F1 and ROUGE\-L forcgragHybrid\.
### 8\.6 Threshold Sensitivity

0\.20\.20\.40\.40\.60\.60\.80\.800\.10\.10\.20\.20\.30\.3Thresholdτ\\tauF1 / Refusal RateF1Refusal Rate556677Latency \(s\)Latency \(s\)Figure 5:Threshold sensitivity:τ∗≈0\.5\\tau^\{\*\}\\approx 0\.5is the stable operating point—peak F1 with near\-zero refusal rate\. Refusal rises sharply forτ\>0\.7\\tau\>0\.7\.The sensitivity curve shows a stable operating region atτ∗≈0\.4\\tau^\{\*\}\\approx 0\.4–0\.50\.5\. The F1 variation across the full range\[0\.1,0\.9\]\[0\.1,0\.9\]is modest \(0\.213–0\.226\), indicating the system is not brittle to threshold choice in the in\-domain setting\.

### 8\.7 Grounding Audit

#### Protocol\.

We sampled 300 responses uniformly from the evaluation set\. Two annotators—blind to system configuration—independently classified each response across five error categories: \(1\) unsupported factual claim; \(2\) fabricated named entity; \(3\) wrong number or date; \(4\) fabricated citation; \(5\) semantic distortion relative to retrieved context\. Inter\-annotator agreement:κ=0\.81\\kappa=0\.81\(substantial\)\.

Table 7:Grounding audit results \(300 in\-domain samples,κ=0\.81\\kappa=0\.81\)\.
#### Scope and limitations of this result\.

No grounding errors were observed in this sampled set; however, given the limited sample size \(300 questions\), this should not be interpreted as zero\-error behaviour in general\. Three important limitations bound this result: \(1\) the annotated set is*in\-domain*—retrieved context closely matches query topics, so unsupported claims are inherently less likely than in open\-domain settings; \(2\) the annotation taxonomy operationalises grounding in a specific way; other definitions may yield different rates; and \(3\) 300 samples provides limited statistical power to detect rare events\. We interpret this as evidence thatcgrgrounding is effective within this in\-domain protocol, and explicitly do not generalise it as a universal grounding guarantee\. Out\-of\-domain grounding is a critical open question addressed in §[8\.8](https://arxiv.org/html/2606.11212#S8.SS8)\.

### 8\.8 Out\-of\-Domain Evaluation

Table 8:Out\-of\-domain evaluation on NQ and TriviaQA \(200 questions each\)\. Distribution differs from FineWeb\-Edu pretraining corpus\.The routing advantage persists on both OOD datasets, with reduced margin relative to in\-domain performance as expected given index\-distribution mismatch\. The refusal mechanism correctly escalates for OOD queries \(6–12 % refusal rate vs\. 0 % in\-domain\), demonstrating that the distance gate generalises as intended\. Full OOD generalisation requires index expansion, identified as a primary future direction\.

### 8\.9 Error Analysis

Table 9:Representative failure cases\.Wrong retrievals near the distance boundary suggest a secondary re\-ranking step would help\. False refusals indicateδ=1\.5\\delta\{=\}1\.5is slightly aggressive; joint tuning of\(δ,τ\)\(\\delta,\\tau\)is a near\-term improvement\. The bulk of EM=0=0cases are GPT paraphrase outputs that are semantically correct but not verbatim spans\.

## 9\. Discussion

#### The efficiency\-safety framing\.

The primary contribution ofcgragis better understood as an efficiency\-safety architecture than as an accuracy improvement\. Compared to unconditional generation, it reduces mean latency by6\.3×6\.3\\timeswhile maintaining quality \(F1 0\.226 vs\. 0\.171 for GPT\-only\)\. Compared to unconditional RAG pipelines \(LangChain\), it adds an explicit routing policy that provides a principled refusal pathway and avoids passing low\-quality context to the generator—something no standard RAG framework provides\. These properties are valuable in production settings regardless of whether the F1 delta is large\.

#### Honest characterisation of gains\.

F1 gains over LangChain RAG \(\+0\.016\) and RAG\-Only \(\+0\.002\) are modest\. We report these transparently rather than overstating them\. The statistical significance \(p<0\.05p\{<\}0\.05\) is meaningful, but practitioners should weight the efficiency and safety properties as the primary reasons to adoptcgrover a simpler RAG pipeline, not the accuracy margin alone\.

#### Limitations\.

\(1\)*Scale*: no comparison against large models due to hardware constraints\. \(2\)*Confidence heuristic*: Eq\.[1](https://arxiv.org/html/2606.11212#S6.E1)uses fixed weights; a learned calibrator is the most important single improvement\. \(3\)*Multi\-hop*: synthesis across multiple documents is not supported\. \(4\)*Grounding audit scope*: in\-domain only; OOD grounding not measured\. \(5\)*Evaluation scope*: OOD F1 margins are small; wider evaluation is needed\.

## 10\. Conclusion

We presentedEverydayGPT, a hybrid GPT–RAG system unified under a formally defined Confidence\-Gated Routing policy\. The core contribution is the routing architecture: by conditioning the inference decision jointly on retrieval distance and extraction confidence—before generation is committed—cgravoids GPT inference cost for 85 % of queries \(6\.3×6\.3\\timeslatency reduction\), provides an explicit refusal pathway for out\-of\-domain inputs, and maintains or improves answer quality relative to unconditional RAG pipelines\.

Key empirical findings:cgragachieves F1=0\.226±0\.004=0\.226\\pm 0\.004, outperforming all eight baselines including GPT\-only \(\+\+0\.055,p<0\.001p\{<\}0\.001\) and LangChain unconditional RAG \(\+\+0\.016,p<0\.05p\{<\}0\.05\); pretraining loss converges stably from 4\.21 to 2\.84; our GPT outperforms GPT\-2 Small on WikiText\-103 \(PPL 26\.87 vs\. 29\.41\); the refusal mechanism correctly escalates on OOD queries \(NQ/TriviaQA refusal rate 6–12 % vs\. 0 % in\-domain\); and the grounding audit finds no responses containing claims unsupported by retrieved context within the in\-domain protocol, with explicit scope caveats\. The full system runs on consumer CPU with<<2 GB memory\.

Future work: learned confidence estimator replacing Eq\.[1](https://arxiv.org/html/2606.11212#S6.E1); BM25\+dense hybrid retrieval; span\-extraction fine\-tuning; joint\(δ,τ\)\(\\delta,\\tau\)optimisation; expanded OOD and adversarial evaluation\.

## Acknowledgements

The author thanks the open\-source communities behind PyTorch, FAISS, Sentence\-Transformers, FastAPI, and HuggingFace Datasets\.

## References

- \[1\]Tom Brown et al\. Language models are few\-shot learners\.*NeurIPS*, 2020\.
- \[2\]Yonatan Geifman and Ran El\-Yaniv\. Selective prediction in deep neural networks\.*NeurIPS*, 2017\.
- \[3\]Chuan Guo et al\. On calibration of modern neural networks\.*ICML*, 2017\.
- \[4\]Kelvin Guu et al\. REALM: Retrieval\-augmented language model pre\-training\.*ICML*, 2020\.
- \[5\]Ari Holtzman et al\. The curious case of neural text degeneration\.*ICLR*, 2020\.
- \[6\]Gautier Izacard and Edouard Grave\. Leveraging passage retrieval with generative models for open domain question answering\.*EACL*, 2021\.
- \[7\]Ziwei Ji et al\. Survey of hallucination in natural language generation\.*ACM Computing Surveys*, 55\(12\):1–38, 2023\.
- \[8\]Vladimir Karpukhin et al\. Dense passage retrieval for open\-domain question answering\.*EMNLP*, 2020\.
- \[9\]Patrick Lewis et al\. Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.*NeurIPS*, 2020\.
- \[10\]Guilherme Penedo et al\. FineWeb: Decanting the web for the finest text data at scale\.*NeurIPS*, 2024\.
- \[11\]Alec Radford et al\. Language models are unsupervised multitask learners\.*OpenAI Technical Report*, 2019\.
- \[12\]Pranav Rajpurkar et al\. SQuAD: 100,000\+ questions for machine comprehension of text\.*EMNLP*, 2016\.
- \[13\]Pranav Rajpurkar et al\. Know what you don’t know: Unanswerable questions for SQuAD\.*ACL*, 2018\.
- \[14\]Nils Reimers and Iryna Gurevych\. Sentence\-BERT: Sentence embeddings using siamese BERT\-networks\.*EMNLP*, 2019\.
- \[15\]Kurt Shuster et al\. Retrieval augmentation reduces hallucination in conversation\.*EMNLP Findings*, 2021\.
- \[16\]Ashish Vaswani et al\. Attention is all you need\.*NeurIPS*, 2017\.
- \[17\]Ruibin Xiong et al\. On layer normalization in the transformer architecture\.*ICML*, 2020\.

Similar Articles

Answer Presence Drives RAG Rewriting Gains

Hugging Face Daily Papers

The paper investigates whether the performance gains from rewriting retrieved passages in RAG QA pipelines are causally driven by the presence of the gold answer string in the rewritten context, using controlled intervention audits across multiple models and datasets.

Introducing ChatGPT

OpenAI Blog

OpenAI introduces ChatGPT, a conversational AI model fine-tuned from GPT-3.5 using reinforcement learning from human feedback (RLHF). The model is designed to answer follow-up questions, admit mistakes, and reject inappropriate requests, with free access provided during the research preview.