EverydayGPT: Confidence-Gated Routing for Efficient and Safe Hybrid GPT-RAG Conversational QA

arXiv cs.CL 06/11/26, 04:00 AM Papers
Summary
EverydayGPT introduces Confidence-Gated Routing (CGR), a mechanism that per query decides whether to use RAG, direct GPT generation, or refusal, achieving 120x latency reduction on 85% of queries while maintaining answer quality, as demonstrated on a 500-question benchmark.
arXiv:2606.11212v1 Announce Type: new Abstract: Standard Retrieval-Augmented Generation (RAG) pipelines route every query through retrieval and generation unconditionally, incurring unnecessary computation and propagating low-quality context to the generator. We introduce EverydayGPT, a lightweight conversational QA system built around a Confidence-Gated Routing (CGR) mechanism that formalises the routing decision as a joint policy over retrieval distance and extraction adequacy. The backbone is a 205M-parameter GPT trained from scratch on 10B tokens of FineWeb-Edu. CGR avoids invoking the costly GPT pathway (~5.9s) for 85 percent of queries by resolving them via fast RAG extraction (~45 ms), yielding over 120x latency reduction on the majority of queries while maintaining answer quality. On a 500-question in-domain benchmark, the system achieves F1 = 0.226 +/- 0.004 compared to 0.171 for GPT-only and 0.210 for unconditional RAG. Gains over strong baselines are modest but consistent, while efficiency improvements are substantial (6.3x mean latency reduction). A structured grounding audit finds no unsupported claims in the sampled set, with explicit scope limitations. We position this work as a study of routing strategies under resource constraints rather than a claim of state-of-the-art performance.
Original Article
View Cached Full Text
Cached at: 06/11/26, 01:35 PM
# 1. Introduction
Source: [https://arxiv.org/html/2606.11212](https://arxiv.org/html/2606.11212)
###### Abstract

Standard Retrieval\-Augmented Generation \(RAG\) pipelines route every query through retrieval and generation unconditionally, incurring unnecessary computation and propagating low\-quality context to the generator\. We introduceEverydayGPT, a lightweight conversational QA system built around aConfidence\-Gated Routing\(cgr\) mechanism that formalises the routing decision as a joint policy over retrieval distance and extraction adequacy:π:𝒬×𝒟→\{rag,gpt,refuse\}\\pi\\colon\\mathcal\{Q\}\\times\\mathcal\{D\}\\to\\\{\\textsc\{rag\},\\textsc\{gpt\},\\textsc\{refuse\}\\\}\. This is strictly distinct from output\-level abstention methods, which defer*after*a full forward pass, and from distance\-only RAG filtering, which ignores answer responsiveness\. The backbone is a 205 M\-parameter GPT trained from scratch on 10 B tokens of FineWeb\-Edu \(pretraining loss 4\.21→\\to2\.84\), avoiding dependence on proprietary weights\. The primary contribution of this work is the routing architecture itself:cgravoids invoking the costly GPT pathway \(≈\{\}\\approx\{\}5\.9 s\) for 85 % of queries by resolving them via fast RAG extraction \(≈\{\}\\approx\{\}45 ms\), yielding a more than120×120\{\\times\}latency reduction on that majority while maintaining answer quality\. On a 500\-question in\-domain benchmark,cgragachieves F1=0\.226±0\.004=0\.226\\pm 0\.004vs\. F1=0\.171=0\.171for GPT\-only and F1=0\.198=0\.198for unconditional dense RAG\. Gains over GPT\-only are large and significant \(\+\+0\.055,p<0\.001p<0\.001, Wilcoxon signed\-rank\)\. Gains over the strongest comparable baseline, LangChain unconditional RAG \(F1=0\.210=0\.210\), are modest but consistent \(\+0\.016\)\. A structured grounding audit on 300 in\-domain samples finds no responses containing claims unsupported by retrieved context under a five\-category annotation protocol \(κ=0\.81\\kappa=0\.81\); scope limitations of this result are discussed explicitly\. The full system runs at sub\-6 s mean latency on consumer CPU with<<2 GB memory\. All code and evaluation scripts are publicly released\. We position this work as a study of routing strategies under resource constraints rather than a claim of state\-of\-the\-art performance\.

Retrieval\-Augmented Generation \(RAG\)\[[9](https://arxiv.org/html/2606.11212#bib.bib9)\]has become the dominant paradigm for grounding generative language models in external knowledge, substantially reducing hallucination compared to purely parametric generation\[[7](https://arxiv.org/html/2606.11212#bib.bib7),[15](https://arxiv.org/html/2606.11212#bib.bib15)\]\. Despite this success, standard RAG deployments share a critical architectural assumption: retrieval and generation are applied*unconditionally*for every query, regardless of whether the retrieved context is informative or whether the extracted answer is adequate\. This assumption has two practical consequences:

- •Wasted computation\.Invoking a generative model for queries that a simple extraction step would answer correctly is expensive, especially under CPU inference constraints\.
- •Quality degradation\.Passing low\-quality retrieved context to the generator without a quality gate can produce worse outputs than refusing or routing differently\.

We address both problems by introducingConfidence\-Gated Routing\(cgr\), a routing policy that makes an explicit decision at inference time—before expensive generation is committed—based on the joint quality of retrieval and extraction\. Our system,EverydayGPT, implementscgrover a custom\-trained 205 M\-parameter GPT and a FAISS\-based dense retrieval index\.

#### The central claim\.

The primary contribution of this work is*not*a large accuracy gain over strong large\-model baselines—we do not claim to surpass systems with orders\-of\-magnitude more parameters\. Instead, the contribution is anefficiency\-safety architecture: a formally defined routing policy that achieves comparable or better answer quality to unconditional RAG while avoiding GPT inference cost for 85 % of queries \(120×120\{\\times\}latency reduction on those queries\), providing an explicit safe\-refusal pathway for out\-of\-domain inputs, and running entirely on consumer CPU hardware\. We believe this is a practically useful contribution for resource\-constrained deployment settings that the NLP community has not fully addressed\.

#### Contributions\.

- C1A formally defined three\-way routing policyπ:𝒬×𝒟→\{rag,gpt,refuse\}\\pi\\colon\\mathcal\{Q\}\\times\\mathcal\{D\}\\to\\\{\\textsc\{rag\},\\textsc\{gpt\},\\textsc\{refuse\}\\\}conditioned on*joint*retrieval distance and extraction confidence, distinct from output\-level abstention and distance\-only filtering\.
- C2A 205 M\-parameter GPT trained end\-to\-end without pretrained weights, with pretraining loss curves confirming stable convergence\. Base model evaluated against GPT\-2 Small on WikiText\-103 and PTB\.
- C3Empirical evaluation against eight baselines with bootstrap confidence intervals, Wilcoxon significance tests, threshold sensitivity analysis, a structured grounding audit, and out\-of\-domain evaluation on Natural Questions and TriviaQA\.
- C4A fully deployed CPU\-runnable system \(<<2 GB, sub\-6 s latency\) with public release of all code and evaluation infrastructure\.

## 2\. Related Work

#### Retrieval\-Augmented Generation\.

RAG\[[9](https://arxiv.org/html/2606.11212#bib.bib9)\]substantially reduces hallucination on knowledge\-intensive tasks\[[15](https://arxiv.org/html/2606.11212#bib.bib15)\]\. Dense Passage Retrieval\[[8](https://arxiv.org/html/2606.11212#bib.bib8)\]improves recall via bi\-encoder retrieval; hybrid sparse\-dense pipelines further improve coverage\. Fusion\-in\-Decoder \(FiD\)\[[6](https://arxiv.org/html/2606.11212#bib.bib6)\]encodes passages jointly at the encoder level\. REALM\[[4](https://arxiv.org/html/2606.11212#bib.bib4)\]jointly trains retrieval and generation\. Production frameworks such as LangChain and Haystack provide RAG pipelines but apply retrieval unconditionally\.cgris architecturally distinct from all of these: it treats the routing decision as a first\-class operation conditioned on joint uncertainty*before*generation is invoked\.

#### Confidence, Abstention, and Selective Prediction\.

Calibration research\[[3](https://arxiv.org/html/2606.11212#bib.bib3)\]motivates models that express reliable uncertainty\. SQuAD 2\.0\[[13](https://arxiv.org/html/2606.11212#bib.bib13)\]introduced unanswerable questions, prompting output\-level abstention\. Selective prediction\[[2](https://arxiv.org/html/2606.11212#bib.bib2)\]defers when model output confidence is low\. These methods condition the abstention decision onP\(y∣x\)P\(y\\mid x\)—requiring a full forward pass—and operate only at the output level\.cgrextends this principle*upstream*: the routing decision is made before generation, conditioning on retrieval quality, and avoids committing compute to an unreliable generation path\. This is the key architectural distinction\.

#### Autoregressive Language Models\.

The transformer architecture\[[16](https://arxiv.org/html/2606.11212#bib.bib16)\]and GPT\-family models\[[11](https://arxiv.org/html/2606.11212#bib.bib11),[1](https://arxiv.org/html/2606.11212#bib.bib1)\]established the autoregressive paradigm\. Our backbone occupies the same parameter regime as GPT\-2 \(117–345 M\), but is trained on FineWeb\-Edu\[[10](https://arxiv.org/html/2606.11212#bib.bib10)\], a curated educational corpus better aligned with our target domain than web\-crawl data\. We compare directly to GPT\-2 Small on standard benchmarks to contextualise base model quality\.

#### Scope relative to large models\.

We do not compare against GPT\-4, Llama 2/3, or Mistral, as these require GPU inference infrastructure incompatible with our CPU deployment constraint\. This is an explicit limitation, not an oversight\. Our system is designed for the resource\-constrained setting where large models are inaccessible, a practically important but understudied scenario\. The comparison set is intentionally matched to our scale and deployment context\.

## 3\. System Architecture

EverydayGPTintegrates three modules—a GPT backbone, a FAISS retrieval pipeline, and thecgr—into a unified inference stack\. Figure[1](https://arxiv.org/html/2606.11212#S3.F1)illustrates the routing flow with per\-block latency annotations\.

User QueryQuery Embeddingall\-MiniLM\-L6\-v2,∼\{\\sim\}8 msFAISS Searchtop\-k=10k\{=\}10, L2,∼\{\\sim\}12 msdmin≤δd\_\{\\min\}\{\\leq\}\\delta?Safe Refusalout\-of\-domainContext Assemblydedup, 800 tokExtraction \+Confidenceccc≥τc\{\\geq\}\\tau?RAG Answer∼\{\\sim\}45 ms totalGPT Answer∼\{\\sim\}5\.9 sFinal ResponseyesnoyesnoFigure 1:EverydayGPT inference pipeline\. Thecgrgate at each diamond makes an explicit routing decision*before*generation is committed\. On 85 % of queries the RAG path resolves the query at∼\{\\sim\}45 ms, avoiding the∼\{\\sim\}5\.9 s GPT forward pass entirely\.
## 4\. GPT Model

### 4\.1 Architecture

The backbone is a standard causal GPT with Pre\-LN layer normalisation\[[17](https://arxiv.org/html/2606.11212#bib.bib17)\], GELU activations, and4×4\\timesFFN expansion\. Configuration is in Table[1](https://arxiv.org/html/2606.11212#S4.T1)\.

Table 1:GPT model configuration\.
### 4\.2 Training

The model is pretrained on FineWeb\-Edu\[[10](https://arxiv.org/html/2606.11212#bib.bib10)\]using AdamW \(lr=10−4\\mathrm\{lr\}=10^\{\-4\}, cosine decay, 500 warmup steps\), batch size 32 with gradient accumulation \(×\\times4\), on an NVIDIA Tesla P4 GPU \(8 GB VRAM\) for 48–72 h across Kaggle sessions\. Selective loss masking during instruction fine\-tuning computes gradients only over response tokens, preventing template memorisation\.

#### Loss convergence\.

Figure[2](https://arxiv.org/html/2606.11212#S4.F2)shows pretraining loss decreasing from 4\.21 to 2\.84 over 10 B tokens without divergence, confirming stable training despite session\-based checkpointing\.

02244668810102\.52\.5333\.53\.5444\.54\.5Tokens seen \(billions\)Training lossFigure 2:Pretraining loss converges from 4\.21 to 2\.84 over 10 B tokens, confirming stable training on consumer\-grade GPU hardware\.
#### Base model quality\.

Table[2](https://arxiv.org/html/2606.11212#S4.T2)compares perplexity against GPT\-2 Small \(117 M\)\. Our model achieves lower perplexity on both benchmarks, consistent with its larger size and domain\-specialised pretraining corpus\.

Table 2:Perplexity vs\. GPT\-2 Small\. Lower is better\.

### 4\.3 Inference

Generation uses top\-kksampling \(k=50k\{=\}50,τ=0\.4\\tau\{=\}0\.4\) with a sliding\-window 3\-gram repetition detector\[[5](https://arxiv.org/html/2606.11212#bib.bib5)\]\.

## 5\. Retrieval Pipeline

Documents are encoded offline withall\-MiniLM\-L6\-v2\[[14](https://arxiv.org/html/2606.11212#bib.bib14)\]into 384\-dim embeddings indexed in FAISSIndexFlatL2\(𝒪\(Nd\)\\mathcal\{O\}\(Nd\)retrieval\)\. At inference, top\-k=10k\{=\}10neighbours are retrieved \(∼\{\\sim\}12 ms\), filtered by distance and token count, deduplicated by 120\-character prefix fingerprinting, and truncated to 800 tokens\. A rule\-guided sentence ranker classifies question type \(factoid, definitional, temporal, causal, yes/no\) and scores candidates by keyword overlap and type\-specific signals, running in𝒪\(S⋅\|q\|\)\\mathcal\{O\}\(S\{\\cdot\}\|q\|\)\.

## 6\. Confidence\-Gated Routing

### 6\.1 Formal Routing Policy

###### Definition 1\(Routing Policy\)\.

Let𝒬\\mathcal\{Q\}be the query space and𝒟\\mathcal\{D\}the retrieved document space\. Thecgrpolicy is:

π:𝒬×𝒟⟶\{rag,gpt,refuse\}\\pi\\colon\\mathcal\{Q\}\\times\\mathcal\{D\}\\;\\longrightarrow\\;\\\{\\textsc\{rag\},\\;\\textsc\{gpt\},\\;\\textsc\{refuse\}\\\}parameterised by retrieval distancedmin=mini⁡did\_\{\\min\}=\\min\_\{i\}d\_\{i\}and extraction confidencec∈\[0,1\]c\\in\[0,1\]\.

###### Definition 2\(Decision Rule\)\.

Given distance ceilingδ\\deltaand confidence floorτ\\tau:

π\(q,D\)=\{refusedmin\>δragdmin≤δ∧c≥τgptdmin≤1\.0∧c<τrefuseotherwise\\pi\(q,D\)=\\begin\{cases\}\\textsc\{refuse\}&d\_\{\\min\}\>\\delta\\\\ \\textsc\{rag\}&d\_\{\\min\}\\leq\\delta\\;\\wedge\\;c\\geq\\tau\\\\ \\textsc\{gpt\}&d\_\{\\min\}\\leq 1\.0\\;\\wedge\\;c<\\tau\\\\ \\textsc\{refuse\}&\\text\{otherwise\}\\end\{cases\}

#### What makescgrnovel\.

We formalise routing as a*joint decision over retrieval and answer adequacy*, rather than treating retrieval and generation independently as in all prior RAG systems\. Output\-level abstention\[[13](https://arxiv.org/html/2606.11212#bib.bib13),[2](https://arxiv.org/html/2606.11212#bib.bib2)\]conditions onP\(y\|x\)P\(y\|x\)after a full forward pass\. Distance\-only RAG filtering\[[9](https://arxiv.org/html/2606.11212#bib.bib9)\]usesdmind\_\{\\min\}alone, ignoring whether the extracted answer is responsive\. To our knowledge,cgris among the first to condition the routing decision on the*joint signal*\(dmin,c\)\(d\_\{\\min\},c\), enabling early termination before generation and finer\-grained discrimination between out\-of\-domain queries \(highdmind\_\{\\min\}\), adequate extraction \(highcc\), and inadequate extraction \(lowcc, fallback to GPT\)\. The practical effect is that generation cost is paid only when actually needed\.

### 6\.2 Confidence Score

c=min⁡\(1\.0,\|w\|25⋅0\.3\+ovlp\(q,a\)⋅0\.4\+η⋅0\.3\)c=\\min\\\!\\left\(1\.0,\\;\\frac\{\|w\|\}\{25\}\{\\cdot\}0\.3\+\\mathrm\{ovlp\}\(q,a\)\{\\cdot\}0\.4\+\\eta\{\\cdot\}0\.3\\right\)\(1\)where\|w\|\|w\|is answer word count,ovlp\(q,a\)\\mathrm\{ovlp\}\(q,a\)is keyword overlap, andη∈\{0\.3,1\.0,1\.5\}\\eta\\in\\\{0\.3,1\.0,1\.5\\\}is a type\-correctness bonus\. The feature weights were selected by grid search overτ∈\{0\.1,0\.3,0\.5,0\.7,0\.9\}\\tau\\in\\\{0\.1,0\.3,0\.5,0\.7,0\.9\\\}on a held\-out 50\-question development set\.

We acknowledge that Eq\.[1](https://arxiv.org/html/2606.11212#S6.E1)is a weighted heuristic, not a probabilistically calibrated score\[[3](https://arxiv.org/html/2606.11212#bib.bib3)\]\. This is an intentional design choice under the constraint that the routing decision must run in<<1 ms \(the RAG pathway latency budget\)\. A learned confidence estimator would be more principled and is identified as the most important direction for future work\.

### 6\.3 Efficiency Analysis

The efficiency gain from routing is the central practical benefit ofcgr\. For a batch ofQQqueries:

Costcgrag=Q⋅TRAG\+αQ⋅TGPT\\text\{Cost\}\_\{\\textsc\{cgrag\}\}=Q\{\\cdot\}T\_\{\\text\{RAG\}\}\+\\alpha Q\{\\cdot\}T\_\{\\text\{GPT\}\}\(2\)whereTRAG≈45T\_\{\\text\{RAG\}\}\\approx 45ms,TGPT≈5900T\_\{\\text\{GPT\}\}\\approx 5900ms, andα=0\.15\\alpha=0\.15is the fraction of queries routed to GPT\. This gives:

Costcgrag≈Q⋅\(45\+0\.15×5900\)=Q⋅930ms\\text\{Cost\}\_\{\\textsc\{cgrag\}\}\\approx Q\{\\cdot\}\(45\+0\.15\{\\times\}5900\)=Q\{\\cdot\}930\\,\\text\{ms\}compared toQ⋅5900Q\{\\cdot\}5900ms for unconditional generation, a6\.3×\\mathbf\{6\.3\\times\}mean latency reduction while maintaining the quality ceiling of GPT generation where it is needed\.

### 6\.4 Routing Algorithm

Algorithm 1Confidence\-Gated Routing \(cgr\)0:query

qq, threshold

τ\\tau, ceiling

δ=1\.5\\delta\{=\}1\.5
1:

𝒟←FaissSearch\(q,k=10\)\\mathcal\{D\}\\leftarrow\\textsc\{FaissSearch\}\(q,k\{=\}10\)\{

𝒪\(Nd\)\\mathcal\{O\}\(Nd\),

∼12\{\\sim\}12ms\}

2:if

mini⁡di\>δ\\min\_\{i\}d\_\{i\}\>\\deltathen

3:returnRefuse\{out\-of\-domain\}

4:endif

5:

ctx←Assemble\(𝒟\)\\mathrm\{ctx\}\\leftarrow\\textsc\{Assemble\}\(\\mathcal\{D\}\)
6:

a←Extract\(q,ctx\)a\\leftarrow\\textsc\{Extract\}\(q,\\mathrm\{ctx\}\)\{

𝒪\(S\|q\|\)\\mathcal\{O\}\(S\|q\|\),

∼\{\\sim\}20 ms\}

7:

c←Confidence\(a,q\)c\\leftarrow\\textsc\{Confidence\}\(a,q\)\{Eq\.[1](https://arxiv.org/html/2606.11212#S6.E1)\}

8:if

a≠∅a\\neq\\emptysetand

c≥τc\\geq\\tauthen

9:return

aa\{RAG path, total

∼\{\\sim\}45 ms\}

10:elseif

mini⁡di≤1\.0\\min\_\{i\}d\_\{i\}\\leq 1\.0then

11:return

GptGenerate\(ctx,q\)\\textsc\{GptGenerate\}\(\\mathrm\{ctx\},q\)\{GPT path,

∼\{\\sim\}5\.9 s\}

12:else

13:returnRefuse

14:endif

## 7\. Experiments

### 7\.1 Benchmark and Metrics

We evaluate on a500\-question in\-domain SQuAD\-derived benchmarkspanning six categories aligned with our pretraining corpus: Computer Science \(125\), Mathematics \(125\), General Science \(63\), Machine Learning \(63\), RAG/IR \(62\), and NLP \(62\)\. We report token\-level F1\[[12](https://arxiv.org/html/2606.11212#bib.bib12)\]and ROUGE\-L as primary metrics, with bootstrap 95 % CIs \(1000 resamples\) and Wilcoxon signed\-rank tests\. Exact Match \(EM\) is reported for completeness only: as a generative system producing full\-sentence responses, EM = 0 throughout is expected and does not indicate factual incorrectness; F1 is the appropriate primary metric\.

### 7\.2 Baselines

All baselines share the same retrieval index and GPT checkpoint:

1. 1\.GPT\-only: Pure parametric generation, no retrieval\.
2. 2\.GPT\-2 Small \(117M\): Same\-scale public model\[[11](https://arxiv.org/html/2606.11212#bib.bib11)\]\.
3. 3\.BM25: Okapi BM25 sparse retrieval\.
4. 4\.FAISS dense \(unconditional\): Dense retrieval, no routing\.
5. 5\.BM25\+Dense hybrid: Score interpolation \(λ=0\.5\\lambda\{=\}0\.5\)\.
6. 6\.LangChain RAG: Unconditional retrieve\-and\-generate using the same index and GPT backbone—the strongest directly comparable baseline\.
7. 7\.RAG\-Only\(τ=1\.0\\tau\{=\}1\.0\): Never invokes GPT\.
8. 8\.GPT\-Dominant\(τ=0\.1\\tau\{=\}0\.1\): Almost always invokes GPT\.

We explicitly do not compare against large language models \(GPT\-4, Llama, Mistral\) because they require GPU inference infrastructure incompatible with our CPU deployment setting\. This is astated hardware constraint, not selective avoidance of stronger baselines\. Our work targets the resource\-constrained deployment scenario specifically; large model comparisons are orthogonal to this research question\.

## 8\. Results

### 8\.1 Aggregate Performance

Table 3:cgragHybrid aggregate results \(τ=0\.50\\tau\{=\}0\.50\)\.
### 8\.2 Baseline Comparison

Table 4:Full baseline comparison\.†\\dagger:p<0\.05p\{<\}0\.05;‡\\ddagger:p<0\.001p\{<\}0\.001vs\.cgrag, Wilcoxon signed\-rank, bootstrap 95 % CI\.#### Interpreting the margins\.

cgragachieves the best F1 and ROUGE\-L across all baselines\. We distinguish two regimes of improvement:

- •Large, significant gains: vs\. GPT\-only \(\+\+0\.055,p<0\.001p\{<\}0\.001\), GPT\-2 Small \(\+\+0\.068\), BM25 \(\+\+0\.037\), and unconditional FAISS dense RAG \(\+\+0\.028\)\. These gaps are large relative to CI width and confirm that retrieval grounding and routing together substantially outperform generation\-only and simpler retrieval approaches\.
- •Modest, consistent gains: vs\. LangChain RAG \(\+\+0\.016\) and RAG\-Only \(\+\+0\.002\)\. We report these conservatively: the gains are statistically significant but small\. Their practical value is not the F1 delta itself—it is thatcgragachieves this qualityat6\.3×6\.3\\timeslower mean latencythan unconditional generation \(Eq\.[2](https://arxiv.org/html/2606.11212#S6.E2)\), with an explicit safety valve for out\-of\-domain queries that LangChain and RAG\-Only lack entirely\.

### 8\.3 Efficiency and Routing Benefit

The efficiency argument is the primary practical contribution and deserves direct quantification\. Figure[3](https://arxiv.org/html/2606.11212#S8.F3)shows the latency decomposition across routing pathways\.

10110^\{1\}10210^\{2\}10310^\{3\}10410^\{4\}GPT\-onlyGPT pathMean \(cgrag\)RAG path8\.688\.688\.688\.686\.846\.843\.813\.81Latency \(ms, log scale\)Figure 3:Latency comparison \(log scale\)\. The RAG path resolves 85 % of queries at∼\{\\sim\}45 ms\. Meancgraglatency is 930 ms, a6\.3×6\.3\\timesreduction vs\. unconditional GPT at 5900 ms\. Full GPT path \(15 % of queries\) matches unconditional latency\.
### 8\.4 Ablation Study

Table 5:Ablation study\.p∗<0\.05\{\}^\{\*\}p\{<\}0\.05, Wilcoxon signed\-rank\.The ablation confirms that hybrid routing consistently outperforms both single\-modality extremes\. The gains over GPT\-Dominant are modest but reliable \(p<0\.05p\{<\}0\.05\)\. The more important observation is thatcgragachieves the quality of RAG\-Only at substantially lower latency whenever the RAG path is sufficient, and falls back to GPT generation only when extraction confidence is genuinely low\.

### 8\.5 Per\-Category Analysis

Table 6:Per\-category F1 and ROUGE\-L \(cgrag\)\.Computer Science achieves the highest F1 \(0\.330\), reflecting alignment between FineWeb\-Edu and CS terminology\. NLP and RAG/IR score lowest \(0\.076 and 0\.141\), as these domains require precise technical vocabulary that the model paraphrases rather than reproduces exactly\.

CSSciMathMLRAGNLP00\.10\.10\.20\.20\.30\.30\.40\.4ScoreF1ROUGE\-LFigure 4:Per\-category F1 and ROUGE\-L forcgragHybrid\.
### 8\.6 Threshold Sensitivity

0\.20\.20\.40\.40\.60\.60\.80\.800\.10\.10\.20\.20\.30\.3Thresholdτ\\tauF1 / Refusal RateF1Refusal Rate556677Latency \(s\)Latency \(s\)Figure 5:Threshold sensitivity:τ∗≈0\.5\\tau^\{\*\}\\approx 0\.5is the stable operating point—peak F1 with near\-zero refusal rate\. Refusal rises sharply forτ\>0\.7\\tau\>0\.7\.The sensitivity curve shows a stable operating region atτ∗≈0\.4\\tau^\{\*\}\\approx 0\.4–0\.50\.5\. The F1 variation across the full range\[0\.1,0\.9\]\[0\.1,0\.9\]is modest \(0\.213–0\.226\), indicating the system is not brittle to threshold choice in the in\-domain setting\.

### 8\.7 Grounding Audit

#### Protocol\.

We sampled 300 responses uniformly from the evaluation set\. Two annotators—blind to system configuration—independently classified each response across five error categories: \(1\) unsupported factual claim; \(2\) fabricated named entity; \(3\) wrong number or date; \(4\) fabricated citation; \(5\) semantic distortion relative to retrieved context\. Inter\-annotator agreement:κ=0\.81\\kappa=0\.81\(substantial\)\.

Table 7:Grounding audit results \(300 in\-domain samples,κ=0\.81\\kappa=0\.81\)\.
#### Scope and limitations of this result\.

No grounding errors were observed in this sampled set; however, given the limited sample size \(300 questions\), this should not be interpreted as zero\-error behaviour in general\. Three important limitations bound this result: \(1\) the annotated set is*in\-domain*—retrieved context closely matches query topics, so unsupported claims are inherently less likely than in open\-domain settings; \(2\) the annotation taxonomy operationalises grounding in a specific way; other definitions may yield different rates; and \(3\) 300 samples provides limited statistical power to detect rare events\. We interpret this as evidence thatcgrgrounding is effective within this in\-domain protocol, and explicitly do not generalise it as a universal grounding guarantee\. Out\-of\-domain grounding is a critical open question addressed in §[8\.8](https://arxiv.org/html/2606.11212#S8.SS8)\.

### 8\.8 Out\-of\-Domain Evaluation

Table 8:Out\-of\-domain evaluation on NQ and TriviaQA \(200 questions each\)\. Distribution differs from FineWeb\-Edu pretraining corpus\.The routing advantage persists on both OOD datasets, with reduced margin relative to in\-domain performance as expected given index\-distribution mismatch\. The refusal mechanism correctly escalates for OOD queries \(6–12 % refusal rate vs\. 0 % in\-domain\), demonstrating that the distance gate generalises as intended\. Full OOD generalisation requires index expansion, identified as a primary future direction\.

### 8\.9 Error Analysis

Table 9:Representative failure cases\.Wrong retrievals near the distance boundary suggest a secondary re\-ranking step would help\. False refusals indicateδ=1\.5\\delta\{=\}1\.5is slightly aggressive; joint tuning of\(δ,τ\)\(\\delta,\\tau\)is a near\-term improvement\. The bulk of EM=0=0cases are GPT paraphrase outputs that are semantically correct but not verbatim spans\.

## 9\. Discussion

#### The efficiency\-safety framing\.

The primary contribution ofcgragis better understood as an efficiency\-safety architecture than as an accuracy improvement\. Compared to unconditional generation, it reduces mean latency by6\.3×6\.3\\timeswhile maintaining quality \(F1 0\.226 vs\. 0\.171 for GPT\-only\)\. Compared to unconditional RAG pipelines \(LangChain\), it adds an explicit routing policy that provides a principled refusal pathway and avoids passing low\-quality context to the generator—something no standard RAG framework provides\. These properties are valuable in production settings regardless of whether the F1 delta is large\.

#### Honest characterisation of gains\.

F1 gains over LangChain RAG \(\+0\.016\) and RAG\-Only \(\+0\.002\) are modest\. We report these transparently rather than overstating them\. The statistical significance \(p<0\.05p\{<\}0\.05\) is meaningful, but practitioners should weight the efficiency and safety properties as the primary reasons to adoptcgrover a simpler RAG pipeline, not the accuracy margin alone\.

#### Limitations\.

\(1\)*Scale*: no comparison against large models due to hardware constraints\. \(2\)*Confidence heuristic*: Eq\.[1](https://arxiv.org/html/2606.11212#S6.E1)uses fixed weights; a learned calibrator is the most important single improvement\. \(3\)*Multi\-hop*: synthesis across multiple documents is not supported\. \(4\)*Grounding audit scope*: in\-domain only; OOD grounding not measured\. \(5\)*Evaluation scope*: OOD F1 margins are small; wider evaluation is needed\.

## 10\. Conclusion

We presentedEverydayGPT, a hybrid GPT–RAG system unified under a formally defined Confidence\-Gated Routing policy\. The core contribution is the routing architecture: by conditioning the inference decision jointly on retrieval distance and extraction confidence—before generation is committed—cgravoids GPT inference cost for 85 % of queries \(6\.3×6\.3\\timeslatency reduction\), provides an explicit refusal pathway for out\-of\-domain inputs, and maintains or improves answer quality relative to unconditional RAG pipelines\.

Key empirical findings:cgragachieves F1=0\.226±0\.004=0\.226\\pm 0\.004, outperforming all eight baselines including GPT\-only \(\+\+0\.055,p<0\.001p\{<\}0\.001\) and LangChain unconditional RAG \(\+\+0\.016,p<0\.05p\{<\}0\.05\); pretraining loss converges stably from 4\.21 to 2\.84; our GPT outperforms GPT\-2 Small on WikiText\-103 \(PPL 26\.87 vs\. 29\.41\); the refusal mechanism correctly escalates on OOD queries \(NQ/TriviaQA refusal rate 6–12 % vs\. 0 % in\-domain\); and the grounding audit finds no responses containing claims unsupported by retrieved context within the in\-domain protocol, with explicit scope caveats\. The full system runs on consumer CPU with<<2 GB memory\.

Future work: learned confidence estimator replacing Eq\.[1](https://arxiv.org/html/2606.11212#S6.E1); BM25\+dense hybrid retrieval; span\-extraction fine\-tuning; joint\(δ,τ\)\(\\delta,\\tau\)optimisation; expanded OOD and adversarial evaluation\.

## Acknowledgements

The author thanks the open\-source communities behind PyTorch, FAISS, Sentence\-Transformers, FastAPI, and HuggingFace Datasets\.

## References

- \[1\]Tom Brown et al\. Language models are few\-shot learners\.*NeurIPS*, 2020\.
- \[2\]Yonatan Geifman and Ran El\-Yaniv\. Selective prediction in deep neural networks\.*NeurIPS*, 2017\.
- \[3\]Chuan Guo et al\. On calibration of modern neural networks\.*ICML*, 2017\.
- \[4\]Kelvin Guu et al\. REALM: Retrieval\-augmented language model pre\-training\.*ICML*, 2020\.
- \[5\]Ari Holtzman et al\. The curious case of neural text degeneration\.*ICLR*, 2020\.
- \[6\]Gautier Izacard and Edouard Grave\. Leveraging passage retrieval with generative models for open domain question answering\.*EACL*, 2021\.
- \[7\]Ziwei Ji et al\. Survey of hallucination in natural language generation\.*ACM Computing Surveys*, 55\(12\):1–38, 2023\.
- \[8\]Vladimir Karpukhin et al\. Dense passage retrieval for open\-domain question answering\.*EMNLP*, 2020\.
- \[9\]Patrick Lewis et al\. Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.*NeurIPS*, 2020\.
- \[10\]Guilherme Penedo et al\. FineWeb: Decanting the web for the finest text data at scale\.*NeurIPS*, 2024\.
- \[11\]Alec Radford et al\. Language models are unsupervised multitask learners\.*OpenAI Technical Report*, 2019\.
- \[12\]Pranav Rajpurkar et al\. SQuAD: 100,000\+ questions for machine comprehension of text\.*EMNLP*, 2016\.
- \[13\]Pranav Rajpurkar et al\. Know what you don’t know: Unanswerable questions for SQuAD\.*ACL*, 2018\.
- \[14\]Nils Reimers and Iryna Gurevych\. Sentence\-BERT: Sentence embeddings using siamese BERT\-networks\.*EMNLP*, 2019\.
- \[15\]Kurt Shuster et al\. Retrieval augmentation reduces hallucination in conversation\.*EMNLP Findings*, 2021\.
- \[16\]Ashish Vaswani et al\. Attention is all you need\.*NeurIPS*, 2017\.
- \[17\]Ruibin Xiong et al\. On layer normalization in the transformer architecture\.*ICML*, 2020\.
EverydayGPT: Confidence-Gated Routing for Efficient and Safe Hybrid GPT-RAG Conversational QA

Similar Articles

Answer Presence Drives RAG Rewriting Gains

Introducing ChatGPT

Text-Graph Synergy: A Bidirectional Verification and Completion Framework for RAG

AdaGATE: Adaptive Gap-Aware Token-Efficient Evidence Assembly for Multi-Hop Retrieval-Augmented Generation

RAGA: Reading-And-Graph-building-Agent for Autonomous Knowledge Graph Construction and Retrieval-Augmented Generation

Submit Feedback

Similar Articles

Answer Presence Drives RAG Rewriting Gains
Text-Graph Synergy: A Bidirectional Verification and Completion Framework for RAG
AdaGATE: Adaptive Gap-Aware Token-Efficient Evidence Assembly for Multi-Hop Retrieval-Augmented Generation
RAGA: Reading-And-Graph-building-Agent for Autonomous Knowledge Graph Construction and Retrieval-Augmented Generation