Grounded Optimization: A Layered Engineering Framework for Reducing LLM Hallucination in Automated Personal Document Rewriting

arXiv cs.CL Papers

Summary

This paper presents Grounded Optimization, a five-layer framework to reduce LLM hallucination in automated personal document rewriting. Experiments show significant reduction in hallucination rates across various models and temperatures.

arXiv:2607.01457v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly applied to resume optimization for applicant tracking systems, introducing hallucination failures distinct from general text generation: anachronistic technology injection, cross-domain terminology contamination, structural mutation, and content fabrication. We present Grounded Optimization, a five-layer framework combining temporal context validation, deterministic contamination detection, structural invariant enforcement, prompt-level grounding, and an evaluator agent. In ablation experiments across three LLMs, four temperature settings, and six layer configurations on 25 synthetic resumes spanning 14 industries, undefended baselines produce 2.48-5.36 detected hallucinations per resume. Among detectors independent of the active defenses, temporal hallucinations are reduced by 50-95% across all conditions; overall detected hallucination rate falls to 0.04-0.24. Prompt-level grounding alone achieves zero detected hallucinations at low temperature with a capable instruction-following model; higher temperatures and weaker models reveal the need for the deterministic layers as a complement. We release the contamination taxonomy, evaluation code, and raw data.
Original Article
View Cached Full Text

Cached at: 07/03/26, 05:40 AM

# Grounded Optimization: A Layered Engineering Framework for Reducing LLM Hallucination in Automated Personal Document Rewriting
Source: [https://arxiv.org/html/2607.01457](https://arxiv.org/html/2607.01457)
Shashank Indukuri sinduku1@depaul\.edu&Adarsh Agrawal11footnotemark:1 adagrawal@cs\.stonybrook\.edu

###### Abstract

Large language models \(LLMs\) are increasingly applied to resume optimization for applicant tracking systems, introducing hallucination failures distinct from general text generation: anachronistic technology injection, cross\-domain terminology contamination, structural mutation, and content fabrication\. We presentGrounded Optimization, a five\-layer framework combining temporal context validation, deterministic contamination detection, structural invariant enforcement, prompt\-level grounding, and an evaluator agent\.

In ablation experiments across three LLMs, four temperature settings, and six layer configurations on 25 synthetic resumes spanning 14 industries, undefended baselines produce 2\.48–5\.36 detected hallucinations per resume\. Among detectors independent of the active defenses, temporal hallucinations are reduced by 50–95% across all conditions; overall detected hallucination rate falls to 0\.04–0\.24\. Prompt\-level grounding alone achieves zero detected hallucinations at low temperature with a capable instruction\-following model; higher temperatures and weaker models reveal the need for the deterministic layers as a complement\. We release the contamination taxonomy, evaluation code, and raw data\.

## 1Introduction

The use of large language models for document optimization has grown rapidly, with resume tailoring representing one of the most commercially active applications\. Services that rewrite resumes to improve alignment with job descriptions and ATS scoring algorithms now process large volumes of documents\. Yet the hallucination behaviors documented in LLM general text generation\[[1](https://arxiv.org/html/2607.01457#bib.bib1),[2](https://arxiv.org/html/2607.01457#bib.bib2)\]manifest in particularly harmful ways when applied to personal documents:

1. 1\.Temporal fabrication: An LLM optimizing a 2018 software engineering role may inject references to LangChain \(released late 2022\) or Mixtral \(released December 2023\), creating verifiably false claims about the candidate’s experience timeline\.
2. 2\.Cross\-domain contamination: When rewriting a role at an AWS\-centric company, the model may introduce Azure or GCP terminology to match job description keywords, adding multi\-cloud terminology absent from the original role\.
3. 3\.Structural mutation: The model may silently merge, delete, or condense bullet points to reduce output length, removing genuine achievements in the process\.
4. 4\.Content fabrication: The model may invent company names, inflate metrics, or add certifications the candidate never earned\.

These failures carry concrete consequences: candidates may unknowingly submit resumes containing false claims, exposing them to disqualification or termination\. Unlike hallucination in chatbots or search summaries, where users can verify outputs interactively, resume optimization typically operates in batch mode with minimal human review\.

Hallucination mitigation has been studied extensively in open\-domain question answering\[[3](https://arxiv.org/html/2607.01457#bib.bib3)\], summarization\[[4](https://arxiv.org/html/2607.01457#bib.bib4)\], and retrieval\-augmented generation\[[5](https://arxiv.org/html/2607.01457#bib.bib5)\]\. Prior work on hallucination in personal document optimization specifically is more limited\. Concurrent system\-level work has begun integrating anti\-hallucination mechanisms into resume\-tailoring pipelines \(e\.g\.,\[[6](https://arxiv.org/html/2607.01457#bib.bib6)\]\), but to our knowledge no published work characterizes the underlying hallucination modes as a taxonomy or systematically isolates the contribution of individual defense layers\.

The ground truth in this domain is not an external knowledge base but the candidate’s own career history, which the LLM receives as input and must*enhance without distorting*\.

We presentGrounded Optimization, a five\-layer defense\-in\-depth framework that addresses each hallucination mode through a distinct mechanism\. The first two layers address the most common failures we observed:temporal validation\([Section˜3\.1](https://arxiv.org/html/2607.01457#S3.SS1)\) prevents the model from injecting post\-hoc technologies into historical roles by embedding release\-date constraints in every prompt, and adeterministic contamination detector\([Section˜3\.2](https://arxiv.org/html/2607.01457#S3.SS2)\) catches cloud\-provider bleeding using a 257\-service regex taxonomy without involving another LLM \(which would introduce an additional hallucination surface\)\.Structural enforcement\([Section˜3\.3](https://arxiv.org/html/2607.01457#S3.SS3)\) handles bullet compression: it counts roles and bullet points before and after optimization and rejects outputs that lose too much\.Prompt\-level grounding\([Section˜3\.4](https://arxiv.org/html/2607.01457#S3.SS4)\) embeds explicit immutability rules for education, certifications, and company names directly in the agent prompts, providing a first\-line defense before deterministic checks are applied\. Finally, anevaluator agent\([Section˜3\.5](https://arxiv.org/html/2607.01457#S3.SS5)\) deploys a separate LLM instance as a quality gate that can reject and re\-trigger the pipeline \(partially independent; see[Section˜6\.1](https://arxiv.org/html/2607.01457#S6.SS1)for an H2\-specific coupling caveat\)\.

Our framework is implemented as a multi\-agent system built on LangGraph\[[7](https://arxiv.org/html/2607.01457#bib.bib7)\]that processes resumes through five parallel specialized agents, each operating under the full defense stack\. The system includes a fallback\-merge mechanism that combines the best LLM output with preserved originals to retain all original content \([Section˜3\.6](https://arxiv.org/html/2607.01457#S3.SS6)\)\.

The contributions of this paper are:

1. 1\.Ataxonomy of hallucination modesspecific to personal document optimization, distinguishing temporal, cross\-domain, structural, and content fabrication failures \([Section˜2](https://arxiv.org/html/2607.01457#S2)\)\.
2. 2\.Afive\-layer engineering frameworkcombining deterministic validation, prompt engineering, and multi\-agent adversarial checking, implemented and evaluated as a functional multi\-agent system \([Section˜3](https://arxiv.org/html/2607.01457#S3)\)\.
3. 3\.Adeterministic cloud\-provider contamination detectorcovering 257 services across AWS, GCP, Azure, and on\-premise stacks with two\-tier confidence scoring \([Section˜3\.2](https://arxiv.org/html/2607.01457#S3.SS2)\)\.
4. 4\.Anablation and sensitivity analysisacross 16 experimental conditions \(three LLMs, four temperatures, six layer configurations, 680 LLM invocations\) characterizing per\-layer contributions, with documented evaluation limitations \([Section˜4](https://arxiv.org/html/2607.01457#S4),[Section˜6\.1](https://arxiv.org/html/2607.01457#S6.SS1)\)\.

## 2Hallucination Taxonomy for Personal Documents

We identify four distinct hallucination modes in personal document optimization, each with unique detection requirements and consequences\.

### 2\.1Temporal Fabrication \(H1\)

The LLM inserts references to technologies that did not exist during the claimed time period\. In the technology sector, where new tools emerge rapidly and carry strong ATS keyword signals, this is a frequent failure mode in our experiments\. A role from January 2019 to March 2021 gets rewritten to include “Implemented RAG pipelines using LangChain and vector databases,” despite LangChain’s release in late 2022 and the RAG paradigm\[[5](https://arxiv.org/html/2607.01457#bib.bib5)\], introduced in 2020 but widely adopted starting in late 2022\. We attribute this to a training\-data artifact: the model has no mechanism to learn which tools existed in which year relative to a particular person’s employment dates\.

### 2\.2Cross\-Domain Contamination \(H2\)

Cross\-domain contamination proved to be the dominant failure mode in our experiments \(79–89% of baseline incidents\)\. The model introduces terminology from a technology ecosystem not present in the original role: an AWS\-focused position acquires Azure or GCP references because the job description mentions multi\-cloud\. In one test, a role at an AWS\-only company \(“Managed data pipelines using AWS Glue and Athena”\) was rewritten as “Orchestrated ETL workflows using Azure Data Factory and Synapse Analytics”—introducing Azure terminology absent from the original role\. The model treats cloud services as interchangeable synonyms when optimizing for keyword coverage and has no awareness of organizational technology constraints\.

### 2\.3Structural Mutation \(H3\)

Structural mutation is a subtler failure mode in which the model does not fabricate information but instead abbreviates it\. A role with 8 bullet points may return with 4 or 5 “enhanced” entries that cover similar ground at a higher level of abstraction, while the most distinctive accomplishments—those that differentiate one candidate from another—are silently folded into generic summaries such as “Maintained and optimized production systems\.” Unlike the other hallucination modes, structural mutation*removes*truth rather than adding falsehood, making it harder to detect through surface\-level review\. The root cause appears to be that LLMs internalize conciseness as a quality signal, causing “optimize” to become “condense” without explicit instruction\.

### 2\.4Content Fabrication \(H4\)

Content fabrication is the most straightforward failure mode: the model invents concrete details such as fabricated company names, inflated metrics \(“Reduced API latency by 90%” in a role that mentioned no performance numbers\), and non\-existent certifications\. This occurs less frequently than contamination or temporal fabrication in our data but is the hardest to detect post\-hoc, as fabricated metrics resemble plausible candidate accomplishments and require access to the candidate’s actual work history to verify\.

## 3Defense Framework

Our layered defense addresses each hallucination mode through a distinct layer, as shown in[Figure˜1](https://arxiv.org/html/2607.01457#S3.F1)\. Two of the five layers operate at generation time: Layer 4 embeds immutability constraints directly in the agent prompts*before*the LLM call, making it the first defense to act on any given optimization cycle\. Layers 1–3 and 5 operate post\-generation, validating and potentially reverting the LLM’s output before it is accepted\. The layers are numbered by their role in the validation pipeline; the execution order within a single cycle is L4 \(prompt injection\)→\\toLLM call→\\toL1–L3 \(output validation\)→\\toL5 \(evaluator gate\)\. Failures at any post\-generation layer trigger retry with augmented constraints or fallback to original content\.

Layer 1: Temporal Context ValidationTechnology timeline embedded in agent promptsLayer 2: Cross\-Domain Contamination DetectionDeterministic taxonomy \+ word\-boundary matchingLayer 3: Structural Invariant EnforcementRole count \+ bullet count validationLayer 4: Prompt\-Level Content GroundingImmutability rules for education, certs, companiesLayer 5: Evaluator Agent QA GateIndependent LLM adversarial validationRetry with augmented constraints

Figure 1:Five\-layer defense\-in\-depth architecture\. Each layer addresses a distinct hallucination mode\. Failed validation at Layer 5 triggers a retry cycle with contamination warnings and structural constraints injected into the prompt\. After 3 failed retries, the system falls back to a merge of the best LLM output with original content\.### 3\.1Layer 1: Temporal Context Validation

The temporal context layer prevents anachronistic technology injection \(H1\) by building a per\-resume timeline and embedding it as a constraint in every agent prompt\.

Given a resumeRRwith experience entriesE=\{e1,…,en\}E=\\\{e\_\{1\},\\ldots,e\_\{n\}\\\}, each with start/end dates, we construct a temporal contextTC​\(R\)\\text\{TC\}\(R\)containing the career span, a technology\-to\-year\-range mapping derived from bullet\-point scanning, and the current year\. We maintain a curated mapping of technology release dates \(e\.g\., LangChain→\\to2022, Vertex AI→\\to2021\) that constrain which technologies may appear in which roles\. The full timeline construction algorithm and release\-date table are in[Appendix˜A](https://arxiv.org/html/2607.01457#A1)\.

The temporal context is serialized and injected into every agent prompt, instructing the LLM to verify technology existence during each role’s time period\.

### 3\.2Layer 2: Cross\-Domain Contamination Detection

The contamination detection layer addresses cross\-domain bleeding \(H2\) through a fully deterministic, LLM\-free mechanism\. An initial LLM\-based approach—asking the model to verify its own output for foreign cloud services—proved functional but added latency and cost per invocation\. Because cloud service names form a finite, enumerable set, a deterministic regex\-based approach is both sufficient and more efficient\.

We construct a taxonomy𝒯\\mathcal\{T\}of 257 cloud services across four ecosystems \(AWS: 76, GCP: 53, Azure: 64, On\-Premise: 64\), plus a cloud\-agnostic set of 69 provider\-independent technologies\. Each ecosystem entry consists of explicit provider keywords \(e\.g\., “aws”\) and service names \(e\.g\., “sagemaker”\)\. Detection uses two\-tier word\-boundary regex matching: Tier 1 attributes on a single explicit\-keyword match; Tier 2 requires≥\\geq2 service\-name matches to handle ambiguity \(e\.g\., “lambda” as AWS Lambda vs\. the Python keyword\)\. The full detection algorithm and ambiguity resolution are in[Appendix˜C](https://arxiv.org/html/2607.01457#A3)\.

The key design decision: we compare each role’s cloud signature*before and after*optimization, flagging only*newly introduced*providers:

Contaminated​\(ei\)=Clouds​\(eiupdated\)∖Clouds​\(eioriginal\)≠∅\\text\{Contaminated\}\(e\_\{i\}\)=\\text\{Clouds\}\(e\_\{i\}^\{\\text\{updated\}\}\)\\setminus\\text\{Clouds\}\(e\_\{i\}^\{\\text\{original\}\}\)\\neq\\emptyset\(1\)When contamination is detected, the role’s responsibilities are reverted to originals, a contamination warning is injected into the retry prompt, and optimization is retried with augmented constraints\.

### 3\.3Layer 3: Structural Invariant Enforcement

Structural mutation \(H3\) is addressed through pre/post counting of semantic units with tolerance\-aware validation\.

Before optimization, we record a structural signatureSig​\(R\)=\(\|E\|,\{\|bi\|\}\)\\text\{Sig\}\(R\)=\(\|E\|,\\\{\|b\_\{i\}\|\\\}\), the number of experience entries and bullet counts per entry\. After optimization, we validate that\|E′\|≥\|E\|\|E^\{\\prime\}\|\\geq\|E\|and\|bi′\|≥\|bi\|−1\|b^\{\\prime\}\_\{i\}\|\\geq\|b\_\{i\}\|\-1for each entry, accommodating minor restructuring while preventing significant content loss\. When validation fails, the retry prompt includes explicit structural targets\. After 3 failed attempts, a deterministic fallback merge ensures all original content is retained \([Appendix˜D](https://arxiv.org/html/2607.01457#A4)\)\.

### 3\.4Layer 4: Prompt\-Level Content Grounding

Content fabrication \(H4\) is addressed through explicit immutability declarations embedded in every agent prompt\. While prompt\-level constraints alone are insufficient at higher temperatures and on weaker models \(Experiments 2–3\), they serve as a strong first line of defense that significantly reduces the frequency of violations the subsequent layers must catch\.

The grounding constraints are organized into four categories:

1. 1\.Content Preservation: “Preserve the exact number of bullet points for each entry\. DO NOT reduce or condense them\.”
2. 2\.Factual Immutability: “DO NOT hallucinate, add, or modify educational details \(institution name, location, degree information\)\.”
3. 3\.Entity Integrity: “DO NOT create a new company or use placeholder names\.”
4. 4\.Metric Realism: “Ensure metrics and numbers are realistic for the time period\.”

### 3\.5Layer 5: Evaluator Agent QA Gate

The final layer deploys an independent LLM instance as an adversarial quality\-control agent, implementing a generator\-critic architecture\[[8](https://arxiv.org/html/2607.01457#bib.bib8)\]specialized for personal document validation\.

The evaluator receives the original resume, the rewritten resume, and the target job description, and returns\(is\_acceptable∈\{0,1\},feedback\)\(\\text\{is\\\_acceptable\}\\in\\\{0,1\\\},\\text\{feedback\}\)\. It checks for content removal, JD alignment, and plausible ATS improvement\. As a distinct model instance, it avoids generator\-bias transfer; on rejection, feedback is injected into the next rewrite attempt\. If the evaluator itself fails \(timeout or malformed output\), it defaults to rejection rather than silently accepting the candidate output\.

### 3\.6Implementation

The framework is implemented as a multi\-agent pipeline on LangGraph\[[7](https://arxiv.org/html/2607.01457#bib.bib7)\]\. The system processes resumes through four stages \(parse, score, rewrite, re\-score\) with up to 5 optimization cycles\. Five specialized agents \(Summary, Skills, Experience, Projects, Education\) run in parallel; the Experience Agent receives the full defense stack because professional experience is where most hallucinations occur\. A LangGraphAgentStatepreserves original data alongside optimized versions throughout, enabling fallback merge at any point\. Full pipeline details are in[Appendix˜E](https://arxiv.org/html/2607.01457#A5)\.

## 4Evaluation

We evaluate the framework through three complementary experiments following evaluation methodology from recent hallucination benchmarks\[[9](https://arxiv.org/html/2607.01457#bib.bib9),[10](https://arxiv.org/html/2607.01457#bib.bib10)\]: \(1\) an ablation study measuring each defense layer’s contribution, \(2\) a multi\-model generalization study across three LLMs, and \(3\) a temperature sensitivity analysis\. All experiments use 25 synthetic resumes, 42 roles, 188 bullet points, and 5 job descriptions, with seed=42 for reproducibility\.

### 4\.1Dataset

We construct a corpus of 25 synthetic resumes spanning 14 industries \(technology, finance, healthcare, manufacturing, consulting, retail, education, energy, government, media, logistics, telecom, insurance, and real estate\)\. Resumes contain 42 professional roles totaling 188 bullet points, ranging from 1 to 6 roles per resume and covering career histories from 2013 to 2026\. Five adversarial job descriptions are designed to induce hallucination: a multi\-cloud AI position requiring both AWS and Azure, a GCP ML role requesting RAG experience, an AWS full\-stack role mentioning generative AI, an Azure data analytics role, and a generic senior role\. Each resume is paired with one job description in round\-robin assignment\.

### 4\.2Evaluation Protocol

For each experiment, every resume–JD pair is processed under the specified configuration and the output is evaluated by four deterministic hallucination detectors \(H1–H4\) that compare each optimized role against its original:

1. 1\.H1 Temporal detector: Checks for technologies released after the role’s end date, using a curated mapping of technology release years\.
2. 2\.H2 Contamination detector: Uses the cloud\-provider taxonomy \([Section˜3\.2](https://arxiv.org/html/2607.01457#S3.SS2)\) to identify newly introduced cloud services not in the original role\.
3. 3\.H3 Structural detector: Compares bullet\-point counts, flagging any loss of\>\>1 bullet\.
4. 4\.H4 Fabrication detector: Checks for company name changes and title mutations exceeding 50% word overlap\.

Known detector–defense coupling \(H2\)\.The H2 detector and the Layer 2 defense share the same underlyingdetect\_role\_contaminationfunction from the cloud taxonomy module\. When Layer 2 is active in a configuration, any contamination it detects is reverted*before*the H2 detector evaluates the output; the detector and the defense therefore cannot disagree by construction\. H2 counts in L2\-active configurations are thus a tautological consequence of L2’s revert behavior and should not be interpreted as independent empirical measurements\. We retain these counts in the tables for completeness but discuss the implication in[Section˜6\.1](https://arxiv.org/html/2607.01457#S6.SS1)and treat them accordingly when interpreting results\.

### 4\.3Metrics

We reportHallucination Rate \(HR\): mean detected hallucination incidents per resume, with standard deviation \(σ\\sigma\) and 95% confidence interval\.

### 4\.4Experiment 1: Ablation Study

We test six defense configurations with GPT\-4\.1\-nano at temperature=0 to isolate each layer’s contribution:

Detected Incidents by TypeDefense ConfigurationDetect\. Rate↓\\downarrowStd Dev95% CITemporalContam\.†StructuralFabricationNo defense \(baseline\)2\.483\.84±\\pm1\.5075500L4 only \(prompt grounding\)0\.000\.00±\\pm0\.000000L1\+L4 \(\+ temporal\)0\.120\.43±\\pm0\.172001L1\+L2\+L4 \(\+ contamination\)0\.080\.27±\\pm0\.111010L1\+L2\+L3\+L4 \(\+ structural\)0\.160\.37±\\pm0\.141003Full \(L1\+L2\+L3\+L4\+L5\)0\.120\.33±\\bm\{\\pm\}0\.131002†Contamination counts under L2\-active configs are mechanically zero by construction \([Section˜6\.1](https://arxiv.org/html/2607.01457#S6.SS1)\)\.

Table 1:Ablation study on 25 resumes \(GPT\-4\.1\-nano,tt=0\)\. Contamination \(†\\dagger\) counts when Layer 2 is active are mechanically zero by construction \(see[Section˜6\.1](https://arxiv.org/html/2607.01457#S6.SS1)\) and are not independent measurements\. The undefended baseline produces 62 detected hallucination incidents \(2\.48 per resume\)\. Prompt\-level grounding alone \(L4\) achieves zero detected hallucinations in this single \(model, temperature\) configuration; the temperature and multi\-model experiments demonstrate this does not generalize\.Observation:The L4\-only result is informative: with a strong instruction\-following model attt=0, prompt grounding alone produces zero detected hallucinations\. This is consistent with the view that modern LLMs can respect explicit behavioral constraints at low temperature\. Experiments 2 and 3 show the result does not generalize across models or temperatures\. L4\-only \(HR=0\.00\) also outperforms the Full framework \(HR=0\.12\) at this single configuration\. Inspection of the 3 residual incidents under Full reveals 1 H1 \(a 2019 role received a post\-2022 technology reference despite Layer 1 constraints\) and 2 H4 \(title reformulations such as “Senior Data Analyst”→\\to“Lead Data Analyst” crossing the 50% word\-overlap threshold\)\. These H4 cases are likely false positives of our coarse fabrication detector\. Excluding them, Full achieves HR=0\.04, consistent with L4\-only\. The L4\-vs\-Full gap is therefore most likely an artifact of H4 detector sensitivity rather than evidence that additional layers harm performance\.

### 4\.5Experiment 2: Multi\-Model Generalization

We test the baseline and full framework across three LLMs of varying capability attt=0:

Detected Incidents by TypeModelDefenseDetect\. Rate↓\\downarrowStd DevTemporalContam\.†StructuralFabricationReduction \(%\)†GPT\-4\.1\-nanoBaseline2\.483\.8475500—GPT\-4\.1\-nanoFull0\.120\.33100295\.2GPT\-4o\-miniBaseline5\.365\.622010608—GPT\-4o\-miniFull0\.040\.20100099\.3Llama\-3\.1\-8BBaseline4\.445\.61198804—Llama\-3\.1\-8BFull0\.120\.33100297\.3†Contamination and reduction figures inherit the H2 detector–defense coupling \([Section˜6\.1](https://arxiv.org/html/2607.01457#S6.SS1)\)\.

Table 2:Multi\-model evaluation attt=0\. Contamination \(†\\dagger\) counts under the Full configuration are mechanically zero by construction \(see[Section˜6\.1](https://arxiv.org/html/2607.01457#S6.SS1)\)\. Less capable models produce 2–4×\\timesmore baseline detected hallucinations\. Reduction percentages are computed against detected\-HR and inherit the H2 caveat; we report them for engineering reference but do not claim elimination in an independent\-evaluator sense\.Observation:The framework is applicable across model families \(OpenAI, Meta/Groq\)\. GPT\-4o\-mini produces 2\.2×\\timesmore baseline detected hallucinations than GPT\-4\.1\-nano, and Llama\-3\.1\-8B 1\.8×\\timesmore\. Under the Full configuration, detected\-HR falls to near zero on our current metrics\. H2 counts of zero under Full are structurally guaranteed \([Section˜6\.1](https://arxiv.org/html/2607.01457#S6.SS1)\); the non\-tautological observations are \(i\) the large baseline H2 counts across all three models, which show the defense target is real, and \(ii\) the reductions in H1, H3, and H4 which are measured by detectors distinct from any active defense component\.

### 4\.6Experiment 3: Temperature Sensitivity

We vary the sampling temperature from 0 to 1\.0 with GPT\-4\.1\-nano:

Detected Incidents by TypeTemp\.DefenseDetect\. Rate↓\\downarrowStd DevTemporalContam\.†StructuralFabricationReduction \(%\)†0\.0Baseline2\.483\.8475500—0\.0Full0\.120\.33100295\.20\.3Baseline2\.122\.8974501—0\.3Full0\.160\.46100392\.50\.7Baseline1\.722\.9183401—0\.7Full0\.160\.37200290\.71\.0Baseline1\.803\.6383601—1\.0Full0\.240\.43400286\.7†Contamination and reduction figures inherit the H2 detector–defense coupling \([Section˜6\.1](https://arxiv.org/html/2607.01457#S6.SS1)\)\.

Table 3:Temperature sensitivity \(GPT\-4\.1\-nano\)\. Contamination \(†\\dagger\) counts under Full are mechanically zero by construction \(see[Section˜6\.1](https://arxiv.org/html/2607.01457#S6.SS1)\)\. Baseline detected\-hallucinations decrease slightly at higher temperatures in this data; we are cautious interpreting this trend givenσ\>μ\\sigma\>\\muon all baselines\. Residual violations under Full are H1 \(temporal\) and H4 \(minor fabrication\); H1 residuals grow from 1 attt=0 to 4 attt=1\.0, consistent with reduced prompt compliance under higher stochasticity\.Observation:Detected\-HR under Full increases from 0\.12 attt=0 to 0\.24 attt=1\.0, driven almost entirely by H1 \(temporal\) residuals \(1→\\to4\) which are measured by a detector independent of any active defense layer\. The H1 trend is the most interpretable signal in this experiment because it is free of the detector–defense coupling that affects H2\. The graceful degradation of H1 detected\-count under increasing stochasticity suggests prompt compliance weakens with temperature, motivating deterministic layers as a complement rather than a replacement for prompt\-based grounding\.

### 4\.7Cross\-Experiment Summary

Taken together, the three experiments reveal a clear interaction: prompt\-level grounding \(L4\) is sufficient at low temperature with a strong model, but its effectiveness degrades predictably with both increasing temperature \(H1 residuals:1→41\\to 4\) and decreasing model capability \(baseline HR:2\.48→5\.362\.48\\to 5\.36across models\)\. The deterministic layers \(L1–L3\) provide the most value precisely where prompt compliance is weakest — high temperature and weaker models — rather than as a uniform improvement over prompt grounding alone\. The remaining residuals under defended configurations are almost entirely H1 \(temporal\) and likely\-false\-positive H4 \(minor title reformulations\), suggesting that further gains require either a stronger temporal enforcement mechanism or a semantics\-aware fabrication detector\.

## 5Related Work

#### LLM hallucination\.

Hallucination has been extensively studied across summarization, translation, and dialogue\[[1](https://arxiv.org/html/2607.01457#bib.bib1),[2](https://arxiv.org/html/2607.01457#bib.bib2),[11](https://arxiv.org/html/2607.01457#bib.bib11)\]\. Detection methods include sampling consistency \(SelfCheckGPT\[[3](https://arxiv.org/html/2607.01457#bib.bib3)\]\) and fine\-grained factuality scoring \(FActScore\[[9](https://arxiv.org/html/2607.01457#bib.bib9)\]\)\. These approaches target*general knowledge*hallucination where ground truth exists in external corpora\. Personal document optimization is different in kind: the ground truth is the input document itself, and hallucination manifests as distortion of the user’s own data\.

#### Constrained generation and multi\-agent systems\.

Constrained decoding\[[12](https://arxiv.org/html/2607.01457#bib.bib12),[13](https://arxiv.org/html/2607.01457#bib.bib13)\]enforces token\-level constraints; our structural enforcement operates at the semantic\-unit level \(roles, bullet points\)\. Multi\-agent debate\[[14](https://arxiv.org/html/2607.01457#bib.bib14)\], self\-reflection\[[15](https://arxiv.org/html/2607.01457#bib.bib15)\], and critic\-generator frameworks\[[8](https://arxiv.org/html/2607.01457#bib.bib8)\]improve LLM reliability through adversarial checking\. Our evaluator agent extends this paradigm to personal document QA, where the critic checks content preservation rather than general quality\. A related pattern appears in code generation, where deterministic catalog selection and access\-control gating before LLM SQL generation reduce execution errors\[[16](https://arxiv.org/html/2607.01457#bib.bib16)\]; both settings suggest deterministic layers around generation can complement prompt\-level approaches\.

#### Taxonomy\-driven LLM evaluation\.

Recent benchmarks structure LLM behavioral evaluation around explicit hazard or failure\-mode taxonomies\. The MLCommons AI Safety Benchmark v0\.5\[[17](https://arxiv.org/html/2607.01457#bib.bib17)\]introduces a 13\-hazard taxonomy for general\-purpose chat assistants\. Our four\-mode hallucination taxonomy \(H1–H4\) is narrower in scope but follows a similar methodological pattern: structured failure\-mode definitions paired with category\-specific detectors and per\-category reporting\.

#### Resume processing\.

Prior work focuses on parsing and screening\[[18](https://arxiv.org/html/2607.01457#bib.bib18)\], matching\[[19](https://arxiv.org/html/2607.01457#bib.bib19)\], scoring\[[20](https://arxiv.org/html/2607.01457#bib.bib20)\], and end\-to\-end LLM\-based resume generation\[[21](https://arxiv.org/html/2607.01457#bib.bib21)\]\. Recent concurrent work on resume\-tailoring systems has begun incorporating anti\-hallucination guardrails as a system component\[[6](https://arxiv.org/html/2607.01457#bib.bib6)\]\. Our work differs in framing: rather than building a single tailored system, we characterize the hallucination behaviors specific to this domain as a taxonomy and isolate the empirical contribution of individual defense layers\.

## 6Limitations and Discussion

### 6\.1H2 Detector–Defense Coupling

The H2 detector and Layer 2 defense share the samedetect\_role\_contaminationfunction\. Because Layer 2 reverts contaminated output*before*the detector runs, H2 counts under L2\-active configurations are mechanically zero by construction; we flag these with†\\daggerthroughout\. The large baseline H2 counts \(measured without any active defense\) confirm the defense target is real, but we cannot independently verify Layer 2*eliminates*contamination rather than merely hiding it from our own detector\. An independent NLI\-based evaluator on the existing 680 outputs is the highest\-priority extension\.

Other limitations are more conventional: we do not compare against SelfCheckGPT, FActScore, or CRITIC \(our H1–H4 detectors are task\-specific\); per\-resume counts are zero\-inflated and heavy\-tailed \(σ\>μ\\sigma\>\\muon all baselines\); the 25\-resume synthetic dataset and 257\-service taxonomy miss real\-world distributions and long\-tail platforms \(Salesforce, SAP\); and the framework detects presence but not magnitude hallucinations\.

#### Future work\.

An independent NLI\-based H2 evaluator on the existing 680 outputs is the highest\-priority extension\. Beyond that: comparison with external hallucination detectors \(SelfCheckGPT, FActScore\) and commodity guardrail frameworks \(e\.g\., NeMo Guardrails, Guardrails AI\) as alternative enforcement backends, human annotation on real resumes, and extension to other personal documents\. All code, taxonomy, and raw data are available at[https://github\.com/shashank\-indukuri/grounded\-optimization](https://github.com/shashank-indukuri/grounded-optimization)\.

## References

- Ji et al\. \[2023\]Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung\.Survey of hallucination in natural language generation\.*ACM Computing Surveys*, 55\(12\):1–38, 2023\.
- Zhang et al\. \[2023\]Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al\.Siren’s song in the ai ocean: A survey on hallucination in large language models\.*arXiv preprint arXiv:2309\.01219*, 2023\.
- Manakul et al\. \[2023\]Potsawee Manakul, Adian Liusie, and Mark JF Gales\.Selfcheckgpt: Zero\-resource black\-box hallucination detection for generative large language models\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 9004–9017, 2023\.
- Kryściński et al\. \[2020\]Wojciech Kryściński, Bryan McCann, Caiming Xiong, and Richard Socher\.Evaluating the factual consistency of abstractive text summarization\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing*, pages 9332–9346, 2020\.
- Lewis et al\. \[2020\]Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen\-tau Yih, Tim Rocktäschel, et al\.Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.*Advances in Neural Information Processing Systems*, 33:9459–9474, 2020\.
- Abhinav \[2026\]Kumar Abhinav\.Career\-aware resume tailoring via multi\-source retrieval\-augmented generation with provenance tracking: A case study\.*arXiv preprint arXiv:2605\.05257*, 2026\.
- LangChain \[2024\]LangChain\.Langgraph: Building stateful, multi\-actor applications with llms\.[https://github\.com/langchain\-ai/langgraph](https://github.com/langchain-ai/langgraph), 2024\.
- Gou et al\. \[2024\]Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen\.Critic: Large language models can self\-correct with tool\-interactive critiquing\.In*Proceedings of the Twelfth International Conference on Learning Representations*, 2024\.
- Min et al\. \[2023\]Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen\-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi\.Factscore: Fine\-grained atomic evaluation of factual precision in long form text generation\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 12076–12100, 2023\.
- Li et al\. \[2023\]Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian\-Yun Nie, and Ji\-Rong Wen\.Halueval: A large\-scale hallucination evaluation benchmark for large language models\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 6449–6464, 2023\.
- Huang et al\. \[2024\]Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu\.A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics*, 2024\.
- Hu et al\. \[2019\]J Edward Hu, Huda Khayrallah, Ryan Culkin, Patrick Xia, Tongfei Chen, Matt Post, and Benjamin Van Durme\.Improved lexically constrained decoding for translation and monolingual rewriting\.In*Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics*, pages 839–850, 2019\.
- Lu et al\. \[2021\]Ximing Lu, Peter West, Rowan Zellers, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi\.Neurologic decoding: \(un\)supervised neural text generation with predicate logic constraints\.In*Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics*, pages 4288–4299, 2021\.
- Du et al\. \[2023\]Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch\.Improving factuality and reasoning in language models through multiagent debate\.*arXiv preprint arXiv:2305\.14325*, 2023\.
- Shinn et al\. \[2023\]Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao\.Reflexion: Language agents with verbal reinforcement learning\.*Advances in Neural Information Processing Systems*, 36, 2023\.
- Agrawal and Indukuri \[2026\]Adarsh Agrawal and Shashank Indukuri\.Schema\-first retrieval: Embedding catalogs for natural language analytics\.*arXiv preprint arXiv:2606\.28387*, 2026\.
- Vidgen et al\. \[2024\]Bertie Vidgen, Adarsh Agrawal, Ahmed M\. Ahmed, Victor Akinwande, Namir Al\-Nuaimi, Najla Alfaraj, et al\.Introducing v0\.5 of the AI safety benchmark from MLCommons\.*arXiv preprint arXiv:2404\.12241*, 2024\.
- Sinha et al\. \[2021\]Ankit Kumar Sinha, M Amir Khusru Akhtar, and Anand Kumar\.Resume screening using natural language processing and machine learning: A systematic review\.In*Machine Learning and Information Processing*, pages 207–218\. Springer, 2021\.
- Deng et al\. \[2018\]Yao Deng, Hang Lei, Xiao Li, and Yihong Lin\.An improved deep neural network model for job matching\.In*2018 International Conference on Algorithms and Architectures for Parallel Processing*, pages 86–96\. Springer, 2018\.
- Mittal et al\. \[2020\]Vikas Mittal, Palak Mehta, Devesh Relan, and Garima Shakhla\.Methodology for resume parsing and job domain prediction\.*Journal of Statistics and Management Systems*, 23\(7\):1263–1274, 2020\.
- Zinjad et al\. \[2024\]Saurabh Bhausaheb Zinjad, Amrita Bhattacharjee, Amey Bhilegaonkar, and Huan Liu\.Resumeflow: An llm\-facilitated pipeline for personalized resume generation and refinement\.*arXiv preprint arXiv:2402\.06221*, 2024\.

## Appendix ATemporal Context Validation Details

### A\.1Timeline Construction

Given a resumeRRwith professional experience entriesE=\{e1,…,en\}E=\\\{e\_\{1\},\\ldots,e\_\{n\}\\\}, where each entryeie\_\{i\}has start datesis\_\{i\}and end datetit\_\{i\}, we construct:

TC​\(R\)=\{career\_span:\[smin,tmax\]tech\_timeline:\{\(τ,\[first\_used,last\_used\]\)\}current\_year:ynow\}\\text\{TC\}\(R\)=\\left\\\{\\begin\{aligned\} &\\text\{career\\\_span\}:\[s\_\{\\min\},t\_\{\\max\}\]\\\\ &\\text\{tech\\\_timeline\}:\\\{\(\\tau,\[\\text\{first\\\_used\},\\text\{last\\\_used\}\]\)\\\}\\\\ &\\text\{current\\\_year\}:y\_\{\\text\{now\}\}\\end\{aligned\}\\right\\\}\(2\)
whereτ\\tauranges over technologies mentioned in existing bullet points, and first/last used years are derived from the dates of roles containingτ\\tau\.

### A\.2Release Date Constraints

Table 4:Example technology release date constraints embedded in temporal context\.
### A\.3Timeline Construction Algorithm

Algorithm 1Temporal Context Construction1:Resume

RRwith experience entries

EE
2:Temporal context

TC​\(R\)\\text\{TC\}\(R\)
3:

total\_months←0\\text\{total\\\_months\}\\leftarrow 0
4:

tech\_timeline←\{\}\\text\{tech\\\_timeline\}\\leftarrow\\\{\\\}
5:for

ei∈Ee\_\{i\}\\in Edo

6:Parse

si,tis\_\{i\},t\_\{i\}from

ei\.datese\_\{i\}\.\\text\{dates\}
7:

total\_months\+=\(ti−si\)\\text\{total\\\_months\}\\mathrel\{\+\}=\(t\_\{i\}\-s\_\{i\}\)in months

8:forresponsibility

r∈ei\.bulletsr\\in e\_\{i\}\.\\text\{bullets\}do

9:fortechnology

τ\\taudetected in

rrdo

10:Update

tech\_timeline​\[τ\]\.first\_used\\text\{tech\\\_timeline\}\[\\tau\]\.\\text\{first\\\_used\}
11:Update

tech\_timeline​\[τ\]\.last\_used\\text\{tech\\\_timeline\}\[\\tau\]\.\\text\{last\\\_used\}
12:endfor

13:endfor

14:endfor

15:return

TC​\(R\)=\{total\_months,tech\_timeline,ynow\}\\text\{TC\}\(R\)=\\\{\\text\{total\\\_months\},\\text\{tech\\\_timeline\},y\_\{\\text\{now\}\}\\\}

## Appendix BRecency\-Bounded Optimization

An additional grounding mechanism limits the optimization scope: only roles within the most recent 7 years are processed\. The 7\-year threshold was determined empirically: beyond this point, the marginal benefit of optimization diminished while hallucination risk for older roles increased\. Older roles are passed through untouched, removing them from the optimization scope and therefore from this framework’s hallucination risk\.

Eprocess=\{ei∈E:months​\(ei\)≤84\},Epreserve=E∖EprocessE\_\{\\text\{process\}\}=\\\{e\_\{i\}\\in E:\\text\{months\}\(e\_\{i\}\)\\leq 84\\\},\\quad E\_\{\\text\{preserve\}\}=E\\setminus E\_\{\\text\{process\}\}\(3\)
The recency split operates chronologically from the most recent role, accumulating months until the 7\-year threshold is reached\.

## Appendix CContamination Detection Details

### C\.1Two\-Tier Detection Algorithm

Algorithm 2Two\-Tier Cloud Provider Detection1:Text

xx, Taxonomy

𝒯\\mathcal\{T\}
2:Detected providers

PP
3:

P←∅P\\leftarrow\\emptyset
4:for

\(j,Kj,Sj\)∈𝒯\(j,K\_\{j\},S\_\{j\}\)\\in\\mathcal\{T\}do

5:if

∃k∈Kj:WordMatch​\(k,x\)\\exists k\\in K\_\{j\}:\\textsc\{WordMatch\}\(k,x\)then

6:

P←P∪\{j\}P\\leftarrow P\\cup\\\{j\\\};continue

7:endif

8:

m←\|\{s∈Sj:WordMatch​\(s,x\)\}\|m\\leftarrow\|\\\{s\\in S\_\{j\}:\\textsc\{WordMatch\}\(s,x\)\\\}\|
9:if

m≥2m\\geq 2then

10:

P←P∪\{j\}P\\leftarrow P\\cup\\\{j\\\}
11:endif

12:endfor

13:return

PPif

P≠∅P\\neq\\emptysetelse

\{Cloud\-Agnostic\}\\\{\\text\{Cloud\-Agnostic\}\\\}

TheWordMatchfunction uses compiled word\-boundary regex patterns \(\\\\backslashbterm\\\\backslashb\) with case\-insensitive matching, ensuring that “S3” matches the AWS service but not substrings like “MS365\.”

### C\.2Ambiguity Resolution

Certain terms require contextual disambiguation\. For example, “Glue” could refer to AWS Glue \(an ETL service\) or general adhesive\. We handle ambiguous terms by requiring co\-occurrence with provider context:

1ifservicein\["glue","powerbi","databricks"\]:

2ifservice=="glue"and"aws"notintext\_lower:

3continue

4ifservice=="databricks"andprovider=="Azure"\\

5and"azure"notintext\_lower:

6continue

Listing 1:Ambiguity resolution for context\-dependent terms

## Appendix DStructural Enforcement Details

### D\.1Retry Prompt Template

When structural validation fails, the retry prompt includes explicit targets:

1CRITICALRETRYINSTRUCTION\-ATTEMPT\{attempt\}:

2YouMUSTreturnEXACTLY\{role\_count\}roles:

3\-role\_0:SeniorEngineeratCompanyA:6bullets

4\-role\_1:EngineeratCompanyB:5bullets

5DONOTskipanyroles\.ProcessALLrolesfrominput\.

Listing 2:Structural retry injection
### D\.2Fallback Merge Algorithm

After 3 failed validation attempts, the system executes a deterministic merge:

Algorithm 3Fallback Merge Strategy1:Original roles

EE, Best LLM output roles

E′E^\{\\prime\}
2:Merged roles

MMwith zero content loss

3:

map←\{\(titlei,companyi\)→ei′:ei′∈E′\}\\text\{map\}\\leftarrow\\\{\(\\text\{title\}\_\{i\},\\text\{company\}\_\{i\}\)\\to e^\{\\prime\}\_\{i\}:e^\{\\prime\}\_\{i\}\\in E^\{\\prime\}\\\}
4:

M←\[\]M\\leftarrow\[\]
5:for

ei∈Ee\_\{i\}\\in Edo

6:

key←\(titlei,companyi\)\\text\{key\}\\leftarrow\(\\text\{title\}\_\{i\},\\text\{company\}\_\{i\}\)
7:if

key∈map\\text\{key\}\\in\\text\{map\}then

8:

ei′←map​\[key\]e^\{\\prime\}\_\{i\}\\leftarrow\\text\{map\}\[\\text\{key\}\]
9:if

\|bi′\|<\|bi\|\|b^\{\\prime\}\_\{i\}\|<\|b\_\{i\}\|then

10:

bi′←bi′\+bi\[\|bi′\|:\]b^\{\\prime\}\_\{i\}\\leftarrow b^\{\\prime\}\_\{i\}\+b\_\{i\}\[\|b^\{\\prime\}\_\{i\}\|:\]
11:endif

12:

M\.append​\(ei′\)M\.\\text\{append\}\(e^\{\\prime\}\_\{i\}\)
13:else

14:

M\.append​\(ei\)M\.\\text\{append\}\(e\_\{i\}\)
15:endif

16:endfor

17:return

MM

By construction, no role or bullet point can be dropped by this merge, even when the LLM consistently fails structural validation\.

## Appendix ESystem Architecture Details

### E\.1Pipeline Stages

The system processes resumes through a four\-stage state machine:

1. 1\.Parse: LLM\-based PDF\-to\-JSON conversion, producing structured resume data with typed fields \(contact info, experience entries with dates and bullet points, education, skills, projects, certifications\)\.
2. 2\.Score: ATS scoring against the target job description, producing section\-level feedback and an aggregate score\.
3. 3\.Rewrite: Multi\-agent parallel optimization with all five defense layers active\.
4. 4\.Re\-Score: The optimized resume is scored again; if the score has not improved sufficiently, the rewrite stage is repeated \(up to 5 cycles\)\.

### E\.2Agent Specialization

Five specialized agents run in parallel:

- •Summary Agent: Optimizes the professional summary
- •Skills Agent: Aligns skills with job requirements
- •Experience Agent: Rewrites professional experience \(full defense stack\)
- •Projects Agent: Enhances project descriptions
- •Education Agent: Validates \(but does not modify\) education entries

The Experience Agent receives the heaviest defense treatment because professional experience is where most hallucinations occur\. It alone executes the retry\-validation\-fallback loop described in[Section˜3\.3](https://arxiv.org/html/2607.01457#S3.SS3)\.

### E\.3State Management

The system maintains a LangGraphAgentStatethat carries the resume through all stages, preserving the original data alongside optimized versions\. This enables the fallback merge at any point and provides full diff\-based auditability of every change\.

## Appendix FQualitative Examples

Cross\-model contamination \(GPT\-4o\-mini\): Optimizing a GCP\-only ML Engineer role for a multi\-cloud JD, GPT\-4o\-mini injected 7 Azure terms \(“Azure ML Studio,” “Cosmos DB,” “Azure Monitor”\) and 4 AWS terms \(“SageMaker,” “Glue,” “Redshift,” “CloudWatch”\) into a single role, introducing AWS and Azure terminology absent from the original GCP\-only role\. The deterministic detector identified all 11 foreign terms and reverted the output\.

Temporal fabrication at high temperature: Attt=1\.0, GPT\-4\.1\-nano rewrote a 2017–2019 bank analyst role to include “leveraged vector databases for semantic search,” a technology paradigm that emerged in 2022\. Attt=0, the same model respected the temporal constraint\. This demonstrates the temperature\-dependent reliability of prompt\-level constraints\.

Intern hallucination across models: A software intern \(Python Flask, pytest, 3 months\) was optimized for a GCP ML Engineer position\. All three models injected cloud services \(GCP: “Cloud Run,” “Vertex AI”; AWS: “Lambda,” “SageMaker”\) at baseline, fabricating cloud expertise for a candidate with zero cloud experience\. The framework correctly identified and reverted all contamination\.

Similar Articles

Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting

arXiv cs.CL

This paper introduces Attention-Shifting (AS), a novel framework for selective machine unlearning in LLMs that balances effective removal of sensitive information while preventing hallucinations and preserving model utility. The method uses importance-aware attention suppression and retention enhancement to achieve up to 15% higher accuracy preservation compared to existing unlearning approaches on standard benchmarks.