PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection

arXiv cs.LG Papers

Summary

PromptAudit is a controlled evaluation framework that isolates the effects of prompt formulations on LLM-based vulnerability detection, finding that chain-of-thought prompting achieves the best overall performance while prompt sensitivity must be treated as a first-class system property.

arXiv:2605.24171v1 Announce Type: new Abstract: Large language models are increasingly used for vulnerability detection, yet their reliability under different prompt formulations remains uncharacterized. We present PromptAudit, a controlled evaluation framework that isolates prompt effects by fixing the dataset, decoding, and parsing while varying only the prompting strategy. Using five prompting strategies across five open-weight models on 1,000 CVEs (6,074 code samples spanning 16 programming languages), we evaluate accuracy, recall, abstention, coverage, and effective F1. We find that standard chain-of-thought prompting achieves the strongest overall operational performance, while few-shot prompting provides model-dependent benefits that are most pronounced for prompt-sensitive models. In contrast, adaptive chain-of-thought frequently suppresses recall and self-consistency induces excessive abstention, sharply reducing effective performance. These results show that vulnerability detection behavior is jointly determined by the model and the prompt, and that prompt sensitivity is a first-class system property that must be explicitly characterized in evaluation and deployment.
Original Article
View Cached Full Text

Cached at: 05/26/26, 09:01 AM

# PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection
Source: [https://arxiv.org/html/2605.24171](https://arxiv.org/html/2605.24171)
,Yahya HmaitiUniversity of Central FloridaOrlandoFloridaUSA[yohan\.hmaiti@ucf\.edu](https://arxiv.org/html/2605.24171v1/mailto:[email protected]),Mandana GhadamianUniversity of Central FloridaOrlandoFloridaUSA[mandana\.ghadamian@ucf\.edu](https://arxiv.org/html/2605.24171v1/mailto:[email protected])andDavid MohaisenUniversity of Central FloridaOrlandoFloridaUSA[mohaisen@ucf\.edu](https://arxiv.org/html/2605.24171v1/mailto:[email protected])

###### Abstract\.

Large language models are increasingly used for vulnerability detection, yet their reliability under different prompt formulations remains uncharacterized\. We presentPromptAudit, a controlled evaluation framework that isolates prompt effects by fixing the dataset, decoding, and parsing while varying only the prompting strategy\. Using five prompting strategies across five open\-weight models on 1,000 CVEs \(6,074 code samples spanning 16 programming languages\), we evaluate accuracy, recall, abstention, coverage, and effective F1\. We find that standard chain\-of\-thought prompting achieves the strongest overall operational performance, while few\-shot prompting provides model\-dependent benefits that are most pronounced for prompt\-sensitive models\. In contrast, adaptive chain\-of\-thought frequently suppresses recall and self\-consistency induces excessive abstention, sharply reducing effective performance\. These results show that vulnerability detection behavior is jointly determined by the model and the prompt, and that prompt sensitivity is a first\-class system property that must be explicitly characterized in evaluation and deployment\.

large language models, vulnerability detection, prompt sensitivity, software security, CVEfixes, empirical evaluation

††copyright:none## 1\.Introduction

Large language models \(LLMs\) are appearing more frequently in software\-security tasks, including vulnerability detection, vulnerability repair, and security\-aware code generation\(Lin and Mohaisen,[2025](https://arxiv.org/html/2605.24171#bib.bib31); Jiang et al\.,[2025](https://arxiv.org/html/2605.24171#bib.bib23); Basic and Giaretta,[2024](https://arxiv.org/html/2605.24171#bib.bib5); Kharma et al\.,[2026](https://arxiv.org/html/2605.24171#bib.bib25)\)\. In this work, we focus specifically on general\-purpose, instruction\-tuned models applied to vulnerability detection through prompting alone, without any security\-specific fine\-tuning\. They can read code, identify familiar weakness patterns, and provide rapid judgments that often appear more flexible than traditional static or dynamic analysis tools\. Unlike traditional static and dynamic analysis tools based on fixed rules and signatures, LLMs operate over learned representations that generalize across programming languages and vulnerability classes\(He and Vechev,[2023](https://arxiv.org/html/2605.24171#bib.bib20); Ullah et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib43)\)\. However, prior work also consistently reports that LLM\-based vulnerability detection exhibits substantial variability, with performance strongly dependent on prompt formulation, model configuration, dataset properties, and evaluation protocol\(Sclar et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib39); Chatterjee et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib8); Lin and Mohaisen,[2025](https://arxiv.org/html/2605.24171#bib.bib31)\)\.

Existing evaluations typically attribute this variability to model behavior alone\. Yet, as recent analyses indicate, prompt sensitivity reflects not only model uncertainty but also how evaluation design mediates inference\(Sclar et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib39); Liang et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib30)\)\. Semantically equivalent prompts can yield different reasoning traces and different vulnerability labels, indicating that reported performance depends on prompt structure, output constraints, and parsing rules in addition to the model itself\(Chatterjee et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib8); Lin and Mohaisen,[2025](https://arxiv.org/html/2605.24171#bib.bib31)\)\. In binary classification settings such as labeling code asSAFEorVULNERABLE, this instability complicates reproducibility and obscures comparison across studies\. For a simple binary task such as deciding whether a code snippet isSAFEorVULNERABLE, this instability carries security\-specific consequences that go beyond generic classification error\. Vulnerability detection is not a symmetric problem: a false negative, which is a vulnerable snippet classified as safe, may allow exploitable code to proceed undetected, whereas a false positive adds reviewer workload but does not introduce risk\. Abstention, where the model declines to produce a verdict, poses a similar operational concern: code that receives no verdict is effectively unreviewed unless a downstream fallback is in place\. Prompt\-induced shifts in recall or abstention rate can therefore alter the operational security behavior of a detector without any change to the underlying model, making prompt design a first\-order deployment concern rather than an evaluation detail\.

Related work further shows that evaluation artifacts can amplify apparent instability\. Scoring methods, prompt templates, and output extraction procedures may penalize semantically correct outputs or shift decision thresholds without improving underlying reasoning\(Hua et al\.,[\[n\. d\.\]](https://arxiv.org/html/2605.24171#bib.bib22)\)\. In vulnerability detection, these effects are compounded by dataset characteristics, including class imbalance and commit\-level labeling, where ground truth is often ambiguous or context\-dependent\(Lin and Mohaisen,[2025](https://arxiv.org/html/2605.24171#bib.bib31); Ullah et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib43)\)\. As a result, observed accuracy may conflate model capability, prompt\-induced variance, and dataset uncertainty, limiting the interpretability of empirical conclusions\(Liang et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib30)\)\.

This work reframes prompt sensitivity as an evaluation problem rather than a property of individual prompts or models\. Building on prior analyses of prompting strategies, sensitivity, and benchmarking practices, we introducePromptAudit, a controlled evaluation framework that isolates the effect of prompt structure by fixing the dataset, decoding parameters, and evaluation pipeline while varying only the prompting strategy\. This study does not propose a new vulnerability detection model, introduce new benchmarks, or claim improvements in detection accuracy\. Instead, it provides a systematic measurement analysis of how inference\-time prompting strategies affect the stability and reliability of LLM\-based vulnerability classification\. By disentangling prompt\-induced variance from other experimental factors,PromptAuditenables more reliable interpretation of prior results and supports principled comparison across vulnerability detection studies\.

Research Questions\.This study addresses four research questions\.RQ1:How do inference\-time prompting strategies affect vulnerability detection performance, coverage, and abstention under controlled evaluation conditions?RQ2:Do prompting strategies primarily improve conditional detection performance, or do they shift the precision\-recall\-coverage tradeoff?RQ3:What recurring failure modes emerge under different prompting strategies, and how do they affect operational reliability?RQ4:How do model\-strategy interactions shape operational performance, and which combinations provide the most reliable behavior under our evaluation setup? Sections[4](https://arxiv.org/html/2605.24171#S4)and[5](https://arxiv.org/html/2605.24171#S5)address these questions\.

Contributions\.We make the following contributions:\(1\)Prompt Sensitivity Controlled Evaluation\.We present a controlled measurement study of prompt sensitivity in LLM vulnerability detection that isolates prompting strategy as the primary experimental variable\. By fixing the dataset, decoding parameters, and evaluation pipeline, we disentangle prompt\-induced variance from model configuration and evaluation artifacts\.\(2\)Systematic Analysis of Inference\-Time Prompting Strategies\.We compare five concrete prompt templates corresponding to zero\-shot, few\-shot, chain\-of\-thought, adaptive chain\-of\-thought, and self\-consistency under a unified setup, showing how prompt\-template choice affects vulnerability classification across models\.\(3\)Coverage\-Aware Analysis of Operational Reliability\.We show that prompting strategies affect not only standard F1, but also abstention, coverage, and effective F1, revealing failure modes such as recall collapse and coverage collapse that are obscured by accuracy\-only evaluation\.\(4\)PromptAudit: A Reproducible Evaluation Framework\.We introducePromptAudit, a reusable evaluation framework for auditing prompt\-induced variance in LLM\-based vulnerability detection\.PromptAuditenables reproducible measurement studies and principled comparison across prompting strategies without introducing new detection models or benchmarks\.

Scope of Claims\.This study makes comparative measurement claims rather than claims about absolute vulnerability\-detection capability\. We evaluate how fixed prompt\-template instances change model behavior under one CVEfixes\-derived label source, one decoding configuration, and one default parser policy\. The labels provide a shared experimental substrate, but do not establish that every isolated snippet is independently exploitable or safe\. Similarly, each prompting strategy is represented by one concrete template instance; results should therefore be interpreted as template\-level evidence rather than universal claims about all possible zero\-shot, few\-shot, CoT, adaptive CoT, or self\-consistency prompts\.

## 2\.Related Work

This section situates our work within prior work on learning\-based vulnerability detection, inference\-time prompting strategies, and LLMs evaluation robustness\. Rather than exhaustively surveying all related systems, we focus on work most relevant to understanding how prompting, dataset construction, and evaluation design interact to shape reported performance\. This perspective clarifies the limitations of existing evaluations and motivates our controlled analysis of prompt\-induced variance in LLM\-based detection\.

### 2\.1\.Vulnerability Detection and Prompting

The prior work on automated vulnerability detection spans static and dynamic analysis tools\(Chess and West,[2007](https://arxiv.org/html/2605.24171#bib.bib11); Viega et al\.,[2000](https://arxiv.org/html/2605.24171#bib.bib44); Serebryany et al\.,[2012](https://arxiv.org/html/2605.24171#bib.bib40)\), learning\-based approaches operating over token sequences and program graphs\(Li et al\.,[2018](https://arxiv.org/html/2605.24171#bib.bib29); Russell et al\.,[2018](https://arxiv.org/html/2605.24171#bib.bib38); Zhou et al\.,[2019](https://arxiv.org/html/2605.24171#bib.bib51); Chakraborty et al\.,[2021](https://arxiv.org/html/2605.24171#bib.bib7)\), and pre\-trained code models such as CodeBERT and GraphCodeBERT\(Feng,[2020](https://arxiv.org/html/2605.24171#bib.bib15); Guo et al\.,[2020](https://arxiv.org/html/2605.24171#bib.bib17)\)\. Security\-oriented variants, including VulBERTa and LineVul, demonstrate that large\-scale pretraining yields representations transferable to vulnerability detection\(Hanif and Maffeis,[2022](https://arxiv.org/html/2605.24171#bib.bib19); Fu and Tantithamthavorn,[2022](https://arxiv.org/html/2605.24171#bib.bib16)\)\.

Building on these advances, instruction\-tuned LLMs extend this line of work by enabling vulnerability classification through natural language prompts without task\-specific fine\-tuning\. Empirical studies on GPT\-3\.5, GPT\-4\(Pearce et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib35); Siddiq et al\.,[2022](https://arxiv.org/html/2605.24171#bib.bib41)\), and code models such as Code Llama, StarCoder, and DeepSeek\-Coder\(Roziere et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib37); Li et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib27); Guo et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib18)\)show that LLMs can recognize vulnerability patterns, but with performance highly dependent on vulnerability type, code complexity, and prompting strategy\. Recent comparative evaluations further demonstrate substantial variation across models, languages, and prompt settings, with some configurations performing near chance\(Lin and Mohaisen,[2025](https://arxiv.org/html/2605.24171#bib.bib31); Jiang et al\.,[2025](https://arxiv.org/html/2605.24171#bib.bib23)\)\.

A recurring limitation of LLM\-based detection is high sensitivity to prompt formulation and inference parameters\(White et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib49); Liu et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib32); Wei et al\.,[2022](https://arxiv.org/html/2605.24171#bib.bib48); Sclar et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib39)\)\. Unlike traditional analyzers with fixed decision procedures\(Chess and West,[2007](https://arxiv.org/html/2605.24171#bib.bib11); Johnson et al\.,[2013](https://arxiv.org/html/2605.24171#bib.bib24)\), LLMs perform probabilistic inference over multi\-step reasoning paths, making predictions sensitive to instruction framing, example selection, and output constraints\. While techniques such as self\-consistency can reduce variance across generations\(Wang et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib47)\), prompt\-induced instability remains a major obstacle to reliable evaluation\.

Despite this growing interest, existing studies largely focus on proprietary models, limiting systematic comparison of open\-weight alternatives\(Fan et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib13)\)\. Moreover, while prompt engineering is known to affect LLM behavior\(Wei et al\.,[2022](https://arxiv.org/html/2605.24171#bib.bib48); Kojima et al\.,[2022](https://arxiv.org/html/2605.24171#bib.bib26)\), most evaluations consider a narrow set of prompt variants or conflate prompt effects with model, dataset, or decoding choices\. As a result, reported improvements often obscure whether gains reflect improved reasoning or prompt\-induced shifts on specific benchmarks, such as CVEfixes and Big\-Vul\(Bhandari et al\.,[2021](https://arxiv.org/html/2605.24171#bib.bib6); Fan et al\.,[2020](https://arxiv.org/html/2605.24171#bib.bib14)\)\.

### 2\.2\.Prompt Sensitivity and Evaluation Design

To better understand these limitations, recent work directly quantifies prompt sensitivity\. Sclar et al\.\(Sclar et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib39)\)show that prompt*format*alone can induce accuracy swings of up to 76%, and that neither scaling nor instruction tuning reliably mitigates this variance\. Chatterjee et al\.\(Chatterjee et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib8)\)introduce POSIX, a sensitivity index based on log\-likelihood changes under intent\-preserving substitutions, and show that sensitivity is structured rather than random: template\-level changes dominate for multiple\-choice tasks, while paraphrasing has stronger effects on open\-ended generation\. Beyond quantification, other work focuses on mitigation strategies, including diversifying prompt formats\(Ngweta et al\.,[2025](https://arxiv.org/html/2605.24171#bib.bib33)\)and adopting semantics\-aware evaluation methods to reduce artifacts introduced by rigid scoring and parsing rules\(Hua et al\.,[\[n\. d\.\]](https://arxiv.org/html/2605.24171#bib.bib22)\)\. These studies suggest that prompt sensitivity is partly entangled with the evaluation harness itself rather than reflecting model behavior alone\.

In the specific context of vulnerability detection, another line of work shows that prompt sensitivity is further compounded by dataset construction\. Common benchmarks derive labels from vulnerability\-fixing commits linked to CVE records\(Bhandari et al\.,[2021](https://arxiv.org/html/2605.24171#bib.bib6); Fan et al\.,[2020](https://arxiv.org/html/2605.24171#bib.bib14); Chen et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib10)\)\. This construction supports large\-scale studies, but it can mislabel isolated snippets when exploitability depends on calling context, configuration, or interprocedural data flow\(Chakraborty et al\.,[2021](https://arxiv.org/html/2605.24171#bib.bib7); Arp et al\.,[2022](https://arxiv.org/html/2605.24171#bib.bib4); Steenhoek et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib42)\)\. Recent LLM evaluations show that these label and context limitations interact with prompt choice, language, and evaluation protocol, with performance often degrading under stricter task definitions\(Lin and Mohaisen,[2025](https://arxiv.org/html/2605.24171#bib.bib31); Ullah et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib43); Jiang et al\.,[2025](https://arxiv.org/html/2605.24171#bib.bib23)\)\.

Table 1\.Comparison of prior work along evaluation design dimensions\. C1–C6 denote: use of LLMs, prompt variation, dataset control, decoding control, evaluation control, and prompt isolation\. Prior work typically varies prompts alongside other factors, obscuring attribution, whereas our design isolates prompt effects under controlled conditions\.Overall, the prior work shows that prompt sensitivity, dataset ambiguity, and evaluation design jointly shape reported LLM vulnerability detection performance\. Existing studies typically vary these factors together or treat prompts as fixed inputs, as summarized in Table[1](https://arxiv.org/html/2605.24171#S2.T1)and compared with our work, making it difficult to attribute observed performance differences to any single source\. This work, on the other hand, addresses that gap by isolating prompt structure as a variable while fixing the dataset, inference parameters, and evaluation pipeline, enabling a controlled analysis of how prompting alone influences detection outcomes\.

## 3\.Methodology

We review thePromptAuditframework, including the evaluation datasets, prompting strategies, and pipeline architecture\. We describe the system’s operational mechanics and the theoretical rationale underpinning each design specification\. Figure[1](https://arxiv.org/html/2605.24171#S3.F1)summarizes the evaluation pipeline, while Figure[2](https://arxiv.org/html/2605.24171#S3.F2)summarizes the prompt\-design spectrum\. We describe thePromptAuditframework, including the evaluation datasets, prompting strategies, and pipeline architecture\.

Dataset⋆Decoding Configuration⋆Parser Configuration⋆LoaderSamples andtrue labelsPromptOne of fivestrategiesExecutionLLM generatesoutputParserExtractverdictMetrics Engine\(Acc, Recall,Abstention\)VARYING COMPONENTOutputCartesian Product of Experimental FactorsD\(1\)M\(5\)P\(5\)O\(3\)R\(2\)×\\times×\\times×\\times×\\timesFigure 1\.PromptAuditevaluation pipeline\. The main evaluation fixes dataset, decoding, output protocol, and parser mode while varying the prompt template and is evaluated against dataset \(D\), models \(M\), prompts \(P\), outputs \(O\), and reranking \(R\)\.⋆indicates a fixed component\.### 3\.1\.System Overview

PromptAuditis a modular evaluation framework that keepsdataset loading,prompt formatting,model execution,output parsing, andmetric computationseparate\. This structure keeps the experimental substrate fixed while the prompt\-template instance changes, reducing the chance that observed variation is caused by execution or reporting differences rather than prompt structure\. At a high level, the runner loads a labeled code sample, formats it with the selected prompt template and output protocol, queries an open\-weight model through the Ollama backend, parses the response intoSAFE,VULNERABLE, orUNKNOWN, and computes accuracy, precision, recall, abstention, coverage, F1, and effective F1\. Parser\-mode and output\-protocol ablations use the same pipeline to test whether extraction or verdict\-placement choices materially affect the measured prompt sensitivity\.

### 3\.2\.Dataset Construction

To evaluate vulnerability detection under conditions aligned with real\-world practice, we derive our dataset from CVEfixes\(Bhandari et al\.,[2021](https://arxiv.org/html/2605.24171#bib.bib6)\), a public repository linking CVE records to vulnerability\-fixing commits in open\-source projects\. CVEfixes contains9,386 CVE recordswith75,950 code samplesacross multiple languages\. We randomly sampled1,000 CVEsfrom the full corpus, with the sample proportionally reflecting the year\-of\-disclosure distribution present in CVEfixes across 1999–2024\. This provides natural temporal coverage without overrepresenting any single disclosure period, toolchain, or coding convention\. This design prioritizes breadth and reproducibility rather than completeness\. Each sampled CVE maps to multiple code samples because CVEfixes links each vulnerability to all modified files in the fixing commit\. Hence, this yielded6,074 code samplesfor evaluation from 1,000 CVEs after filtering duplicates and non\-code artifacts\. The 1,000\-CVE subset size was validated empirically: we evaluated subsets at 50, 100, 500, and 1,000 CVE levels and found that performance trends across prompting strategies stabilized at 1,000 CVEs, representing the smallest tractable configuration for controlled, repeatable experimentation\. The curated subset will be released publicly alongside the artifacts\.

The resulting 6,074 samples span 16 programming languages, with no single language exceeding 26\.54% of the corpus, reflecting the natural distribution of CVEfixes rather than deliberate language\-level balancing\. The samples are balanced by construction: each CVE contributes both a pre\-fix \(VULNERABLE\) and a post\-fix \(SAFE\) version from the same commit, yielding a near\-equal class distribution after deduplication and filtering\. All samples are drawn from standardized CVE documentation and linked open\-source repositories, preserving alignment with public security disclosures while supporting controlled evaluation\. As with other commit\- and disclosure\-derived datasets, the presence of a CVE does not imply that every associated code snippet independently exhibits exploitable behavior in isolation\.

Each dataset entry includes the CVE identifier, disclosure year, programming language, and vulnerability metadata extracted from the corresponding CVE record\. Entries further include one or more associated code snippets reflecting the reported vulnerability and its surrounding context, as available in the referenced repository\. This structure supports consistent evaluation across samples while making explicit the limitations inherent in snippet\-level representations of context\-dependent vulnerabilities\.

Data Cleaning and Limitations\.Samples from CVEfixes were scrubbed before integration into thePromptAuditpipeline to address data quality issues\. We removed: \(1\) non\-code artifacts \(e\.g\., configuration files, documentation, README\), \(2\) files failing programming language validation, \(3\) exact and near\-duplicates \(\>98%\>98\\%similarity\), and \(4\) snippets<10<10lines \(all manually verified\)\. The runtime loader then retains only before/after file pairs whose original filename can be recovered from the CVE folder metadata and whose extension maps to a supported programming language; files with unknown language are excluded\.

We note that CVEfixes exhibits limitations inherent to commit\-based datasets\. These issues appear as: \(1\)Labeling ambiguity, where commits may include non\-security changes alongside the fix, mitigated by retaining only files directly modified in the fixing commit and excluding non\-code artifacts; \(2\)Context dependency, where some snippets require broader program context to assess exploitability, acknowledged as an inherent constraint of snippet\-level evaluation; and \(3\)Incomplete fixes, where CVEs map to multi\-stage patches, addressed by deduplication and near\-duplicate filtering at the\>\>98% similarity threshold\. Additionally, these label quality constraints apply uniformly across all evaluated prompting strategies and models\. Any residual noise applies across all evaluated conditions and is therefore unlikely to fully explain the differential prompt effects observed in Section[4](https://arxiv.org/html/2605.24171#S4), which are the primary claims of this work\.

Note that we do not interpret CVEfixes labels as proof that each isolated snippet is independently exploitable\. The labels provide a fixed experimental substrate for comparing prompt\-induced variation under the same data source\. Our claims therefore concern relative changes across prompt templates under a shared label construction procedure, not absolute vulnerability\-detection accuracy\.

Dataset Preprocessing\.Prior to prompting, all samples are normalized into a unified representation to ensure consistency across experimental conditions\. Each sample is structured as a CVE\-linked record containing a unique identifier, inferred programming language, raw code snippet, and a binary label \(SAFEorVULNERABLE\)\. Retained samples were also standardized in terms of whitespace and indentation\. This normalization is intended to standardize model inputs rather than to encode complete vulnerability semantics\.

Vulnerable and fixed code samples are derived from standardized before/after file structures associated with vulnerability\-fixing commits\. Code appearing prior to the fix is labeledVULNERABLE, while the patched version is labeledSAFE\. Programming language is inferred from file extensions, and non\-code artifacts \(e\.g\., license or documentation\) are excluded to reduce ambiguity during analysis\.

We adopt a binarySAFE/VULNERABLEas a controlled abstraction for studying prompt\-induced variance under fixed conditions\. This labeling scheme is not intended to capture the full semantics of real\-world vulnerabilities, which may depend on execution context, configuration, or interprocedural behavior beyond what is represented in isolated code snippets\. Moreover, when a prompt does not produce aSAFE/VULNERABLElabel, the output is treated as an abstention and recorded asUNKNOWNfor evaluation purposes\.

### 3\.3\.Prompting Strategies

PromptAuditevaluates five prompt templates corresponding to common prompting strategies drawn from practice and recent prompt\-engineering research\(Chen et al\.,[2025](https://arxiv.org/html/2605.24171#bib.bib9)\)\. The templates vary in imposed structure, encouragement of reasoning, and model freedom in answer generation\. Since prompting is the variable under study, each strategy is implemented as a configurable template via the prompt loader\. Representative excerpts are included in Appendix[A](https://arxiv.org/html/2605.24171#A1)\. The strategies we used span a spectrum from minimal guidance to highly structured reasoning, enablingPromptAuditto isolate how instruction level influences model behavior\. Figure[2](https://arxiv.org/html/2605.24171#S3.F2)summarizes this progression from minimal prompting to increasingly structured and aggregated reasoning\.

❶ZS Model answers with only the task instruction\.Is the followingcode vulnerable?SAFE orVULNERABLEInstruction❷FS Model sees a few examples\.Code 1→\\rightarrowSAFECode 2→\\rightarrowVULNERABLE…\\dotsInstruction \+ Examples❸CoT Model is asked to reason step\-by\-step1\. Understand2\. Identify3\. Analyze4\. ConcludeAnswer: SAFE/ VULNERABLEInstruction \+ CoT Cue❹A\-CoT Adaptive reasoningAssess complexity↓\\downarrowMore reasoning?Yes→\\rightarrowContinueNo→\\rightarrowAnswerInstruction \+ A\-CoT Cue❺S\-C Multiple paths \+ aggregationPath 1→\\rightarrowVULNPath 2→\\rightarrowVULN…\\dotsPath n→\\rightarrowSAFEMajority VoteInstruction \+ S\-C CueLow StructureMinimal GuidanceHigh StructureStrong Guidance & Aggregation

Figure 2\.Prompt strategy design spectrum\. The five strategies, ZS, FS, CoT, A\-CoT, and S\-C, form a continuum from minimal guidance to structured, aggregated reasoning\.- •Zero\-Shot \(ZS\)\.The model receives only the task instruction and direct classification guidance\. No examples or step\-by\-step reasoning are provided\. This serves as the baseline condition, with exact verdict placement controlled separately by the selected output protocol\.
- •Few\-Shot \(FS\)\.A small set of labeled example classifications is provided before the target snippet\. In the current implementation, these are synthetic CVE\-style before/after function examples\. Few\-shot prompting is widely used because it can reduce misclassification in low\-context tasks without explicitly requesting step\-by\-step reasoning\.
- •Chain\-of\-Thought \(CoT\)\.The model is prompted to reason step by step about factors such as input validation, memory safety, race conditions, and injection or trust\-boundary risks\. The selected output protocol determines whether the decisive label appears on the first line or on the final line after the explanation\.
- •Adaptive Chain\-of\-Thought \(A\-CoT\)\.The model is instructed to reason only when needed\. Trivially safe or vulnerable code receives minimal explanation, while complex cases \(e\.g\., pointer arithmetic, raw memory manipulation, manual resource management, or complex input handling\) elicit more detailed reasoning\. As with CoT, the selected output protocol determines whether the final verdict is emitted first or last\.
- •Self\-Consistency \(S\-C\)\.The model produces multiple independent predictions using the A\-CoT template under the active output protocol\. Each vote is parsed using the selected parser mode and aggregated by majority voting\. Thus, the S\-C results evaluate majority voting over A\-CoT\-style generations, not self\-consistency over all possible base prompts\. When no label receives a true majority across the requested samples, the final output is treated as an abstention \(UNKNOWN\)\.

Output\-Protocol and Parser Ablations\.Beyond prompt family,PromptAuditexposes two controlled ablation axes\. The output protocol toggles whether the model must emit the verdict on the first line \(verdict\_first\) or on the final line after any explanation \(verdict\_last\)\. The parser mode toggles whether label extraction is restricted to the exact protocol location \(strict\), broadened to explicit verdict phrases \(structured\), or further extended with lexical fallback \(full\)\. These ablations allow us to measure whether reported prompt effects depend on verdict placement or extraction heuristics rather than on prompt family alone\. Unless otherwise stated, the prompt\-strategy results reported in Section[4](https://arxiv.org/html/2605.24171#S4)correspond to the defaultverdict\_first\+fullconfiguration\.

Prompt Input ConstructionPrior to prompting, no semantic rewriting, truncation, or vulnerability\-specific annotation is applied to the code\. Snippets are passed verbatim to preserve real\-world complexity and prevent cue leakage that could confound the experiment\. Each prompt is constructed by inserting a single code sample into a configurable prompt template corresponding to the evaluated prompting strategy\. The template specifies the instruction framing, output constraints, and optional reasoning structure \(when the prompt strategy involves reasoning\), while the underlying code content remains unchanged across conditions\. This process allows observed differences to be attributed to the prompting strategy rather than data transformation artifacts\.

### 3\.4\.Evaluation Pipeline Architecture

The pipeline structure helps keep the evaluations consistent and repeatable across all models and prompting strategies\. The pipeline begins with the dataset loader reading the dataset and preparing each snippet with its ground\-truth label\. The prompt loader constructs the strategy\-specific prompt, after which the runner appends the active output\-protocol suffix that governs verdict placement\. The model loader initializes each open\-weight architecture under identical decoding settings\. By abstracting model invocation behind a commongenerate\(\)method,PromptAuditensures model differences do not arise from backend inconsistencies\.

The experiment runner cycles through every combination of dataset, model, prompting strategy, output protocol, parser mode, and code sample, corresponding to the Cartesian product𝒟×ℳ×𝒫×𝒪×ℛ\\mathcal\{D\}\\times\\mathcal\{M\}\\times\\mathcal\{P\}\\times\\mathcal\{O\}\\times\\mathcal\{R\}\. Per configuration, it builds the strategy prompt, appends the selected output\-protocol instruction, sends the resulting prompt to the model, and records the model’s raw output\. Each response gets forwarded to the label parser, which applies the selected parser mode \(Section[3\.7](https://arxiv.org/html/2605.24171#S3.SS7)\)\. Afterward, the output is assigned one of three labels:SAFE,VULNERABLE, orUNKNOWN, with the last category used when the response is ambiguous, inconsistent, or cannot be mapped to a valid verdict under the chosen parser mode\.

Parsed predictions are passed to the metrics engine, which computes accuracy, precision, recall, abstention rate, coverage, standard F1, and effective F1 for each model\-prompt combination\. The reporting module then aggregates results into CSV summaries and produces an interactive HTML dashboard that visualizes trends, model comparisons, and the effects of prompting\.

### 3\.5\.Model Settings and Backend Configuration

PromptAuditrelies on five open\-weight models served through the local Ollama backend: Mistral \(mistral:latest\), Gemma 7B \(gemma:7b\), CodeLlama 7B Instruct \(codellama:7b\-instruct\), Falcon 7B Instruct \(falcon:7b\-instruct\), and DeepSeek\-Coder 6\.7B Instruct \(deepseek\-coder:6\.7b\-instruct\)\. These models represent a mix of general\-purpose and code\-oriented architectures commonly used for local analysis tasks and size is constrained to roughly 7B parameters or smaller due to hardware limitations\. To avoid introducing variation unrelated to prompting, all models run under identical decoding parameters: temperature 0\.2, top\-p 0\.9, top\-k 40, a 250\-token limit, and fixed sampling penalties\. These settings reduce sampling randomness and help prevent the formatting drift that often occurs at higher temperatures\. Appendix[C](https://arxiv.org/html/2605.24171#A3)lists the full configuration\. By holding decoding constant,PromptAuditisolates prompt structure as the primary source of output variation\.

### 3\.6\.Output\-Protocol Ablations

Because LLMs often embed labels within explanations or defer decisions,PromptAuditdefines two output\-protocol variants\. Underverdict\_first, the model must place exactly one word,SAFEorVULNERABLE, on the first non\-empty line, with any explanation following\. Underverdict\_last, the explanation precedes the verdict, and the final non\-empty line must contain onlySAFEorVULNERABLE\. This separation isolates formatting effects from prompt design: prompting controls reasoning, while the protocol controls label placement\. Unless otherwise stated, results use theverdict\_firstconfiguration\. For tractability, the protocol ablation is evaluated on a stratified 1,232\-sample subset of the 6,074\-sample dataset, preserving language and prompt distributions\. As shown in Section[4\.6](https://arxiv.org/html/2605.24171#S4.SS6),verdict\_firstyields stronger operational performance and is therefore used as the default\.

### 3\.7\.Label Parsing Logic

Even with controlled output protocols, models occasionally place the label in the wrong location, phrase it in a longer sentence, or embed it in freer reasoning text\.PromptAudittherefore exposes three parser modes built on the same layered parser\.Strict parser\.The parser checks only the protocol\-defined verdict location: the first non\-empty line forverdict\_firstor the last non\-empty line forverdict\_last\. The target line must contain a singleSAFE/VULNERABLEtoken after minor normalization of punctuation\.Structured parser\.If strict parsing fails, the parser scans non\-empty lines from bottom to top for explicit verdict markers such asFinal answer: SAFE,The final answer is SAFE,In conclusion, the code is VULNERABLE, orThe code is SAFE\.Full parser\.If neither of the above succeeds, the parser performs a whole\-response lexical fallback over safety and vulnerability cues, includingunsafe,vulnerable,vulnerabilities,exploitable,at risk,safe,secure,not vulnerable, andno vulnerabilities\. Negations are handled explicitly, and mixed or conflicting evidence yields anUNKNOWNclassification\. The reported study configuration corresponds most closely toverdict\_firstwith thefullparser\. In the updated framework,strictandstructuredmodes are exposed as explicit ablations so parser sensitivity can be measured rather than hidden\. Full excerpts are shown in Appendix[B](https://arxiv.org/html/2605.24171#A2)\.

The main study fixes parser mode across all model–prompt comparisons\. This controls parser behavior when comparing prompting strategies, but it does not make the absolute scores independent of extraction choices: thefullparser still includes lexical fallback\. We therefore treat parsing as part of the evaluation setup and exposestrict,structured, andfullmodes as ablation controls\. The reported prompt\-strategy comparisons should be read under that fixed parsing configuration\.

![Refer to caption](https://arxiv.org/html/2605.24171v1/x1.png)Figure 3\.Effective F1 across model–prompt combinations\. Performance improves with moderate structure \(FS, CoT\) but degrades under higher\-structure prompts \(A\-CoT, S\-C\), indicating a non\-monotonic trade\-off\.
### 3\.8\.Evaluation Metrics

We evaluate each model–prompt configuration using accuracy, precision, recall, abstention rate, coverage, standard F1, and effective F1\. Because vulnerability detection is an asymmetric classification task, accuracy and F1 alone are insufficient: a model may perform well on answered samples while abstaining on a substantial portion of inputs, reducing its practical utility\.

Each prediction is mapped to one of six outcome categories: true positive \(TP\), true negative \(TN\), false positive \(FP\), false negative \(FN\), incorrect, and unknown false negative \(UnFN\)\. TP and TN denote correctVULNERABLEandSAFEpredictions, respectively\. FP denotes code incorrectly classified asVULNERABLE, and FN denotes vulnerable code incorrectly classified asSAFE\. UnFN captures vulnerable samples for which the model fails to produce a definitiveVULNERABLEverdict, including abstentions and unresolved outputs\. Incorrect captures unresolved or invalid outputs on non\-vulnerable samples, including protocol violations, refusals, or responses that cannot be mapped to eitherSAFEorVULNERABLE\.

Accuracy is defined asT​P\+T​NT​P\+T​N\+F​P\+F​N\+U​n​F​N\\frac\{TP\+TN\}\{TP\+TN\+FP\+FN\+UnFN\}, precision asT​PT​P\+F​P\\frac\{TP\}\{TP\+FP\}, and recall asT​PT​P\+F​N\+U​n​F​N\\frac\{TP\}\{TP\+FN\+UnFN\}\. Incorrect outputs are excluded from the accuracy denominator because they reflect protocol non\-compliance rather than a committedSAFE/VULNERABLEclassification; their impact is instead captured through abstention, coverage, and effective F1\. Recall is defined conservatively, as abstentions on vulnerable samples correspond to missed detections\. Standard F1 is the harmonic mean of precision and recall\. We define abstention rate asI​n​c​o​r​r​e​c​t\+U​n​F​NT​P\+T​N\+F​P\+F​N\+U​n​F​N\+I​n​c​o​r​r​e​c​t\\frac\{Incorrect\+UnFN\}\{TP\+TN\+FP\+FN\+UnFN\+Incorrect\}and coverage as one minus abstention rate\. Our primary operational metric is effective F1, defined as the product of standard F1 and coverage, which penalizes configurations that achieve strong conditional performance only by abstaining on many inputs\.

Effective F1 is not intended as a complete deployment cost model and scales F1 by coverage to penalize unresolved outputs, while recall separately treats abstentions on vulnerable samples as missed detections via UnFN\. In practice, abstained samples may be routed to fallback analyzers or human review, whereas false negatives may pass silently\. Different deployments may therefore weight false positives, false negatives, and abstentions differently\. We use effective F1 as a consistent reporting measure across configurations\.

## 4\.Evaluation and Results

### 4\.1\.Aggregate Performance

Figure[3](https://arxiv.org/html/2605.24171#S3.F3)summarizes effective\-F1 differences across all model and template combinations; the full standard\-F1 and effective\-F1 table is provided in Appendix[E](https://arxiv.org/html/2605.24171#A5)\(Table[4](https://arxiv.org/html/2605.24171#A4.T4)\)\. Three structural patterns emerge\. The absolute scores should not be interpreted as evidence that these models are deployment\-ready vulnerability detectors\. Many configurations operate near chance precision, and even the strongest configurations remain limited by snippet\-level labels and missing program context\. The central result is not high detection accuracy, but the magnitude and direction of behavioral changes induced by prompt\-template choice under fixed evaluation conditions\.

First, standard and effective F1 diverge substantially only when abstention is elevated: Falcon’s mean standard F1 \(0\.430\) matches Mistral’s, but its mean effective F1 \(0\.265\) is 0\.119 points lower, driven entirely by a mean abstention rate of 42\.26% that persists across all five strategies\. Second, Mistral serves as the stability baseline: it achieves the smallest cross\-strategy F1 range \(0\.103\), indicating low sensitivity to prompt formulation\. Third, Gemma serves as the prompt\-sensitivity baseline: it exhibits the largest cross\-strategy F1 range \(0\.398\), with standard F1 spanning from 0\.102 under A\-CoT to 0\.499 under self\-consistency, showing prompt choice can fundamentally determine whether a model detects vulnerabilities at all\. We report deterministic point estimates under a fixed seed and decoding configuration\. Because this study focuses on controlled comparisons rather than repeated stochastic sampling, we emphasize large and directionally consistent differences across prompt templates\. Estimating run\-to\-run, template\-to\-template, and backend\-level uncertainty is our future work\.

### 4\.2\.Effect of Prompting Strategy

![Refer to caption](https://arxiv.org/html/2605.24171v1/x2.png)Figure 4\.F1, coverage, and effective F1 by prompt template\. High coverage is consistent across strategies, but variation in F1 drives overall effectiveness, which peaks for FS and CoT and declines for A\-CoT and S\-C\.In the following, we treat the evaluated prompt\-template instance as the primary independent variable and examine its impact on recall, abstention, and effective F1 across models\. Figure[4](https://arxiv.org/html/2605.24171#S4.F4)complements the tabular results by decomposing mean performance into standard F1, coverage, and effective F1 across the evaluated prompt\-template instances\. Results are drawn from Figure[4](https://arxiv.org/html/2605.24171#S4.F4), Table[3](https://arxiv.org/html/2605.24171#S4.T3), and Table[4](https://arxiv.org/html/2605.24171#A4.T4); per\-cell accuracy, precision, and recall appear in Appendix[D](https://arxiv.org/html/2605.24171#A4)\(Table[8](https://arxiv.org/html/2605.24171#A5.T8)\), and abstention rates are Table[2](https://arxiv.org/html/2605.24171#S4.T2)\. The output\-protocol ablation in Section[4\.6](https://arxiv.org/html/2605.24171#S4.SS6)shows that these strategy\-level differences are not explained solely by verdict placement\.

Table 2\.Abstention rate \(%\) by model and prompt template\. Means and ranges \(max–min\) are reported across templates for each model and across models for each template\.Figure[4](https://arxiv.org/html/2605.24171#S4.F4)reveals that variation in effective performance is driven primarily by changes in standard F1 rather than coverage\. Across ZS, FS, CoT, and A\-CoT, coverage remains high and relatively stable, while F1 varies substantially, causing effective F1 to track F1 closely in most cases\. The main exception was under configurations that induce higher abstention, where coverage drops and effective F1 diverges from standard F1\. This decomposition also shows a non\-monotonic effect of prompt structure where moderate structure \(few\-shot and CoT\) yields the highest F1 and effective F1 and more complex strategies \(A\-CoT and our A\-CoT\-based self\-consistency configuration\) degrade performance, either through recall reduction or increased abstention\. These results indicate that increasing prompt complexity does not consistently improve detection quality, but instead shifts the balance between recall and coverage\.

Quantitative Analysis\.The evaluated CoT templateachieves the highest mean F1 \(0\.465\) and mean effective F1 \(0\.427\), with a low mean abstention rate of 7\.28%\. Its cross\-model recall range \(0\.458\) is substantially lower than that of zero\-shot \(0\.805\) and few\-shot \(0\.641\), indicating that it combines strong average performance with moderate cross\-model stability under our setup\.

The evaluated few\-shot templateyields the second\-highest mean F1 \(0\.430\) and effective F1 \(0\.371\), but exhibits high cross\-model recall variability \(0\.641\), suggesting uneven benefit across architectures\.The evaluated zero\-shot templatefurther reduces structure, producing a mean F1 of 0\.401 with the second\-lowest abstention rate \(7\.82%\) but the highest cross\-model recall variability \(0\.805\), reflecting strong dependence on model\-specific behavior\.

For more complex prompting,our A\-CoT\-based self\-consistency configurationachieves a mean standard F1 \(0\.420\) but the lowest mean effective F1 \(0\.197\), with a gap of 0\.223 points driven by a high mean abstention of 55\.23%\.The A\-CoT templatesimilarly underperforms, producing the lowest mean standard F1 \(0\.287\) despite a moderate abstention rate \(10\.69%\), indicating that its degradation is driven primarily by recall collapse rather than coverage loss\. Recall under A\-CoT falls below the corresponding CoT value for every model, with the largest drops in CodeLlama \(0\.685→\\rightarrow0\.279\) and Falcon \(0\.541→\\rightarrow0\.223\)\. Implications are discussed in Section[5\.1](https://arxiv.org/html/2605.24171#S5.SS1)\.

AQ1:Prompt\-template choice substantially affects performance, coverage, and abstention\. CoT provides the best overall balance, while A\-CoT reduces recall and our A\-CoT\-based self\-consistency configuration sharply reduces coverage through abstention\. Overall, prompt effects are non\-monotonic: moderate structure improves performance, but additional complexity degrades reliability via recall suppression or abstention\.

### 4\.3\.Precision–Recall Tradeoff

![Refer to caption](https://arxiv.org/html/2605.24171v1/x3.png)Figure 5\.Coverage vs\. F1 across model–prompt combinations\. High coverage does not guarantee high F1; CoT provides stronger trade\-offs, while S\-C is unstable and some settings collapse to high\-coverage, low\-F1 regimes\.Figure[5](https://arxiv.org/html/2605.24171#S4.F5)puts all 25 model–prompt combinations in coverage–accuracy space\. Selected examples illustrating the extremes of recall and abstention behavior are in Appendix[E](https://arxiv.org/html/2605.24171#A5)\(Table[5](https://arxiv.org/html/2605.24171#A5.T5)\), with additional results in Appendix[E\.1](https://arxiv.org/html/2605.24171#A5.SS1); full results in Appendix[D](https://arxiv.org/html/2605.24171#A4)\(Table[8](https://arxiv.org/html/2605.24171#A5.T8)\)\.

A consistent pattern across all models is that recall varies substantially while precision remains confined to a narrow range\. In CodeLlama, precision holds at 0\.492 across zero\-shot and A\-CoT while recall swings from 0\.865 to 0\.279, which is a difference of 0\.586 points with accuracy remaining nearly flat \(0\.472 vs\. 0\.484\)\. In Mistral, the same dynamic appears under CoT and self\-consistency: precision stays within 0\.003 points while recall drops from 0\.423 to 0\.375 and effective F1 falls from 0\.465 to 0\.212 due to abstention\.

Table 3\.Recall performance by model and prompting strategy, with precision ranges\. The bottom row shows variability \(max–min\) across models, highlighting sensitivity to prompt\.Gemma exhibits the strongest form of this pattern\. Recall under zero\-shot and A\-CoT falls below 0\.063, rising to 0\.420 under CoT and 0\.486 under self\-consistency, while precision remains bounded within a 0\.158\-point range \(Table[3](https://arxiv.org/html/2605.24171#S4.T3)\)\. Across all models, precision remains tightly bounded \(Table[3](https://arxiv.org/html/2605.24171#S4.T3)\), while recall spans a much wider interval\. Table[3](https://arxiv.org/html/2605.24171#S4.T3)makes this mechanism explicit: recall varies substantially across prompting strategies, whereas precision changes only marginally\. This decoupling indicates that prompt\-template choice primarily affects the model’s willingness to assign theVULNERABLElabel, rather than its ability to distinguish between correct and incorrect predictions once a decision is made\. In effect, prompting shifts the operating point of the detector by altering recall, not by improving precision\. The implications for interpreting reported performance gains are discussed in Section[5\.1](https://arxiv.org/html/2605.24171#S5.SS1)\.

This pattern is further reflected in the variability row of Table[3](https://arxiv.org/html/2605.24171#S4.T3)\. Prompting strategies such as zero\-shot and few\-shot exhibit large cross\-model recall ranges \(0\.805 and 0\.641, respectively\), indicating that they amplify differences between model architectures\. In contrast, A\-CoT and S\-C show reduced variability \(0\.223 and 0\.270\), but this compression arises from uniformly lower recall rather than improved consistency\. Thus, reduced variability does not imply more reliable detection, but often reflects a conservative operating regime where vulnerabilities are systematically missed\.

Abstention compounds this effect under our A\-CoT\-based self\-consistency configuration\. Falcon maintains precision within 0\.009 points across the evaluated CoT template and this self\-consistency configuration, but effective F1 collapses from 0\.380 to 0\.073 as abstention rises from 27\.49% to 75\.96%\. This reflects a shift toward a more conservative operating regime in which both recall and coverage are suppressed, reducing operational utility despite stable precision\. Beyond individual configurations, Figure[5](https://arxiv.org/html/2605.24171#S4.F5)reveals three distinct operational regimes\. First, several configurations cluster in a high\-coverage but low\-F1 region, indicating that the model produces predictions consistently but fails to identify vulnerabilities reliably; this corresponds torecall\-limited behaviorwhere predictions are made but often incorrect\. Second, the strongest configurations lie in the upper\-right region, achieving both high coverage and high F1, reflecting abalanced operating pointwhere the model both commits to predictions and maintains reasonable detection performance\. Third, configurations based on our A\-CoT\-based self\-consistency setup shift sharply leftward, indicatingcoverage collapse due to abstention, even when conditional precision remains stable\.

This structure highlights that the trade\-off between coverage and effectiveness is non\-linear\. Improvements in F1 do not arise from uniformly better predictions, but from shifts in the model’s operating point along the precision–recall spectrum, coupled with varying abstention behavior\. In particular, movement along the horizontal axis reflects changes in abstention, while vertical variation reflects changes in recall\-driven effectiveness\. As a result, two configurations with similar precision can occupy very different regions of the plot and exhibit substantially different operational utility\. Importantly, no configuration dominates the space, reinforcing that prompt design does not uniformly improve performance but repositions the detector within a constrained trade\-off surface\.

AQ2:Prompting primarily shifts the precision–recall tradeoff by altering recall rather than improving accuracy\. Precision remains stable across models, while recall varies substantially, so F1 gains reflect shifts in the operating point rather than better reasoning\. Accuracy and precision alone are therefore insufficient, as they miss recall\- and coverage\-driven effects\.

### 4\.4\.Failure Modes

1Recall CollapseOver\-constrained prompts label real vulnerabilities as SAFE\.Impact: vulnerable code unreviewed\.strcpy\(buf, user\_input\); \.\.\.Falsenegative2Abstention ExplosionComplex prompts increase uncertainty and return no verdict\.Impact: code remains unreviewed\.if \(x<len\) buf\[x\]=val; \.\.\.ABSTAIN3Reasoning DriftLong reasoning wanders and contradicts the final verdict\.Impact: lowers reliability and trust\.free\(ptr\); ptr\[0\]=0; \.\.\.Wronglabel4Format\-Driven AbstentionParser\-unreadable outputs appear even under simple prompts\.Impact: lowers coverage, compounds failures\.strcpy\(buf, user\_input\); \.\.\.ABSTAIN

Figure 6\.Common failure modes introduced by prompt\-template and output\-interface choices\.Across model–template combinations, four recurring failure modes emerge\. Figure[6](https://arxiv.org/html/2605.24171#S4.F6)summarizes these patterns and their operational effect\. Full confusion statistics appear in Appendix[D](https://arxiv.org/html/2605.24171#A4)\(Table[7](https://arxiv.org/html/2605.24171#A5.T7)\)\.

Mode 1: Recall Collapse under A\-CoT\.The A\-CoT template reduces recall relative to CoT for every model, with the largest drops in CodeLlama \(0\.685→\\rightarrow0\.279\), Falcon \(0\.541→\\rightarrow0\.223\), Gemma \(0\.420→\\rightarrow0\.057\), and Mistral \(0\.423→\\rightarrow0\.280\)\. Accuracy changes are comparatively small in several cases, so this failure is most visible through recall and effective F1 rather than accuracy alone\.

Mode 2: Explanation Reliability\.Despite producing longer and more structured responses, reasoning\-based strategies do not consistently improve recall or effective F1 relative to zero\-shot \(Table[4](https://arxiv.org/html/2605.24171#A4.T4)\)\. Rationale length is therefore not evidence of improved security judgment\. Systematic rationale–verdict consistency analysis requires raw model outputs and is left for future work\.

Mode 3: Coverage Collapse under Self\-Consistency\.Under our A\-CoT\-based self\-consistency, coverage drops across all models when sampled outputs disagree, falling to 51\.00% for CodeLlama, 48\.93% for Mistral, 60\.85% for Gemma, 39\.04% for DeepSeek, and 24\.04% for Falcon \(Table[2](https://arxiv.org/html/2605.24171#S4.T2)\)\. Unlike recall collapse, abstention is explicit, but the operational effect is similar: large portions of code receive no verdict\. This reflects the behavior of this majority\-vote policy rather than a general property of all self\-consistency variants\.

Mode 4: Format\-driven Abstention\.Falcon exhibits a persistently high abstention rate \(42\.26%\) across all prompting strategies \(Table[2](https://arxiv.org/html/2605.24171#S4.T2)\), including zero\-shot \(31\.66%\)\. This indicates that failure arises not from prompt complexity but from inconsistent instruction\-following that produces outputs the parser cannot resolve, compounding other failure modes by reducing usable predictions independently of strategy choice\.

AQ3:Three failure modes emerge\. A\-CoT induces recall collapse, causing vulnerable code to be missed despite stable accuracy\. Our A\-CoT\-based self\-consistency configuration causes coverage collapse, leaving large portions of code unreviewed due to abstention\. Falcon exhibits persistent format\-driven abstention across prompts, reducing usable predictions\. Together, these show that operational reliability depends on both recall and coverage and cannot be inferred from accuracy alone\.

### 4\.5\.Model\-Level Differences

![Refer to caption](https://arxiv.org/html/2605.24171v1/x4.png)Figure 7\.Effective F1 by model and prompt\. FS and CoT typically outperform others, while A\-CoT and S\-C degrade performance; results remain strongly model\-dependent\.This section treats model architecture as the primary variable and analyzes how different models respond to prompt variation in terms of detection capability, abstention behavior, and cross\-prompt stability\. Figure[7](https://arxiv.org/html/2605.24171#S4.F7)shows how effective F1 varies across models under each prompting strategy\. Results are drawn from Figure[7](https://arxiv.org/html/2605.24171#S4.F7)and Table[4](https://arxiv.org/html/2605.24171#A4.T4); per\-cell abstention rates appear in Appendix[D](https://arxiv.org/html/2605.24171#A4)\(Table[2](https://arxiv.org/html/2605.24171#S4.T2)\)\.

Models separate into three behavioral groups\. CodeLlama and Mistral form the high\-performance tier, achieving mean F1 scores of 0\.524 and 0\.430, respectively, but differing in stability: CodeLlama exhibits a wider cross\-strategy range \(0\.271\), whereas Mistral shows the narrowest \(0\.103\)\. Falcon occupies a distinct position, matching Mistral on mean standard F1 \(0\.430\) but dropping to an effective F1 of 0\.265 due to a persistently high abstention rate \(42\.26%\)\. DeepSeek and Gemma form a prompt\-dependent tier, with performance varying substantially across strategies; Gemma in particular exhibits the largest cross\-strategy F1 range \(0\.398\)\.

Across all models, two distinct failure patterns emerge: recall\-limited models maintain coverage but miss vulnerabilities, while coverage\-limited models achieve reasonable conditional accuracy but abstain excessively\. The interaction between these patterns and prompt design is discussed in Section[5\.3](https://arxiv.org/html/2605.24171#S5.SS3)\.

Figure[7](https://arxiv.org/html/2605.24171#S4.F7)also reveals a consistent cross\-model pattern in how prompts affect performance\. Across most models, few\-shot and CoT prompting yield the strongest effective F1, while A\-CoT and our A\-CoT\-based self\-consistency consistently degrade performance, either through recall reduction or increased abstention\. This indicates that the effect of prompt structure isdirectionally consistenteven though its magnitude remains model\-dependent\. In addition, cross\-strategy variability reflects differences in how models respond to prompt formulation\. Models such as Gemma and DeepSeek exhibit large performance ranges, indicating that their detection capability is highly sensitive to prompt design and can be substantially activated or suppressed depending on the template\. In contrast, Mistral shows limited variation across strategies, suggesting more stable behavior under prompt changes\. Falcon represents a distinct case where performance is constrained primarily by persistent abstention rather than prompt sensitivity, leading to consistently lower effective F1 despite comparable standard F1\. These patterns show that prompt sensitivity is not uniform across models, but instead reflects underlying differences in how models balance recall, confidence, and abstention under varying instruction structures\.

AQ4:Model–template interactions are structured and model\-dependent, and no prompt\-template is optimal across all models\. CoT and few\-shot prompting yield the strongest performance across models, while A\-CoT and self\-consistency consistently degrade performance through recall reduction or abstention\. Models separate into distinct regimes: CodeLlama and Mistral achieve strong and stable performance, DeepSeek and Gemma are highly prompt\-sensitive, and Falcon is constrained by persistent abstention\. Prompt\-template effects are therefore not transferable and must be validated per model family\.

### 4\.6\.Output\-Protocol Ablation

Appendix[E](https://arxiv.org/html/2605.24171#A5)\(Table[6](https://arxiv.org/html/2605.24171#A5.T6)\) reports F1, effective F1, and abstention for all 25 model–template combinations under both output protocols on the 1,232\-sample ablation subset\. The protocol design is described in Section[3\.6](https://arxiv.org/html/2605.24171#S3.SS6)\. In the main text, we summarize the ablation by its aggregate effect: verdict\-first yields mean F1 0\.403, mean effective F1 0\.347, and mean abstention 9\.2%, while verdict\-last yields mean F1 0\.344, mean effective F1 0\.248, and mean abstention 32\.6%\.*Overall, verdict\-first outperforms verdict\-last\.*Switching to verdict\-last consistently reduces performance, with F1 decreasing from 0\.403 to 0\.344, effective F1 from 0\.347 to 0\.248, and abstention increasing from 9\.2% to 32\.6%\. Across all models except Falcon, abstention increases in every configuration, with the largest shifts in Mistral \(e\.g\., CoT: 0\.2%→\\rightarrow50\.0%\) and CodeLlama \(e\.g\., A\-CoT: 6\.3%→\\rightarrow30\.0%\)\.

We note that*there are, however, a small number of deviations\.*CodeLlama Few\-Shot and DeepSeek Zero\-Shot show higher effective F1 under verdict\-last \(0\.464 vs\. 0\.288 and 0\.346 vs\. 0\.080, respectively\)\. In both cases, the gain is driven by recall recovery rather than reduced abstention, indicating a shift in operating point rather than improved reasoning\. On the other hand,*Falcon represents a distinct regime*and shows comparatively small differences across protocols \(mean effective F1: 0\.281 vs\. 0\.224\), consistent with its persistent format\-driven abstention \(Section[4\.4](https://arxiv.org/html/2605.24171#S4.SS4)\)\. Here, performance is constrained by instruction\-following rather than protocol choice\.*Overall, these results support usingverdict\_firstas the default protocol*becauseverdict\_lastincreases unresolved outputs without improving conditional performance, reducing operational utility\. While this does not fully resolve the relationship between reasoning and final decisions, it does not support the view that verdict\-first CoT results are merely post\-hoc rationalizations\.

## 5\.Discussion

Prior work has shown that prompt wording and format can produce large performance swings in LLMs, and that vulnerability\-detection performance varies substantially across models, datasets, and prompting setups\(Sclar et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib39); Chatterjee et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib8); Lin and Mohaisen,[2025](https://arxiv.org/html/2605.24171#bib.bib31); Jiang et al\.,[2025](https://arxiv.org/html/2605.24171#bib.bib23); Hua et al\.,[\[n\. d\.\]](https://arxiv.org/html/2605.24171#bib.bib22)\)\. Our results sharpen this picture in a more controlled and operational setting\. Rather than treating prompt sensitivity as a secondary observation within broader benchmarks, we examine concrete prompt\-template instances under a fixed dataset, decoding configuration, and evaluation pipeline\. This setup allows us to isolate shifts in recall, abstention, and effective coverage from changes in conditional performance, and to show that prompt choice alters operational behavior even when the underlying model and task remain unchanged\. More broadly, this framing aligns with recent work arguing that robustness and reliability should be assessed using standardized, multi\-metric evaluations rather than a single aggregate score\(Liang et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib30)\)\.

### 5\.1\.Prompting as a Hidden Decision Threshold

The results show that the evaluated prompt\-template choices systematically alter model behavior in a way that resembles adjusting a decision threshold rather than improving underlying vulnerability understanding, consistent with prior work showing that semantically minor prompt changes can materially shift model outputs and measured performance\(Sclar et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib39); Chatterjee et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib8); Hua et al\.,[\[n\. d\.\]](https://arxiv.org/html/2605.24171#bib.bib22)\)\. Across all models, precision remains relatively stable while recall varies substantially, indicating that prompting shifts the model’s effective decision threshold for assigning theVULNERABLElabel\.

This distinction is critical for interpreting performance gains\. Prompt design can change the operating point of the detector without necessarily improving the quality of the underlying security reasoning\(Sclar et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib39); Chatterjee et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib8); Liang et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib30)\)\. Improvements in F1 therefore often reflect a more aggressive labeling policy rather than better reasoning about vulnerabilities\. As a result, two prompt\-template instances with similar accuracy may operate at very different points on the precision–recall curve, leading to substantially different security outcomes under the same evaluation policy\.

The behavior of the A\-CoT template is particularly instructive: although designed to encourage selective reasoning, it reduces recall relative to the CoT template for every model, with the largest drops in CodeLlama \(0\.685→\\rightarrow0\.279\), Falcon \(0\.541→\\rightarrow0\.223\), Gemma \(0\.420→\\rightarrow0\.057\), and Mistral \(0\.423→\\rightarrow0\.280\)\. This suggests that the instruction to reason selectively is interpreted as a signal to default toSAFEunless strong evidence is present\. One possible explanation is that A\-CoT introduces an additional latent decision—whether reasoning is needed at all—which, when resolved conservatively, may bypass deeper analysis\. Gemma exhibits a strong form of this pattern: under zero\-shot and A\-CoT, recall falls below 0\.063, rising to 0\.420 under CoT, reflecting a shift in operating point rather than improved semantic understanding\. The output\-protocol ablation in Section[4\.6](https://arxiv.org/html/2605.24171#S4.SS6)supports this interpretation, indicating that these effects are not explained solely by verdict placement: strategy\-level differences in recall and abstention persist across both protocols\.

The evaluated CoT template achieves the highest mean F1 \(0\.465\) and mean effective F1 \(0\.427\) across models while maintaining a low abstention rate \(7\.28%\), suggesting that this structured\-reasoning template stabilizes outputs without substantially increasing abstention\. This aligns with prior work using CoT and related methods to structure intermediate reasoning rather than change the task itself\(Wei et al\.,[2022](https://arxiv.org/html/2605.24171#bib.bib48); Wang et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib47); Chen et al\.,[2025](https://arxiv.org/html/2605.24171#bib.bib9)\)\. At the same time, the results do not establish that longer reasoning traces cause better vulnerability understanding: reasoning\-oriented prompts sometimes improve final metrics and sometimes do not\. Structured reasoning should therefore be treated as a controllable inductive bias rather than a reliability guarantee, particularly in security settings where prior work reports brittle reasoning and missed vulnerabilities despite fluent explanations\(Ullah et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib43); Basic and Giaretta,[2024](https://arxiv.org/html/2605.24171#bib.bib5); Lin and Mohaisen,[2025](https://arxiv.org/html/2605.24171#bib.bib31)\)\. Systematic rationale–verdict consistency analysis requires raw\-output inspection and remains future work\.

The failure of A\-CoT is instructive rather than anomalous\. More broadly, adaptive prompting has been proposed as a way to vary reasoning effort with task complexity, but the literature does not establish that such adaptation is reliable in code\-security settings, especially when the model must decide whether deeper analysis is warranted\(Wan et al\.,[2023a](https://arxiv.org/html/2605.24171#bib.bib45),[b](https://arxiv.org/html/2605.24171#bib.bib46); R,[2024](https://arxiv.org/html/2605.24171#bib.bib36); Nong et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib34)\)\. The evaluated A\-CoT template was designed to mimic human review behavior by encouraging deeper reasoning on complex code, yet it underperformed the evaluated standard CoT template on both mean F1 \(0\.287 vs\. 0\.465\) and mean effective F1 \(0\.254 vs\. 0\.427\)\. Brittleness here arises not from prompting itself, but from delegating the depth\-of\-analysis decision to the model without explicit supervision\. More robust adaptation would require explicit, verifiable triggers based on syntactic or semantic code properties, rather than relying on the model to regulate reasoning depth through natural language instruction alone\.

### 5\.2\.Abstention as a Security Risk

Building on the threshold\-shift behavior discussed above, abstention introduces a distinct and often underappreciated failure mode in vulnerability detection\. More generally, evaluation work on language models has argued that system utility cannot be judged by accuracy alone when robustness, reliability, and actionability matter\(Liang et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib30)\)\. In code\-security workflows, an abstained sample corresponds to code that receives no actionable verdict unless a downstream fallback is explicitly defined\(Chess and West,[2007](https://arxiv.org/html/2605.24171#bib.bib11); Johnson et al\.,[2013](https://arxiv.org/html/2605.24171#bib.bib24)\)\.

Across models, abstention rates are substantial and highly dependent on prompt\-template choice\. In particular, our A\-CoT\-based self\-consistency configuration produces widespread abstention due to disagreement across sampled outputs, reducing effective coverage even when conditional accuracy remains reasonable\. This creates a silent failure mode in which the system appears accurate on evaluated samples while leaving large portions of input unprocessed\. These findings indicate that abstention is not a secondary metric but a core part of system behavior\. As discussed above, accuracy alone does not capture operational reliability: evaluations that omit coverage or effective F1 risk overstating model utility and obscuring risk\. Falcon illustrates this concretely: despite maintaining precision within 0\.009 points across the evaluated CoT template and our A\-CoT\-based self\-consistency configuration, its effective F1 collapses from 0\.380 to 0\.073 as abstention rises from 27\.49% to 75\.96%, reflecting a conservative decision process that trades coverage for confidence\. Our A\-CoT\-based self\-consistency configuration amplifies this across all models, with coverage dropping to 24\.04% for Falcon and 39\.04% for DeepSeek, meaning that in the worst case fewer than one in four code samples receives any verdict\.

In our implementation, self\-consistency converts inter\-sample disagreement into abstention when sampled outputs fail to produce a clear majority label\. When models are uncertain, outputs frequently split acrossSAFEandVULNERABLEwithout converging, causing majority voting to return no verdict rather than a potentially unreliable one\. These observations support treating vulnerability detection as an operational reliability problem rather than a closed\-world classification task, where a complete assessment requires explicit policies defining when abstention is acceptable and how abstained cases should be handled, such as escalation to static analysis tools or human review\(Liang et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib30); Chess and West,[2007](https://arxiv.org/html/2605.24171#bib.bib11); Johnson et al\.,[2013](https://arxiv.org/html/2605.24171#bib.bib24); Li et al\.,[2025](https://arxiv.org/html/2605.24171#bib.bib28); Yang et al\.,[2025](https://arxiv.org/html/2605.24171#bib.bib50)\)\.

### 5\.3\.Model and Strategy Interaction

Building on the abstention and threshold effects discussed above, prompt\-template choice and model architecture interact in a non\-uniform manner\. No single evaluated template is optimal across all models, and the effectiveness of a given template depends strongly on the model’s underlying capabilities, consistent with prior work showing that prompt sensitivity is structured, model\-dependent, and not eliminated by scale or instruction tuning\(Sclar et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib39); Chatterjee et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib8); Lin and Mohaisen,[2025](https://arxiv.org/html/2605.24171#bib.bib31); Jiang et al\.,[2025](https://arxiv.org/html/2605.24171#bib.bib23)\)\.

CodeLlama achieves the strongest mean F1 \(0\.524\) even under zero\-shot prompting, consistent with the idea that code\-specialized pretraining provides stronger priors for program understanding and vulnerability\-pattern recognition than general\-purpose instruction tuning alone\(Roziere et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib37); Fan et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib13); Hou et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib21)\)\. Mistral exhibits the most consistent behavior across strategies \(F1 range: 0\.103\), making it less sensitive to prompt formulation and more suitable for deployment scenarios requiring predictable coverage\. In contrast, DeepSeek and Gemma show strong prompt dependence: DeepSeek’s F1 increases from 0\.308 under zero\-shot to 0\.507 under few\-shot, while Gemma’s F1 spans 0\.398 across strategies, indicating that its capability is present but inconsistently activated depending on task framing\.

These differences indicate that prompt sensitivity is systematic under our evaluation setup rather than random evaluation noise\. Reporting a single model–prompt result therefore provides an incomplete picture: relative rankings shift with strategy choice, and a configuration that appears optimal under one prompt may underperform under another\. Aggregate comparisons that fix a single strategy obscure model\-specific behavior and risk overinterpreting prompt\-specific results as general capability differences, especially when prompt format, scoring, and extraction choices can also influence apparent rankings\(Liang et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib30); Hua et al\.,[\[n\. d\.\]](https://arxiv.org/html/2605.24171#bib.bib22); Sclar et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib39)\)\.

### 5\.4\.Reproducibility Implications

Building on the model\-dependent prompt effects discussed above, the sensitivity of model performance to prompting strategy has direct implications for reproducibility\. When the same model evaluated on the same dataset produces substantially different results under different prompts, a single reported metric does not fully characterize its behavior\(Sclar et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib39); Chatterjee et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib8); Hua et al\.,[\[n\. d\.\]](https://arxiv.org/html/2605.24171#bib.bib22)\)\.

This variability complicates comparisons across studies and makes it difficult to disentangle methodological gains from prompt\-induced effects unless evaluation conditions are explicitly standardized and reported\(Liang et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib30); Sclar et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib39); Chatterjee et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib8)\)\. Reported improvements may reflect differences in prompting, scoring, or output handling rather than genuine advances in vulnerability detection capability\(Hua et al\.,[\[n\. d\.\]](https://arxiv.org/html/2605.24171#bib.bib22)\)\. As shown earlier, prompt\-template choice can materially alter recall and coverage, so studies that report a single configuration risk overinterpreting prompt\-specific behavior as model\-level capability\.

Reliable evaluation therefore requires treating prompting as a controlled variable\. The cross\-strategy F1 ranges observed in Table[4](https://arxiv.org/html/2605.24171#A4.T4), spanning 0\.398 for Gemma and 0\.271 for CodeLlama, demonstrate that a single reported metric does not characterize model behavior\. Two studies evaluating the same model on the same dataset with different prompts can therefore reach opposite conclusions\. Multi\-strategy evaluation provides a more complete characterization and reduces the risk of overinterpreting single\-configuration results\. The output\-protocol ablation in Section[4\.6](https://arxiv.org/html/2605.24171#S4.SS6)further supports that the strategy\-level effects are not explained solely by verdict placement or parser heuristics, supporting the validity of cross\-strategy comparisons under thePromptAuditframework\.

AlthoughPromptAuditholds datasets, decoding parameters, and parsing rules constant to isolate prompt effects, the results show that evaluation design choices themselves influence measured performance, consistent with broader work emphasizing that benchmark conclusions depend materially on how scenarios, metrics, and scoring procedures are defined\(Liang et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib30); Hua et al\.,[\[n\. d\.\]](https://arxiv.org/html/2605.24171#bib.bib22)\)\. Strict output constraints reduce ambiguity but increase abstention when models deviate from the expected format, as evidenced by Falcon’s mean abstention of 42\.26% persisting across all five strategies regardless of prompt complexity \(Appendix[D](https://arxiv.org/html/2605.24171#A4), Table[2](https://arxiv.org/html/2605.24171#S4.T2)\)\. Aggregation under self\-consistency improves conditional F1 \(0\.420\) while collapsing mean effective F1 to 0\.197, showing that aggregation strategies can improve standard metrics while reducing operational coverage\. These dependencies indicate that the evaluation harness is part of the causal pathway by which prompting affects outcomes\. Making these design choices explicit, asPromptAuditdoes through its fixed pipeline and ablation controls, is essential for interpreting results and enabling meaningful cross\-study comparison\.

### 5\.5\.Evaluation Implications

Building on the reproducibility and interaction effects discussed above, the 25 model\-template combinations separate into three reliability tiers under this evaluation setup\.High\-performing combinations:CodeLlama and Mistral paired with the evaluated zero\-shot or CoT templates provide the strongest and most stable operational performance\.Moderately reliable combinations:Falcon under the evaluated few\-shot or CoT templates, and DeepSeek under the evaluated few\-shot template, are usable but sensitive to coverage changes\.Unstable combinations:Any model under our A\-CoT\-based self\-consistency configuration, DeepSeek under the evaluated zero\-shot or adaptive reasoning templates, and Gemma under the evaluated zero\-shot or A\-CoT templates, all of which exhibit large performance swings or severe abstention\.

These tiers reinforce that prompt\-template selection must be validated per model family rather than uniformly\. Moreover, the findings translate into several practical considerations for LLM\-based vulnerability detection systems\. First, evaluation should report coverage and abstention with F1, as standard metrics do not capture operational reliability; effective F1 and abstention rates are necessary to assess real\-world performance\. Second, certain prompt\-template designs should be used with caution; the evaluated adaptive CoT template and our A\-CoT\-based self\-consistency introduce failure modes such as recall suppression and coverage collapse, making similar designs unsuitable for smaller models without careful validation; this aligns with broader findings that reasoning\-oriented prompts can behave unpredictably in code\-security tasks despite their success on generic reasoning benchmarks\(Wang et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib47); Ullah et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib43); Basic and Giaretta,[2024](https://arxiv.org/html/2605.24171#bib.bib5)\)\.

Prompt\-template selection should therefore be model\-specific rather than uniform, as prior work and our results show that prompt sensitivity is structured and architecture\-dependent rather than a generic nuisance variable\(Sclar et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib39); Chatterjee et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib8); Lin and Mohaisen,[2025](https://arxiv.org/html/2605.24171#bib.bib31)\)\. Code\-oriented models often perform well under simpler prompts, while less specialized models benefit from structured reasoning\. Given this sensitivity, each model–prompt combination should be evaluated on representative data prior to deployment through a systematic prompt audit\. In addition, deployment pipelines must define explicit handling policies for abstention\. Samples that receive no verdict should be routed to fallback systems or human review, as treating them as safe introduces systematic risk equivalent to missed detections\(Chess and West,[2007](https://arxiv.org/html/2605.24171#bib.bib11); Johnson et al\.,[2013](https://arxiv.org/html/2605.24171#bib.bib24); Li et al\.,[2025](https://arxiv.org/html/2605.24171#bib.bib28); Yang et al\.,[2025](https://arxiv.org/html/2605.24171#bib.bib50)\)\.

Finally, evaluation and optimization should prioritize recall and effective F1 over precision\. As shown earlier, precision remains relatively stable across configurations, while recall and coverage determine security outcomes\.

### 5\.6\.Limitations and Future Work

This study examines prompt sensitivity using open\-weight language models with fewer than 7B parameters, reflecting practical hardware constraints and an emphasis on fully reproducible, local experiments\. While this enables controlled comparisons across strategies, it limits generalization to larger or proprietary models, which may exhibit different instruction\-following behavior due to scale or extensive post\-training\. The goal is not to establish absolute performance ceilings, but to analyze how prompt formulation influences model behavior when other factors are held constant\. Within this regime, the observed differences reflect prompt\-induced effects, even if their magnitude may vary at larger scales\.

The task is framed as binarySAFE/VULNERABLEclassification over code snippets from real CVE records\. This abstraction simplifies security judgments that, in practice, depend on broader program context, exploitability conditions, and severity considerations\. For instance, some snippets lack sufficient information for a definitive decision, and model abstentions or inconsistencies may reflect legitimate semantic uncertainty than failure\. Thus, binary classification should be viewed as a controlled proxy for studying prompt sensitivity rather than a complete model of vulnerability analysis\.

Each prompting strategy is instantiated with a single prompt template, so the results should be interpreted as strategy–template measurements rather than universal claims about all possible zero\-shot, few\-shot, CoT, A\-CoT, or self\-consistency prompts\. Alternative templates may shift absolute scores and, in some cases, cross\-strategy rankings, although the output\-protocol ablation in Section[4\.6](https://arxiv.org/html/2605.24171#S4.SS6)suggests that the main trends are not driven solely by verdict placement\. CVEfixes commit\-level labels also carry residual noise from multi\-stage patches and context\-dependent vulnerabilities; these constraints apply uniformly across all conditions and are unlikely to fully explain the differential prompt effects observed\. Finally, all experiments use temperature 0\.2, reflecting near\-deterministic behavior; systems deployed at higher temperatures may exhibit additional variance not captured here\.

Taken together, these considerations highlight broader evaluation risks that extend beyond prompt design\. The CVEfixes dataset is publicly available and may have been included in the training data of some models\. Although this cannot be verified, it introduces a risk of data leakage and potential bias in the evaluation, a concern widely acknowledged in benchmark\-based studies where overlap between training and test data is difficult to fully rule out\.

Future Work\.Future work will strengthenPromptAuditalong three axes: realism, task coverage, and robustness\. We plan to validate our findings on additional real\-world datasets \(e\.g\.,Big\-Vul\(Fan et al\.,[2020](https://arxiv.org/html/2605.24171#bib.bib14)\)\) to assess whether prompt\-driven behaviors persist across diverse projects, languages, and labels\. We will extend beyond coarse classification to support*vulnerability\-type/CWE*,*localization*to vulnerable functions or lines, and*fix\-oriented suggestions*in structured outputs aligned with security review workflows\(Li et al\.,[2018](https://arxiv.org/html/2605.24171#bib.bib29); Fu and Tantithamthavorn,[2022](https://arxiv.org/html/2605.24171#bib.bib16); Pearce et al\.,[2023](https://arxiv.org/html/2605.24171#bib.bib35)\)\. We also plan to evaluate whether model explanations can be converted into normalized vulnerability\-report artifacts\(Althebeiti et al\.,[2025b](https://arxiv.org/html/2605.24171#bib.bib3),[a](https://arxiv.org/html/2605.24171#bib.bib2)\)\.

To improve reliability, we will incorporate lightweight static\-analysis cues \(e\.g\., unsafe API flags and simple taint or dataflow indicators\) and introduce a controlled prompt\-perturbation suite to quantify stability across templates, formatting variations, and added context, reporting variance rather than single\-prompt outcomes\(Sclar et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib39); Chatterjee et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib8)\)\. We will also evaluate deployment relevance through*human\-in\-the\-loop*studies \(e\.g\., triage speed and error reduction\), complemented by scalable*LLM\-as\-a\-judge*rubrics for explanation, localization, and fix quality, and benchmark prompting against stronger alternatives, including*fine\-tuned*models and*tool\-augmented*pipelines that combine static analyzers with LLM reasoning\(Du et al\.,[2024](https://arxiv.org/html/2605.24171#bib.bib12); Li et al\.,[2025](https://arxiv.org/html/2605.24171#bib.bib28); Yang et al\.,[2025](https://arxiv.org/html/2605.24171#bib.bib50); Johnson et al\.,[2013](https://arxiv.org/html/2605.24171#bib.bib24)\)\.

## 6\.Conclusion

We studied how prompt formulation affects LLM\-based vulnerability classification and introducedPromptAudit, a framework that fixes the dataset, decoding configuration, output protocol, parser, and metrics while varying prompt structure\. Across five prompt templates and multiple open\-weight models on a CVE\-derived binarySAFE/VULNERABLEtask, prompt choice changed recall, abstention, and effective F1 enough to change model rankings and evaluation conclusions\.

The main lesson is that prompt\-template choice behaves like part of the decision policy\. It can shift a model toward higher recall, higher abstention, or more conservative labeling without changing the underlying model\. Single\-prompt scores therefore give an incomplete account of reliability\.PromptAuditprovides a controlled way to measure this variance and to compare models under the same data, parsing, and reporting assumptions\. We conclude that prompt sensitivity should be reported as part of LLM\-based security evaluation\. Coverage, abstention, and prompt\-induced variability are necessary context for judging whether a classifier can provide consistent, actionable outputs in security\-critical workflows\.

## References

- \(1\)
- Althebeiti et al\.\(2025a\)Hattan Althebeiti, Mohammed Alkinoon, Manar Mohaisen, Saeed Salem, DaeHun Nyang, and David Mohaisen\. 2025a\.Enhancing Vulnerability Reports with Automated and Augmented Description Summarization\.*IEEE Transactions on Big Data*\(2025\)\.
- Althebeiti et al\.\(2025b\)Hattan Althebeiti, Brett Fazio, William Chen, Jamen Park, and David Mohaisen\. 2025b\.Mujaz: A Summarization\-based Approach for Normalized Vulnerability Description\.*IEEE Transactions on Dependable and Secure Computing*\(2025\)\.
- Arp et al\.\(2022\)Daniel Arp, Erwin Quiring, Feargus Pendlebury, Alexander Warnecke, Fabio Pierazzi, Christian Wressnegger, Lorenzo Cavallaro, and Konrad Rieck\. 2022\.Dos and don’ts of machine learning in computer security\. In*31st USENIX Security Symposium \(USENIX Security 22\)*\. 3971–3988\.
- Basic and Giaretta \(2024\)Enna Basic and Alberto Giaretta\. 2024\.*From Vulnerabilities to Remediation: A Systematic Literature Review of LLMs in Code Security*\.[https://doi\.org/10\.48550/arXiv\.2412\.15004](https://doi.org/10.48550/arXiv.2412.15004)
- Bhandari et al\.\(2021\)Guru Bhandari, Amara Naseer, and Leon Moonen\. 2021\.CVEfixes: automated collection of vulnerabilities and their fixes from open\-source software\. In*Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering**\(PROMISE 2021\)*\. Association for Computing Machinery, New York, NY, USA, 30–39\.[https://doi\.org/10\.1145/3475960\.3475985](https://doi.org/10.1145/3475960.3475985)
- Chakraborty et al\.\(2021\)Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi Ray\. 2021\.Deep learning based vulnerability detection: Are we there yet?*IEEE Transactions on Software Engineering*48, 9 \(2021\), 3280–3296\.
- Chatterjee et al\.\(2024\)Anwoy Chatterjee, H S V N S Kowndinya Renduchintala, Sumit Bhatia, and Tanmoy Chakraborty\. 2024\.POSIX: A Prompt Sensitivity Index For Large Language Models\. In*Findings of the Association for Computational Linguistics: EMNLP 2024*\. Association for Computational Linguistics, Miami, Florida, USA, 14550–14565\.[https://doi\.org/10\.18653/v1/2024\.findings\-emnlp\.852](https://doi.org/10.18653/v1/2024.findings-emnlp.852)
- Chen et al\.\(2025\)Banghao Chen, Zhaofeng Zhang, Nicolas Langrené, and Shengxin Zhu\. 2025\.Unleashing the Potential of Prompt Engineering for Large Language Models\.*arXiv preprint arXiv:2310\.14735*\(2025\), 25 pages\.[https://arxiv\.org/abs/2310\.14735v6](https://arxiv.org/abs/2310.14735v6)Version 6, February 2025\.
- Chen et al\.\(2023\)Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Chen, and David Wagner\. 2023\.DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection\. In*Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses*\. ACM, Hong Kong China, 654–668\.[https://doi\.org/10\.1145/3607199\.3607242](https://doi.org/10.1145/3607199.3607242)
- Chess and West \(2007\)Brian Chess and Jacob West\. 2007\.*Secure programming with static analysis*\.Pearson Education\.
- Du et al\.\(2024\)Xiaohu Du, Ming Wen, Jiahao Zhu, Zifan Xie, Bin Ji, Huijun Liu, Xuanhua Shi, and Hai Jin\. 2024\.Generalization\-Enhanced Code Vulnerability Detection via Multi\-Task Instruction Fine\-Tuning\. In*Findings of the Association for Computational Linguistics ACL 2024*\. Association for Computational Linguistics, Bangkok, Thailand and virtual meeting, 10507–10521\.[https://doi\.org/10\.18653/v1/2024\.findings\-acl\.625](https://doi.org/10.18653/v1/2024.findings-acl.625)
- Fan et al\.\(2023\)Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang\. 2023\.Large language models for software engineering: Survey and open problems\. In*2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering \(ICSE\-FoSE\)*\. IEEE, 31–53\.
- Fan et al\.\(2020\)Jiahao Fan, Yi Li, Shaohua Wang, and Tien N\. Nguyen\. 2020\.A C/C\+\+ Code Vulnerability Dataset with Code Changes and CVE Summaries\. In*Proceedings of the 17th International Conference on Mining Software Repositories**\(MSR ’20\)*\. Association for Computing Machinery, New York, NY, USA, 508–512\.[https://doi\.org/10\.1145/3379597\.3387501](https://doi.org/10.1145/3379597.3387501)
- Feng \(2020\)Z Feng\. 2020\.Codebert: A pre\-trained model for program\-ming and natural languages\.*arXiv preprint arXiv:2002\.08155*\(2020\)\.
- Fu and Tantithamthavorn \(2022\)Michael Fu and Chakkrit Tantithamthavorn\. 2022\.LineVul: a transformer\-based line\-level vulnerability prediction\. In*Proceedings of the 19th International Conference on Mining Software Repositories*\. ACM, Pittsburgh Pennsylvania, 608–620\.[https://doi\.org/10\.1145/3524842\.3528452](https://doi.org/10.1145/3524842.3528452)
- Guo et al\.\(2020\)Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al\.2020\.Graphcodebert: Pre\-training code representations with data flow\.*arXiv preprint arXiv:2009\.08366*\(2020\)\.
- Guo et al\.\(2024\)Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al\.2024\.DeepSeek\-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence\.*arXiv preprint arXiv:2401\.14196*\(2024\)\.
- Hanif and Maffeis \(2022\)Hazim Hanif and Sergio Maffeis\. 2022\.Vulberta: Simplified source code pre\-training for vulnerability detection\. In*2022 International joint conference on neural networks \(IJCNN\)*\. IEEE, 1–8\.
- He and Vechev \(2023\)Jingxuan He and Martin Vechev\. 2023\.Large Language Models for Code: Security Hardening and Adversarial Testing\. In*Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security**\(CCS ’23\)*\. Association for Computing Machinery, New York, NY, USA, 1865–1879\.[https://doi\.org/10\.1145/3576915\.3623175](https://doi.org/10.1145/3576915.3623175)
- Hou et al\.\(2024\)Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang\. 2024\.Large language models for software engineering: A systematic literature review\.*ACM Transactions on Software Engineering and Methodology*33, 8 \(2024\), 1–79\.
- Hua et al\.\(\[n\. d\.\]\)Andong Hua, Kenan Tang, Chenhe Gu, Jindong Gu, Eric Wong, and Yao Qin\. \[n\. d\.\]\.Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs\.\(\[n\. d\.\]\)\.
- Jiang et al\.\(2025\)Xuefeng Jiang, Lvhua Wu, Sheng Sun, Jia Li, Jingjing Xue, Yuwei Wang, Tingting Wu, and Min Liu\. 2025\.Investigating Large Language Models for Code Vulnerability Detection: An Experimental Study\.[https://doi\.org/10\.48550/arXiv\.2412\.18260](https://doi.org/10.48550/arXiv.2412.18260)arXiv:2412\.18260 \[cs\]\.
- Johnson et al\.\(2013\)Brittany Johnson, Yoonki Song, Emerson Murphy\-Hill, and Robert Bowdidge\. 2013\.Why don’t software developers use static analysis tools to find bugs?\. In*2013 35th International Conference on Software Engineering \(ICSE\)*\. IEEE, 672–681\.
- Kharma et al\.\(2026\)Mohammed F Kharma, Soohyeon Choi, Mohammad Alkhanafseh, and David Mohaisen\. 2026\.Security and quality in llm\-generated code: A multi\-language, multi\-model analysis\.*IEEE Transactions on Dependable and Secure Computing*\(2026\)\.
- Kojima et al\.\(2022\)Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa\. 2022\.Large language models are zero\-shot reasoners\.*Advances in neural information processing systems*35 \(2022\), 22199–22213\.
- Li et al\.\(2023\)Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al\.2023\.Starcoder: may the source be with you\!*arXiv preprint arXiv:2305\.06161*\(2023\)\.
- Li et al\.\(2025\)Ziyang Li, Saikat Dutta, and Mayur Naik\. 2025\.IRIS: LLM\-Assisted Static Analysis for Detecting Security Vulnerabilities\.*International Conference on Representation Learning*2025 \(May 2025\), 35735–35758\.
- Li et al\.\(2018\)Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong\. 2018\.VulDeePecker: A Deep Learning\-Based System for Vulnerability Detection\. In*Proceedings 2018 Network and Distributed System Security Symposium*\.[https://doi\.org/10\.14722/ndss\.2018\.23158](https://doi.org/10.14722/ndss.2018.23158)arXiv:1801\.01681 \[cs\]\.
- Liang et al\.\(2023\)Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D\. Manning, Christopher Ré, Diana Acosta\-Navas, Drew A\. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda\. 2023\.Holistic Evaluation of Language Models\.[https://doi\.org/10\.48550/arXiv\.2211\.09110](https://doi.org/10.48550/arXiv.2211.09110)arXiv:2211\.09110 \[cs\]\.
- Lin and Mohaisen \(2025\)Jie Lin and David Mohaisen\. 2025\.From Large to Mammoth: A Comparative Evaluation of Large Language Models in Zero\-Shot Vulnerability Detection\. In*Proceedings 2025 Network and Distributed System Security Symposium*\. Internet Society, San Diego, CA, USA\.[https://doi\.org/10\.14722/ndss\.2025\.241491](https://doi.org/10.14722/ndss.2025.241491)
- Liu et al\.\(2024\)Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang\. 2024\.Lost in the middle: How language models use long contexts\.*Transactions of the Association for Computational Linguistics*12 \(2024\), 157–173\.
- Ngweta et al\.\(2025\)Lilian Ngweta, Kiran Kate, Jason Tsay, and Yara Rizk\. 2025\.Towards LLMs Robustness to Changes in Prompt Format Styles\.[https://doi\.org/10\.48550/arXiv\.2504\.06969](https://doi.org/10.48550/arXiv.2504.06969)arXiv:2504\.06969 \[cs\]\.
- Nong et al\.\(2024\)Yu Nong, Mohammed Aldeen, Long Cheng, Hongxin Hu, Feng Chen, and Haipeng Cai\. 2024\.Chain\-of\-thought prompting of large language models for discovering and fixing software vulnerabilities\.*arXiv preprint arXiv:2402\.17230*\(2024\)\.
- Pearce et al\.\(2023\)Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan\-Gavitt\. 2023\.Examining zero\-shot vulnerability repair with large language models\. In*2023 IEEE Symposium on Security and Privacy \(SP\)*\. IEEE, 2339–2356\.
- R \(2024\)Kamesh R\. 2024\.Think Beyond Size: Adaptive Prompting for More Effective Reasoning\.arXiv:2410\.08130 \[cs\.LG\][https://arxiv\.org/abs/2410\.08130](https://arxiv.org/abs/2410.08130)
- Roziere et al\.\(2023\)Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al\.2023\.Code llama: Open foundation models for code\.*arXiv preprint arXiv:2308\.12950*\(2023\)\.
- Russell et al\.\(2018\)Rebecca Russell, Louis Kim, Lei Hamilton, Tomo Lazovich, Jacob Harer, Onur Ozdemir, Paul Ellingwood, and Marc McConley\. 2018\.Automated vulnerability detection in source code using deep representation learning\. In*2018 17th IEEE international conference on machine learning and applications \(ICMLA\)*\. IEEE, 757–762\.
- Sclar et al\.\(2024\)Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr\. 2024\.Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting\.*International Conference on Representation Learning*2024 \(May 2024\), 25055–25083\.
- Serebryany et al\.\(2012\)Konstantin Serebryany, Derek Bruening, Alexander Potapenko, and Dmitriy Vyukov\. 2012\.\{\\\{AddressSanitizer\}\\\}: A fast address sanity checker\. In*2012 USENIX annual technical conference \(USENIX ATC 12\)*\. 309–318\.
- Siddiq et al\.\(2022\)Mohammed Latif Siddiq, Shafayat H Majumder, Maisha R Mim, Sourov Jajodia, and Joanna CS Santos\. 2022\.An empirical study of code smells in transformer\-based code generation techniques\. In*2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation \(SCAM\)*\. IEEE, 71–82\.
- Steenhoek et al\.\(2023\)Benjamin Steenhoek, Md Mahbubur Rahman, Richard Jiles, and Wei Le\. 2023\.An Empirical Study of Deep Learning Models for Vulnerability Detection\. In*Proceedings of the 45th International Conference on Software Engineering**\(ICSE ’23\)*\. IEEE Press, Melbourne, Victoria, Australia, 2237–2248\.[https://doi\.org/10\.1109/ICSE48619\.2023\.00188](https://doi.org/10.1109/ICSE48619.2023.00188)
- Ullah et al\.\(2024\)Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, and Gianluca Stringhini\. 2024\.Llms cannot reliably identify and reason about security vulnerabilities \(yet?\): A comprehensive evaluation, framework, and benchmarks\. In*2024 IEEE symposium on security and privacy \(SP\)*\. IEEE, 862–880\.
- Viega et al\.\(2000\)John Viega, Jon\-Thomas Bloch, Yoshi Kohno, and Gary McGraw\. 2000\.ITS4: A static vulnerability scanner for C and C\+\+ code\. In*Proceedings 16th Annual Computer Security Applications Conference \(ACSAC’00\)*\. IEEE, 257–267\.
- Wan et al\.\(2023a\)Xingchen Wan, Ruoxi Sun, Hanjun Dai, Sercan O\. Arik, and Tomas Pfister\. 2023a\.Better Zero\-Shot Reasoning with Self\-Adaptive Prompting\.arXiv:2305\.14106 \[cs\.CL\][https://arxiv\.org/abs/2305\.14106](https://arxiv.org/abs/2305.14106)
- Wan et al\.\(2023b\)Xingchen Wan, Ruoxi Sun, Hootan Nakhost, Hanjun Dai, Julian Martin Eisenschlos, Sercan O\. Arik, and Tomas Pfister\. 2023b\.Universal Self\-Adaptive Prompting\.arXiv:2305\.14926 \[cs\.CL\][https://arxiv\.org/abs/2305\.14926](https://arxiv.org/abs/2305.14926)
- Wang et al\.\(2023\)Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou\. 2023\.Self\-Consistency Improves Chain of Thought Reasoning in Language Models\.[https://doi\.org/10\.48550/arXiv\.2203\.11171](https://doi.org/10.48550/arXiv.2203.11171)arXiv:2203\.11171 \[cs\]\.
- Wei et al\.\(2022\)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al\.2022\.Chain\-of\-thought prompting elicits reasoning in large language models\.*Advances in neural information processing systems*35 \(2022\), 24824–24837\.
- White et al\.\(2024\)Jules White, Sam Hays, Quchen Fu, Jesse Spencer\-Smith, and Douglas C Schmidt\. 2024\.Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design\.In*Generative AI for Effective Software Development*\. Springer, 71–108\.
- Yang et al\.\(2025\)Chenyuan Yang, Zijie Zhao, Zichen Xie, Haoyu Li, and Lingming Zhang\. 2025\.KNighter: Transforming Static Analysis with LLM\-Synthesized Checkers\. In*Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles*\. 655–669\.[https://doi\.org/10\.1145/3731569\.3764827](https://doi.org/10.1145/3731569.3764827)arXiv:2503\.09002 \[cs\]\.
- Zhou et al\.\(2019\)Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu\. 2019\.Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks\.*Advances in neural information processing systems*32 \(2019\)\.

## Ethical Considerations

This research evaluates prompt sensitivity in LLM\-based vulnerability detection using publicly disclosed CVE records\. Stakeholders include the research team, security practitioners, model developers, open\-source maintainers whose code appears in CVE records, and the broader security community\.

The primary ethical benefit is exposing reliability limitations in LLM\-based detection tools before deployment, enabling defenders to make informed decisions about operational constraints and failure modes\. While adversaries could exploit knowledge of model weaknesses to select prompts that suppress detection, this information asymmetry already exists—adversaries can conduct similar experiments privately\. Publication enables transparent evaluation and responsible deployment practices, with benefits to defenders substantially outweighing potential adversarial advantages\.

No human subjects participated in this research and all analyzed code derives from publicly disclosed CVE records already part of security disclosure processes\. We also perform no novel vulnerability discovery or responsible disclosure\. The framework and findings are publicly available, focusing on open\-weight models to ensure reproducibility without proprietary access or substantial computational resources\. We conclude that publication serves the public interest by documenting systematic weaknesses that could lead to missed vulnerabilities if these tools are deployed without understanding their limitations\. Transparent documentation of coverage\-accuracy tradeoffs and failure modes enables more responsible development and deployment of LLM\-based security tools\.

## Appendix

## Appendix APrompt Templates and Output Protocol

We provide abbreviated versions of the prompt templates used inPromptAudit\. These excerpts document the exact interface between the experiment runner and the models\.

### A\.1\.Output Protocol Instructions

All prompting strategies share the same task\-specific verdict instructions appended by the experiment runner after the strategy prompt\. The selected output protocol controls only where the final verdict must appear; the prompting strategy controls whether and how reasoning is encouraged\. Representative protocol suffixes are shown below\.

Verdict\-First Protocol

TASK: Classify the code’s security\. On the FIRST LINE ONLY, output exactly one of these words: SAFE or VULNERABLE\. Do not add any other words, punctuation, or symbols on that first line\. If you include any explanation, it must begin on the SECOND line\.

Verdict\-Last Protocol

TASK: Classify the code’s security\. If you include any explanation, it must appear before the final verdict\. On the FINAL LINE ONLY, output exactly one of these words: SAFE or VULNERABLE\. Do not add any other words, punctuation, or symbols on that final line\.

The label parser described in Section[3\.7](https://arxiv.org/html/2605.24171#S3.SS7)treats the protocol\-defined verdict line as the primary target: the first non\-empty line forverdict\_firstand the last non\-empty line forverdict\_last\.

### A\.2\.Zero\-Shot Prompt \(Excerpt\)

The zero\-shot template adds a minimal instruction on how to approach the classification, without examples or explicit reasoning steps\. The runner appends the selected output\-protocol suffix after this strategy text:

You are a secure code auditor\. Examine the following code and determine whether it is secure or vulnerable to potential exploits\. Do not use step\-by\-step reasoning unless the selected prompt strategy explicitly asks for it\. If you include any explanation, keep it brief\. Briefly consider issues such as buffer overflows, injections, improper validation, and other common weaknesses\. <CODE\_SNIPPET\_HERE\>

### A\.3\.Few\-Shot Prompt \(Excerpt\)

The few\-shot strategy extends the zero\-shot with compact labeled examples\. In the current implementation, these are synthetic CVE\-style before/after function snippets designed to resemble function\-level patch data without copying exact dataset rows:

Use the examples below as prior knowledge for classifying code security\. First, review the examples and their labels\. Then analyze the new code snippet\. Do not use step\-by\-step reasoning unless the selected prompt strategy explicitly asks for it\. If you include any explanation, keep it brief\. Examples: Example 1 \(before patch\): Code: int parse\_message\(const unsigned char \*buf, size\_t len\) \{ unsigned int msg\_len; char header\[256\];if \(len < 2\) \{ return \-1; \} msg\_len = \(\(unsigned int\)buf\[0\] << 8\) \| buf\[1\]; memcpy\(header, buf \+ 2, msg\_len\); return 0; \} Label: VULNERABLE Example 2 \(after patch\): Code: int parse\_message\(const unsigned char \*buf, size\_t len\) \{ unsigned int msg\_len; char header\[256\];if \(len < 2\) \{ return \-1; \} msg\_len = \(\(unsigned int\)buf\[0\] << 8\) \| buf\[1\]; if \(msg\_len \> len \- 2 \|\| msg\_len \> sizeof\(header\)\) \{ return \-1; \} memcpy\(header, buf \+ 2, msg\_len\); return 0; \} Label: SAFE Now analyze this code from a security perspective: <CODE\_SNIPPET\_HERE\>

### A\.4\.Chain\-of\-Thought Prompt \(Excerpt\)

The CoT strategy encourages structured reasoning and uses a protocol\-aware placement hint\. The runner inserts one of the following hints depending on the selected output protocol: \(i\) “Give the required verdict first, then explain your reasoning starting on the second line” forverdict\_first; or \(ii\) “Reason step by step first, then place the final verdict on the last line” forverdict\_last\.

You are a secure code auditor\. Analyze the following code step by step, carefully reasoning about potential security vulnerabilities such as buffer overflows, injections, improper validation, race conditions, and other common issues\. <PLACEMENT\_HINT\> Consider at least these factors in your reasoning:\(1\)Inputs and trust boundaries\(2\)Validation and sanitization\(3\)Memory safety and resource management\(4\)Injection, race, and logic risksCode: <CODE\_SNIPPET\_HERE\>

### A\.5\.Adaptive Chain\-of\-Thought \(Excerpt\)

The A\-CoT strategy instructs the model to decide when stepwise reasoning is necessary and, like CoT, uses a protocol\-aware placement hint:

You are a secure code auditor\. Determine whether the following code is SAFE or VULNERABLE\. Adjust the depth of your reasoning to the code:\(1\)If the code is straightforward and obviously SAFE or VULNERABLE, keep any explanation brief\.\(2\)If the code uses pointer arithmetic, raw memory operations, manual resource management, or complex input handling, reason through the risks step by step\.\(3\)<PLACEMENT\_HINT\>Code: <CODE\_SNIPPET\_HERE\>

### A\.6\.Self\-Consistency Strategy

The self\-consistency strategy uses the same adaptive CoT template as A\-CoT, but relies on the experiment runner to issue multiple completions under the active output protocol\. Five independent responses are collected per snippet by default\. Each vote is parsed individually using the selected parser mode and then aggregated by majority vote\. If no label receives a true majority over the requested samples, the final result is recorded asUNKNOWN\.

## Appendix BLabel Parsing Logic

This section summarizes the parser modes used to convert raw model output intoSAFE,VULNERABLE, orUNKNOWN\. The full implementation appears inevaluation/label\_parser\.py\.

### B\.1\.Parser Modes

The parser supports three modes:

- •Strict: protocol\-position check only\.
- •Structured: strict parsing plus explicit verdict\-line patterns\.
- •Full: structured parsing plus whole\-response lexical fallback\.

All three modes begin by normalizing away empty lines and then applying the protocol\-specific strict check\.

### B\.2\.Tier 1: Protocol\-Position Strict Check

1def\_strict\_line\_label\(lines,output\_protocol\):

2ifnotlines:

3return"unknown",None

4

5idx=0ifoutput\_protocol=="verdict\_first"else\-1

6target\_line=lines\[idx\]

7parts=target\_line\.split\(\)

8iflen\(parts\)==1:

9label=\_normalize\_token\_label\(parts\[0\]\)

10iflabel:

11tier=\(

12"strict\_first\_line"

13ifoutput\_protocol=="verdict\_first"

14else"strict\_last\_line"

15\)

16returnlabel,tier

17

18return"unknown",None

This tier accepts only the protocol\-defined verdict location\. Minor punctuation is normalized away before testing the token, so values such asSAFEstill map toSAFE\.

### B\.3\.Tier 2: Explicit Verdict Markers

1\_EXPLICIT\_VERDICT\_PATTERNS=\[

2re\.compile\(

3r"^\(finalanswer\|answer\|classification\|"

4r"verdict\|label\|conclusion\)"

5r"\\s\*\[:\\\-\]?\\s\*\(?:is\\s\+\)?"

6r"\(safe\|vulnerable\)\\s\*\[\.\!\]?\\s\*$",

7re\.IGNORECASE,

8\),

9re\.compile\(

10r"^\(?:the\\s\+\)?\(?:final\\s\+answer\|answer\|"

11r"classification\|verdict\)\\s\+is\\s\+"

12r"\(safe\|vulnerable\)\\s\*\[\.\!\]?\\s\*$",

13re\.IGNORECASE,

14\),

15re\.compile\(

16r"^\(therefore\|thus\|overall\|ultimately\|"

17r"inconclusion\),?\\s\+"

18r"\(?:the\\s\+code\\s\+is\\s\+\)?"

19r"\(safe\|vulnerable\)\\s\*\[\.\!\]?\\s\*$",

20re\.IGNORECASE,

21\),

22re\.compile\(

23r"^\(?:the\\s\+code\|this\\s\+code\)\\s\+is\\s\+"

24r"\(safe\|vulnerable\)\\s\*\[\.\!\]?\\s\*$",

25re\.IGNORECASE,

26\),

27\]

1forlninreversed\(lines\):

2forpatternin\_EXPLICIT\_VERDICT\_PATTERNS:

3m=pattern\.search\(ln\)

4ifm:

5explicit\_label=m\.groups\(\)\[\-1\]\.lower\(\)

6returnexplicit\_label

Scanning bottom up prioritizes stable final verdicts when the model writes a longer explanation before committing to a label\.

### B\.4\.Tier 3: Contextual Keyword Scan

1lowered=text\.lower\(\)

2

3has\_not\_safe=re\.search\(r"\\bnot\\s\+safe\\b",lowered\)

4has\_unsafe=re\.search\(r"\\bunsafe\\b",lowered\)

5has\_vulnerable=re\.search\(r"\\bvulnerable\\b",lowered\)

6has\_vulnerability=re\.search\(r"\\bvulnerabilit\(?:y\|ies\)\\b",lowered\)

7has\_exploitable=re\.search\(r"\\bexploitable\\b",lowered\)

8has\_at\_risk=re\.search\(r"\\bat\\s\+risk\\b",lowered\)

9

10has\_safe\_word=re\.search\(r"\\bsafe\\b",lowered\)

11has\_secure\_word=re\.search\(r"\\bsecure\\b",lowered\)

12has\_not\_vulnerable=re\.search\(r"\\bnot\\s\+vulnerable\\b",lowered\)

13has\_no\_vuln=re\.search\(

14r"\\bno\\s\+\(known\\s\+\)?vulnerabilit\(?:y\|ies\)\\b",lowered

15\)

1ifvulnerable\_signalandnotpositive\_safe:

2return"vulnerable","lexical\_fallback"

3ifpositive\_safeandnotvulnerable\_signal:

4return"safe","lexical\_fallback"

5return"unknown",None

This tier is available only infullmode\. It helps recover verdicts from freer outputs, but conflicting or mixed evidence is intentionally mapped toUNKNOWN\.

## Appendix CGeneration Settings and Configurations

These settings match the YAML configuration used across all experiments\. Each parameter controls a specific aspect of the model’s decoding behavior\.

- •Temperature: 0\.2The temperature controls how much randomness is introduced when selecting the next token\. Lower values make the model more deterministic and reduce the chance of drifting away from the expected output format\. A value of 0\.2 helps keep the classification stable under the selected output protocol while still allowing the model to produce natural reasoning when the prompt strategy requests it\.
- •Top\-p: 0\.9Top\-p sampling \(also called nucleus sampling\) restricts token choices to the smallest possible set whose cumulative probability mass reaches the specified threshold\. Usingp=0\.9p=0\.9limits low\-probability transitions without forcing fully greedy decoding, which helps maintain consistency over models\.
- •Top\-k: 40Top\-k further constrains the sampling distribution by limiting each step to the 40 most likely next tokens\. This avoids rare or unstable tokens that could violate the selected output protocol or introduce noisy reasoning\.
- •Max new tokens: 250This caps the number of tokens a model may generate in response to a prompt\. Although the classification itself may appear on either the first or final line depending on the selected output protocol, CoT and adaptive prompts may produce additional reasoning\.
- •Repetition penalty: 1\.0This penalty discourages the model from repeating the same phrase or token sequence\. Using 1\.0 keeps behavior consistent across models and avoids introducing artificial bias into the reasoning style\.
- •Frequency and presence penalties: 0\.0These penalties adjust how the model weighs tokens it has already used\. Setting both to zero prevents manipulating the output structure in unintended ways\. This helps preserve reproducibility when applying the selected output protocol\.
- •Seed: 42The seed ensures that sampling decisions are repeatable\. Fixing the seed allows multiple runs to produce comparable output, which is important for evaluating the self\-consistency and ensuring differences between prompts are genuine and not due to random sampling\.
- •Self\-consistency samples: 5For the self\-consistency, the model generates five independent completions for each snippet\.PromptAuditparses each completion separately and applies majority voting\. Using five samples provides a reasonable balance between stability and computational cost\.

## Appendix DEvaluation Metrics

Section[3\.8](https://arxiv.org/html/2605.24171#S3.SS8)summarizes the metrics in the main text; this appendix gives the complete formulation used for reporting results\. Because prompt\-induced abstention plays a central role in LLM\-based vulnerability detection, standard classification metrics alone are not sufficient\. The definitions below are designed to capture both predictive correctness and operational coverage\.

Table 4\.Standard F1 and effective F1 by model and prompt template\.### D\.1\.Confusion Categories

Each model prediction is categorized into one of several outcome types: aTrue Positive \(TP\)corresponds to code that is correctly classified asVULNERABLE, while aTrue Negative \(TN\)denotes code that is correctly classified asSAFE\. In contrast, aFalse Positive \(FP\)occurs when code is incorrectly classified asVULNERABLE, and aFalse Negative \(FN\)occurs when vulnerable code is incorrectly classified asSAFE\. We additionally distinguishUnknown False Negatives \(UnFN\), which capture vulnerable code instances for which the model fails to produce a definitiveVULNERABLEverdict under the enforced output protocol, including explicit abstentions or responses that avoid committing to a classification\. Finally,Incorrectoutputs refer to cases in which the model does not provide a definitive classification despite the enforced output protocol, such as refusals, excessive meta\-reasoning without a verdict, or responses that avoid committing to eitherSAFEorVULNERABLE; these cases reflect model behavior under the given prompt rather than parser failure\.UnFNexplicitly captures abstentions on vulnerable samples because abstaining on vulnerable code has similar operational consequences to false negatives in security settings\.

### D\.2\.Metric Definitions

Let TP, TN, FP, FN, UnFN, and Incorrect denote the outcome counts defined above\. Using these quantities, we compute standard and abstention\-aware evaluation metrics as follows\.Accuracyis defined as the fraction of correctly classified instances, \(TP \+ TN\) / \(TP \+ TN \+ FP \+ FN \+ UnFN\)\.Precisionmeasures the reliability of positive predictions and is computed as TP / \(TP \+ FP\), whileRecallcaptures the model’s ability to identify vulnerable code and is given by TP / \(TP \+ FN \+ UnFN\), explicitly accounting for unknown false negatives\. Incorrect outputs are excluded from the accuracy denominator because they represent protocol non\-compliance rather than a committedSAFE/VULNERABLEclassification; their operational impact is captured through abstention, coverage, and effective F1\. TheF1 Scoreis the harmonic mean of precision and recall, computed as 2⋅\\cdot\(Precision⋅\\cdotRecall\) / \(Precision \+ Recall\)\. To quantify non\-commitment behavior, we define theAbstention Rateas \(Incorrect \+ UnFN\) / \(TP \+ TN \+ FP \+ FN \+ UnFN \+ Incorrect\), withCoveragegiven by one minus the abstention rate\. Finally, we reportEffective F1, defined as the product of the F1 score and coverage, which jointly captures classification quality and the model’s willingness to issue definitive verdicts\.

### D\.3\.Interpretation

Standard metrics such as accuracy and F1 alone can be misleading when models abstain frequently\. A model may achieve high F1 on answered samples while remaining unusable in practice due to low coverage\. TheEffective F1metric penalizes such behavior by scaling F1 by coverage, yielding a measure that better reflects operational utility\. By explicitly accounting forUNKNOWNoutputs and abstentions, these metrics align evaluation with real\-world security workflows, where failure to flag a vulnerability is often more costly than issuing a false alarm\. This formulation allowsPromptAuditto distinguish between models that appear strong under closed\-world assumptions and those that provide consistent, actionable outputs in realistic deployment scenarios\.

## Appendix ESupplementary Result Tables

This appendix reports numeric tables that duplicate or expand the figure\-level summaries in the main text\. They are included for exact lookup without requiring the main paper to repeat the same information in both graphical and tabular form\.

Table 5\.Selected model–prompt pairs illustrating the precision–recall tradeoff\.Table 6\.Output\-protocol ablation: verdict\-first \(VF\) vs\. verdict\-last \(VL\) on a 1,232\-sample subset\. Mean rows summarize across all model–prompt combinations\.### E\.1\.Additional Raw Results

Table 7\.Extended confusion statistics by model and prompt template\.Table 8\.Classification performance by model and prompt\.Table[7](https://arxiv.org/html/2605.24171#A5.T7)summarizes true positive \(TP\), true negative \(TN\), false positive \(FP\), false negative \(FN\), Incorrect, and UnFN outcome distributions across all evaluated configurations, enabling fine\-grained analysis of detection reliability and model decision behavior\. Self\-consistency prompting produces the largest increases in unresolved and error\-prone outputs, reflected by elevated Incorrect and UnFN counts across multiple models, indicating reduced vulnerability recall and increased prediction instability\. In contrast, zero\-shot and Chain\-of\-Thought prompting maintain more balanced outcome distributions for several model families, preserving higher volumes of correct classifications with fewer unresolved predictions\. These outcome shifts align with the Effective F1 and coverage trends reported in the main analysis, as growth in Incorrect and UnFN outcomes reduces usable prediction volume and contributes to the performance degradation observed under fragile prompting strategies\.

Table[8](https://arxiv.org/html/2605.24171#A5.T8)summarizes performance across all evaluated model–prompt combinations using accuracy, precision, and recall\. The results show that prompting strategy systematically shapes performance patterns across models\. In particular, few\-shot prompting generally increases recall compared to zero\-shot settings, indicating improved detection coverage\. Chain\-of\-Thought prompting tends to produce more balanced outcomes, stabilizing accuracy while moderating fluctuations in recall\. Adaptive Chain\-of\-Thought has mixed effects: it preserves accuracy in some configurations but leads to notable recall degradation in others\. Self\-consistency introduces a distinct trade\-off, often reducing accuracy while maintaining relatively stable precision—consistent with the more conservative prediction behavior observed in the abstention analysis\.

From a metric perspective, overall accuracy remains relatively stable across strategies \(typically 0\.47–0\.52 outside self\-consistency\), suggesting limited sensitivity of aggregate correctness to prompt structure\. Precision shows moderate variability, generally remaining near 0\.50, indicating robust control of false positives\. In contrast, recall exhibits the largest variation, ranging from 0\.057 \(Gemma Adaptive CoT\) to 0\.865 \(CodeLlama Zero\-Shot\), highlighting detection coverage as the main driver of performance variability\.

Table[2](https://arxiv.org/html/2605.24171#S4.T2)reports abstention behavior across all evaluated model–prompt combinations and directly reflects effective prediction coverage, where lower abstention corresponds to higher usable output coverage\. Our A\-CoT\-based self\-consistency configuration produces the highest abstention rates for every model, exceeding 39% in all cases and reaching 75\.96% for Falcon, indicating a substantial reduction in effective coverage under this majority\-vote policy\. In contrast, the evaluated zero\-shot and Chain\-of\-Thought templates maintain low abstention for models such as Mistral and Gemma, often below 1%, resulting in near\-complete output coverage\. Falcon exhibits the highest mean abstention rate \(42\.26%\), reflecting reduced prediction availability across template choices, whereas Gemma preserves near\-full coverage under most evaluated prompt templates\. These coverage reductions directly contribute to the Effective F1 trends reported in the main results, as templates or aggregation policies that increase abstention reduce the proportion of evaluable predictions and therefore lower the effective performance even when raw F1 remains stable\.

Similar Articles

Self-Supervised Prompt Optimization

Papers with Code Trending

This paper introduces Self-Supervised Prompt Optimization (SPO), a framework that optimizes prompts for LLMs without external references by using output comparisons, significantly reducing costs and data requirements.

Are most LLM eval tools still too prompt-focused?

Reddit r/AI_Agents

The author questions whether current LLM evaluation tools are too focused on isolated prompts rather than full workflows and agent interactions, noting that step-by-step accuracy can mask overall behavioral drift in production.

Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

arXiv cs.CL

This paper investigates how toxic lexical perturbations in prompts reduce the factual accuracy and increase uncertainty of LLMs, and uses attribution-graph analyses to trace internal changes. It finds that increasing toxicity amplifies perturbation-sensitive variant nodes while core reasoning nodes remain invariant.