A Systematic Study of Training-Free Methods for Trustworthy Large Language Models

arXiv cs.CL 04/20/26, 04:00 AM Papers

Summary

A systematic study evaluating training-free methods for improving trustworthiness in large language models, categorizing approaches into input, internal, and output-level interventions while analyzing trade-offs between trustworthiness, utility, and robustness.

arXiv:2604.15789v1 Announce Type: new Abstract: As Large Language Models (LLMs) receive increasing attention and are being deployed across various domains, their potential risks, including generating harmful or biased content, producing unsupported claims, and exhibiting vulnerabilities to adversarial attacks, have drawn significant attention. To enable quick and low-cost adaptation, training-free methods have recently emerged as cost-effective alternatives to post-training alignment techniques. Despite their promising results, these methods are evaluated inconsistently across the literature, cover limited dimensions of trustworthiness, and can introduce undesirable side effects, such as utility degradation and increased brittleness. To fully assess the impacts of these training-free methods, we take a step back and systematically re-evaluate the effectiveness of existing training-free methods against various trustworthy settings and their influence on utility, robustness, and computational overhead. We also categorize these methods into three levels (input, internal, and output) based on where they intervene in the model's information flow during inference. Using this taxonomy, we conduct a comprehensive analysis of various representative and effective methods from each level across different LLM families and sizes. Our analysis highlights several trade-offs and unresolved challenges in current approaches. We summarize key findings and limitations in the existing literature, and propose practical recommendations for balancing trustworthiness, utility, and robustness in LLMs without the need for additional training.

Original Article

View Cached Full Text

Cached at: 04/20/26, 08:29 AM

# A Systematic Study of Training-Free Methods for Trustworthy Large Language Models
Source: https://arxiv.org/html/2604.15789
Wai Man Si, Mingjie Li, Michael Backes, Yang Zhang

CISPA Helmholtz Center for Information Security

###### Abstract

As Large Language Models (LLMs) receive increasing attention and are being deployed across various domains, their potential risks, including generating harmful or biased content, producing unsupported claims, and exhibiting vulnerabilities to adversarial attacks, have drawn significant attention. To enable quick and low-cost adaptation, training-free methods have recently emerged as cost-effective alternatives to post-training alignment techniques. Despite their promising results, these methods are evaluated inconsistently across the literature, cover limited dimensions of trustworthiness, and can introduce undesirable side effects, such as utility degradation and increased brittleness. To fully assess the impacts of these training-free methods, we take a step back and systematically re-evaluate the effectiveness of existing training-free methods against various trustworthy settings and their influence on utility, robustness, and computational overhead. We also categorize these methods into three levels (input, internal, and output) based on where they intervene in the model's information flow during inference. Using this taxonomy, we conduct a comprehensive analysis of various representative and effective methods from each level across different LLM families and sizes. Our analysis highlights several trade-offs and unresolved challenges in current approaches. We summarize key findings and limitations in the existing literature, and propose practical recommendations for balancing trustworthiness, utility, and robustness in LLMs—without the need for additional training.

## Introduction

Over the past few years, LLMs have been used in a wide range of domains, from productivity tools to mobile assistants. However, pretrained LLMs have been shown to generate undesired content (e.g., harmful or biased) and are vulnerable to adversarial attacks, which has become a serious concern due to their widespread use and potential risks. A common strategy to mitigate these issues is to retrain or finetune models with intended outcomes. However, such approaches are often costly and time-consuming. In addition, collecting high-quality training data in sufficient quantities is challenging, further increasing the difficulty. In many practical scenarios, users are required to quickly adapt models to new threats or evolving policies, e.g., LLMs need to continuously adapt to users' habits in personalized agents. Moreover, retraining or finetuning often demands extensive computational and data resources, which are not always accessible. These challenges have motivated growing interest in methods that enhance LLM trustworthiness without requiring additional training.

Among these, prompt engineering has proven to be particularly effective and user-friendly. For example, the system prompt from the LLaMA-2 report is specifically designed to enhance safety and accuracy by guiding the model toward responsible engagement. Besides, Self-Reminder is designed to counter "jailbreak" attempts by employing a system prompt with a reminder at the end of user queries. Other research focuses on directly modifying model activations or parameters to shape model behavior more precisely. Turner et al. propose Activation Addition to steer the model behavior by contrasting activations between prompts and have shown effectiveness in detoxifying responses. Similarly, ProFS reduces toxic generation by editing the model parameters away from the toxic subspace. Beyond prompting and model editing, adjustments during the decoding process also show promise for improving model trustworthiness. For instance, DoLA modifies the output distribution by contrasting differences in logits from various internal layers, while ICD employs a similar approach using external models.

In summary, training-free methods enable users to adjust model behavior to enhance trustworthiness in a cost-effective and timely way. While these methods can enhance trustworthiness, their effectiveness varies widely and often inconsistently across different papers, leaving gaps in understanding their full potential and limitations. For instance, most existing methods are designed for a single purpose (e.g., to improve safety) and are evaluated only on that property. This narrow scope limits insights into how these methods might impact model trustworthiness in dimensions beyond the primary target. Also, utility evaluations differ significantly across studies in both structure and task orientation. Some methods are evaluated on question-answering tasks, while some are evaluated on instruction-following tasks. This discrepancy creates gaps in understanding between expected and actual model performance. Third, the evaluation on model robustness, including resistance to adversarial attacks, the presence of watermarking artifacts, and tendencies to over-refuse, remains fragmented and inconsistent across existing studies. These factors are essential for understanding the impact of training-free methods on real-world applications and their influence on user experience.

This work is motivated by the lack of a comprehensive understanding of the side effects of existing training-free methods. We re-evaluate the effectiveness of these methods in enhancing trustworthiness and their impact on model utility and robustness. Additionally, we examine the computational cost of each method and the implications of using multiple methods simultaneously. To begin, we systematically categorize current methods into three levels—input, internal, and output—based on the information flow within the model during inference. We then apply eight representative training-free methods to four widely used LLMs, ranging from 7B to 70B parameters, and assess their effects on trustworthiness, utility, and robustness tasks, along with their computational costs. Our findings reveal consistent trade-offs across different levels. Input-level methods tend to reduce unsafe behavior but can worsen truthfulness, bias, and increase over-refusal. Internal-level methods are more effective at improving truthfulness and reducing bias, but they often come at the cost of lower utility. Output-level methods offer modest improvements in safety and truthfulness, typically with minimal impact on utility and robustness. Through this investigation, we provide a deeper understanding of training-free methods and their potential and risks in practice.

The contributions of this paper are as follows:

- We categorize training-free methods into three levels—input, internal, and output—based on how model information flows during inference.
- We conduct a comprehensive analysis of training-free methods across multiple tasks, evaluating their trustworthiness, utility, and robustness.
- We further investigate the computational cost, as well as the potential benefits and drawbacks when combining multiple methods.
- We provide practical guidance for deploying these techniques in real-world applications, including recommendations for selecting the most suitable methods to achieve the desired behaviors.

## Training-Free Methods

**Figure 1:** An overview of the pipeline used in the taxonomy.

| Evaluation | Model | Dev. | Deploy. | Tech | Title | Target | Safety | Bias | Truthfulness | Utility | Robustness | Cost | Open | Small | Medium | Large | Access | Extra | Date | Code |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Input | Prompting | Self-Reminder | J | ✍ | U 1–5 | AA | ✓ | ✓ | ✓ | ● | P | 2023.12 | ✓ |
| | | SAGE | J | D | 1,5 | U 6–8 | AA | Time | ✓ | ✓ | ● | P | 2025.05 | ✓ |
| | In-Context Demonstration | In-Context Defense | J | D | 1–2 | U 1,9 | AA | Time | ✓ | ✓ | ● | P | 2023.10 | ✓ |
| | | Goal | J | ✍ | U 10–11 | AA | ✓ | ✓ | ✓ | ✓ | ● | P | 2023.11 | ✓ |
| | Multi-Turn Prompting | Self Defense | S | ✍ | AA | ✓ | ✓ | ● | P | 2023.08 | ✓ |
| | | IA | J | D | 1–2, 16–17 | D 15 | U 10,6 | AA | Time | ✓ | ✓ | ✓ | ✓ | ● | P, F | 2024.01 | ✓ |
| | | BtB | J | ✍ | U 9 | AA | ✓ | ● | F | 2024.02 |
| Internal | Activation Editing | ActAdd | S | D | 4 | ✓ | ○ | S | 2023.08 | ✓ |
| | | CAA | S, T | ✍ | D 15 | U 6 | ✓ | ✓ | ○ | S | 2023.12 | ✓ |
| | | InferAligner | S | D | 1 | U 7, 29–32 | ✓ | ○ | S | 2024.01 |
| | | SEA | B, T | D 18 | D 15 | U 6–7, 13–16 | ✓ | ✓ | ✓ | ○ | S | 2024.05 | ✓ |
| | | SCANS | S | D | 1, 8, 19–20 | D 15 | U 3, 6, 18, 33 | OR | Time + Mem | ✓ | ✓ | ○ | S | 2024.08 | ✓ |
| | | CAST | S | D | 3 | U 10 | ✓ | ✓ | ✓ | ○ | S | 2024.09 |
| | | SVA | S | D | 1, 5, 8–9, 19–20, 24 | U 6, 17–18 | OR | ✓ | ✓ | ✓ | ○ | S | 2024.10 | ✓ |
| | | Category | S | D | 7, 25 | U 10 | ✓ | ○ | S | 2024.10 |
| | | Antidote | J | D | 5 | U 10 | AA | Time | ✓ | ✓ | ✓ | ○ | S | 2024.10 |
| | | SAC | S, B, T | D | 19 | D 6 | D 23 | U 6, 22 | OR | ✓ | ✓ | ✓ | ○ | S | 2024.11 |
| | | AdaSteer | J | D | 1 | U 10 | AA, OR | Time | ✓ | ○ | S | 2025.04 |
| | Sparse AE. | SAS | S, T | ✍ | D 15 | U 6 | ✓ | ○ | S | 2025.02 |
| | Parameter Editing | ProFS | S | D | 4 | U 13, 17–21, 34 | ✓ | ○ | 2024.05 | ✓ |
| Output | Guided Decoding | DeAL | S | D | 13 | U 36 | AA | ✓ | ○ | M | 2024.02 |
| | | DoLA | T | D | 15, 22 | U 7, 23, 35 | Time + Mem | ✓ | ✓ | ✓ | ○ | 2023.09 | ✓ |
| | | ICD | T | D | 15, 21 | U 6, 17, 10 | ✓ | ○ | F | 2023.12 | ✓ |
| | | Self-CD | S | D | 19–20 | OR | ✓ | ✓ | ✓ | ○ | 2024.01 | ✓ |
| | | ROSE | S | D | 10–14, 19 | U 6, 10 | ✓ | ○ | F | 2024.02 |
| | | DeCoRe | T | D | 15 | U 3, 14, 22–25 | U 28 | ✓ | ✓ | ○ | F | 2024.10 | ✓ |
| | Iterative Rewrite | RAIN | J, T | D | 1 | D 15 | U 36 | AA | Time | ✓ | ✓ | ✓ | ○ | F | 2023.09 | ✓ |

**Table 1:** An overview of existing training-free methods for LLM trustworthiness. Each row is categorized by model level (Input, Internal, Output) and technique. Target denotes the primary objective: S = Safety, T = Truthfulness, B = Bias, J = Jailbreak. Under Evaluation, entries are benchmark datasets used to assess Safety, Bias, Truthfulness, Utility, and Robustness; Dataset IDs map to Table 8 and Table 9, and ✍ indicates paper-specific customized datasets. In Robustness, AA = adversarial attack study and OR = over-refusal study. Cost illustrates the types of computational overhead examined in the study: Time = inference-time latency, Mem = GPU memory usage. Open marks commercial model testing. Small/Medium/Large indicate compatibility across model scales (<12B, 12–32B, >32B). Access denotes accessibility (● = black-box, ○ = white-box). Extra lists additional resource types: P = prompt, S = auxiliary storage, M = additional model(s), F = additional forward pass(es). Date and Code give publication date and code availability.

### Definition

In this paper, we focus on *training-free* methods, approaches that operate without gradient-based optimization and are applied directly to the model or during inference. These techniques avoid computing gradients for the model or any auxiliary components (e.g., guardrails), making them generally fast and inexpensive to use. Examples include prompting, forward-only modifications to parameters or activations, and constraint-driven or contrastive decoding. All of these methods rely solely on manipulating inputs, outputs, or intermediate representations to influence model behavior.

In contrast, we exclude methods that involve gradient-based updates, such as finetuning via LoRA or gradient-based model editing, even if they are efficient. These techniques still require a decent amount of training resources (e.g., GPU and data) for gradient calculation, which can be challenging in many deployment contexts. By clearly defining the boundaries of training-free methods, this work highlights a class of fast, lightweight interventions that are particularly well-suited to real-world constraints.

### Why Training-Free?

Training-free methods are attractive in practice because they are easy to deploy, quick to develop, and suitable for low-resource environments. They also integrate naturally as pre- or post-processing steps around LLMs. Below, we summarize their key advantages in detail:

- **Efficiency.** Training-based methods such as SFT and RLHF require large datasets and significant GPU resources. In contrast, training-free methods avoid gradient updates entirely, reducing computational cost and requiring less data.
- **Accessibility.** Training-free methods (e.g., prompting) operate effectively in black-box settings and are easily transferable across models, making them particularly well-suited for commercial systems (e.g., OpenAI's GPT-4o and Anthropic's Claude).
- **Auditability.** Training-free methods can be deployed or rolled back quickly at low cost, and this enables rapid A/B testing to assess impact. Also, each intervention can be versioned and logged, facilitating reproducible evaluation, traceability, and compliance with governance or regulatory requirements.
- **Responsiveness.** When new risks appear (e.g., novel jailbreak attacks), model behavior can be constrained or adjusted without retraining, enabling rapid mitigation.

### Literature Search

We collect training-free methods that aim to improve the trustworthiness of LLMs or report trustworthiness evaluations, excluding work focused on general capabilities or efficiency. As summarized in Table 1, our survey includes 27 papers, and we observe that all papers were published after the release of ChatGPT (November 2022). Each method is categorized based on its intervention location within the inference-time information flow—input, internal, or output. Chronologically, the development of these techniques has progressed from input-level prompting and in-context learning strategies (late 2023), to output-level decoding controls (early to mid-2024), and most recently, to internal interventions on activations or parameters (late 2024 to 2025). Across the surveyed papers, jailbreak attacks and safety are the most frequently evaluated aspects, while robustness (e.g., watermarking) and computational overhead are less commonly reported. Code is publicly available for the majority of methods, and most evaluations focus on small to medium-sized models, with fewer results reported for large-scale models.

### Taxonomy of Existing Methods

To systematize existing work, we categorize training-free methods based on the information flow within the inference pipeline: input, internal, and output stages. Each stage represents a distinct location of intervention and entails different levels of model access as shown in Figure 1.

- **Input-level methods.** We consider methods that modify the input prior to model execution. Examples include appending content to the system or user prompt, or inserting demonstrations related to the target behavior. These techniques require access only to the input interface and are often model-agnostic.
- **Internal-level methods.** These methods operate on the model's hidden representations or parameters. They steer behavior by injecting or modifying internal components, such as activations or weights. Because they require access to intermediate states, these techniques are only applicable to open-weight models.
- **Output-level methods.** These methods act during or after decoding to adjust the generated text. They modify output logits to guide next-token generation or perform iterative rewrites to refine an initial draft toward the desired behavior. As they require access to output logits, they are also applicable to open-weight models.

This taxonomy offers a structured framework for understanding recent advances in training-free methods, highlighting how different approaches manipulate inputs, internals, or outputs to influence model behavior.

### Observations from the Landscape

In Table 1, we present an overview of existing training-free methods. Based on the table, we draw the following observations:

- Most existing methods focus on safety and jailbreak prevention, while side effects

A Systematic Study of Training-Free Methods for Trustworthy Large Language Models

Similar Articles

A Systematic Evaluation of Black-Box Uncertainty Estimation Methods for Large Language Models

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

TrustLDM: Benchmarking Trustworthiness in Language Diffusion Models

Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty

DataDignity: Training Data Attribution for Large Language Models

Submit Feedback

Similar Articles

A Systematic Evaluation of Black-Box Uncertainty Estimation Methods for Large Language Models

A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

TrustLDM: Benchmarking Trustworthiness in Language Diffusion Models

Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty

DataDignity: Training Data Attribution for Large Language Models