PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

Hugging Face Daily Papers 06/08/26, 04:19 PM Papers

llm-safety refusal-framework psychological-intervention fine-tuning prompting ai-safety harm-prevention

Summary

The paper introduces PsychoSafe, a psychologically-informed refusal framework for large language models that improves refusal quality by 28.1% and resource referral by 46.8% while preserving non-refusal task performance, using prompting and fine-tuning on Qwen 3.5 27B.

Large language models (LLMs) routinely face requests that should be refused, creating a trade-off between helpfulness and harm prevention. However, refusals themselves can be helpful. In high-risk interactions involving crisis, coercion, or escalating intent, blunt non-compliance may prevent direct harm while still failing to support the needs of the person behind the request. We present PsychoSafe, a psychologically-informed refusal framework that reframes refusal as structured supportive communication grounded in evidence-based intervention strategies. To develop PsychoSafe, we construct a corpus of 8019 prompt-response pairs spanning five psychologically salient risk domains and apply prompting and parameter-efficient fine-tuning to Qwen 3.5 27B. On a balanced validation set of 500 prompts, evaluated with an LLM judge and validated through human ratings, PsychoSafe prompting improves overall refusal quality by 28.1% over a generic baseline, with particularly strong gains in external resource referral (+46.8%) and psychological grounding (+34.8%), while preserving downstream performance on non-refusal tasks. Fine-tuning achieves near-perfect refusal and resource-referral rates but reduces response relevance. Additional evaluations on SORRY-Bench and XSTest show strong in-domain robustness but limited out-of-domain generalization, suggesting that future work should diversify fine-tuning data to help models apply interventions selectively rather than schematically.

Original Article

View Cached Full Text

Cached at: 06/10/26, 09:43 AM

Paper page - PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

Source: https://huggingface.co/papers/2606.09697

Abstract

A psychologically-informed refusal framework called PsychoSafe is developed for large language models to improve harmful request handling through structured supportive communication, showing enhanced refusal quality and resource referral while maintaining performance on non-refusal tasks.

Large language models(LLMs) routinely face requests that should be refused, creating a trade-off between helpfulness and harm prevention. However, refusals themselves can be helpful. In high-risk interactions involving crisis, coercion, or escalating intent, blunt non-compliance may prevent direct harm while still failing to support the needs of the person behind the request. We present PsychoSafe, a psychologically-informedrefusal frameworkthat reframes refusal as structured supportive communication grounded in evidence-based intervention strategies. To develop PsychoSafe, we construct a corpus of 8019 prompt-response pairs spanning five psychologically salient risk domains and applypromptingandparameter-efficient fine-tuningtoQwen 3.5 27B. On a balanced validation set of 500 prompts, evaluated with anLLM judgeand validated through human ratings, PsychoSafepromptingimproves overall refusal quality by 28.1% over a generic baseline, with particularly strong gains in external resource referral (+46.8%) andpsychological grounding(+34.8%), while preserving downstream performance on non-refusal tasks. Fine-tuning achieves near-perfect refusal and resource-referral rates but reduces response relevance. Additional evaluations onSORRY-BenchandXSTestshow strong in-domain robustness but limited out-of-domain generalization, suggesting that future work should diversify fine-tuning data to help models apply interventions selectively rather than schematically.

View arXiv page View PDF GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2606\.09697

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### giannor/Qwen3.5-27B-psysafe Image-Text-to-Text• 27B• Updatedabout 2 hours ago • 18

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.09697 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.09697 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

Paper page - PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

Abstract

Models citing this paper1

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Deliberative alignment: reasoning enables safer language models

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification

From hard refusals to safe-completions: toward output-centric safety training

could refusal layers be masking dialect-conditioned safety failures in MoE models [d]

Submit Feedback

Similar Articles

Deliberative alignment: reasoning enables safer language models

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification

From hard refusals to safe-completions: toward output-centric safety training

could refusal layers be masking dialect-conditioned safety failures in MoE models [d]