PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models
Summary
The paper introduces PsychoSafe, a psychologically-informed refusal framework for large language models that improves refusal quality by 28.1% and resource referral by 46.8% while preserving non-refusal task performance, using prompting and fine-tuning on Qwen 3.5 27B.
View Cached Full Text
Cached at: 06/10/26, 09:43 AM
Paper page - PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models
Source: https://huggingface.co/papers/2606.09697
Abstract
A psychologically-informed refusal framework called PsychoSafe is developed for large language models to improve harmful request handling through structured supportive communication, showing enhanced refusal quality and resource referral while maintaining performance on non-refusal tasks.
Large language models(LLMs) routinely face requests that should be refused, creating a trade-off between helpfulness and harm prevention. However, refusals themselves can be helpful. In high-risk interactions involving crisis, coercion, or escalating intent, blunt non-compliance may prevent direct harm while still failing to support the needs of the person behind the request. We present PsychoSafe, a psychologically-informedrefusal frameworkthat reframes refusal as structured supportive communication grounded in evidence-based intervention strategies. To develop PsychoSafe, we construct a corpus of 8019 prompt-response pairs spanning five psychologically salient risk domains and applypromptingandparameter-efficient fine-tuningtoQwen 3.5 27B. On a balanced validation set of 500 prompts, evaluated with anLLM judgeand validated through human ratings, PsychoSafepromptingimproves overall refusal quality by 28.1% over a generic baseline, with particularly strong gains in external resource referral (+46.8%) andpsychological grounding(+34.8%), while preserving downstream performance on non-refusal tasks. Fine-tuning achieves near-perfect refusal and resource-referral rates but reduces response relevance. Additional evaluations onSORRY-BenchandXSTestshow strong in-domain robustness but limited out-of-domain generalization, suggesting that future work should diversify fine-tuning data to help models apply interventions selectively rather than schematically.
View arXiv pageView PDFGitHub2Add to collection
Get this paper in your agent:
hf papers read 2606\.09697
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### giannor/Qwen3.5-27B-psysafe Image-Text-to-Text• 27B• Updatedabout 2 hours ago • 18
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.09697 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.09697 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Deliberative alignment: reasoning enables safer language models
OpenAI presents 'deliberative alignment,' a technique where language models explicitly reason through safety policies before responding, enabling more robust refusals of disallowed content including obfuscated or encoded harmful requests.
When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints
Researchers identify a systematic safety failure in LLMs where reformulating harmful requests as forced-choice multiple-choice questions (MCQs) bypasses refusal behavior, even in models that reject equivalent open-ended prompts. Evaluated across 14 proprietary and open-source models, the study reveals current safety benchmarks substantially underestimate risks in structured decision-making settings.
LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification
This paper presents an iterative imbalance-aware fine-tuning approach using Qwen3-8B with QLoRA for psychological defense mechanism classification, achieving a macro F1 of 0.3917 and ranking 4th out of 21 teams in the PsyDefDetect 2026 shared task.
From hard refusals to safe-completions: toward output-centric safety training
OpenAI introduced 'safe completions,' a new safety-training approach in GPT-5 that replaces binary refusal-based training with output-centric rewards, improving both safety and helpfulness—especially for dual-use prompts. The method penalizes unsafe outputs and rewards helpful responses, resulting in fewer and less severe safety violations compared to refusal-trained models like o3.
could refusal layers be masking dialect-conditioned safety failures in MoE models [d]
Tests on Qwen3.5-35B-A3B show that AAVE-coded prompts cause MoE models to respond differently, with refusal layers masking dialect-conditioned safety failures that become visible when refusal is weakened.