LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification

arXiv cs.CL Papers

Summary

This paper presents an iterative imbalance-aware fine-tuning approach using Qwen3-8B with QLoRA for psychological defense mechanism classification, achieving a macro F1 of 0.3917 and ranking 4th out of 21 teams in the PsyDefDetect 2026 shared task.

arXiv:2606.00647v1 Announce Type: new Abstract: Detecting psychological defense mechanisms in conversational text remains a challenging clinical NLP problem. For the PsyDefDetect 2026 shared task (nine-class utterance classification evaluated via macro F1), our team LinguIUTics achieves a macro F1-score of 0.3917 on the official positive-class leaderboard, ranking 4th out of 21 registered teams and improving over the Ministral-8B task baseline (31.48 macro F1) by 7.7 absolute points (24.4 percent relative). BERT-family encoders and zero-shot LLMs proved ineffective on rare classes due to severe class imbalance, leading us to QLoRA fine-tuning of Qwen3-8B. We leverage three key strategies: grouped stratified cross-validation (preventing leakage), minority-class round-robin lexical augmentation, and a post-processing pipeline with logit bias tuning and ensemble blending. Together, these components close much of the validation-to-leaderboard gap and substantially improve minority-class recall, driving the critical "Unclear" class (Level 8) from near-zero performance to an F1 score of 0.797.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:38 PM

# Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism ClassificationAccepted at PsyDefDetect, a shared task at the 25th BioNLP Workshop (BioNLP 2026), co-located with ACL 2026 in San Diego, CA, USA.
Source: [https://arxiv.org/html/2606.00647](https://arxiv.org/html/2606.00647)
Shefayat E Shams Adib∗,Ahmed Alfey Sani∗,Md Hasibur Rahman Alif∗, Ajwad Abrar∗ Department of Computer Science and Engineering, Islamic University of Technology, Dhaka, Bangladesh \{shefayatadib, ahmedalfey, hasiburrahman21, ajwadabrar\}@iut\-dhaka\.edu ∗All authors contributed equally to this work

###### Abstract

Detecting psychological defense mechanisms in conversational text remains a challenging clinical NLP problem\. For the PsyDefDetect 2026 shared task \(9\-class utterance classification evaluated via macro F1\), our team LinguIUTics111Code and resources available at[https://github\.com/Shefwef/LingIUTics\-PsyDefDetect\-BIONLP26](https://github.com/Shefwef/LingIUTics-PsyDefDetect-BIONLP26)achieves a macro F1\-score of 0\.3917 on the official positive\-class leaderboard, ranking 4th out of 21 registered teams and improving over the Ministral\-8B task baseline \(31\.48 macro F1\) by \+7\.7 absolute points \(\+24\.4% relative\)\. BERT\-family encoders and zero\-shot LLMs proved ineffective on rare classes due to severe class imbalance, leading us to QLoRA fine\-tuning of Qwen3\-8B\. We leverage three key strategies: grouped stratified cross\-validation \(preventing leakage\), minority\-class round\-robin lexical augmentation, and a post\-processing pipeline with logit bias tuning and ensemble blending\. Together, these components close much of the validation–leaderboard gap and substantially improve minority\-class recall, driving the critical “Unclear” class \(Level 8\) from near\-zero performance to F1 = 0\.797\.

LinguIUTics at PsyDefDetect: Iterative Imbalance\-Aware Fine\-tuning of Qwen3\-8B for Psychological Defense Mechanism Classification††thanks:Accepted at PsyDefDetect, a shared task at the 25th BioNLP Workshop \(BioNLP 2026\), co\-located with ACL 2026 in San Diego, CA, USA\.

Shefayat E Shams Adib∗, Ahmed Alfey Sani∗, Md Hasibur Rahman Alif∗,Ajwad Abrar∗Department of Computer Science and Engineering,Islamic University of Technology, Dhaka, Bangladesh\{shefayatadib, ahmedalfey, hasiburrahman21, ajwadabrar\}@iut\-dhaka\.edu∗All authors contributed equally to this work\.

## 1Introduction

Automatic detection of psychological defense mechanisms \(unconscious strategies to mitigate distress under the DMRS framework\(Perry and Henry,[2004](https://arxiv.org/html/2606.00647#bib.bib3)\)\) helps mental health platforms flag maladaptive coping and improves empathetic conversational agents\(Liuet al\.,[2021](https://arxiv.org/html/2606.00647#bib.bib1); Naet al\.,[2025](https://arxiv.org/html/2606.00647#bib.bib17)\)\. The PsyDefDetect 2026 shared task\(Naet al\.,[2026a](https://arxiv.org/html/2606.00647#bib.bib16),[b](https://arxiv.org/html/2606.00647#bib.bib2)\)challenges participating systems to classify seeker utterances into nine DMRS levels\. This poses a major obstacle, that is extreme class imbalance\(He and Garcia,[2009](https://arxiv.org/html/2606.00647#bib.bib7)\), with the frequency gap between majority \("High\-Adaptive", 51\.9%\) to minority \("Unclear ",1\.5% \) classes at about 34\.6 times respectively\. Because evaluation uses macro\-averaged F1, optimizing on accuracy leads to majority\-class collapse and task failure\.

To address this, we followed up with an iterative development process\. Standard single\-fold PEFT \(parameter\-efficient fine\-tuning\) on Qwen3\-8B\(Yanget al\.,[2025](https://arxiv.org/html/2606.00647#bib.bib6)\)suffered a massive generalization gap \(0\.34 validation vs\. 0\.24 leaderboard F1\) due to limited low\-rank capacity and majority\-class overfitting\. By systematically upgrading model capacity, loss functions, and inference, we established a robust pipeline\. Our key contributions are:

1. 1\.A leakage\-safe cross\-validation scheme at the level of groups, where synthetic augmentations are in a set with their source utterances\. This leads to an order of magnitude smaller generalization gap between out\-of\-fold and leaderboard batches\.
2. 2\.An oversampling method that preserves the original psychological signal\(Wei and Zou,[2019](https://arxiv.org/html/2606.00647#bib.bib8)\)\. This is achieved by expanding specific minority classes by 3 times in a round\-robin lexical mutation approach\.
3. 3\.A post\-processing pipeline that combines OOF based logit bias tuning, that is guarded using v2 decoding, and multi\-seed probability blending\.

## 2Task and Dataset

The PsyDefDetect 2026 task\(Naet al\.,[2026a](https://arxiv.org/html/2606.00647#bib.bib16)\)classifies seeker utterances into nine psychological defense levels, as defined by the DMRS framework\(Perry and Henry,[2004](https://arxiv.org/html/2606.00647#bib.bib3)\), and evaluated by macro averaged F1\-score\. The PsyDefConv dataset\(Naet al\.,[2026b](https://arxiv.org/html/2606.00647#bib.bib2)\)contains 2,336 utterances across 200 dialogues from ESConv\(Liuet al\.,[2021](https://arxiv.org/html/2606.00647#bib.bib1)\)\. For 5\-fold CV, we merged train and validation sets \(1,864 training examples and 472 test\)\. It exhibits very high class imbalance with a 34\.6 times frequency gap \(Table[1](https://arxiv.org/html/2606.00647#S2.T1)\)\.

Table 1:PsyDefConv combined training class distribution \(Train \+ Val splits\)\. Level 7 vs\. Level 8 ratio: 34\.6×\\times\.
## 3Implementation Process

Our development went through three iterative stages, each revealing a fundamental limitation that directly inspired the next architectural transition\. Table[10](https://arxiv.org/html/2606.00647#A5.T10)follows the complete leaderboard path from R0 \(F1 = 0\.240\) to our final submission \(F1 = 0\.392\)\. The detailed system run log is provided in Table[2](https://arxiv.org/html/2606.00647#S3.T2)\.

Table 2:Complete system run log\. LB = CodaBench leaderboard\.![Refer to caption](https://arxiv.org/html/2606.00647v1/x1.png)Figure 1:Full system pipeline\.Phase 1: raw training data is preprocessed and minority classes \(Levels 2, 3, 4, 5, 8\) are oversampled via round\-robin lexical mutation, creating an expanded training set of 2,600\+ utterances\.Phase 2: two independent grouped stratified 5\-fold QLoRA fine\-tuning runs of Qwen3\-8B theAnchor\(seed = 42\) andSeed\-A\(seed = 20260407\) sharing identical architecture and hyperparameters, with dialogues grouped across folds to prevent data leakage\.Phase 3: a class\-specific logit bias vector \(δc\\delta\_\{c\}\) is optimised on Anchor OOF predictions and locked for reuse; test probabilities from both runs are blended before a final guarded decode step that maximises minority\-class recall without sacrificing majority\-class precision\.### 3\.1Stage 1: BERT\-Family Encoder Baselines

We evaluated MentalBERT, MentalRoBERTa, DeBERTa\-v3\-base, and RoBERTa\-base\(Devlinet al\.,[2019](https://arxiv.org/html/2606.00647#bib.bib15); Jiet al\.,[2022](https://arxiv.org/html/2606.00647#bib.bib12); Heet al\.,[2021](https://arxiv.org/html/2606.00647#bib.bib13); Liuet al\.,[2019](https://arxiv.org/html/2606.00647#bib.bib14)\)for multiple context windows, loss functions \(cross\-entropy, Focal\(Linet al\.,[2017](https://arxiv.org/html/2606.00647#bib.bib10)\), EMD\), and ensemble strategies \(full scores in Appendix[E](https://arxiv.org/html/2606.00647#A5), Table[10](https://arxiv.org/html/2606.00647#A5.T10)\)\. Best validation macro F1 was 0\.314 \(MentalRoBERTa \+ Focal/EMD \+ Hungarian remapping\), with leaderboard peak at 0\.240\. Importantly, F1 for Classes 3, 5 and 8 was still zero across*all*variants, establishing an*encoder capacity ceiling*with a 51\.8% majority\-class prior usingn≤50n\\\!\\leq\\\!50minority examples\. From the general formula:

ℒFocal=−\(1−pt\)γ​log⁡pt\\mathcal\{L\}\_\{\\text\{Focal\}\}=\-\(1\-p\_\{t\}\)^\{\\gamma\}\\log p\_\{t\}\(1\)whereptp\_\{t\}is the probability of predicting the correct class, logarithmically compensated by strength constantγ\\gamma\.

### 3\.2Stage 2: Zero\-Shot Evaluation

Qwen3\-8B, Llama 3\.1\-8B, and Ministral\-8B evaluated zero\-shot with explicit DMRS label definitions produced 8–16% macro F1 \(Table[10](https://arxiv.org/html/2606.00647#A5.T10)\), confirming that task knowledge cannot be prompt\-engineered\.

### 3\.3Stage 3: Diagnostic LLM Fine\-Tuning

Ministral\-8B fine\-tuned with 4\-bit NF4 quantization achieved 64\.71% accuracy but only 14\.74 macro F1 \(Table[10](https://arxiv.org/html/2606.00647#A5.T10)\)\. This illustrates that standard cross\-entropy collapses to majority\-class prediction under severe imbalance\. Furthermore, the accuracy is an actively misleading metric in this setting\.

### 3\.4Stage 4: Final System \(Qwen3\-8B LoRA Pipeline\)

As such, with the three lessons above guiding us, our final pipeline consists of five components: model architecture, imbalance\-aware training objective, data augmentation, leakage\-safe cross\-validation and a post\-processing ensemble, each targeting a specific failure mode identified in the earlier stages\.

#### 3\.4\.1Model Architecture

We fine\-tune Qwen3\-8B via QLoRA\(Dettmerset al\.,[2023](https://arxiv.org/html/2606.00647#bib.bib5)\)with 4\-bit NF4 quantization, reducing peak GPU memory from∼\{\\sim\}32 GB to∼\{\\sim\}8 GB on a single NVIDIA RTX 3090 Ti\. LoRA adapters\(Huet al\.,[2021](https://arxiv.org/html/2606.00647#bib.bib4)\)target all attention and feed\-forward layers \(q,k,v,o,gate,up,down\) and thescorehead \(r=128r\\\!=\\\!128,α=256\\alpha\\\!=\\\!256, dropout = 0\.1\), yielding≈\{\\approx\}31M trainable parameters \(0\.4% of the 8B base\)\. Increasing rank fromr=64r\\\!=\\\!64tor=128r\\\!=\\\!128delivered a\+\+24\.9% fold\-level F1 gain, critical for separating psychologically similar classes \(e\.g\., Level 4 vs\. Level 5\)\. Full training hyperparameters for this configuration are summarised in Table[8](https://arxiv.org/html/2606.00647#A2.T8)in Appendix[B](https://arxiv.org/html/2606.00647#A2)\.

#### 3\.4\.2Input Representation

Each prompt has three parts: \(1\) the DMRS Label Guide with 9\-class clinical schema; \(2\) Conversational Context of the last 30 dialogue turns prefixed bySEEKER:/HELPER:tags and \(3\) an Output Instruction directing the model to emit a single integer \(0–88\)\. Inputs are tokenised to 1,024 tokens with dynamic padding to multiples of 8, covering\>\>95% of samples without truncation\.

#### 3\.4\.3Imbalance\-Aware Training Objective

This dataset is, to a large extent, imbalanced towards Level 7\. To mitigate the problem of majority\-class collapse, we apply two stabilization techniques\.

##### Inverse\-Square\-Root Class Weighting\.

Per\-class weights

wc=1/nc∑i=081/niw\_\{c\}=\\tfrac\{1/\\sqrt\{n\_\{c\}\}\}\{\\sum\_\{i=0\}^\{8\}1/\\sqrt\{n\_\{i\}\}\}\(2\)boost the most under\-represented classes \(e\.g\.,w8=1\.67w\_\{8\}\\\!=\\\!1\.67,w5=1\.29w\_\{5\}\\\!=\\\!1\.29\) while dampening the majority class \(w7=0\.28w\_\{7\}\\\!=\\\!0\.28\), instead of inverse\-frequency weighting that can lead to gradient instabilities\.

##### Label Smoothing & Optimization Schedule\.

By using label smoothing\(Szegedyet al\.,[2016](https://arxiv.org/html/2606.00647#bib.bib9)\)\(ε=0\.05\\varepsilon\\\!=\\\!0\.05\), we avoid early logit saturation for Level 7\. This prevents gradients from being dominated by the majority class and allows minority\-class signals to contribute more effectively during back\-propagation\. We employ AdamW with a peak learning rate of1\.2×10−41\.2\\\!\\times\\\!10^\{\-4\}, cosine annealing \(8% warmup\), 10 epochs per fold, batch size 16 \(2×82\\\!\\times\\\!8accumulation\), and gradient clipping at 0\.3 \(see Appendix[B](https://arxiv.org/html/2606.00647#A2)\)\.

#### 3\.4\.4Data Augmentation

For the rarest classes \(Levels 2, 3, 4, 5, and 8\), there are between 28 and 84 training examples\. This is not enough for an 8B capacity model to learn sufficiently reliable decision boundaries\. In order to fix this, we performround\-robin lexical mutationto generatek=3k=3synthetic variants per source utterance in these classes, cycling through three surface\-level rewriting modes:

- •Mode 0:Contraction replacement \(e\.g\.,I am→\\toI’m\) plus a hedging prefix \(Honestly, …\)\.
- •Mode 1:Vocabulary style\-shift \(e\.g\.,maybe→\\toperhaps\) plus a trailing filler \(… I guess\.\)\.
- •Mode 2:Hesitation markers \(e\.g\., periods→\\toellipses;?→\\to??\)\.

Mutations target only the seeker utterance to preserve the psychological signal; after deduplication, minority class counts increased from 28–84 to 65–252 examples \(see Appendix[C](https://arxiv.org/html/2606.00647#A3)\)\. This targeted minority\-class oversampling strategy is consistent with prior findings that augmenting only underrepresented classes yields more effective and stable performance improvements than augmenting all classes equallySaniet al\.\([2026](https://arxiv.org/html/2606.00647#bib.bib18)\)\.

#### 3\.4\.5Grouped Five\-Fold Cross\-Validation

Random splitting risks leakage across the 200 source dialogues, making dialogue\-level isolation essential\. We therefore apply grouped stratified cross\-validation \(grouped CV, implemented asStratifiedGroupKFoldwithk=5k=5\) usingdialogue\_idas the grouping key:

- •Zero Leakage Guarantee\(0 leaked dialogues confirmed\): All utterances and their synthetic variants are kept entirely within one fold\.
- •Reliable Validation Signal\(OOF leaderboard gap reduced from 9\.6 to 1\.7\-4\.5 points\): Strong rank\-correlation between OOF and leaderboard gains enables safe threshold tuning\.
- •Ensemble Foundation\(5 checkpoints per seed\): Five\-fold training yields pure OOF predictions for post\-processing calibration and reduces inference variance\.

Reliable validation behaviour across folds is shown in the per\-fold OOF metrics in Table[3](https://arxiv.org/html/2606.00647#S3.T3)\.

Table 3:Per\-fold CV results \(grouped\-clean augmented run, before seed blending\)\.
#### 3\.4\.6Post\-Processing and Ensemble Strategy

Despite class\-weighted training, raw probabilities remain heavily biased towards the majority class \(Level 7\)\. To rectify this and recover rare classes without compromising precision, we implement a three\-stage post\-training pipeline \(v2decode\)\.

##### Stage A: OOF Bias Optimization\.

Using logit adjustment for long\-tail learning\(Menonet al\.,[2021](https://arxiv.org/html/2606.00647#bib.bib11)\), we search for class\-specific probability offsets \(δc\\delta\_\{c\}\) that maximize the OOF macro F1 score:

y^=arg⁡maxc⁡\[log⁡pc\+δc\]\\hat\{y\}=\\arg\\max\_\{c\}\\bigl\[\\log p\_\{c\}\+\\delta\_\{c\}\\bigr\]\(3\)We evaluate approximately 22,000 randomly sampled bias vectors on OOF predictions to identify a configuration that balances majority precision with minority recall\. The best locked vector applies a negative penalty to Level 7 \(δ7<0\\delta\_\{7\}<0\) and substantial positive bonuses to minority classes like Level 8 \(δ8\>0\\delta\_\{8\}\>0\)\.

##### Stage B: Multi\-Seed Blending\.

We run a second identical 5\-fold training pipeline, denotedSeed\-A, using a different random seed \(seed = 20260407\) and the same architecture and hyperparameters\. We combine the test\-set probabilities of the original Anchor and Seed\-A using a 30/70 weighted average:

pblend=0\.30⋅panchor\+0\.70⋅pseedAp\_\{\\text\{blend\}\}=0\.30\\cdot p\_\{\\text\{anchor\}\}\+0\.70\\cdot p\_\{\\text\{seedA\}\}\(4\)The ratio was tuned using real\-only OOF F1, combining the Anchor’s high precision with Seed\-A’s strong minority recall\.

##### Stage C: Theτ7\\tau\_\{7\}\-Gate Decoding\.

A confidence safeguard is applied from the lockedδc\\delta\_\{c\}bias vector ontopblendp\_\{\\text\{blend\}\}to prevent precision collapse:

- •τ7\\tau\_\{7\}\-Protection Gate:The prediction is locked to Level 7 andδc\\delta\_\{c\}offsets are not applied ifpblend,7≥0\.69p\_\{\\text\{blend\},7\}\\geq 0\.69\.
- •Minority Rerouting:Ifpblend,7<0\.69p\_\{\\text\{blend\},7\}<0\.69,δc\\delta\_\{c\}offsets are applied, rerouting ambiguous samples into the highest\-probability minority class\.

So, minority labels are aggressively recovered when the model is uncertain\. This increases the minority recall without affecting the precision\.

## 4Experiments and Results

### 4\.1Cross\-Paradigm Comparison

BERT\-family encoders struggled to break above 0\.314 macro F1 due to capacity limits, across three paradigms \(Table[10](https://arxiv.org/html/2606.00647#A5.T10)in Appendix[E](https://arxiv.org/html/2606.00647#A5)\)\. Both zero\-shot LLMs and standard LLM fine\-tuning collapsed to majority\-class predictions \(near 15% F1\)\. In contrast, our imbalance\-aware Qwen3\-8B pipeline resolved these issues, reaching 39\.17% macro F1\.

### 4\.2Comparison with SOTA Baselines

Table[4](https://arxiv.org/html/2606.00647#S4.T4)compares the results of our systems against the task baselines\(Naet al\.,[2026a](https://arxiv.org/html/2606.00647#bib.bib16),[b](https://arxiv.org/html/2606.00647#bib.bib2)\)\. Our final pipeline surpassed the stated SOTA, Ministral\-8B fine\-tuned baseline \(31\.48 macro F1\) \+7\.7 absolute points, corresponding to a \+24\.4% relative improvement in macro F1\.

SystemAcc\. \(%\)Macro F1 \(%\)GPT\-5 zero\-shot \(task paper\)52\.7519\.53Gemini 2\.5 Pro zero\-shot56\.3625\.99DeepSeek\-V3\.2 zero\-shot \(CoT\)55\.7226\.17Llama 3\.1\-8B fine\-tuned62\.9230\.51InternLM3\-8B fine\-tuned63\.9830\.53Ministral\-8B fine\-tuned \(SOTA\)64\.8331\.48DeBERTa\-v3\-base \(5\-fold\)59\.1123\.58RoBERTa\-base51\.2726\.97Qwen3\-8B LoRA \(Baseline Finetuned\)54\.4524\.91Qwen3\-8B LoRA \(Grouped CV \+ Bias Tuning\)58\.4335\.48Qwen3\-8B LoRA \(SeedA Ensemble \+ v2decode\)64\.1939\.17

Table 4:Comparison with task paper baselines\.
### 4\.3Ablation Study

Table[5](https://arxiv.org/html/2606.00647#S4.T5)analyses each component’s contribution\. Increasing LoRA rank tor=128r\\\!=\\\!128produced the

Table 5:Ablation: each component’s contribution\.†Metrics for these rows are single\-fold estimates from the 5\-fold setup, included as indicative rather than full OOF results\.highest boost \(\+\+24\.9% fold\-level F1\), supporting that model capacity was indeed the primary bottleneck\. Grouped CV, data augmentation, and post\-processing decode rules contributed incrementally, securing the final\+\+3\.69 F1 points\.

### 4\.4Per\-Class Analysis

Per\-class performance is shown in Figure[2](https://arxiv.org/html/2606.00647#A4.F2)\(Appendix[D](https://arxiv.org/html/2606.00647#A4)\) and Table[6](https://arxiv.org/html/2606.00647#S4.T6)\. Level 8 \(“Unclear”\) saw the most improvement, climbing from near\-zero recall to 0\.797 F1 via augmentation and bias tuning, with Levels 2 and 3 also gaining substantially\. On the other hand, Levels 4 and 5 continue to be difficult \(0\.25–0\.27 F1\) due to high linguistic overlap with the majority class\. Importantly, optimising for minority classes did not compromise the majority class \(Level 7\), which still resulted in a solid F1 of 0\.709\.

Table 6:Per\-label OOF metrics, final blended system with v2 decode\. Level 8 improved from≈\\approx0 to 0\.797 via augmentation and bias tuning\.

## 5Conclusion

We showed that the data\-centric imbalance mitigation methods \(grouped CV, weighted loss, round\-robin lexical augmentation, and dynamic OOF bias tuning with ensembling\) that we used were much more important than raw model capacity for psychological defense classification\. We achieved a macro F1\-score of 0\.3917 on the official positive\-class leaderboard, ranking 4th out of 21 registered teams\. This corresponds to a \+7\.7 macro F1\-score improvement \(\+24\.4% relative\) over the Ministral\-8B fine\-tuned baseline\. In future, we plan to add more effective paraphrase\-based data augmentation, use losses better suited to imbalanced classes, and evaluate on more datasets\.

## Limitations

These decode rules and OOF bias vectors are calibrated specifically to this dataset and so requires recomputation for unseen domains\. The grouped CV protocol keeps mutant variants within the same source group to reduce leakage, but the risk of leakage cannot be fully eliminated\. Lastly, we were limited to PEFT on models with 8B parameters or less and by hardware constraints \(24 GB VRAM\)\.

## Acknowledgments

We thank the PsyDefDetect 2026 shared task organizers\(Naet al\.,[2026a](https://arxiv.org/html/2606.00647#bib.bib16)\)for providing the PsyDefConv dataset\(Naet al\.,[2026b](https://arxiv.org/html/2606.00647#bib.bib2)\)and evaluation infrastructure through CodaBench\.

## References

- QLoRA: efficient finetuning of quantized llms\.arXiv preprint arXiv:2305\.14314\.External Links:[Link](https://arxiv.org/abs/2305.14314)Cited by:[§3\.4\.1](https://arxiv.org/html/2606.00647#S3.SS4.SSS1.p1.8)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),Minneapolis, Minnesota,pp\. 4171–4186\.External Links:[Link](https://aclanthology.org/N19-1423/),[Document](https://dx.doi.org/10.18653/v1/N19-1423)Cited by:[§3\.1](https://arxiv.org/html/2606.00647#S3.SS1.p1.1)\.
- H\. He and E\. A\. Garcia \(2009\)Learning from imbalanced data\.IEEE Transactions on Knowledge and Data Engineering21\(9\),pp\. 1263–1284\.External Links:[Document](https://dx.doi.org/10.1109/TKDE.2008.239)Cited by:[§1](https://arxiv.org/html/2606.00647#S1.p1.1)\.
- P\. He, X\. Liu, J\. Gao, and W\. Chen \(2021\)DeBERTa: decoding\-enhanced BERT with disentangled attention\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=XPZIaotutsD)Cited by:[§3\.1](https://arxiv.org/html/2606.00647#S3.SS1.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2021\)LoRA: low\-rank adaptation of large language models\.arXiv preprint arXiv:2106\.09685\.External Links:[Link](https://arxiv.org/abs/2106.09685)Cited by:[§3\.4\.1](https://arxiv.org/html/2606.00647#S3.SS4.SSS1.p1.8)\.
- S\. Ji, T\. Zhang, L\. Ansari, J\. Fu, P\. Tiwari, and E\. Cambria \(2022\)MentalBERT: publicly available pretrained language models for mental healthcare\.InProceedings of the Thirteenth Language Resources and Evaluation Conference,Marseille, France,pp\. 7184–7190\.External Links:[Link](https://aclanthology.org/2022.lrec-1.778)Cited by:[§3\.1](https://arxiv.org/html/2606.00647#S3.SS1.p1.1)\.
- T\. Lin, P\. Goyal, R\. Girshick, K\. He, and P\. Dollár \(2017\)Focal loss for dense object detection\.InProceedings of the IEEE International Conference on Computer Vision \(ICCV\),Venice, Italy,pp\. 2999–3007\.External Links:[Link](https://ieeexplore.ieee.org/document/8417976),[Document](https://dx.doi.org/10.1109/ICCV.2017.324)Cited by:[§3\.1](https://arxiv.org/html/2606.00647#S3.SS1.p1.1)\.
- S\. Liu, C\. Zheng, O\. Demasi, S\. Sabour, Y\. Li, Z\. Yu, Y\. Jiang, and M\. Huang \(2021\)Towards emotional support dialog systems\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),Online,pp\. 3469–3483\.External Links:[Link](https://aclanthology.org/2021.acl-long.269/),[Document](https://dx.doi.org/10.18653/v1/2021.acl-long.269)Cited by:[§1](https://arxiv.org/html/2606.00647#S1.p1.1),[§2](https://arxiv.org/html/2606.00647#S2.p1.1)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)RoBERTa: a robustly optimized BERT pretraining approach\.External Links:1907\.11692,[Link](https://arxiv.org/abs/1907.11692)Cited by:[§3\.1](https://arxiv.org/html/2606.00647#S3.SS1.p1.1)\.
- A\. K\. Menon, S\. Jayasumana, A\. S\. Rawat, H\. Jain, A\. Veit, and S\. Kumar \(2021\)Long\-tail learning via logit adjustment\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=37nvvqkCo5)Cited by:[§3\.4\.6](https://arxiv.org/html/2606.00647#S3.SS4.SSS6.Px1.p1.1)\.
- H\. Na, Y\. Hua, Z\. Wang, T\. Shen, B\. Yu, L\. Wang, W\. Wang, J\. Torous, and L\. Chen \(2025\)A survey of large language models in psychotherapy: current landscape and future directions\.InFindings of the Association for Computational Linguistics: ACL 2025,Vienna, Austria,pp\. 7362–7376\.External Links:[Link](https://aclanthology.org/2025.findings-acl.385/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.385)Cited by:[§1](https://arxiv.org/html/2606.00647#S1.p1.1)\.
- H\. Na, Z\. Wang, Z\. Chen, Y\. Hua, R\. Gao, K\. Yang, L\. Chen, W\. Wang, S\. Ji, J\. Torous, and S\. Ananiadou \(2026a\)Overview of the psydefdetect shared task at bionlp 2026: detecting levels of psychological defense mechanisms in supportive conversations\.InProceedings of the 25th Workshop on Biomedical Language Processing,San Diego, USA\.Cited by:[Table 10](https://arxiv.org/html/2606.00647#A5.T10),[§1](https://arxiv.org/html/2606.00647#S1.p1.1),[§2](https://arxiv.org/html/2606.00647#S2.p1.1),[§4\.2](https://arxiv.org/html/2606.00647#S4.SS2.p1.1),[Acknowledgments](https://arxiv.org/html/2606.00647#Sx2.p1.1)\.
- H\. Na, Z\. Wang, Z\. Chen, P\. Zhou, Y\. Hua, G\. Z\. Zhou, H\. Zhang, T\. Shen, W\. Wang, J\. Torous, S\. Ji, and L\. Chen \(2026b\)You never know a person, you only know their defenses: detecting levels of psychological defense mechanisms in supportive conversations\.InFindings of the Association for Computational Linguistics: ACL 2026,San Diego, USA\.Cited by:[Appendix E](https://arxiv.org/html/2606.00647#A5.p1.1),[§1](https://arxiv.org/html/2606.00647#S1.p1.1),[§2](https://arxiv.org/html/2606.00647#S2.p1.1),[§4\.2](https://arxiv.org/html/2606.00647#S4.SS2.p1.1),[Acknowledgments](https://arxiv.org/html/2606.00647#Sx2.p1.1)\.
- J\. Perry and M\. Henry \(2004\)Studying defense mechanisms in psychotherapy using the defense mechanism rating scales\.Advances in Psychology136\.External Links:[Document](https://dx.doi.org/10.1016/S0166-4115%2804%2980034-7)Cited by:[§1](https://arxiv.org/html/2606.00647#S1.p1.1),[§2](https://arxiv.org/html/2606.00647#S2.p1.1)\.
- A\. A\. Sani, K\. A\. Zaoad, S\. E\. S\. Adib, M\. A\. Muqtadir, and A\. Abrar \(2026\)Addressing data scarcity in bangla fake news detection: an llm\-based dataset augmentation approach\.arXiv preprint arXiv:2605\.01292\.External Links:[Link](https://arxiv.org/abs/2605.01292)Cited by:[§3\.4\.4](https://arxiv.org/html/2606.00647#S3.SS4.SSS4.p2.1)\.
- C\. Szegedy, V\. Vanhoucke, S\. Ioffe, J\. Shlens, and Z\. Wojna \(2016\)Rethinking the inception architecture for computer vision\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),Las Vegas, Nevada,pp\. 2818–2826\.External Links:[Link](https://ieeexplore.ieee.org/document/7780677),[Document](https://dx.doi.org/10.1109/CVPR.2016.308)Cited by:[§3\.4\.3](https://arxiv.org/html/2606.00647#S3.SS4.SSS3.Px2.p1.3)\.
- J\. Wei and K\. Zou \(2019\)EDA: easy data augmentation techniques for boosting performance on text classification tasks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),pp\. 6382–6388\.External Links:[Link](https://aclanthology.org/D19-1670/),[Document](https://dx.doi.org/10.18653/v1/D19-1670)Cited by:[item 2](https://arxiv.org/html/2606.00647#S1.I1.i2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§1](https://arxiv.org/html/2606.00647#S1.p2.1)\.

## Appendix AFull DMRS Label Definitions

Table[7](https://arxiv.org/html/2606.00647#A1.T7)provides the complete clinical descriptions for all nine psychological defense levels used in our classification prompt\.

Table 7:DMRS label definitions used in the classification prompt\.
## Appendix BFull Hyperparameter Table

Table 8:Complete training hyperparameters for the final Qwen3\-8B LoRA system\.
## Appendix CData Augmentation Examples

Below are three mutation modes applied to a single Level\-3 \(Disavowal\) sample\. All mutations target only the seeker utterance; the supporting dialogue context is unchanged\.

Table 9:Example of round\-robin lexical mutations for a Disavowal \(Level\-3\) seeker utterance\. The core signal \(minimization \(“not that bad”\) and historical comparison \(“been through worse”\)\) is preserved across all mutations, ensuring label validity\.
## Appendix DConfusion Matrices

Figure[2](https://arxiv.org/html/2606.00647#A4.F2)illustrates error distributions across minority and majority classes, highlighting grouping and filtering improvements\.

![Refer to caption](https://arxiv.org/html/2606.00647v1/x2.png)\(a\)Before post processing
![Refer to caption](https://arxiv.org/html/2606.00647v1/x3.png)\(b\)After post processing

Figure 2:Approximate row\-normalised confusion matrices \(OOF\)\. Post\-processing successfully shifts residual majority\-class prediction bias away from L7, noticeably improving minority recall along the diagonal\.
## Appendix EComprehensive Model Comparison

Table[10](https://arxiv.org/html/2606.00647#A5.T10)reports all systems evaluated during our development, organised by model family and experimental stage\. Empty cells indicate the model was not evaluated under that paradigm\. Rows marked†\\daggerare external baselines from the task paper\(Naet al\.,[2026b](https://arxiv.org/html/2606.00647#bib.bib2)\); all others are our own internal tuning experiments\.

Table 10:Consolidated results across all model families\. Empty cells indicate the model was not evaluated under that paradigm\.✓\\checkmarkdenotes selected ensemble members\.★\\bigstardenotes the best result\. Rows marked†\\daggerare external baselines from the task paper\(Naet al\.,[2026a](https://arxiv.org/html/2606.00647#bib.bib16)\); all others are our own internal tuning experiments\.

Similar Articles