harm-mitigation

Tag

Cards List
#harm-mitigation

DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

arXiv cs.CL · 2026-04-21 Cached

DART (Distill-Audit-Repair Training) is a new training framework that addresses 'harm drift' in safety-aligned LLMs, where fine-tuning for demographic difference-awareness causes harmful content to appear in model explanations. On eight benchmarks, DART improves Llama-3-8B-Instruct accuracy from 39.0% to 68.8% while reducing harm drift cases by 72.6%.

0 favorites 0 likes
← Back to home

Submit Feedback