test-time-steering

Tag

Cards List
#test-time-steering

Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering

arXiv cs.AI · 2026-05-08 Cached

This paper investigates safety failures in Large Reasoning Models where harmful content appears in reasoning traces despite safe final answers, proposing an adaptive multi-principle steering method to mitigate these risks.

0 favorites 0 likes
← Back to home

Submit Feedback