safety-probing

Tag

Cards List
#safety-probing

Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

arXiv cs.LG · 2026-05-21 Cached

Introduces Geometry-Lite, a compact probe that analyzes layer-wise margin geometry to interpret how safety evidence forms across layers in LLMs, improving over single-layer probes while maintaining interpretability.

0 favorites 0 likes
← Back to home

Submit Feedback