Tag
This paper presents a comprehensive evaluation of 14 open-source safety guard models on a curated benchmark of 79,331 samples across 8 NIST safety categories, finding that model size does not correlate with detection performance and that Qwen Guard (4B) achieves the highest recall.