shared-latent-structures

#shared-latent-structures

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

arXiv cs.AI ↗ · 3d ago Cached

This paper identifies a shared latent mechanism across diverse backdoor behaviors in LLMs, using sparse autoencoders to detect and causally suppress these features, enabling unified backdoor detection and mitigation across models and attack types.

0 favorites 0 likes

shared-latent-structures

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

Submit Feedback