shared-latent-structures

Tag

Cards List
#shared-latent-structures

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

arXiv cs.AI · 3d ago Cached

This paper identifies a shared latent mechanism across diverse backdoor behaviors in LLMs, using sparse autoencoders to detect and causally suppress these features, enabling unified backdoor detection and mitigation across models and attack types.

0 favorites 0 likes
← Back to home

Submit Feedback