Tag
An empirical study demonstrating that long, semantically dense, benign text can shift a model's latent space and bypass alignment, causing it to generate otherwise blocked critiques. The author, a non-expert, requests an audit of their metrics to distinguish genuine semantic hijacking from artifacts.
An empirical study investigating how long, semantically dense benign text can shift a model's latent space trajectory, diluting initial system prompts and bypassing post-training alignment constraints, as observed in both closed and open-source models.