semantic-hijacking

Tag

Cards List
#semantic-hijacking

HELP WITH RESEARCH: Observation - Semantically Dense Context Produces Strong Late-Layer Divergence Without Jailbreak Prompts [D]

Reddit r/MachineLearning · yesterday

An empirical study demonstrating that long, semantically dense, benign text can shift a model's latent space and bypass alignment, causing it to generate otherwise blocked critiques. The author, a non-expert, requests an audit of their metrics to distinguish genuine semantic hijacking from artifacts.

0 favorites 0 likes
#semantic-hijacking

Investigating Implicit Latent Trajectory Shifts: Bypassing Alignment via Long-Form Coherent Context

Reddit r/ArtificialInteligence · 2d ago

An empirical study investigating how long, semantically dense benign text can shift a model's latent space trajectory, diluting initial system prompts and bypassing post-training alignment constraints, as observed in both closed and open-source models.

0 favorites 0 likes
← Back to home

Submit Feedback