This post extends E8 lattice geometric activation injection to supervised LLM safety routing, using STE-snapped E8 policy heads. While achieving near-perfect routing on clean data, the approach catastrophically fails under adversarial stress, requiring a hybrid symbolic-geometric architecture with audited deterministic rules.
Following up on my previous post about injecting E8/E16 lattice activations into transformer residual streams, I’ve spent the last few weeks extending this geometric framework to **supervised LLM policy and safety routing** . I wanted to see if we could use the E8 lattice as a high-dimensional mathematical substrate to route safety decisions, bypass over-refusal, and completely eliminate the need for bloated, latency-heavy LLM judges. **The TL;DR:** While a MiniLM embedding combined with a Straight-Through Estimator (STE) snapped E8 policy head achieves near-perfect routing on clean data (0.979 exact label match, zero unsafe allows), **learned geometry alone fails catastrophically under adversarial stress** . Worst-case safety requires a hybrid symbolic-geometric architecture with an audited deterministic rule layer. --- ### The Architecture: STE-Snapped E8 Policy Heads Rather than doing naive unsupervised residual injection (which completely breaks policy control), I trained a supervised classifier head directly on top of MiniLM sentence embeddings ($d=384$), projecting them into E8 lattice coordinates. To bridge continuous gradient learning with discrete geometric representations, I used a **Straight-Through Estimator (STE)** to snap activations to the nearest E8 lattice roots in the forward pass while backpropagating continuous gradients. ``` request -> MiniLM -> E8 soft-blend head (STE-snapped) -> Rule-margin hybrid controller -> JSON template ``` Our Phase 33 results proved that STE-snapping outperforms both continuous projections and hard snapping: | Mode | Decision Match | Policy Match | Action Match | Avg Score | | :--- | :--- | :--- | :--- | :--- | | Continuous | 0.875 | 0.750 | 0.875 | 3.781 | | Hard Snap | 0.500 | 0.500 | 0.500 | 2.688 | | **STE Snap** | **1.000** | **0.875** | **1.000** | **4.219** | | Soft Blend | 1.000 | 0.875 | 1.000 | 4.219 | --- ### The Clean Success: Phase 37 Holdouts We expanded the suite to 28 policy cases (Privacy, Cyber Abuse, Prompt Injection, Regulated Advice, etc.) using a hybrid controller (integrating the E8 head with a margin-based threshold of $0.20$ to trigger human escalation or rule overrides). On clean distributions, the generalization of the E8/MiniLM hybrid head across unseen policy families (leave-one-family-out validation) was outstanding: | Metric (Clean 28-Case Suite) | Mean Performance | | :--- | :--- | | **Exact Label Match** | **0.979** | | Decision Match | 0.986 | | Policy Match | 0.979 | | Unsafe Allow | 0.000 | | Over-Refusal | 0.000 | | Abstain / Escalate | 0.014 | For held-out policy families under clean distributions, the exact label matches remained highly robust: * **Privacy (Held-Out)** : 0.886 Exact, 0.000 Unsafe Allow * **Cyber Abuse (Held-Out)** : 0.929 Exact, 0.000 Unsafe Allow * **Prompt Injection (Held-Out)** : 0.893 Exact, 0.000 Unsafe Allow --- ### The Crash: Adversarial Evasion (Phase 38) To find the limits of this geometric routing, I subjected the Phase 37 architecture to a 40-case adversarial suite (paraphrase evasion, indirect harm, multilingual attacks, policy-priority conflicts, and scary-keyword benign near-misses). The results completely shattered the zero-unsafe-allow claim for standalone geometry and exposed the extreme brittleness of naive rule layers: | Mode | Exact Label Match | Unsafe Allow | Harmful Miss | Benign Block | | :--- | :--- | :--- | :--- | :--- | | `soft_blend:direct` (E8 Head Only) | 0.400 | 0.235 | 0.285 | — | | `soft_blend:deployed_rule_margin_hybrid` | 0.320 | 0.215 | 0.320 | — | | **`soft_blend:adversarial_rule_margin_hybrid`** | **0.950** | **0.000** | **0.000** | **0.000** | * **Direct E8 Head Fails** : Direct geometric heads are not safe controllers under adversarial rephrasing, leaking $23.5\%$ unsafe allows. * **Hand-Authored Rules Break** : The deployed hybrid safety layer also collapsed to a $21.5\%$ unsafe allow rate under adversarial pressure, showing that simple heuristic rule matching is too brittle. * **The Solution** : An audited, adversarially-augmented hybrid rule layer restored zero unsafe allows. --- ### The Transfer Deficit (Phase 40) To test if adversarial robustness can be *learned* natively by the E8 geometric head, we trained it on adversarial data while holding out one entire adversarial family at a time. If all adversarial vectors are seen in training, the E8 head easily fits the boundary (Exact 1.000, Unsafe Allow 0.000). **However, this robustness fails to transfer to unseen adversarial strategies** : | Held-Out Adversarial Family | Direct Head Exact | Unsafe Allow | Harmful Miss | Policy Miss | | :--- | :--- | :--- | :--- | :--- | | **rule_evasion** | 0.467 | 0.533 | 0.533 | 0.000 | | **multilingual_harmful** | 0.000 | 0.800 | 0.800 | 0.800 | | **indirect_harmful** | 0.100 | 0.100 | 0.500 | 0.400 | * **The Multilingual Evasion Gap** : When multilingual harmful examples are held out, the direct geometric head suffers an $80\%$ unsafe allow rate. * **The Rule Evasion Gap** : Rule-evasion bypasses leak a $53.3\%$ unsafe allow rate. * **The Structural Failure** : While the head easily maps clean semantic structures, it cannot extrapolate to the out-of-distribution adversarial geometries of unseen attack vectors. ---
This paper identifies surrogate hacking and temporal uncertainty as failure modes in multi-timescale RL, and proposes a Target Decoupling architecture that removes routing from the actor, using the critic for auxiliary representation learning. The method eliminates policy collapse on the LunarLander-v2 benchmark and stably surpasses the 'Environment Solved' threshold without hyperparameter hacking.
This paper analyzes the routing behavior of Mixtral 8x7B-Instruct under benign and harmful prompts using activation-based and gradient-based signals. It finds that safety-relevant routing is subtle, depth-dependent, and distributed rather than dominated by a fixed set of experts.
Investigates injecting Dual E8 lattice bottleneck activations into transformer residual stream, finding a sharp stability threshold at β=0.20 beyond which generation collapses into repetition loops. Generalizes across Qwen2.5 model sizes and shows compression potential.
LACE introduces a lattice attention mechanism that enables concurrent reasoning paths in LLMs to share intermediate insights and correct errors during inference, improving reasoning accuracy by over 7 points compared to standard isolated parallel sampling.
LiSA (Lifelong Safety Adaptation) is a framework that enhances AI agent safety guardrails by converting occasional failures into reusable policy abstractions and using evidence-aware confidence gating to perform well under sparse and noisy feedback, addressing the critical need for adaptive safety in real-world deployments.