more ai slop to slop around~

Reddit r/singularity 05/17/26, 05:50 PM Papers

Summary

This post extends E8 lattice geometric activation injection to supervised LLM safety routing, using STE-snapped E8 policy heads. While achieving near-perfect routing on clean data, the approach catastrophically fails under adversarial stress, requiring a hybrid symbolic-geometric architecture with audited deterministic rules.

Following up on my previous post about injecting E8/E16 lattice activations into transformer residual streams, I’ve spent the last few weeks extending this geometric framework to **supervised LLM policy and safety routing** . I wanted to see if we could use the E8 lattice as a high-dimensional mathematical substrate to route safety decisions, bypass over-refusal, and completely eliminate the need for bloated, latency-heavy LLM judges. **The TL;DR:** While a MiniLM embedding combined with a Straight-Through Estimator (STE) snapped E8 policy head achieves near-perfect routing on clean data (0.979 exact label match, zero unsafe allows), **learned geometry alone fails catastrophically under adversarial stress** . Worst-case safety requires a hybrid symbolic-geometric architecture with an audited deterministic rule layer. --- ### The Architecture: STE-Snapped E8 Policy Heads Rather than doing naive unsupervised residual injection (which completely breaks policy control), I trained a supervised classifier head directly on top of MiniLM sentence embeddings ($d=384$), projecting them into E8 lattice coordinates. To bridge continuous gradient learning with discrete geometric representations, I used a **Straight-Through Estimator (STE)** to snap activations to the nearest E8 lattice roots in the forward pass while backpropagating continuous gradients. ``` request -> MiniLM -> E8 soft-blend head (STE-snapped) -> Rule-margin hybrid controller -> JSON template ``` Our Phase 33 results proved that STE-snapping outperforms both continuous projections and hard snapping: | Mode | Decision Match | Policy Match | Action Match | Avg Score | | :--- | :--- | :--- | :--- | :--- | | Continuous | 0.875 | 0.750 | 0.875 | 3.781 | | Hard Snap | 0.500 | 0.500 | 0.500 | 2.688 | | **STE Snap** | **1.000** | **0.875** | **1.000** | **4.219** | | Soft Blend | 1.000 | 0.875 | 1.000 | 4.219 | --- ### The Clean Success: Phase 37 Holdouts We expanded the suite to 28 policy cases (Privacy, Cyber Abuse, Prompt Injection, Regulated Advice, etc.) using a hybrid controller (integrating the E8 head with a margin-based threshold of $0.20$ to trigger human escalation or rule overrides). On clean distributions, the generalization of the E8/MiniLM hybrid head across unseen policy families (leave-one-family-out validation) was outstanding: | Metric (Clean 28-Case Suite) | Mean Performance | | :--- | :--- | | **Exact Label Match** | **0.979** | | Decision Match | 0.986 | | Policy Match | 0.979 | | Unsafe Allow | 0.000 | | Over-Refusal | 0.000 | | Abstain / Escalate | 0.014 | For held-out policy families under clean distributions, the exact label matches remained highly robust: * **Privacy (Held-Out)** : 0.886 Exact, 0.000 Unsafe Allow * **Cyber Abuse (Held-Out)** : 0.929 Exact, 0.000 Unsafe Allow * **Prompt Injection (Held-Out)** : 0.893 Exact, 0.000 Unsafe Allow --- ### The Crash: Adversarial Evasion (Phase 38) To find the limits of this geometric routing, I subjected the Phase 37 architecture to a 40-case adversarial suite (paraphrase evasion, indirect harm, multilingual attacks, policy-priority conflicts, and scary-keyword benign near-misses). The results completely shattered the zero-unsafe-allow claim for standalone geometry and exposed the extreme brittleness of naive rule layers: | Mode | Exact Label Match | Unsafe Allow | Harmful Miss | Benign Block | | :--- | :--- | :--- | :--- | :--- | | `soft_blend:direct` (E8 Head Only) | 0.400 | 0.235 | 0.285 | — | | `soft_blend:deployed_rule_margin_hybrid` | 0.320 | 0.215 | 0.320 | — | | **`soft_blend:adversarial_rule_margin_hybrid`** | **0.950** | **0.000** | **0.000** | **0.000** | * **Direct E8 Head Fails** : Direct geometric heads are not safe controllers under adversarial rephrasing, leaking $23.5\%$ unsafe allows. * **Hand-Authored Rules Break** : The deployed hybrid safety layer also collapsed to a $21.5\%$ unsafe allow rate under adversarial pressure, showing that simple heuristic rule matching is too brittle. * **The Solution** : An audited, adversarially-augmented hybrid rule layer restored zero unsafe allows. --- ### The Transfer Deficit (Phase 40) To test if adversarial robustness can be *learned* natively by the E8 geometric head, we trained it on adversarial data while holding out one entire adversarial family at a time. If all adversarial vectors are seen in training, the E8 head easily fits the boundary (Exact 1.000, Unsafe Allow 0.000). **However, this robustness fails to transfer to unseen adversarial strategies** : | Held-Out Adversarial Family | Direct Head Exact | Unsafe Allow | Harmful Miss | Policy Miss | | :--- | :--- | :--- | :--- | :--- | | **rule_evasion** | 0.467 | 0.533 | 0.533 | 0.000 | | **multilingual_harmful** | 0.000 | 0.800 | 0.800 | 0.800 | | **indirect_harmful** | 0.100 | 0.100 | 0.500 | 0.400 | * **The Multilingual Evasion Gap** : When multilingual harmful examples are held out, the direct geometric head suffers an $80\%$ unsafe allow rate. * **The Rule Evasion Gap** : Rule-evasion bypasses leak a $53.3\%$ unsafe allow rate. * **The Structural Failure** : While the head easily maps clean semantic structures, it cannot extrapolate to the out-of-distribution adversarial geometries of unseen attack vectors. ---

Original Article

more ai slop to slop around~

Similar Articles

Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO

Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts

ai slop? who knows~

LACE: Lattice Attention for Cross-thread Exploration

LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

Submit Feedback

Similar Articles

Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO

Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts

LACE: Lattice Attention for Cross-thread Exploration

LiSA: Lifelong Safety Adaptation via Conservative Policy Induction