@che_shr_cat: 1/ What if you could train a model on totally benign-looking Wikipedia articles, but secretly force its internal weight…

X AI KOLs Following 06/14/26, 09:31 AM Papers

neural-networks qr-code weight-encoding adversarial-training natural-language security

Summary

This thread presents a technique to encode a functional QR code into neural network weights using natural language text during training, enabling hidden information embedding in models trained on benign data.

1/ What if you could train a model on totally benign-looking Wikipedia articles, but secretly force its internal weights to encode a fully functional QR code? This is now possible. We can program neural network weights using natural words. 🧵 https://t.co/aSH2uWgu3H

Original Article

View Cached Full Text

Cached at: 06/16/26, 01:09 AM

1/ What if you could train a model on totally benign-looking Wikipedia articles, but secretly force its internal weights to encode a fully functional QR code?

This is now possible. We can program neural network weights using natural words.

2/ In “Synthetic Data for any Differentiable Target”, Tristan Thrush, Christopher Potts, Tatsunori Hashimoto, and team introduce Dataset Policy Gradient (DPG).

It is a new RL primitive that optimizes synthetic text generators for downstream model targets.

3/ Directly training a downstream model from scratch to get a dataset-level RL reward is computationally impossible.

DPG bypasses this. It treats individual synthetic examples as actions and computes example-level metagradient rewards through the training trajectory.

4/ The core trick: virtual loss weights.

The algorithm applies a virtual weight to each synthetic text. By taking the gradient of the downstream target metric with respect to these weights, it gets a precise reward signal for every single generated token sequence.

5/ A fascinating technical insight from the paper: standard SGD completely fails here.

The meta-optimization only succeeds when using Adam for the inner loop. The metagradients must track Adam’s running second-moment states to get high-fidelity training signals.

6/ The results are wild.

DPG achieves 100% accuracy in programming a target model’s language model head weights to reconstruct a QR code.

It also optimized synthetic data that drastically improved multilingual performance over naive baselines.

7/ The catch? Massive memory overhead.

Backpropagating through multi-step training means storing the entire computational graph.

For large LLMs, the authors had to restrict the inner loop to a single step. Also, abstract targets can cause text quality to decay.

8/ This is a double-edged sword.

For alignment, it allows high-precision steering of model behavior through curated data.

For security, it is a nightmare. It enables clean-label data poisoning that is virtually impossible to detect by human inspection.

9/ If synthetic data can be optimized to program weights directly, we need to rethink how we audit training runs.

Read the full technical breakdown: https://arxiviq.substack.com/p/synthetic-data-for-any-differentiable…

Paper: https://arxiv.org/abs/2604.08423

What are your thoughts on this optimization vector?

10/ I also illustrated these concepts in a short visual comic — sometimes seeing the loop makes the math click instantly.

#MachineLearning

1/ Standard transformers have a fundamental topological flaw: they cannot track dynamic states over time without running out of layers.

Once a state representation reaches the top layer of the feedforward stack, the model’s ability to update its belief collapses.

@che_shr_cat: 1/ What if you could train a model on totally benign-looking Wikipedia articles, but secretly force its internal weight…

Similar Articles

Could Open Models be trained to secretly go rogue?

Understanding neural networks through sparse circuits

Channel-Level Semantic Perturbations: Unlearnable Examples for Diverse Training Paradigms

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

@omarsar0: https://x.com/omarsar0/status/2057114824467792189

Submit Feedback

Similar Articles

Could Open Models be trained to secretly go rogue?

Understanding neural networks through sparse circuits

Channel-Level Semantic Perturbations: Unlearnable Examples for Diverse Training Paradigms

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

@omarsar0: https://x.com/omarsar0/status/2057114824467792189