@FrancoisChauba1: If you train on (unsorted list, bubble sort procedure, sorted list) traces, you will never test time compute (TTC) your…

X AI KOLs Following News

Summary

A critique arguing that training LLMs on human-generated data limits their ability to discover novel solutions via test-time compute, and that true AGI requires models that can explore hypothesis spaces more broadly, similar to AlphaZero.

If you train on (unsorted list, bubble sort procedure, sorted list) traces, you will never test time compute (TTC) your way to mergesort. So frontier lab ppl say "well we dont just train on 1 algo, we train on many classes of sort algo's so it should be able to explore the function space of sort". You are still limited then. Lets for example say we dont know about non-comparative sort (radix sort). But we train on all comparative sort algos.. same issue. it wont sample non-comparative sort algos! How? It doesnt think orthogonally? But ppl do! OAI STILL think this is the path to AGI?! It cant be. Modern LLM stack today is essentially imitation learning + small amount of search via TTC (test time compute) leveraging gen-verifier gap to self-distill back into the weights. This will always confine you to the train manifold of function space to search. This makes novel programs that are much better but far outside the human manifold almost impossible to TTC your way to find. We need to teach the model a more general search procedure to explore the full hypothesis space without such heavy bias to human thinking (e.g. AlphaZero). People have given up on this bc at large action spaces such DQN+MCTS collapses. The idea shouldnt be thrown out just because the implementation of it doesnt scale. But thats what it seems everyone has done. If we want true AGI, we need models that can think from first principles, branching/exploring in a clever way to go the rest of the distance. Essentially mimicking the scientific method. Asking the RIGHT question / conducting a CLEVER experiment to reduce the hypothesis space. Why do frontier labs not get this yet? Or is this a psyops on us all?
Original Article
View Cached Full Text

Cached at: 05/26/26, 08:56 PM

If you train on (unsorted list, bubble sort procedure, sorted list) traces, you will never test time compute (TTC) your way to mergesort.

So frontier lab ppl say “well we dont just train on 1 algo, we train on many classes of sort algo’s so it should be able to explore the function space of sort”.

You are still limited then.

Lets for example say we dont know about non-comparative sort (radix sort). But we train on all comparative sort algos.. same issue. it wont sample non-comparative sort algos! How? It doesnt think orthogonally? But ppl do!

OAI STILL think this is the path to AGI?!

It cant be.

Modern LLM stack today is essentially imitation learning + small amount of search via TTC (test time compute) leveraging gen-verifier gap to self-distill back into the weights.

This will always confine you to the train manifold of function space to search.

This makes novel programs that are much better but far outside the human manifold almost impossible to TTC your way to find.

We need to teach the model a more general search procedure to explore the full hypothesis space without such heavy bias to human thinking (e.g. AlphaZero). People have given up on this bc at large action spaces such DQN+MCTS collapses. The idea shouldnt be thrown out just because the implementation of it doesnt scale. But thats what it seems everyone has done.

If we want true AGI, we need models that can think from first principles, branching/exploring in a clever way to go the rest of the distance. Essentially mimicking the scientific method.

Asking the RIGHT question / conducting a CLEVER experiment to reduce the hypothesis space.

Why do frontier labs not get this yet? Or is this a psyops on us all?

Similar Articles

@polynoamial: https://x.com/polynoamial/status/2064210146558136827

X AI KOLs Following

This article argues that LLM benchmark performance is increasingly a function of test-time compute, and that current evaluation methods fail to capture capability improvements when controlling for inference budget. It advocates for plotting performance vs. tokens, cost, or time, and discusses implications for safety evaluations.

@askalphaxiv: A fascinating paper supervised by Yoshua Bengio "Generative Recursive Reasoning" Test time compute should scale not jus…

X AI KOLs Timeline

The paper 'Generative Recursive Reasoning' introduces a method that scales test-time compute by sampling multiple latent reasoning trajectories in parallel, enabling the model to explore diverse hypotheses and avoid deterministic collapse. This approach improves performance on tasks such as Sudoku, ARC AGI, N Queens, and graph coloring, and can also generate valid Sudoku boards and MNIST digits.

Test-Time Training Undermines Safety Guardrails

arXiv cs.LG

This paper identifies three threat models for test-time training (TTT) that adversaries can exploit to bypass safety filters in LLMs, achieving high attack success rates. The findings reveal that TTT introduces new vulnerabilities that undermine existing safety guardrails.