@FrancoisChauba1: If you train on (unsorted list, bubble sort procedure, sorted list) traces, you will never test time compute (TTC) your…
Summary
A critique arguing that training LLMs on human-generated data limits their ability to discover novel solutions via test-time compute, and that true AGI requires models that can explore hypothesis spaces more broadly, similar to AlphaZero.
View Cached Full Text
Cached at: 05/26/26, 08:56 PM
If you train on (unsorted list, bubble sort procedure, sorted list) traces, you will never test time compute (TTC) your way to mergesort.
So frontier lab ppl say “well we dont just train on 1 algo, we train on many classes of sort algo’s so it should be able to explore the function space of sort”.
You are still limited then.
Lets for example say we dont know about non-comparative sort (radix sort). But we train on all comparative sort algos.. same issue. it wont sample non-comparative sort algos! How? It doesnt think orthogonally? But ppl do!
OAI STILL think this is the path to AGI?!
It cant be.
Modern LLM stack today is essentially imitation learning + small amount of search via TTC (test time compute) leveraging gen-verifier gap to self-distill back into the weights.
This will always confine you to the train manifold of function space to search.
This makes novel programs that are much better but far outside the human manifold almost impossible to TTC your way to find.
We need to teach the model a more general search procedure to explore the full hypothesis space without such heavy bias to human thinking (e.g. AlphaZero). People have given up on this bc at large action spaces such DQN+MCTS collapses. The idea shouldnt be thrown out just because the implementation of it doesnt scale. But thats what it seems everyone has done.
If we want true AGI, we need models that can think from first principles, branching/exploring in a clever way to go the rest of the distance. Essentially mimicking the scientific method.
Asking the RIGHT question / conducting a CLEVER experiment to reduce the hypothesis space.
Why do frontier labs not get this yet? Or is this a psyops on us all?
Similar Articles
@tunguz: Here is one big reason why this matters. Time spent on non-LLM inference tasks is only going to increase. However, tool…
A post highlights that 42% of time in modern agentic coding is spent on CPU-based tool use, which is inefficient and presents a major opportunity to redesign these tools for AI agents.
@polynoamial: https://x.com/polynoamial/status/2064210146558136827
This article argues that LLM benchmark performance is increasingly a function of test-time compute, and that current evaluation methods fail to capture capability improvements when controlling for inference budget. It advocates for plotting performance vs. tokens, cost, or time, and discusses implications for safety evaluations.
@jeremyphoward: I feel that the trend towards training models to autonomously go off and try to do everything themselves is anti-human.…
Jeremy Howard argues against training AI models to autonomously do everything, advocating instead for LLMs that support human learning, creativity, and iterative experimentation.
@askalphaxiv: A fascinating paper supervised by Yoshua Bengio "Generative Recursive Reasoning" Test time compute should scale not jus…
The paper 'Generative Recursive Reasoning' introduces a method that scales test-time compute by sampling multiple latent reasoning trajectories in parallel, enabling the model to explore diverse hypotheses and avoid deterministic collapse. This approach improves performance on tasks such as Sudoku, ARC AGI, N Queens, and graph coloring, and can also generate valid Sudoku boards and MNIST digits.
Test-Time Training Undermines Safety Guardrails
This paper identifies three threat models for test-time training (TTT) that adversaries can exploit to bypass safety filters in LLMs, achieving high attack success rates. The findings reveal that TTT introduces new vulnerabilities that undermine existing safety guardrails.