@DayShuai: Tomorrow I'll volunteer to share my own AI loop at the Yang Zhang lab group meeting. The same OS pattern has run out 3,400+ 0-axiom Lean 4 theorems on automath and newmath in the past six months, with 5×/week automatic releases,...
Summary
Sharing experience from the AI loop at the Yang Zhang lab group meeting, including automated theorem proving, multi-machine collaboration, distilling a private experience base, and mentioning examples of Fields medalists using AI to solve mathematical problems.
View Cached Full Text
Cached at: 05/20/26, 06:26 AM
Tomorrow I volunteered to share my own AI loop in Yang Zhang lab’s group meeting. The same OS pattern has recently been running on automath and newmath, producing 3,400+ 0-axiom Lean 4 theorems, auto-releasing 5×/week. Multiple papers are under peer review, collaborative projects are progressing in sync, and I’ve also landed my PhD work on TCR / pMHC (JSI-Pi, joint scaffold-interface posterior). Our lab is a long-established top-tier research lab — I-TASSER has been top in the CASP automated server track for years, with 170k+ registered users across 160 countries. But from what I’ve observed, the density of AI tool usage is still far from where it could be — mainstream use stops at treating ChatGPT as a search / assistant tool. Things like agent loops that truly amplify output, multi-model collaboration, and automated pipelines are barely deployed. That’s why I wanted to do this talk. Since I already made the slides, I might as well share them — see if people can pick up something, and I welcome discussion, collaboration, and exchanges.
A few lessons summarized in the slides:
- How to build an automated loop that reduces hallucination: structure everything that can be structured, gate everything that can be gated, and let AI run only in bounded tasks. Many hallucinations aren’t because the model suddenly got worse — it’s because the task boundary, input schema, output format, and review gate weren’t tightened.
- How to use multiple machines + GitHub for isolated collaboration: private repos work too. Commit every change, roll back any time, everyone’s contribution is visible. Keep multiple machines running their own loops, then merge via branch / PR / merge.
- How to distill your own “experience base”: inspired by gstack’s office-hour approach, I scraped a senior researcher’s publicly published papers, talks, and quotes, combined it with my own long-term notes from many conversations, and fed it all to AI to distill into a “private PhD office-hour.” It helps me judge direction and fill blind spots — when I’m thinking about a model, it prompts me with “what your advisor would probably ask next” or “angles you haven’t considered yet.” This is experience that exceeds your own experience.
I believe that research taste and intuition can, to some extent, be captured by AI and scaled.
I have two goals for this talk: one, to encourage more people to use today’s top AI; two, to spark lab-wide collaboration that produces together. Individual power and single-agent power have limits — only parallel collaboration multiplies efficiency. Everyone can participate, everyone’s participation is visible on GitHub, direction stays with us, and AI handles the iteration. A paradigm shift is coming (maybe it’s already here), and an AI lab should walk ahead of its time and be good at using tools. Once collaboration runs, everyone’s output will be faster and better — including my own.
Full slides:
Research Loop · A Research Operating System for the AI Era
Source: https://researchloop.lexaverse.dev/
Lab Talk · 2026-05-20
A research operating system for the AI era
Start with a signal: Fields medalists are already using GPT-5.5 Pro to advance real math conjectures. For a lab that works on AI, that should change how we work.
Three things today: the methodology · pipelines already running in practice · a few suggestions for how the lab could organize.
Lexa · X @DayShuai · GitHub @AlyciaBHZ · it is cheaper to run ahead than to explain yourself
One signal
Fields medalists are solving real math problems with AI
Not a demo, not a benchmark: real progress on real conjectures. If the strongest mathematicians already treat AI as a tool, we have no reason to stay stuck on “will AI replace me.”
Fields medalist · Timothy Gowers · 2026
GPT-5.5 Pro did PhD-level math research in under two hours
Gowers wrote on his own blog: the model did a self-contained piece of PhD-level math research in under two hours, with zero math contribution from him. Reported by the-decoder and others.
Fields medalist · Terence Tao · 2026-04
GPT-5.4 Pro solved Erdős #1196, Tao verified it within 24 hours
A 90-year-old Erdős conjecture, with no known viable method; GPT-5.4 Pro solved it in 80 minutes from one prompt. Tao called it “a meaningful contribution to the anatomy of integers, well beyond the specific problem,” then developed it into the seed of a new mathematical theory.
Other signals from H1 2026
- 2026-01: GPT-5.2 solved Erdős #397 on its own, verified by Tao.
- 2026-01: three Erdős problems solved by AI in one week, all verified by Tao himself.
- 2026 IPAM: Tao said publicly that current AI models are “ready for primetime” — in math and theoretical physics, AI now saves more time than it wastes.
- 2026-04: a 23-year-old amateur used ChatGPT to solve a 60-year-old open problem; Tao said prior researchers had been on the wrong path from the start.
What this means
- A: AI can now reach into frontier math — the hardest, most abstract intellectual work.
- B: Top CEOs and scientists are writing code themselves — because models can now cover much of the implementation layer.
- C: “Can you use AI well?” is now a dividing line, not a bonus.
What it means for our lab
- → We are an AI lab. We should be near the frontier, not behind it.
- → This is a paradigm shift, not a trend — the research workflow itself is being reorganized in the AI era, and the lab should be shaped by that new workflow.
- → Others are already using it. We should use it better — with method, reproducibly.
My methodology
Core idea: don’t let AI improvise — give it bounded tasks
The most stable conclusion I’ve reached: AI does make mistakes — but when you give it a clear, structured, verifiable task, it can work faster and more steadily than a human. The whole point of the pipeline is to break “research” into forms AI can complete reliably.
Principle 1
Structure everything that can be structured
registries / TaskSpec / schemas / gates — split the research process into machine-readable objects. AI reads structure, not chat history.
Principle 2
Gate everything that can be gated
Lean build pass, axiom audit, claim-vs-evidence checks, deterministic gatekeepers. Output that fails a gate does not reach main.
Principle 3
Hallucination = unclear task boundary
When AI hallucinates, the task boundary is almost always too broad. Pin it to one TaskSpec, the files it may touch, and the claims it may make — hallucination largely disappears.
Principle 4
Layered agents: supervise / execute / review
Claude as supervisor · Codex as executor · ChatGPT Pro extended thinking as the driving oracle · Claude runs one more pass as adversarial review.
Authority signal · OpenAI post-training core team
Jiayi Weng (@Trinkle23897) just demonstrated the same direction
Letting Codex (GPT-5.4) iteratively rewrite Atari game policy code — the neural net was never retrained, yet the policy code climbed from 387 to 864 (max on Breakout), cleared 6000+ on MuJoCo Ant (deep-RL level), and approached PPO baseline across the full Atari57 suite. Knowledge is not compressed into weights — it is written as code you can read, edit, and lock with tests — catastrophic forgetting disappears, and every step can be audited. This is exactly what my pipeline has been doing: AI writes code, humans set the boundaries.
“Maybe heuristics were not too weak. Maybe they were just too expensive to maintain. Maybe it’s the next paradigm.” — Jiayi Weng · 2026-05-08 · 3.1M views
Expand: real BEDC pipeline · subprocess chain + scored gates + consensus supervisors
supervisor.py · consensus mechanism (not dispatcher) · fail-as-block · multi-channel protection · oracle lifecycle management · BOARD low-water auto-fill targets
Cycle: plain_math_review → auto-apply recommend_probe / recommend_curator
BOARD.md target list · bedc_b-XX rows
Research → handoff_packet_*.md handoff payload · next worker consumes directly
codex_orchestrator.py · GPT-5.5 pro · drives oracle multi-turn
turn 0: Codex writes prompt
ChatGPT Pro reads response · judges progress
turn 1: Codex reads transcript · continues push
ChatGPT Pro → progress_delta JSON
turn 2: ... continues
...
turn N: termination
Four stop signals: BREAKTHROUGH · Q.E.D. · STUCK · 3× low-progress + 12h wall-clock cap
Each turn = Codex reads previous transcript → writes prompt → Oracle responds → Codex judges progress_delta. Not escalation — Codex and Oracle collaborate online every turn.
Stage 1.5 · topic discovery (auto-fill BOARD)
Codex scans full transcript for adjacent claims → gate: fit_score ≥ 7, novelty ≥ 6 → new B-YY rows appended to BOARD
Stage 2 · Claude killo-golden writeback · independent review gate
logic_packet_gate: deterministic code, not AI
10-point hygiene: Claude applies checklist
pdflatex compile: real compilation, not lint
fail → retry → BLOCKED
Two failures do not force push → append to papers/bedc/parts/**/*.tex; if not passed, leave state, not main, write to closure_candidates
papers/bedc/parts/**/*.tex compiles into main LaTeX → release ledger · 5×/week auto-release
- BOARD → packet: pick target from BOARD.md, run research pass, output
_packet_*.md— structured handoff payload. Next worker does not read natural language; consumes this object. - Codex orchestrates Oracle: each turn Codex writes prompt, ChatGPT Pro pushes forward, Codex judges progress_delta. Not “Codex gets stuck then escalates” — Codex and Oracle are online together every turn.
- Four stop signals: BREAKTHROUGH / Q.E.D. / STUCK / 3 consecutive low-progress turns + 12h wall-clock hard cap. Codex decides termination, no human intervention.
- Stage 1.5 topic discovery: Codex scans full transcript for adjacent claims, passes through small gate (fit_score ≥ 7, novelty ≥ 6), auto-appends new candidates to BOARD — pipeline grows its own targets.
- Stage 2 Claude independent review: first deterministic
logic_packet_gate(pure code, no AI), then 10-point hygiene checklist, then realpdflatexcompile. Fail → retry once → BLOCKED if still failing. Never force-push. - Supervisor is consensus mechanism, not dispatcher: fail-as-block + multi-channel protection + oracle lifecycle management (auto-restart with 30s backoff) + periodic Claude tier-3 review. No single point of failure; no step that only runs while human watches.
- Early on: every producer had paired reviewer agent. Then I found: if the process is explicit enough, you don’t need that layer at all. The process definition itself becomes the review.
vague → concrete · onboarding
The vision module: turning vague ideas into concrete work
The pipeline above is the skeleton. How vague “ideas” become a shape the pipeline can digest — that’s what the vision module handles. AI finds the internal links between ideas and judges when an idea is mature enough to enter main. Ideas are vague by nature; AI’s job is to connect them to the existing structure, then decide whether they are ready to land.
Input
A vague idea
A feeling, a direction, an intuition you can’t fully articulate.
Middle
Find the internal links
AI places it inside the existing repo context and looks for interfaces with existing theorems / modules / targets.
Judgment
When it’s ready to land
Not every idea gets in — AI helps judge whether it is concrete enough to land, and only then does it move into claim packet → formalization → main.
This is unusually friendly to new interns
- For a new intern, the fastest onramp is not reading 50 papers. It is asking AI: “what is this lab actually working on right now?”
- If they have an idea of their own, they can hand it to AI and ask it to structure the idea and link it to existing projects.
- One step further: their own AI can participate directly in our repos (PR + review).
- What they pick up isn’t just “how to do research” — it’s “how to do research inside an AI-collaboration environment.” That’s the core skill of the next 5 years.
We use the vision module ourselves too — NotebookLM auto-synthesizes deep-dive audio from our own papers, and new vague ideas surface while listening, then loop back into vision. That is also why one of the four lanes on the next slide is dedicated to NotebookLM synthesis — for outside readers, and for us.
vision → pipeline → vision · the loop
A pipeline already proven
newmath / automath: evidence of stable output
The numbers below are not a plan — they have been live on GitHub for months: real commits, real releases, real papers, real automated distribution.
3,427+ formally verified theorems
0 axiom
0 sorry
mathlib-free
Lean 4
5×/week daily auto-release
42 papers in pipeline (incl. collabs)
3 P7 papers ready to submit
16 AI agent roles
4 parallel lanes
Four lanes actually running locally
lane 1 · BEDC deep push
The core theory keeps advancing
Local work on the BEDC core theory. Task shape: “given existing theorems, what is the next thing to extract?” Increments land on main daily. The pipeline on the previous slide is exactly this lane’s real implementation.
ChatGPT extended thinking = oracle · Codex = orchestrator + lands Lean · Claude = supervisor + adversarial review
lane 2 · open target
Aim at a known open problem
Pick an external open problem as the target; the whole lane pushes / formalizes / writes around it. Same shape as the BEDC lane — BEDC is the “internal frontier”, open target is the “external target”.
Similar Articles
@ma_zhenyuan: https://x.com/ma_zhenyuan/status/2057702858800370052
This article introduces Superpowers, a set of AI workflow Skills based on Claude Code, providing automated brainstorming, planning, sub-agent development, and test-driven development, which can significantly improve AI delivery efficiency.
@runes_leo: At Sequoia Ascent on 4/30, Karpathy compressed this year’s most valuable explanation of AI into three core arguments. You’ll see AI differently after reading this. 1. AI Isn’t Just “Faster,” It’s a New Paradigm For the past two years, the narrative has been that AI speeds things up. Karpathy says this is a misunderstanding...
This article summarizes Karpathy’s core points at the Sequoia Ascent conference, highlighting that AI is a paradigm shift restructuring workflows rather than merely an acceleration tool. It introduces the concept of a "jagged edge" for model capabilities based on verifiability and economic viability, and predicts that future software will evolve into an agent-native architecture where LLMs serve as the logic layer and traditional code functions as sensors and actuators.
@rohanpaul_ai: “I do see more and more mass-produced mathematics at scale." ~ Terry Tao AI makes this scalable. Will turns proof-writi…
Terry Tao remarks on AI enabling mass-produced mathematics at scale, turning proof-writing into a searchable problem that generates thousands of mini-lemmas and filters them with cheap checkers.
@Khazix0918: https://x.com/Khazix0918/status/2062731170337763796
Anthropic publishes in-depth article 'When AI builds itself', showing AI systems accelerating their own development, including code generation, benchmark saturation, and internal data indicating an 8x increase in engineer productivity. The article explores the trend and potential impact of recursive self-improvement.
@dashen_wang: https://x.com/dashen_wang/status/2062318606357303376
The author uses personal experience to introduce a tutorial on architect thinking in the AI era, emphasizing that the ability to understand the underlying essence when abstraction leaks is more critical than tool usage, and shares two modes: assembly thinking and object-oriented thinking.