@DayShuai: Tomorrow I'll volunteer to share my own AI loop at the Yang Zhang lab group meeting. The same OS pattern has run out 3,400+ 0-axiom Lean 4 theorems on automath and newmath in the past six months, with 5×/week automatic releases,...

X AI KOLs Timeline 05/19/26, 07:20 PM Tools

ai-loop lean4 theorem-proving agent-loop research-tools collaboration automation

Summary

Sharing experience from the AI loop at the Yang Zhang lab group meeting, including automated theorem proving, multi-machine collaboration, distilling a private experience base, and mentioning examples of Fields medalists using AI to solve mathematical problems.

Tomorrow I'll volunteer to share my own AI loop at the Yang Zhang lab group meeting. The same OS pattern has run out 3,400+ 0-axiom Lean 4 theorems on automath and newmath in the past six months, with 5×/week automatic releases, multiple articles under peer review, and collaborative projects progressing in parallel. It's also been applied to my PhD work on TCR / pMHC (JSI-Pi, joint scaffold-interface posterior). Our lab is also a long-established top-tier research lab — I-TASSER has been at the top of the CASP automated server category for a long time, with 170k+ registered users from 160 countries. But from my observation, the density of AI tool usage is still far from where it could be — the mainstream usage is still treating ChatGPT as a search / assistant tool, and the agent loops, multi-model collaboration, and automated pipelines that can truly amplify output are barely being adopted. That's why I want to give this talk. Since I've already made the slides, I might as well share them — hopefully you can get something out of them. Feel free to discuss and collaborate with me. Here are some lessons I've summarized: - How to build an automated loop that reduces hallucinations: structure everything that can be structured, gate everything that can be gated, and let AI only run within bounded tasks. Many hallucinations are not due to the model suddenly becoming worse, but because the task boundaries, input schema, output format, and review gates were not tightened. - How to use multiple machines + GitHub for isolated collaboration: even private repos work; commit every change, roll back anytime, and visualize everyone's contributions. Keep multiple machines running, each running its own loop, and finally converge through branch / PR / merge. - How to distill your own "experience base": Inspired by gstack's office-hour idea, I scraped a senior researcher's publicly published papers, talks, and quotes, added my own long-term notes from conversations with many people, and fed them to AI to distill a "private PhD office-hour". It helps me judge direction and fill blind spots — when I'm thinking about a model, it prompts me with "the next question your advisor would probably ask" or "an angle you haven't considered yet". This is an experience that goes beyond one's own. I believe that research taste and intuition can, to some extent, be captured and scaled by AI. I have two purposes for this talk: first, to encourage more people to make use of top AI; second, to get the lab to collaborate and produce together. The power of an individual and a single agent is limited; only through parallel collaboration can efficiency be multiplied. Everyone can participate, everyone's contributions are visualized on GitHub, we control the direction, and let AI iterate. A paradigm shift is bound to happen (maybe it already has), and AI labs should be ahead of the curve in using tools well. Once collaboration runs, everyone's output will be faster and better, including mine. Full slides:

Original Article

View Cached Full Text

Cached at: 05/20/26, 06:26 AM

Tomorrow I volunteered to share my own AI loop in Yang Zhang lab’s group meeting. The same OS pattern has recently been running on automath and newmath, producing 3,400+ 0-axiom Lean 4 theorems, auto-releasing 5×/week. Multiple papers are under peer review, collaborative projects are progressing in sync, and I’ve also landed my PhD work on TCR / pMHC (JSI-Pi, joint scaffold-interface posterior). Our lab is a long-established top-tier research lab — I-TASSER has been top in the CASP automated server track for years, with 170k+ registered users across 160 countries. But from what I’ve observed, the density of AI tool usage is still far from where it could be — mainstream use stops at treating ChatGPT as a search / assistant tool. Things like agent loops that truly amplify output, multi-model collaboration, and automated pipelines are barely deployed. That’s why I wanted to do this talk. Since I already made the slides, I might as well share them — see if people can pick up something, and I welcome discussion, collaboration, and exchanges.

A few lessons summarized in the slides:

How to build an automated loop that reduces hallucination: structure everything that can be structured, gate everything that can be gated, and let AI run only in bounded tasks. Many hallucinations aren’t because the model suddenly got worse — it’s because the task boundary, input schema, output format, and review gate weren’t tightened.
How to use multiple machines + GitHub for isolated collaboration: private repos work too. Commit every change, roll back any time, everyone’s contribution is visible. Keep multiple machines running their own loops, then merge via branch / PR / merge.
How to distill your own “experience base”: inspired by gstack’s office-hour approach, I scraped a senior researcher’s publicly published papers, talks, and quotes, combined it with my own long-term notes from many conversations, and fed it all to AI to distill into a “private PhD office-hour.” It helps me judge direction and fill blind spots — when I’m thinking about a model, it prompts me with “what your advisor would probably ask next” or “angles you haven’t considered yet.” This is experience that exceeds your own experience.

I believe that research taste and intuition can, to some extent, be captured by AI and scaled.

I have two goals for this talk: one, to encourage more people to use today’s top AI; two, to spark lab-wide collaboration that produces together. Individual power and single-agent power have limits — only parallel collaboration multiplies efficiency. Everyone can participate, everyone’s participation is visible on GitHub, direction stays with us, and AI handles the iteration. A paradigm shift is coming (maybe it’s already here), and an AI lab should walk ahead of its time and be good at using tools. Once collaboration runs, everyone’s output will be faster and better — including my own.

Full slides:

Research Loop · A Research Operating System for the AI Era

Source: https://researchloop.lexaverse.dev/

Lab Talk · 2026-05-20

A research operating system for the AI era

Start with a signal: Fields medalists are already using GPT-5.5 Pro to advance real math conjectures. For a lab that works on AI, that should change how we work.

Three things today: the methodology · pipelines already running in practice · a few suggestions for how the lab could organize.

Lexa · X @DayShuai · GitHub @AlyciaBHZ · it is cheaper to run ahead than to explain yourself

One signal

Fields medalists are solving real math problems with AI

Not a demo, not a benchmark: real progress on real conjectures. If the strongest mathematicians already treat AI as a tool, we have no reason to stay stuck on “will AI replace me.”

Fields medalist · Timothy Gowers · 2026

GPT-5.5 Pro did PhD-level math research in under two hours

Gowers wrote on his own blog: the model did a self-contained piece of PhD-level math research in under two hours, with zero math contribution from him. Reported by the-decoder and others.

Fields medalist · Terence Tao · 2026-04

GPT-5.4 Pro solved Erdős #1196, Tao verified it within 24 hours

A 90-year-old Erdős conjecture, with no known viable method; GPT-5.4 Pro solved it in 80 minutes from one prompt. Tao called it “a meaningful contribution to the anatomy of integers, well beyond the specific problem,” then developed it into the seed of a new mathematical theory.

Other signals from H1 2026

2026-01: GPT-5.2 solved Erdős #397 on its own, verified by Tao.
2026-01: three Erdős problems solved by AI in one week, all verified by Tao himself.
2026 IPAM: Tao said publicly that current AI models are “ready for primetime” — in math and theoretical physics, AI now saves more time than it wastes.
2026-04: a 23-year-old amateur used ChatGPT to solve a 60-year-old open problem; Tao said prior researchers had been on the wrong path from the start.

What this means

A: AI can now reach into frontier math — the hardest, most abstract intellectual work.
B: Top CEOs and scientists are writing code themselves — because models can now cover much of the implementation layer.
C: “Can you use AI well?” is now a dividing line, not a bonus.

What it means for our lab

→ We are an AI lab. We should be near the frontier, not behind it.
→ This is a paradigm shift, not a trend — the research workflow itself is being reorganized in the AI era, and the lab should be shaped by that new workflow.
→ Others are already using it. We should use it better — with method, reproducibly.

My methodology

Core idea: don’t let AI improvise — give it bounded tasks

The most stable conclusion I’ve reached: AI does make mistakes — but when you give it a clear, structured, verifiable task, it can work faster and more steadily than a human. The whole point of the pipeline is to break “research” into forms AI can complete reliably.

Principle 1

Structure everything that can be structured

registries / TaskSpec / schemas / gates — split the research process into machine-readable objects. AI reads structure, not chat history.

Principle 2

Gate everything that can be gated

Lean build pass, axiom audit, claim-vs-evidence checks, deterministic gatekeepers. Output that fails a gate does not reach main.

Principle 3

Hallucination = unclear task boundary

When AI hallucinates, the task boundary is almost always too broad. Pin it to one TaskSpec, the files it may touch, and the claims it may make — hallucination largely disappears.

Principle 4

Layered agents: supervise / execute / review

Claude as supervisor · Codex as executor · ChatGPT Pro extended thinking as the driving oracle · Claude runs one more pass as adversarial review.

Authority signal · OpenAI post-training core team

Jiayi Weng (@Trinkle23897) just demonstrated the same direction

Letting Codex (GPT-5.4) iteratively rewrite Atari game policy code — the neural net was never retrained, yet the policy code climbed from 387 to 864 (max on Breakout), cleared 6000+ on MuJoCo Ant (deep-RL level), and approached PPO baseline across the full Atari57 suite. Knowledge is not compressed into weights — it is written as code you can read, edit, and lock with tests — catastrophic forgetting disappears, and every step can be audited. This is exactly what my pipeline has been doing: AI writes code, humans set the boundaries.

“Maybe heuristics were not too weak. Maybe they were just too expensive to maintain. Maybe it’s the next paradigm.” — Jiayi Weng · 2026-05-08 · 3.1M views

Expand: real BEDC pipeline · subprocess chain + scored gates + consensus supervisors

supervisor.py · consensus mechanism (not dispatcher) · fail-as-block · multi-channel protection · oracle lifecycle management · BOARD low-water auto-fill targets

Cycle: plain_math_review → auto-apply recommend_probe / recommend_curator

BOARD.md target list · bedc_b-XX rows

Research → handoff_packet_*.md handoff payload · next worker consumes directly

codex_orchestrator.py · GPT-5.5 pro · drives oracle multi-turn

turn 0: Codex writes prompt
         ChatGPT Pro reads response · judges progress
turn 1: Codex reads transcript · continues push
         ChatGPT Pro → progress_delta JSON
turn 2: ... continues
...
turn N: termination

Four stop signals: BREAKTHROUGH · Q.E.D. · STUCK · 3× low-progress + 12h wall-clock cap

Each turn = Codex reads previous transcript → writes prompt → Oracle responds → Codex judges progress_delta. Not escalation — Codex and Oracle collaborate online every turn.

Stage 1.5 · topic discovery (auto-fill BOARD)
Codex scans full transcript for adjacent claims → gate: fit_score ≥ 7, novelty ≥ 6 → new B-YY rows appended to BOARD

Stage 2 · Claude killo-golden writeback · independent review gate
logic_packet_gate: deterministic code, not AI
10-point hygiene: Claude applies checklist
pdflatex compile: real compilation, not lint
fail → retry → BLOCKED
Two failures do not force push → append to papers/bedc/parts/**/*.tex; if not passed, leave state, not main, write to closure_candidates

papers/bedc/parts/**/*.tex compiles into main LaTeX → release ledger · 5×/week auto-release

BOARD → packet: pick target from BOARD.md, run research pass, output _packet_*.md — structured handoff payload. Next worker does not read natural language; consumes this object.
Codex orchestrates Oracle: each turn Codex writes prompt, ChatGPT Pro pushes forward, Codex judges progress_delta. Not “Codex gets stuck then escalates” — Codex and Oracle are online together every turn.
Four stop signals: BREAKTHROUGH / Q.E.D. / STUCK / 3 consecutive low-progress turns + 12h wall-clock hard cap. Codex decides termination, no human intervention.
Stage 1.5 topic discovery: Codex scans full transcript for adjacent claims, passes through small gate (fit_score ≥ 7, novelty ≥ 6), auto-appends new candidates to BOARD — pipeline grows its own targets.
Stage 2 Claude independent review: first deterministic logic_packet_gate (pure code, no AI), then 10-point hygiene checklist, then real pdflatex compile. Fail → retry once → BLOCKED if still failing. Never force-push.
Supervisor is consensus mechanism, not dispatcher: fail-as-block + multi-channel protection + oracle lifecycle management (auto-restart with 30s backoff) + periodic Claude tier-3 review. No single point of failure; no step that only runs while human watches.
Early on: every producer had paired reviewer agent. Then I found: if the process is explicit enough, you don’t need that layer at all. The process definition itself becomes the review.

vague → concrete · onboarding

The vision module: turning vague ideas into concrete work

The pipeline above is the skeleton. How vague “ideas” become a shape the pipeline can digest — that’s what the vision module handles. AI finds the internal links between ideas and judges when an idea is mature enough to enter main. Ideas are vague by nature; AI’s job is to connect them to the existing structure, then decide whether they are ready to land.

Input

A vague idea

A feeling, a direction, an intuition you can’t fully articulate.

Middle

Find the internal links

AI places it inside the existing repo context and looks for interfaces with existing theorems / modules / targets.

Judgment

When it’s ready to land

Not every idea gets in — AI helps judge whether it is concrete enough to land, and only then does it move into claim packet → formalization → main.

This is unusually friendly to new interns

For a new intern, the fastest onramp is not reading 50 papers. It is asking AI: “what is this lab actually working on right now?”
If they have an idea of their own, they can hand it to AI and ask it to structure the idea and link it to existing projects.
One step further: their own AI can participate directly in our repos (PR + review).
What they pick up isn’t just “how to do research” — it’s “how to do research inside an AI-collaboration environment.” That’s the core skill of the next 5 years.

We use the vision module ourselves too — NotebookLM auto-synthesizes deep-dive audio from our own papers, and new vague ideas surface while listening, then loop back into vision. That is also why one of the four lanes on the next slide is dedicated to NotebookLM synthesis — for outside readers, and for us.

vision → pipeline → vision · the loop

A pipeline already proven

newmath / automath: evidence of stable output

The numbers below are not a plan — they have been live on GitHub for months: real commits, real releases, real papers, real automated distribution.

3,427+ formally verified theorems
0 axiom
0 sorry
mathlib-free
Lean 4
5×/week daily auto-release
42 papers in pipeline (incl. collabs)
3 P7 papers ready to submit
16 AI agent roles
4 parallel lanes

Four lanes actually running locally

lane 1 · BEDC deep push

The core theory keeps advancing

Local work on the BEDC core theory. Task shape: “given existing theorems, what is the next thing to extract?” Increments land on main daily. The pipeline on the previous slide is exactly this lane’s real implementation.

ChatGPT extended thinking = oracle · Codex = orchestrator + lands Lean · Claude = supervisor + adversarial review

lane 2 · open target

Aim at a known open problem

Pick an external open problem as the target; the whole lane pushes / formalizes / writes around it. Same shape as the BEDC lane — BEDC is the “internal frontier”, open target is the “external target”.