@elliotchen100: https://x.com/elliotchen100/status/2054008474082918614
Summary
The article analyzes Andrej Karpathy's perspective on using HTML as an output format for LLMs, exploring the evolution of human-computer interaction from a neuroscience viewpoint. The author argues that although the future may shift toward neural simulation, HTML will likely remain a best practice for human-AI collaboration in the near to medium term due to its engineering maintainability and low cost.
View Cached Full Text
Cached at: 05/12/26, 04:49 AM
Karpathy Elevated HTML to the Level of Neuroscience, But Left Two Loopholes
Early this morning (5:00 PM Beijing Time, May 12), Andrej Karpathy reposted an article by Thariq from three days ago regarding HTML output formats, adding his own reflections.
In less than five hours, it garnered nearly 1 million views, 10,000 likes, and 10,000 bookmarks. This thread will continue to gain traction.
I had just written an analysis of Thariq’s article three days prior. Reading Karpathy’s post alongside mine reveals an interesting dynamic: we are discussing the same phenomenon, but he approached it from a more fundamental perspective.
This article aims to do three things: first, extract the core arguments from Karpathy’s post; second, compare his angle with mine from three days ago to highlight the differences in perspective; and third, identify the two unresolved gaps in Karpathy’s argument, which represent promising avenues for future discussion.
Karpathy’s Original Post
https://x.com/karpathy/status/2053872850101285137
If you don’t see the embedded tweet above on your X account, here are the key points I’ve extracted:
- Practical Tip: Append “structure your response as HTML” to your prompt, then open the generated file in a browser. You can even try this for slideshows.
- Core Framing: Audio is the preferred input modality for humans, while vision is the preferred output modality for AI.
- Neuroscience Backing: Approximately one-third of the brain is dedicated to visual processing—it’s a 10-lane highway for information entering the brain.
- Evolutionary Sequence of Output Formats:
- End State: Interactive video directly generated by diffusion neural networks, interwoven with Software 1.0 programmable components (e.g., interactive simulations).
- Input-Side Caveat: He himself noted that audio, text, and video are insufficient; “I need pointing/gestures.”
- TL;DR: The mind-meld between human and AI input/output is still evolving. We are far from BCI/Neuralink. For now, the hot tip is to try having models output HTML.
What I Said Three Days Ago Collides with Karpathy’s Post
Here is the original post I wrote three days ago (citing the same article by Thariq):
https://x.com/elliotchen100/status/2052913108616954215
The core sentence in my post was:
The implicit assumption of Markdown is that “humans will read from start to finish.” The implicit assumption of HTML is that “humans just want to scan for key points and make edits.” The latter aligns with the true nature of human-machine collaboration in the AI era.
Looking at both posts together, one thing becomes clear:
I was talking about user behavior; Karpathy was talking about neuroscience.
From the perspective of “how humans use Markdown vs. HTML,” I observed that “HTML allows for scanning and editing.” Karpathy, from the perspective of “how the brain processes information,” explained “why humans prefer scanning and editing over reading line by line.”
He provided the underlying explanation for why my observation was correct.
This isn’t to say we were in conflict. On the contrary, observations from two different levels converging on the same fact reinforce each other. My post stated “this is happening”; his post explained “why this is happening.”
The First Loophole Left by Karpathy: Output Is Defined, Input Is Not
Karpathy briefly mentioned “I need pointing/gestures” in his post but didn’t fully draw that thread out.
Human-computer interaction involves two parallel tracks:
The output track (machine-to-human) has already progressed to the distant form of “interactive neural simulation.”
The input track (human-to-machine) has only reached “voice.” The next step—“voice + gestures + eye tracking + context”—has not yet been fully realized by any product. Beyond that, it becomes even blurrier. BCI is a placeholder, not a viable next step.
My assessment is that the true inflection point for human-computer interaction won’t be when either side reaches its limit alone, but when the two tracks converge at a certain point.
No matter how rich the output is, if the input side relies only on keyboards and mice, you end up with nothing more than passive television consumption. No matter how natural the input is, if the output side remains stuck in text paragraphs, the experience is merely a voice-enabled command line.
The asymmetry in progress between the two sides at this stage in time is severely underestimated.
The Second Loophole Left by Karpathy: Is HTML a Transition?
Karpathy placed HTML in the middle of the evolutionary sequence, implying that “the next stop is neural simulation.”
I am skeptical of this inference. My argument is: HTML may not be a transition, but a local optimum that will persist for a considerable time.
This reasoning isn’t technical (neural simulations will eventually be generated), but engineering-based.
Among all currently available output formats, HTML represents an engineering local optimum:
- Sufficiently Rich: It can express layout, interaction, animation, and even embed lightweight data structures.
- Cost-Effective: Generating an HTML file via LLM is orders of magnitude faster than generating a video.
- Standardized: It can be saved to disk, diffed via Git, reviewed in PRs, and collaborated on by two teams.
- Reversible/Editable: If the output is incorrect, humans can open the HTML and edit it directly, offering an editing experience similar to Markdown.
Jumping straight to “interactive neural simulation” looks more advanced, but the engineering cost is discarding the latter three advantages (savable, diffable, reviewable, editable).
How do you git push a simulation generated in real-time by a neural network?
How do two teams review the same simulation?
How do you A/B test two versions of an experience?
How do you audit how this output was generated?
These are not technical problems; they are engineering problems. Engineering problems do not resolve themselves automatically just because models become stronger.
Therefore, I predict that the HTML stage will last longer than most people expect. Technically, we can skip it; engineering-wise, we won’t.
Summary: Viewing My Post and Karpathy’s Together
I previously wrote an article organizing the “evolution of the input side” into the sequence: prompt → context → harness, where “harness” refers to the entire infrastructure wrapping the model.
Karpathy’s post organizes the “evolution of the output side” into: raw text → Markdown → HTML → neural simulation.
Combined, these form a complete map of human-computer interaction evolution:
Each step pushes engineering capabilities outward by one layer.
Karpathy elevated the output side to the neuroscience level of “why vision is the preferred human output,” a more universal angle than mine. It is worth pausing to read for anyone building AI products.
Where My Judgment Might Be Wrong
As per convention, here is the counter-argument:
- If the generation cost of interactive neural simulations drops faster than I expect (inference < 100ms within 1–2 years), the HTML local optimum will be skipped quickly.
- “Not savable/diffable” might be a feature, not a bug, in certain scenarios. One-off personalized outputs (e.g., games, immersive experiences, temporary reports from private assistants) inherently do not need to be saved.
- Platforms like Apple Vision or similar ecosystems might define new native output formats (e.g., spatial UI based on USDZ), skipping the HTML stage entirely, driven by hardware platforms.
- If the “diffusion + Software 1.0 hybrid” direction mentioned by Karpathy can achieve “savable/diffable” status on one side, the entire trade-off is circumvented.
I personally bet that HTML will remain a pillar for 3 to 5 years, but this is a judgment that could be rewritten by new hardware platforms.
Original Tweets:
Karpathy’s post (today): https://x.com/karpathy/status/2053872850101285137
My post from three days ago: https://x.com/elliotchen100/status/2052913108616954215
Thariq’s original article (cited by both Karpathy and me): https://www.anthropic.com/engineering/the-unreasonable-effectiveness-of-html
Similar Articles
@elliotchen100: Thariq from Anthropic’s viral HTML post hit 1.5M reads. On the surface, it’s about formatting aesthetics, but he’s actually outlining a brand-new workflow. Picking out the most technical points. First, HTML isn’t a document; it’s a throwaway editor. Take his example…
Analyzes a new AI development workflow shared by Anthropic employee Thariq, highlighting how replacing Markdown with HTML and SVG can dramatically improve multi-agent collaboration and interaction efficiency, offering a model better suited to human-AI synergy in the AI era.
@trq212: https://x.com/trq212/status/2052809885763747935
The article argues that HTML is a superior output format for AI agents compared to Markdown due to richer information density, visual clarity, ease of sharing, and two-way interaction, and shares why the author and others at Claude Code prefer HTML.
@karpathy: This works really well btw, at the end of your query ask your LLM to "structure your response as HTML", then view the g…
Andrej Karpathy suggests prompting LLMs to structure responses as HTML for better visualization and predicts AI output will evolve from text to interactive neural videos.
@namcios: Anthropic just killed Markdown. A Claude Code engineer published an article yesterday that could herald the start of a …
An Anthropic engineer argues that HTML should replace Markdown as the primary output format for AI agents, offering interactive interfaces and shared memory over static text reports.
Using Claude Code: The unreasonable effectiveness of HTML
A blog post by a Claude Code team member argues for using HTML instead of Markdown as the preferred output format for AI agents like Claude Code, citing benefits such as richer information density, visual clarity, ease of sharing, and interactive capabilities.