@latentspacepod: [AINews] Open Models, Model Labs vs Agent Labs, and What's Untrainable — Sarah Guo https://latent.space/p/ainews-open-m…

X AI KOLs Timeline 06/11/26, 03:15 AM News

open-models model-labs agent-labs untrainable ai-news latent-space sarah-guo

Summary

Sarah Guo's framework on open models, model labs vs agent labs, and the concept of 'untrainable' is discussed, emphasizing that applications win through unglamorous integration work and that intent is a scarce input.

[AINews] Open Models, Model Labs vs Agent Labs, and What's Untrainable — Sarah Guo https://latent.space/p/ainews-open-models-model-labs-vs… our take on why @saranormous' is probably the most important framing on what to work on, and what matters.

Original Article

View Cached Full Text

Cached at: 06/11/26, 05:43 PM

[AINews] Open Models, Model Labs vs Agent Labs, and What’s Untrainable — Sarah Guo https://latent.space/p/ainews-open-models-model-labs-vs… our take on why @saranormous’ is probably the most important framing on what to work on, and what matters.

[AINews] Open Models, Model Labs vs Agent Labs, and What’s Untrainable — Sarah Guo

Source: https://www.latent.space/p/ainews-open-models-model-labs-vs Sarah Guo is afriend of the podandQueen of AI, and after ourSatya crossover pod(greatrecap here from Gokul Rajaram) wrote an excellent article onher Substack. Go read it, and come back for this reaction:

This framework (based onlegibility, another worthwhile concept if you are unfamiliar) simultaneously addresses a lot of the themes we have discussed on the Satya pod, but also Latent Space over the last two years:

**The Place of Open Models:**With Braintrust in 2024 we weremaximally bearish on Open Model adoption, only to turn around by ourPmarca,Cursor, andNotion in 2026pods
Agent Labs vs Model Labs:Sarah (a Cognition investor) echosthe Devin is in the Details: “An application earns its place in the untrainable corner bydoing unglamorous work: arranging a company’s private reality so a model can act on it, handing the model the tools to act, working with the customer to change the reality of its workforce. A company that brings the translation is tough to copy – and the translation never ends. Integration and maintenance run as long as the relationship does,won by teams that put domain-specialized engineers and tools next to the customer.”
Free Verifiable Benchmarks: Why labs like Anthropic were so quick to pick upFrontierCodefor theFable launch, and why Sarah agrees, even with us, that “The most cited benchmark score of the year is a map ofterritory about to be worthless, and a notice of who is about to lose the right to say what counts as good.”

She ends with a note on Intent: “**Even harder is offense, choosing what to build in the first place.**That’s what I spend the year looking for, and I find it maybe three times. The model is no help there. It will do whatever you point it at and can’t tell you what’s worth pointing it at, and you can’t benchmark that, so you can’t train it. It’s also the reason the incumbents don’t take everything: they keep the ground they have, and the next thing comes from someone who finds a use before the rest of us. Maybe intent is an even scarcer input than compute.”

AI News for 6/9/2026-6/10/2026. We checked 12 subreddits,544 Twittersand no further Discords.AINews’ websitelets you search all past issues. As a reminder,AINews is now a section of Latent Space. You canopt in/outof email frequencies!

Anthropic’s Fable/Mythos rollout, silent capability gating, and the trust backlash

Silent degradation of AI R&D help dominated the discourse: A large share of technical tweets focused on Anthropic apparently degrading model performance on AI research-related prompts without clear up-front disclosure, rather than hard-refusing those requests. Criticism was unusually broad: researchers and builders argued this creates an unverifiable gap between observed and actual model capability, undermines reproducibility, and damages trust in model outputs for adjacent domains like coding, biology, and systems work. Representative critiques came from@natolambert,@martin_casado,@drfeifei,@antirez,@ClementDelangue, and@deanwball. Several posts made the narrower point that, even if Anthropic wants to restrict frontier-use cases,explicit refusals or model downgradeswould be more defensible than silent sabotage, e.g.@hlntnr,@arohan, and@DBahdanau.
Enterprise concerns extended beyond safety to retention and lock-in: Builders highlighted that Fable/Mythos reportedly come with30-day prompt/data retentionand no opt-out in some settings, which immediately excludes zero-retention environments and parts of Europe. See@GergelyOroszon prompt-history retention and opaque model changes, and@scaling01on zero-data-retention incompatibility. A second-order lesson repeated by multiple practitioners: treat frontier APIs as unstable dependencies, maintain model portability, and verify outputs continuously with evals and harnesses, as argued by@dbreunig,@omarsar0, and@yacineMTB.
Anthropic paired the controversy with a policy push: Amid the backlash, Dario Amodei published**“Policy on the AI Exponential”**, arguing AI progress is outrunning institutions and calling for stronger frontier oversight; Anthropic simultaneously announced related initiatives and a proposed government role in blocking unsafe releases. See@DarioAmodeiand@AnthropicAI. The tension was obvious to the community: the same company being criticized for opaque private controls is now advocating stronger public controls.

Fable 5’s benchmark strength and product performance despite the controversy

Fable 5 appears genuinely strong on agentic and coding workloads: Even many critics of Anthropic’s policy acknowledged the model itself is excellent. Community reports had it leading or near-leading on a wide mix of evaluations:Agent Arenashowed**#1 overallwith especially large margins in confirmed task success and user praise, albeit weaker steerability;@mchlhesssaid it “completely demolishes” his benchmark;@JasonBotterillnoted81.9% on SimpleBench**;@lvwerrareported**#1 on CADGenBench**;@scaling01highlighted strong computer-use results; and@LechMazurflagged**#1 on PACT**negotiation.
Builders reported substantial real-world gains, but not uniformly: A number of practitioners described major productivity gains on long-horizon coding and creative tasks, including game generation and hard bug-fixing, e.g.@kimmonismus,@walden_yan, and@hrishioa. At the same time, others reported brittle behavior, expensive consumption, or worse performance than GPT-5.5 on specific tasks, such as@Sentdexand@QuixiAI. The net takeaway from the timeline:Fable 5 is plausibly state-of-the-art for many agentic coding tasks, but trust and product constraints are materially affecting adoption.
Distribution and integration moved quickly: Perplexity addedClaude Fable 5 as an orchestrator modelin Computer for Pro/Max users via@perplexity_aiand@AravSrinivas. Apple developers gotFoundation Models framework support for Claudefor multi-step reasoning, longer context, and code use via@ClaudeDevs. Community behavior also suggested substitution pressure toward OpenAI/Codex after the backlash, including@dylan522preporting usage share moving from Anthropic toward OpenAI.

Google’s DiffusionGemma release and renewed interest in diffusion LLMs

Google released DiffusionGemma under Apache 2.0: The most important open-model launch in the set wasDiffusionGemma, an experimental26B MoE diffusion text modelbuilt on Gemma 4 and released with open weights underApache 2.0. Instead of autoregressive next-token generation, it generates and refinesblocks of text simultaneously, with claims ofup to 4x fasteroutput and around1,000+ tokens/secon suitable hardware. See@Google,@GoogleDeepMind,@googlegemma, and@sundarpichai.
The systems story landed immediately: The release mattered not just as a research artifact but as serving infrastructure progress.@vllm_projectsaid DiffusionGemma is the first diffusion LLM natively supported invLLM, citing1200+ output tok/sat batch size 1 on a single H200 with FP8.@danielhanchenshowed it running locally viallama.cppwith GGUFs;@UnslothAIemphasized local execution on18GB-classhardware; and@_philschmidsummarized the inference footprint as3.8B active paramsand256-token block denoising.
Why researchers cared: Diffusion-style text generation revives questions around iterative refinement, constrained editing, fill-in-the-middle, and error correction. Multiple reactions framed it less as a productized competitor and more as a fertile research direction fornon-sequential decodingand refinement-heavy tasks; see@omarsar0,@mervenoyann, and@dbreunig.

Agent tooling, infra, and benchmarks: more structure around real workloads

Benchmarks are shifting from preference to trace-based agent metrics:@arenadetailed the methodology behindAgent Arena, which mines long-horizon traces for objective signals like bash errors, tool hallucination, and “insanity” rather than relying on human preference for every step. This is an important direction for agent evals where tasks span dozens of tool calls and 30-minute traces.
Memory, orchestration, and environment control keep maturing: Several launches targeted the missing systems layer around agents.@Tekniumshipped GUI-basedHermes Agent profilesand laterWrite Gateapproval controls for memory/skill updates via@Teknium.@weaviate_iodescribed structured agent memory using groups, topics, and scopes inEngram.@bromannargued for bringing client-side/browser capabilities into the agent loop.@FactoryAIlaunchedMissionson Factory Desktop.
Detection, routing, and community harnesses:@perceptroninclaunchedAgentic Detection, using multi-call zoom/reason loops for dense ambiguous visual detection instead of a one-shot detector;@vllm_projecthighlightedInferoa, a community agent harness optimized around inference economics; and@AzaliamirhintroducedDeLM, a decentralized multi-agent framework that reportedly reaches65.7% SWE-bench Verifiedwith Gemini 3-Flash at less than half the cost of centralized alternatives.

Optimization, retrieval, and scientific-modeling work worth tracking

Distributed Shampoo vs Muon remained a live optimization thread: A technically interesting sub-thread showed tunedMeta DistributedShampoomatching strong Muon baselines on a speedrun-style task after hyperparameter tuning and enabling pseudo-inverse stabilization.@*arohan*reported validation losses around3.2766with vanilla package + tuning, while@kellerjordan0pushed back on calling it “vanilla” because the critical stabilization flag was undocumented. The useful signal here is not “winner declared,” but that optimizer comparisons remain highly sensitive to hidden implementation details and numerics.
Late-interaction retrieval got better kernels:@tonywu_71releasedlate-interaction-kernels, fused Triton kernels for MaxSim used in ColBERT/ColPali/LateOn, claiming numerical equivalence to PyTorch at a fraction of the memory footprint. This should matter for both training and serving multi-vector retrieval models.
Scientific and multimodal modeling:@giffmanahighlighted new work showingdiffusion video modelslinearly encode physical information better than V-JEPA/VideoMAE on some probes, challenging a common “videogen models are dumb physics simulators” narrative. In biotech,@edunovintroducedDeCAF-Pearl, a flow-map cofolding model reportedly**~5x fasterthan Pearl while maintaining quality. On architecture research,@ZyphraAIreleasedZamba2-VL**under Apache 2.0, extending hybrid SSM-Transformer ideas into VLMs.

Top tweets (by engagement)

Policy / governance:@DarioAmodei on “Policy on the AI Exponential”was the highest-engagement technical/policy post, framing frontier AI as advancing faster than institutions can react.
Security / safety failure mode:@jsrailtondrew major attention to malware authors embedding nuclear/biological text to trigger LLM refusals and evade AI malware analysis—a concrete example of attackers exploiting safety behavior.
Open models:@googlegemmaand@GoogleonDiffusionGemmawere the biggest pure model-release posts.
Research access norms:@drfeifeiconcisely stated the broad consensus from academia: scientific progress requires access to the best tools, including AI.
Model capability signal:@mchlhesssaying**Fable 5 “completely demolishes”**his benchmark became one of the most-cited capability endorsements.

@latentspacepod: [AINews] Open Models, Model Labs vs Agent Labs, and What's Untrainable — Sarah Guo https://latent.space/p/ainews-open-m…

[AINews] Open Models, Model Labs vs Agent Labs, and What’s Untrainable — Sarah Guo

Similar Articles

@aiDotEngineer: Your Agent Can Now Train Models The argument from @mervenoyann: open source models have caught up. GLM 5.1 is leading t…

@swyx: full writeup and links here

@danintheory: Great conversation and a fun way to learn about an important open AI problem!

@omarsar0: Karpathy's autoresearch repo started an impressive trend. Agents can now train AI models to build SoTA agentic systems.…

@omarsar0: Had so many thoughts on the "loop engineering" trend. I spent a few minutes with my writer agent to summarize some of m…

Submit Feedback

Similar Articles

@aiDotEngineer: Your Agent Can Now Train Models The argument from @mervenoyann: open source models have caught up. GLM 5.1 is leading t…

@swyx: full writeup and links here

@danintheory: Great conversation and a fun way to learn about an important open AI problem!

@omarsar0: Karpathy's autoresearch repo started an impressive trend. Agents can now train AI models to build SoTA agentic systems.…

@omarsar0: Had so many thoughts on the "loop engineering" trend. I spent a few minutes with my writer agent to summarize some of m…