Tag
GLM-5.2 is a new open-source coding model that has caught up to closed-source SOTA models, potentially disrupting revenues of OpenAI and Anthropic.
Proposes TempoWave, a plug-and-play temporal wavelet digit interface that maps time series observations into digit-wise embeddings from multi-wavelet coefficients, improving LLM-based time series forecasting and achieving state-of-the-art on multiple benchmarks.
Ornith-1.0 is a family of open-source LLMs specialized for agentic coding, spanning sizes from 9B to 397B and achieving state-of-the-art performance among open-source models of comparable size.
QuickMaker offers a subscription service that integrates state-of-the-art AI models directly into Blender for enhanced 3D modeling and design workflows.
Fara1.5 is a family of native computer use agents trained using the FaraGen1.5 scalable data pipeline. The models achieve new state-of-the-art results on browser-use benchmarks, competing with much larger frontier models.
OpenAI releases the full version of GPT-5.5-Cyber, a cybersecurity-focused AI model with state-of-the-art performance on CyberGym, and announces efforts to improve security through Patch The Planet and Codex Security.
Apodex releases Apodex-1.0, a deep-research model that uses a heavy-duty agent team with global verification, achieving state-of-the-art results on multiple benchmarks including BrowseComp, DeepSearchQA, and HLE.
ThinkDeception proposes a novel framework that leverages multimodal large language models and a progressive reinforcement learning strategy with chain-of-thought reasoning for interpretable deception detection, achieving new state-of-the-art results on standard benchmarks.
Firecrawl released a state-of-the-art research index for AI/ML papers, claiming 18% better recall on arXivQA than competitors, designed for autonomous research agents.
StepGuard proposes a framework combining Dynamic Dual-Policy Optimization (DDPO) and Confidence-Guided Adaptive Navigation Reflection (CANR) to address reward misalignment and error propagation in web navigation agents, achieving state-of-the-art performance.
UniAR presents a unified autoregressive framework that uses a single discrete visual tokenizer to bridge visual understanding and generation, achieving state-of-the-art results in image generation and editing.
Bardienus Duisterhof introduces Modality Forcing, a recipe for post-training text-to-image (T2I) models that achieves state-of-the-art results on 4 out of 5 monocular depth estimation benchmarks.
The comment acknowledges that the model is state-of-the-art for editing but not for generation.
This paper presents EinsteinArena, an agent-native platform enabling decentralized scientific discovery through open interaction among autonomous AI agents. The platform has already produced 12 new state-of-the-art results, including an improved lower bound for the kissing number problem in dimension 11, demonstrating that collective AI-driven research can emerge from agents sharing insights and building on each other's work.
Anthropic releases Fable 5, claiming it is state-of-the-art on key benchmarks in software engineering, science, knowledge work, and vision, exceeding all previously available models.
Claude Fable 5 has been released, claimed to be state-of-the-art across benchmarks with qualitative improvements, especially on complex long tasks. It is the same underlying model as Mythos but with added safeguards.
ApodexAI releases Apodex-1.0, a deep-research model that operates as a tool-using ReAct agent. Its heavy-duty mode, Apodex-1.0-H, uses an asynchronous agent team with up to 150 sub-agents and achieves new state-of-the-art results on deep-research benchmarks including BrowseComp, DeepSearchQA, HLE, and FrontierScience, surpassing models like GPT-5.5-pro and Claude-Opus-4.8.
Apodex 1.0 is a heavy-duty AI agent team for deep research that achieves state-of-the-art performance by searching the web, reasoning over evidence, and producing reports with verifiable evidence chains.
This technical report introduces DuMate-DeepResearch, a multi-agent framework for deep research tasks that decouples the agent core from a tool ecosystem, and incorporates graph-based dynamic planning, recursive two-level execution, and rubric-based test-time optimization. The system achieves state-of-the-art results on two deep research benchmarks, demonstrating the value of auditable agent infrastructure.
A mental health professional argues that AI, when properly prompted, can offer surprisingly effective therapeutic advice and personalization, sometimes surpassing traditional therapy in nuance and accessibility, especially for neurodivergent individuals.