Tag
ModelScope introduces Agents-A1, a 35B MoE agentic model with 256K context and function calling, achieving SOTA on long-horizon tasks and instruction following.
MOSAIC is a novel framework that uses a frozen LLM to generate semantic embeddings and hierarchical prediction prompts for knowledge tracing, achieving state-of-the-art results on multiple benchmarks.
GLM-5.2 is a new open-source coding model that has caught up to closed-source SOTA models, potentially disrupting revenues of OpenAI and Anthropic.
Proposes TempoWave, a plug-and-play temporal wavelet digit interface that maps time series observations into digit-wise embeddings from multi-wavelet coefficients, improving LLM-based time series forecasting and achieving state-of-the-art on multiple benchmarks.
Ornith-1.0 is a family of open-source LLMs specialized for agentic coding, spanning sizes from 9B to 397B and achieving state-of-the-art performance among open-source models of comparable size.
QuickMaker offers a subscription service that integrates state-of-the-art AI models directly into Blender for enhanced 3D modeling and design workflows.
Fara1.5 is a family of native computer use agents trained using the FaraGen1.5 scalable data pipeline. The models achieve new state-of-the-art results on browser-use benchmarks, competing with much larger frontier models.
OpenAI releases the full version of GPT-5.5-Cyber, a cybersecurity-focused AI model with state-of-the-art performance on CyberGym, and announces efforts to improve security through Patch The Planet and Codex Security.
Apodex releases Apodex-1.0, a deep-research model that uses a heavy-duty agent team with global verification, achieving state-of-the-art results on multiple benchmarks including BrowseComp, DeepSearchQA, and HLE.
ThinkDeception proposes a novel framework that leverages multimodal large language models and a progressive reinforcement learning strategy with chain-of-thought reasoning for interpretable deception detection, achieving new state-of-the-art results on standard benchmarks.
Firecrawl released a state-of-the-art research index for AI/ML papers, claiming 18% better recall on arXivQA than competitors, designed for autonomous research agents.
StepGuard proposes a framework combining Dynamic Dual-Policy Optimization (DDPO) and Confidence-Guided Adaptive Navigation Reflection (CANR) to address reward misalignment and error propagation in web navigation agents, achieving state-of-the-art performance.
UniAR presents a unified autoregressive framework that uses a single discrete visual tokenizer to bridge visual understanding and generation, achieving state-of-the-art results in image generation and editing.
Bardienus Duisterhof introduces Modality Forcing, a recipe for post-training text-to-image (T2I) models that achieves state-of-the-art results on 4 out of 5 monocular depth estimation benchmarks.
The comment acknowledges that the model is state-of-the-art for editing but not for generation.
This paper presents EinsteinArena, an agent-native platform enabling decentralized scientific discovery through open interaction among autonomous AI agents. The platform has already produced 12 new state-of-the-art results, including an improved lower bound for the kissing number problem in dimension 11, demonstrating that collective AI-driven research can emerge from agents sharing insights and building on each other's work.
Anthropic releases Fable 5, claiming it is state-of-the-art on key benchmarks in software engineering, science, knowledge work, and vision, exceeding all previously available models.
Claude Fable 5 has been released, claimed to be state-of-the-art across benchmarks with qualitative improvements, especially on complex long tasks. It is the same underlying model as Mythos but with added safeguards.
ApodexAI releases Apodex-1.0, a deep-research model that operates as a tool-using ReAct agent. Its heavy-duty mode, Apodex-1.0-H, uses an asynchronous agent team with up to 150 sub-agents and achieves new state-of-the-art results on deep-research benchmarks including BrowseComp, DeepSearchQA, HLE, and FrontierScience, surpassing models like GPT-5.5-pro and Claude-Opus-4.8.
Apodex 1.0 is a heavy-duty AI agent team for deep research that achieves state-of-the-art performance by searching the web, reasoning over evidence, and producing reports with verifiable evidence chains.