Tag
Someone analyzed the 196 startups from YC's 2026 spring batch and found that 95% use AI, 85% are AI-native, and the real keyword is agents rather than AI.
Introduces IRTS-ToolBench, a benchmark of 1,700 questions for evaluating LLMs and AI agents on irregular time series question answering via tool-grounded reasoning, covering 10 task types across 13 domains.
Kepler is an agentic development environment designed to run AI agents at scale, targeting developers who need to manage multiple agent workflows.
An article explaining how to build AI-driven 'loops' to automate revenue-generating business processes, citing insights from Boris Cherny (Claude Code) and Peter Steinberger (OpenClaw).
This post evaluates sandbox platforms for background agents, focusing on requirements like running real workloads, ingress, and cost. It outlines the Deputies sandbox provider interface and key considerations.
The article argues that AI education remains focused on basic chatbot and prompt skills, while real-world AI development has shifted towards building agents, systems integration, and robust software engineering, creating a significant gap for learners.
This paper revisits the WorkBench benchmark for workplace agents two years after its initial release, showing that the best agent (Claude Opus 4.8) now completes 89% of tasks with only 2.5% harmful side effects, compared to GPT-4's 43% completion and 26% harm rate in 2024. It finds that capability and safety improve together, open-weight models have drastically lowered costs, and some basic mistakes persist.
CacheRL trains small agent foundation models for multi-step tool-calling tasks, achieving 92% process accuracy (approaching GPT-5's 94%) with 100x less compute using cached rollouts and hybrid reward shaping, with innovations in knowledge transfer, cache-aware rewards, and iterative SFT/GRPO training.
DAIR Academy Plugins is an open-source marketplace of plugins for Claude Code, including an llm-council skill that orchestrates multiple open-weight LLMs via Fireworks AI.
Shared an open collaborative repository Awesome Vibe Research maintained by ModelScope. This repository collects and curates reusable, verifiable, and evolvable AI-assisted components across the full research workflow, including agents, skills, workflows, tools, and best practices. It aims to help researchers and developers leverage AI to improve research efficiency.
A developer shares the challenge of debugging multi-step agents in production, where failures are hard to trace due to complex tool use and confident wrong answers, and asks the community for better monitoring and regression detection approaches.
Clelia enjoyed speaking about retrieval in agent systems at the Vector Space meetup in Berlin, organized by Qdrant, with deepset, cognee, and n8n.
Browser Use 0.13.0 beta is rebuilt in Rust for long-running web agent tasks, featuring a custom LLM harness and a new terminal interface.
Andrew Ng discusses the rise of desktop AI agents and coding CLI tools, introduces the open-source OpenCoworker project, and examines agent harness designs where LLMs drive autonomous task execution.
Midas achieves 0.56 recall@k on BEAM 100K and 0.51 on BEAM 500K with zero LLM calls and zero cost, demonstrating efficient long-term memory for agents.
TerraBench is a new benchmark for evaluating AI agents' ability to reason over heterogeneous Earth-system data, including gridded data, satellite imagery, and simulator outputs. It reveals significant limitations in current frontier models, with top performers achieving only 59.2% tool-use score on average.
This paper introduces a Multi-Modal Agent framework for power distribution defect detection, evaluating foundation models on perception, reasoning, and tool usage capabilities, with a new domain-specific dataset and benchmark.
An opinion piece arguing that long context windows don't equate to memory and that agent failures are often mundane, like forgetting constraints or rereading files, emphasizing that reliability depends on context architecture decisions.
A tweet recommends an article on building good AI agents, implying it is highly valuable for developers.
Novu Connect enables users to ship agents where their users already work.