Tag
Introducing multiple Hermes plugins: theme skins, persistent planning, Draw.io automatic flowcharts, literate programming skill pack, fantasy skill lab, etc., turning Hermes into a versatile terminal and intelligent planning tool.
Discusses using Qwen 27B for planning tasks and Qwen 35B-A3B for execution tasks, suggesting a specialized model approach.
PlanBench-XL is a new benchmark that evaluates LLM agents' ability to plan and adapt in large tool ecosystems with limited visibility and dynamic disruptions. Experiments show GPT-5.4 achieves only 51.9% accuracy in block-free settings and collapses to 11.36% under severe blocking, highlighting significant challenges in long-horizon planning.
A discussion between Kent C. Dodds and Sean Roberts on product engineering, planning with real business context, and the importance of conversations and curiosity over pure data.
This paper surveys evaluation methods for world models and argues for a decision-making-centric framework that prioritizes counterfactual reasoning, planning, and policy optimization over visual quality. It introduces an L0–L7 evaluation ladder and a benchmark protocol to align evaluation with claimed utility.
CEO-Bench introduces a simulation benchmark that evaluates language model agents' ability to manage a startup over 500 days, testing long-term planning, noise handling, adaptability, and multi-task coordination. Results show that even the strongest models struggle, with only Claude Opus 4.8 and GPT-5.5 finishing above the starting balance.
Matt Pocock introduces a decision-mapping skill to split planning into multiple sessions, similar to /to-issues, aiming to streamline greenfield and brownfield builds.
The author built a personal AI agent that uses a frontier model (Codex) for high-level planning while running most token processing locally on a dual RTX 3090 system, enabling long-duration tasks with deterministic validation. The agent supports three swappable tiers: planner, local, and senior, and is available as an open-source repository.
COMET is a model-based reinforcement learning algorithm that combines a frozen object-centric encoder with a transformer-based world model and Monte Carlo Tree Search, using causal attention to focus on task-relevant objects, achieving higher scores on visual RL benchmarks.
Deep Work Plan is a product that helps users provide their AI agents with a structured plan, emphasizing the importance of context over models.
A developer shares satisfaction with Opus 4.8 for planning and GPT-5.5 for execution, emphasizing that breaking tasks into smaller steps improves quality and that dynamic workflows are underrated.
The paper proposes SVoT, a reinforcement learning framework that generates interleaved, verifiable intermediate states and visualizations for multi-hop spatial reasoning in MLLMs, achieving significant accuracy gains on new benchmarks involving multi-object interactions and numerical reasoning.
A discussion about deploying multi-agent AI systems in production, where different agents handle planning, execution, communication, and project management, asking about real-world experiences and bottlenecks.
This paper introduces PhysTool-Bench, a benchmark for evaluating multimodal large language models' ability to recognize and plan the use of physical tools in real-world scenes. The authors find that even the best model identifies only 58.7% of tools and completes just 21.0% of queries end-to-end, revealing a two-level deficit in perception and functional commonsense.
Introduces front-to-attractors (F2A), a new heuristic class for bidirectional search that reduces computational cost by evaluating distances to a small set of attractors instead of the full opposite frontier, achieving up to 11.2x fewer pairwise evaluations and 4.8x fewer node expansions than existing methods.
This paper systematically reviews text world models for LLM-based agents, covering foundations, construction paradigms, applications in planning and training, and evaluation methods.
Stride is an AI-powered workspace that assists with planning, designing, and shipping projects.
A practitioner observes that limiting AI agents to plan only one step ahead instead of multiple steps significantly improves reliability in real-world automation workflows involving CRM and lead qualification, as long-range plans become brittle when external state changes.
This paper introduces OCLGen, a compute-efficient test-time search algorithm that integrates generative planning models with a classical Open-Closed List framework, improving solution quality across combinatorial planning domains.
A comprehensive survey of world models that provides a multi-axis taxonomy covering architectures, methodologies, reasoning strategies, and applications across AI domains, including key systems like Dreamer, MuZero, and Sora.