CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies
Summary
CoffeeBench is a benchmark for evaluating LLM agents in a long-horizon multi-agent economic simulation where firms interact over 90 days to maximize profits, revealing differences in communication patterns and performance among various models.
View Cached Full Text
Cached at: 06/26/26, 10:06 AM
Paper page - CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies
Source: https://huggingface.co/papers/2606.16613
Abstract
CoffeeBench evaluates LLM agents in a multi-agent economic simulation where firms interact over 90 days to maximize profits, revealing differences in communication patterns and performance among various models.
AsLLM agentsbecome capable of increasinglylong-horizon tasks, evaluating their performance ineconomic systemsis becoming increasingly important. Unlike existing benchmarks that primarily evaluate a single agent interacting with a passive environment,economic systemsare inherently multi-agent, requiringautonomous agentsto communicate, negotiate, andtransactwhile pursuing their own objectives over extended periods. We introduce CoffeeBench, a benchmark for evaluatingLLM agentsin a long-horizonmulti-agent economycomposed of heterogeneous firms. In CoffeeBench, two farmers, two roasters, and two retailers autonomously operate their businesses over a 90-day simulation, each seeking to maximizecumulative net incomethroughcommunicationandtransactions while managing cash, inventory, and pricing. The evaluated model controls one coffee roaster, while the remaining firms are controlled by fixed reference agents. Across several recent open-weight and proprietary LLMs, all models outperform a passive baseline that takes no actions, with most achieving positive net income. Analysis ofagent behaviorreveals substantial differences in long-horizon economic interaction: higher-performing models communicate more actively with other firms, whereas Claude~Haiku~4.5 exhibits an idle-drift failure mode, repeatedly choosing inaction despite producing coherent assessments and plans. We release our code and agent trajectories to support future research.
View arXiv pageView PDFProject pageGitHub6Add to collection
Get this paper in your agent:
hf papers read 2606\.16613
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.16613 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.16613 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.16613 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent
Introduces EComAgentBench, a benchmark for evaluating LLM-based shopping agents on long-horizon tasks with hidden intents distributed across queries, profiles, and clarifications. The benchmark uses real Amazon products and automated scoring, revealing that even the best model achieves only 57.1% accuracy.
Agent Bazaar: Enabling Economic Alignment in Multi-Agent Marketplaces
Introduces Agent Bazaar, a multi-agent simulation framework for evaluating economic alignment of LLMs, identifying failure modes like algorithmic instability and Sybil deception, and training a 9B model that outperforms frontier models using targeted reinforcement learning.
Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation
This paper introduces CEO-Bench, a multi-agent benchmark for evaluating LLMs on CEO-level strategic resource reallocation, revealing systematic failure modes and a structural integration–boldness tradeoff.
CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement
CollabBench is a new benchmark for evaluating and training LLM agents in cooperative games, featuring diverse player simulation and a collaborative training paradigm. Experiments show 19.5% higher efficiency and 24.4% improved affective performance over base models.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
WildClawBench evaluates language and vision-language models on realistic long-horizon tasks using actual CLI environments with real tools. The benchmark reveals that even the best model achieves only 62.2% accuracy, indicating long-horizon agent evaluation remains challenging.