CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies

Hugging Face Daily Papers Papers

Summary

CoffeeBench is a benchmark for evaluating LLM agents in a long-horizon multi-agent economic simulation where firms interact over 90 days to maximize profits, revealing differences in communication patterns and performance among various models.

As LLM agents become capable of increasingly long-horizon tasks, evaluating their performance in economic systems is becoming increasingly important. Unlike existing benchmarks that primarily evaluate a single agent interacting with a passive environment, economic systems are inherently multi-agent, requiring autonomous agents to communicate, negotiate, and transact while pursuing their own objectives over extended periods. We introduce CoffeeBench, a benchmark for evaluating LLM agents in a long-horizon multi-agent economy composed of heterogeneous firms. In CoffeeBench, two farmers, two roasters, and two retailers autonomously operate their businesses over a 90-day simulation, each seeking to maximize cumulative net income through communication and transactions while managing cash, inventory, and pricing. The evaluated model controls one coffee roaster, while the remaining firms are controlled by fixed reference agents. Across several recent open-weight and proprietary LLMs, all models outperform a passive baseline that takes no actions, with most achieving positive net income. Analysis of agent behavior reveals substantial differences in long-horizon economic interaction: higher-performing models communicate more actively with other firms, whereas Claude~Haiku~4.5 exhibits an idle-drift failure mode, repeatedly choosing inaction despite producing coherent assessments and plans. We release our code and agent trajectories to support future research.
Original Article
View Cached Full Text

Cached at: 06/26/26, 10:06 AM

Paper page - CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies

Source: https://huggingface.co/papers/2606.16613

Abstract

CoffeeBench evaluates LLM agents in a multi-agent economic simulation where firms interact over 90 days to maximize profits, revealing differences in communication patterns and performance among various models.

AsLLM agentsbecome capable of increasinglylong-horizon tasks, evaluating their performance ineconomic systemsis becoming increasingly important. Unlike existing benchmarks that primarily evaluate a single agent interacting with a passive environment,economic systemsare inherently multi-agent, requiringautonomous agentsto communicate, negotiate, andtransactwhile pursuing their own objectives over extended periods. We introduce CoffeeBench, a benchmark for evaluatingLLM agentsin a long-horizonmulti-agent economycomposed of heterogeneous firms. In CoffeeBench, two farmers, two roasters, and two retailers autonomously operate their businesses over a 90-day simulation, each seeking to maximizecumulative net incomethroughcommunicationandtransactions while managing cash, inventory, and pricing. The evaluated model controls one coffee roaster, while the remaining firms are controlled by fixed reference agents. Across several recent open-weight and proprietary LLMs, all models outperform a passive baseline that takes no actions, with most achieving positive net income. Analysis ofagent behaviorreveals substantial differences in long-horizon economic interaction: higher-performing models communicate more actively with other firms, whereas Claude~Haiku~4.5 exhibits an idle-drift failure mode, repeatedly choosing inaction despite producing coherent assessments and plans. We release our code and agent trajectories to support future research.

View arXiv pageView PDFProject pageGitHub6Add to collection

Get this paper in your agent:

hf papers read 2606\.16613

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.16613 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.16613 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.16613 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Agent Bazaar: Enabling Economic Alignment in Multi-Agent Marketplaces

Hugging Face Daily Papers

Introduces Agent Bazaar, a multi-agent simulation framework for evaluating economic alignment of LLMs, identifying failure modes like algorithmic instability and Sybil deception, and training a 9B model that outperforms frontier models using targeted reinforcement learning.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Hugging Face Daily Papers

WildClawBench evaluates language and vision-language models on realistic long-horizon tasks using actual CLI environments with real tools. The benchmark reveals that even the best model achieves only 62.2% accuracy, indicating long-horizon agent evaluation remains challenging.