hazards

Tag

Cards List
#hazards

Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability

arXiv cs.CL · 17h ago Cached

Introduces ToolBench-X, a benchmark for evaluating large language model agents under various tool-environment reliability hazards, revealing a substantial gap in performance compared to clean environments.

0 favorites 0 likes
← Back to home

Submit Feedback