Tag
This paper empirically tests whether LLM agents with GNN tools exercise judgment or blindly obey the tool, finding that agents agree with the GNN 97.6–99.2% of the time and that stronger backbones defer even more. The cost of this deference does not shrink with capability, and selective invocation remains an open problem.