Tag
This paper empirically tests whether LLM agents with GNN tools exercise judgment or blindly obey the tool, finding that agents agree with the GNN 97.6–99.2% of the time and that stronger backbones defer even more. The cost of this deference does not shrink with capability, and selective invocation remains an open problem.
Proposes SelSkill, a dual-granularity preference-learning framework that learns when to invoke skills in agentic tasks, improving task success by 10.9% on ALFWorld and 5.7% on BFCL.