openjudge

#openjudge

@Ali_TongyiLab: https://x.com/Ali_TongyiLab/status/2067158015615041755

X AI KOLs Timeline ↗ · 2d ago Cached

The AgentScope team introduces PawBench, a benchmark for evaluating the combined performance of models and agent harnesses, analyzing 4,050 test cells to show that harness choice can be as impactful as model upgrades.

0 favorites 0 likes

openjudge

@Ali_TongyiLab: https://x.com/Ali_TongyiLab/status/2067158015615041755

Submit Feedback