PANDO: Efficient Multimodal AI Agents via Online Skill Distillation
Summary
PANDO is a web agent framework that improves efficiency through online skill distillation, reducing token usage by 58-61% while outperforming baselines on VisualWebArena tasks.
View Cached Full Text
Cached at: 05/29/26, 11:04 PM
Paper page - PANDO: Efficient Multimodal AI Agents via Online Skill Distillation
Source: https://huggingface.co/papers/2605.24785
Abstract
PANDO is a web agent framework that improves efficiency through experience accumulation by reducing redundant actions, optimizing skill discovery, and enhancing prompt caching without sacrificing performance.
Recent advances inmultimodal web agentsoften rely on increased inference-time computation, includingrollout search,verifier passes,offline skill discovery, andspecialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories fromVisualWebArenaand identify three recurring sources of inefficiency: repeat-action loops, hidden discovery costs, and low prompt-cache reuse. We then introduce PANDO, a single-rollout onlineskill-distillation frameworkthat maintains a structuredSkill Libraryand combinesprogress reflection,confidence-based skill demotion,hierarchical routing,visual compression, andcache-aware prompting. On the full set of 910VisualWebArenatasks, PANDO achieves a 58.3% success rate, outperforming SGV (54.0%) and our WALT reproduction (45.2%), while using 58% fewer tokens than SGV and 61% fewer tokens than WALT, without any pre-evaluation discovery budget. A 300-task ablation further shows that rules and routines provide most of the success gains, while routing, compression, andcache-aware promptingconvert the largerskill libraryinto lower marginal token cost. Finally, we introduce three trajectory-level efficiency metrics --Action Repetition Rate,Step Overhead Ratio, andPrompt Cache Utilization-- to make efficiency visible beyond terminal success.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.24785
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.24785 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.24785 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.24785 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval
This paper proposes SGDR (State-Grounded Dynamic Retrieval), an online skill learning method for web agents that enables stepwise, state-aware skill reuse rather than static task-level retrieval. Experiments on WebArena show SGDR achieves 37.5% success rate with GPT-4.1, a ~10.6% relative gain over strong baselines.
@dair_ai: https://x.com/dair_ai/status/2061104052818108476
A roundup of three notable AI papers: SkillOpt treats skill documents as trainable parameters to optimize frozen agents; a new method compiles agentic workflows into model weights for 100x cost reduction; and AutoScientists introduces a decentralized agent team for long-running science without a central planner.
DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning
DRIVE proposes a dual-level skill modeling framework that separates reasoning knowledge from interaction knowledge for web agents under continual learning, achieving a 52.8% task success rate on WebArena, outperforming the skill-free baseline by 7.3 percentage points.
COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation
This paper presents COLLEAGUE.SKILL, an open-source system for automatically distilling person-grounded AI skills from heterogeneous traces into inspectable, correctable, and portable skill packages, enabling LLM agents to carry bounded representations of human expertise and interaction style.
AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization
AgentPSO is a particle-swarm-inspired framework that evolves multi-agent reasoning skills by treating agents as particles whose states are natural-language skills. It improves performance on reasoning benchmarks without updating the backbone language model parameters.