uc-santa-barbara

#uc-santa-barbara

VISUALSKILL: Multimodal Skills for Computer-Use Agents

arXiv cs.CL ↗ · 2d ago Cached

VisualSkill proposes a hierarchical multimodal skill library for computer-use agents that combines text and figures, achieving a 15.3 point absolute lift on CUA benchmarks over text-only baselines by retaining visual information for GUI interaction.

0 favorites 0 likes

uc-santa-barbara

VISUALSKILL: Multimodal Skills for Computer-Use Agents

Submit Feedback