VisualClaw: A Real-Time, Personalized Agent for the Physical World

Hugging Face Daily Papers 06/15/26, 12:00 AM Papers

multimodal-agent video-qa real-time skill-evolution edge-applications hybrid-encoding

Summary

VisualClaw is a self-evolving multimodal agent that reduces deployment costs through hybrid encoding and skill evolution, while improving video-QA accuracy across multiple benchmarks.

Vision language models are serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing dense video frames and long prompts, the agent scaffold remains static after deployment, and standard video-QA benchmarks do not test whether agents can use visual evidence inside tool-using workspaces. We present VisualClaw, a self-evolving multimodal agent built around two principles. First, hybrid encoding reduces deployment cost by filtering less informative streaming frames with a cascaded gate and compressing the text skill bank through hot/cold top-k injection. Second, skill evolution lets the agent learn from failures: retrieved memories condition an evolver as direct concatenated context or as guided evidence, producing skill-bank updates that help future questions. Across 4 video-QA benchmarks with 2 VLMs, VisualClaw cuts per-question API cost by an average -98% versus full-frame upload and by -25.9% over the offline uniform 8 frame baseline, while boosting accuracy in most settings, e.g., an average +3.85% and a peak +15.80% on EgoSchema with Gemini 3 Flash. To address the gap, we curate VisualClawArena, a 200-scenario multimodal agentic benchmark built through a strict five-stage pipeline; models must use video evidence, documents, dynamic updates, and executable checks inside a workspace. On VisualClawArena, the same framework with computer-use agent backends improves macro accuracy by +2.9% for Codex (GPT-5.5) and +3.2% for Claude Code (Sonnet 4.6) over no-evolution baselines, with a -9.5% cost reduction compared to the uniform-sampled baseline. These properties make VisualClaw a natural fit for edge applications, where the cascade reduces a 1-hour streaming session from ~3,600 API uploads down to only 5-20 calls and the self-evolution makes it a perfect personalized assistant.

Original Article

View Cached Full Text

Cached at: 06/16/26, 11:34 AM

Paper page - VisualClaw: A Real-Time, Personalized Agent for the Physical World

Source: https://huggingface.co/papers/2606.16295 Authors:

Abstract

VisualClaw is a self-evolving multimodal agent that reduces deployment costs through hybrid encoding and skill evolution while improving video-QA accuracy across multiple benchmarks.

Vision language modelsare serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing dense video frames and long prompts, the agent scaffold remains static after deployment, and standardvideo-QA benchmarksdo not test whether agents can use visual evidence inside tool-using workspaces. We present VisualClaw, a self-evolvingmultimodal agentbuilt around two principles. First,hybrid encodingreduces deployment cost by filtering less informative streaming frames with acascaded gateand compressing the text skill bank throughhot/cold top-k injection. Second,skill evolutionlets the agent learn from failures:retrieved memoriescondition anevolveras direct concatenated context or as guided evidence, producing skill-bank updates that help future questions. Across 4video-QA benchmarkswith 2 VLMs, VisualClaw cuts per-question API cost by an average -98% versus full-frame upload and by -25.9% over the offline uniform 8 frame baseline, while boosting accuracy in most settings, e.g., an average +3.85% and a peak +15.80% on EgoSchema with Gemini 3 Flash. To address the gap, we curateVisualClawArena, a 200-scenariomultimodal agentic benchmark built through a strict five-stage pipeline; models must use video evidence, documents, dynamic updates, and executable checks inside a workspace. OnVisualClawArena, the same framework withcomputer-use agentbackends improves macro accuracy by +2.9% for Codex (GPT-5.5) and +3.2% for Claude Code (Sonnet 4.6) over no-evolution baselines, with a -9.5% cost reduction compared to the uniform-sampled baseline. These properties make VisualClaw a natural fit foredge applications, where the cascade reduces a 1-hour streaming session from ~3,600 API uploads down to only 5-20 calls and the self-evolution makes it a perfect personalized assistant.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2606\.16295

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.16295 in a model README.md to link it from this page.

Datasets citing this paper1

#### UCSC-VLAA/VisualClawArena Updatedabout 9 hours ago • 21 • 2

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.16295 in a Space README.md to link it from this page.

VisualClaw: A Real-Time, Personalized Agent for the Physical World

Paper page - VisualClaw: A Real-Time, Personalized Agent for the Physical World

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper1

Similar Articles

PixelClaw: an LLM agent for image manipulation

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents

Submit Feedback

Similar Articles

PixelClaw: an LLM agent for image manipulation

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents