Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

Hugging Face Daily Papers 05/21/26, 12:00 AM Papers

Summary

Maestro is a reinforcement learning-driven framework that dynamically composes ensembles of frozen expert models and skills for multimodal tasks, achieving 70.1% average accuracy with a 4B orchestrator, surpassing GPT-5 and Gemini-2.5-Pro.

The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)-driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential decision-making process over a hierarchical model-skill registry. Rather than consolidating all knowledge into a single model, Maestro trains a lightweight policy to dynamically compose ensembles of frozen expert models and a two-tier skill library, deciding at each step whether to invoke an external expert, which model-skill pair to select, and when to terminate. The policy is optimized via outcome-based RL, requiring no step-level supervision. We evaluate Maestro across ten representative multimodal benchmarks spanning mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis. With only a 4B orchestrator, Maestro achieves an average accuracy of 70.1%, surpassing both GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining: augmenting the registry with out-of-domain experts yields a 59.5% average on four challenging benchmarks, outperforming all closed-source baselines. Maestro further maintains high computational efficiency with low latency. The source code is available at https://github.com/jinyangwu/Maestro.

Original Article

View Cached Full Text

Cached at: 05/22/26, 06:38 AM

Paper page - Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

Source: https://huggingface.co/papers/2605.22177

Abstract

A reinforcement learning-driven orchestration framework dynamically composes expert models and skills for multimodal tasks, achieving superior performance with low computational overhead.

The proliferation oflarge language models(LLMs) and modular skills has endowedautonomous agentswith increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), aReinforcement Learning(RL)-driven orchestration framework that reframes heterogeneousmultimodal tasksas asequential decision-makingprocess over ahierarchical model-skill registry. Rather than consolidating all knowledge into a single model, Maestro trains a lightweight policy to dynamically compose ensembles of frozen expert models and a two-tier skill library, deciding at each step whether to invoke an external expert, which model-skill pair to select, and when to terminate. The policy is optimized viaoutcome-based RL, requiring no step-level supervision. We evaluate Maestro across ten representative multimodal benchmarks spanning mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis. With only a 4B orchestrator, Maestro achieves an average accuracy of 70.1%, surpassing both GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining: augmenting the registry with out-of-domain experts yields a 59.5% average on four challenging benchmarks, outperforming all closed-source baselines. Maestro further maintains highcomputational efficiencywith low latency. The source code is available at https://github.com/jinyangwu/Maestro.

View arXiv page View PDF GitHub5 Add to collection

Get this paper in your agent:

hf papers read 2605\.22177

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### Jinyang23/Maestro-4B 5B• Updatedabout 3 hours ago

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.22177 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.22177 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

Paper page - Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

Abstract

Models citing this paper1

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

AI as a mirror argument

Apple reveals new AI architecture built around Google Gemini models

Built a spending mandate layer for AI agents — set limits once, agent can't overspend

@vllm_project: Meet vLLM-Omni v0.22.0, a major upgrade for omnimodal world models and production-grade multimodal serving. Day-0 @NVID…

@omarsar0: Great tips. In practice, this is how it roughly looks to run agents autonomously for hours or days. /goal or /loop to k…

Submit Feedback

Similar Articles

Apple reveals new AI architecture built around Google Gemini models

Built a spending mandate layer for AI agents — set limits once, agent can't overspend

@vllm_project: Meet vLLM-Omni v0.22.0, a major upgrade for omnimodal world models and production-grade multimodal serving. Day-0 @NVID…

@omarsar0: Great tips. In practice, this is how it roughly looks to run agents autonomously for hours or days. /goal or /loop to k…