CogniRoute: Learning to Route Social Evidence in Omni-Modal Models

Hugging Face Daily Papers 06/18/26, 12:00 AM Papers

Summary

CogniRoute is a schema-guided Mixture-of-Experts framework for social video question answering that improves multimodal reasoning through cognitive schema factorization and route-aware reinforcement learning. It achieves significant gains over baselines on the new OmniSocialBench benchmark.

Omni-modal models can ingest video, audio, and text, but unified access to multiple modalities does not guarantee that a model uses the right evidence. This gap is especially pronounced in social video question answering, where the answer may hinge on a gesture, vocal tone, temporal cue, or mismatch between what is said and what is visually expressed. We introduce CogniRoute, a schema-guided Mixture-of-Experts framework for social omni reasoning. CogniRoute uses a training-only cognitive schema that factorizes each example by cross-modal relation, reasoning demand, and temporal scope, and aligns global routing signatures with this structure during supervised fine-tuning. We further introduce route-aware reinforcement learning, which jointly optimizes token generation and expert allocation using rewards for answer correctness, modality-consistent reasoning, and cognitive temporal grounding. To support training and evaluation, we construct OmniSocialBench, a diagnostic social video QA resource with 118K structured training examples, grounded reasoning traces, schema labels, temporal evidence spans, and a manually verified evaluation split. CogniRoute achieves 59.38\% average accuracy on OmniSocialBench, improving over the strongest proprietary baseline by 15.33 percentage points and the strongest open-source omni baseline by 26.77 points, with the largest gains on questions requiring audio-visual coordination, conflict resolution, and temporally grounded social inference.

Original Article

View Cached Full Text

Cached at: 06/29/26, 10:05 PM

Paper page - CogniRoute: Learning to Route Social Evidence in Omni-Modal Models

Source: https://huggingface.co/papers/2606.20970 Authors:

Abstract

Omni-modal models can ingest video, audio, and text, but unified access to multiple modalities does not guarantee that a model uses the right evidence. This gap is especially pronounced in social video question answering, where the answer may hinge on a gesture, vocal tone, temporal cue, or mismatch between what is said and what is visually expressed. We introduce CogniRoute, aschema-guided Mixture-of-Expertsframework for social omni reasoning. CogniRoute uses a training-only cognitive schema that factorizes each example bycross-modal relation,reasoning demand, andtemporal scope, and alignsglobal routing signatureswith this structure duringsupervised fine-tuning. We further introduceroute-aware reinforcement learning, which jointly optimizes token generation and expert allocation using rewards foranswer correctness,modality-consistent reasoning, andcognitive temporal grounding. To support training and evaluation, we constructOmniSocialBench, a diagnostic social video QA resource with 118Kstructured training examples,grounded reasoning traces, schema labels,temporal evidence spans, and a manually verified evaluation split. CogniRoute achieves 59.38\% average accuracy onOmniSocialBench, improving over the strongest proprietary baseline by 15.33 percentage points and the strongest open-source omni baseline by 26.77 points, with the largest gains on questions requiring audio-visual coordination, conflict resolution, and temporally grounded social inference.

View arXiv page View PDF Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.20970 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.20970 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.20970 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

CogniRoute: Learning to Route Social Evidence in Omni-Modal Models

Paper page - CogniRoute: Learning to Route Social Evidence in Omni-Modal Models

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

Native Active Perception as Reasoning for Omni-Modal Understanding

Submit Feedback

Similar Articles

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

Native Active Perception as Reasoning for Omni-Modal Understanding