CogniRoute: Learning to Route Social Evidence in Omni-Modal Models
Summary
CogniRoute is a schema-guided Mixture-of-Experts framework for social video question answering that improves multimodal reasoning through cognitive schema factorization and route-aware reinforcement learning. It achieves significant gains over baselines on the new OmniSocialBench benchmark.
View Cached Full Text
Cached at: 06/29/26, 10:05 PM
Paper page - CogniRoute: Learning to Route Social Evidence in Omni-Modal Models
Source: https://huggingface.co/papers/2606.20970 Authors:
,
,
,
,
,
,
,
,
,
Abstract
CogniRoute is a schema-guided Mixture-of-Experts framework for social video question answering that improves multimodal reasoning through cognitive schema factorization and route-aware reinforcement learning.
Omni-modal models can ingest video, audio, and text, but unified access to multiple modalities does not guarantee that a model uses the right evidence. This gap is especially pronounced in social video question answering, where the answer may hinge on a gesture, vocal tone, temporal cue, or mismatch between what is said and what is visually expressed. We introduce CogniRoute, aschema-guidedMixture-of-Expertsframework for social omni reasoning. CogniRoute uses a training-only cognitive schema that factorizes each example bycross-modal relation,reasoning demand, andtemporal scope, and alignsglobal routing signatureswith this structure duringsupervised fine-tuning. We further introduceroute-aware reinforcement learning, which jointly optimizes token generation and expert allocation using rewards foranswer correctness,modality-consistent reasoning, andcognitive temporal grounding. To support training and evaluation, we constructOmniSocialBench, a diagnostic social video QA resource with 118Kstructured training examples,grounded reasoning traces, schema labels,temporal evidence spans, and a manually verified evaluation split. CogniRoute achieves 59.38\% average accuracy onOmniSocialBench, improving over the strongest proprietary baseline by 15.33 percentage points and the strongest open-source omni baseline by 26.77 points, with the largest gains on questions requiring audio-visual coordination, conflict resolution, and temporally grounded social inference.
View arXiv pageView PDFAdd to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.20970 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.20970 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.20970 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment
This paper analyzes the thinking-answer inconsistency in multimodal reinforcement learning with verifiable rewards (RLVR) for large vision-language models and proposes CORA, a method that introduces a consistency reward model and hybrid reward advantage splitting to improve faithfulness and task performance.
SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment
This paper proposes SARA, a framework that aligns routing distributions of multilingual inputs using Jensen-Shannon divergence to improve expert sharing for low-resource languages in sparse Mixture-of-Experts models. Experiments on Qwen3-30B-A3B and Phi-3.5-MoE-instruct show improvements on multilingual benchmarks.
SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks
SciOrch presents an 8B vision-language model trained with MCTS to coordinate multiple expert LLMs for multimodal scientific reasoning, achieving superior performance while reducing API costs.
MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning
This paper proposes MODF-SIR, a multi-agent collaborative framework built on a lightweight multimodal large language model for social intelligence reasoning. It employs knowledge distillation, long-tail event extraction, and test-time adaptation to achieve state-of-the-art results with reduced training data.
Native Active Perception as Reasoning for Omni-Modal Understanding
Introduces OmniAgent, an omni-modal agent that uses an iterative Observation-Thought-Action cycle with active perception to achieve superior long video understanding, outperforming larger models like Qwen2.5-VL-72B on benchmarks.