Tag
A reverse engineering analysis of Kimi K2.6 reveals that its architecture prioritizes orchestration and skill injection over raw parameter count, achieving high SWE-Bench scores through multi-agent collaboration without retraining.
This paper systematically investigates cross-modal skill injection, where a domain-expert LLM is merged into a VLM to induce emergent multimodal capabilities. It evaluates different scenarios (instruction-following, cross-lingual, mathematical reasoning), merging methods (TA, DARE, etc.), and hyperparameters, finding that TA and DARE perform well except in mathematical reasoning.