Tag
This paper introduces a plug-in calibration module that adjusts multimodal representations before fusion, using cross-modal context to suppress misleading signals and emphasize reliable ones, improving performance on multiple benchmarks.
This paper presents a formal roadmap for transitioning from late-fusion multimodal approaches to native multimodal modeling (NMM) within a unified transformer framework, categorizing existing models by input-output duality and systematically addressing architectural coordination, data curation, training recipes, and evaluation.
Introduces CODA, a GPU kernel abstraction that expresses Transformer operations as GEMM-plus-epilogue programs to reduce data movement, covering nearly all non-attention computation in a Transformer block.