VCR: Learning Valid Contextual Representation for Incomplete Wearable Signals
Summary
VCR is a self-supervised framework that learns robust representations from incomplete wearable signals using orthogonal tokenization and missing-aware mixture-of-experts, improving performance under modality missingness.
Similar Articles
InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning
InternVideo3 introduces Multimodal Contextual Reasoning (MCR) and efficient attention mechanisms to enhance long-horizon multimodal tasks, achieving strong results on video understanding benchmarks and demonstrating video agent capabilities.
Weakly Supervised Concept Learning for Object-centric Visual Reasoning
This paper introduces a two-stage neuro-symbolic framework that uses weak supervision (as little as 1% labels) with a slot-based VAE to learn interpretable symbols for object-centric visual reasoning, outperforming foundation models in domain generalization.
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
CollabVR is a research paper proposing a closed-loop framework that collaboratively integrates vision-language models with video generation models to improve visual reasoning and correct failures in real-time.
Vokenization: Multimodel Learning for Vision and Language
The article explains 'Vokenization,' a multimodal learning technique that bridges computer vision and natural language processing by using weak supervision to link visual data with language tokens. It contrasts this approach with text-only models like GPT-3 and BERT, highlighting how visual grounding can improve language understanding.
CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework
CaVe-VLM-CoT is a modular reflection-based agentic-RAG framework for vision-language models that enforces evidence-grounded reasoning through a five-stage pipeline, achieving 87.1% accuracy on ScienceQA and proposing a suite of 23 metrics for evaluation.