Tag
A survey presenting a human-view perspective on video understanding with multimodal large language models, organized around watching, remembering, and reasoning abilities, covering challenges, methods, and applications.
A tweet discusses how OpenAI launches are no longer seen as startup-killers, referencing a new Codex feature that deploys websites using Cloudflare's Sites, D1, and R2.
Introduces Representation Forcing (RF), a technique that enables unified multimodal models to perform both perception and generation end-to-end without external VAE latent spaces, matching state-of-the-art VAE-based models in image generation while improving understanding.
DynaFLIP is a dynamics-aware multimodal pre-training framework that integrates motion understanding into visual perception for robot manipulation. It uses image-language-3D flow triplets and geometric regularization to improve representation learning, achieving significant gains in out-of-distribution scenarios.
This paper advocates for incorporating enactive approaches to perception and cognition into AI, highlighting four key concepts: experience, action-perception inseparability, autonomy, and embodiment. It finds resonance with reinforcement learning but suggests broader integration of enactive ideas.
A reflection on the improving quality of AI-generated images, questioning at what point they become indistinguishable from real photography or digital art.
DexHoldem is a real-world benchmark for evaluating embodied agents in dexterous manipulation tasks, using Texas Hold'em with a ShadowHand to test primitive execution, perception, and decision-making in a closed-loop setting.
This paper introduces a reinforcement learning framework that improves perception-reasoning synergy in vision-language models by explicitly rewarding perceptual fidelity, using a 'blindfolded reasoning' proxy and structured verbal verification to address ambiguity in modality credit assignment.
This paper proposes Embedding Temporal Logic (ETL), a temporal logic that monitors perception-based autonomous systems directly in learned embedding spaces, enabling specification of high-level perceptual concepts and achieving strong empirical agreement with ground-truth semantics.