Tag
Proposes a hierarchical semantic-constrained heterogeneous graph model for open-vocabulary audio-visual event localization, addressing cross-modal consistency at multiple temporal scales and hierarchical semantic constraints between segment and video levels. Achieves state-of-the-art results on OV-AVEL benchmark.
VoLoAgent integrates vision-language models with robot capabilities for open-vocabulary long-horizon manipulation tasks, introducing a physical orchestrator that plans, monitors, and recovers using interruptible tools, and a benchmark called RoboVoLo for evaluation.
This paper introduces DiGSeg, a framework that repurposes pretrained diffusion models for state-of-the-art semantic and open-vocabulary segmentation by leveraging latent space conditioning and text-guided alignment.
Grounding DINO is an open-vocabulary object detection model that can detect arbitrary objects based on text descriptions, now available on Replicate.