Tag
Introduces PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a teacher model into a lightweight temporal model, outperforming state-of-the-art methods in video captioning while maintaining computational efficiency.
Swift Sampling is a training-free algorithm that uses Taylor expansion to identify high-information moments in long-form videos by detecting deviations from predicted feature trajectories, improving accuracy on video QA tasks with minimal computational overhead.
FrameSkip is a data-layer frame selection method that improves Vision-Language-Action (VLA) policy training by prioritizing high-importance frames based on action variation and visual-coherence metrics, achieving a macro-average success rate of 76.15% across three benchmarks while using only 20% of unique frames.