PEEK: Picking Essential frames via Efficient Knowledge distillation
Summary
Introduces PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a teacher model into a lightweight temporal model, outperforming state-of-the-art methods in video captioning while maintaining computational efficiency.
View Cached Full Text
Cached at: 06/01/26, 11:20 AM
Paper page - PEEK: Picking Essential frames via Efficient Knowledge distillation
Source: https://huggingface.co/papers/2605.31029
Abstract
PEEK is an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a teacher model into a lightweight temporal model, outperforming state-of-the-art methods in video captioning while maintaining computational efficiency.
Video-language modelscan process only a limited number of frames, makingframe selectiona key bottleneck for efficient video captioning. Most captioning pipelines still rely on uniform sampling, which is computationally cheap but agnostic to visual content.Adaptive frame samplinghas recently emerged as a promising approach for selecting the most informative frames from a video; however, existing methods remain computationally expensive. We introduce PEEK, an efficient dynamic frame sampling method that distillscaption-conditioned frame relevancerankings from a strongerteacher modelinto alightweight temporal modelthat operates only on visual content. We find that, overall, on ActivityNet Captions and MSR-VTT, our method outperforms state-of-the-art methods across all evaluated downstreamvision language models, especially when only one or two frames are selected for captioning, obtaining the bestCIDErfor most frame budgets. On ActivityNet Captions, PEEK is particularly strong, winning 14 out of 16 configurations.Zero-shot evaluationon MSR-VTT shows that our model transfers best at low frame budgets, while results at four and eight frames are more mixed astemporal coverageandvisual diversitybecome increasingly competitive. Compared with recent adaptive baselines, PEEK is both more accurate in the low-budget regime and more efficient: it adds only 5.2% to the captioning time, compared with 65.4% for CSTA and 211.9% for MaxInfo. We release our code and pre-trained checkpoint at https://github.com/momentslab/peek.
View arXiv pageView PDFProject pageGitHub0Add to collection
Get this paper in your agent:
hf papers read 2605\.31029
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### momentslab/peek Updatedabout 4 hours ago • 10 • 2
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.31029 in a dataset README.md to link it from this page.
Spaces citing this paper1
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
This paper introduces PEEK, a system that caches orientation knowledge about recurring external contexts as a context map, enabling LLM agents to reuse context knowledge across invocations and significantly improving efficiency and accuracy on long-context reasoning and information aggregation tasks.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Re2Pix is a hierarchical video prediction framework that improves future video generation by first predicting semantic representations using frozen vision foundation models, then conditioning a latent diffusion model on these predictions to generate photorealistic frames. The approach addresses train-test mismatches through nested dropout and mixed supervision strategies, achieving improved temporal semantic consistency and perceptual quality on autonomous driving benchmarks.
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training
FrameSkip is a data-layer frame selection method that improves Vision-Language-Action (VLA) policy training by prioritizing high-importance frames based on action variation and visual-coherence metrics, achieving a macro-average success rate of 76.15% across three benchmarks while using only 20% of unique frames.
Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign
Introduces Peak-Detector, a framework that uses instruction-tuned large language models for robust, cross-modal, and explainable peak detection in physiological signals like ECG, PPG, BCG, and BSG. The method transforms time-series data into a condensed 'peak-representation' format and is optimized via supervised fine-tuning followed by reinforcement learning with a multi-objective reward.
Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback
This paper introduces three parameter-efficient methods for multi-view proficiency estimation on the Ego-Exo4D dataset, shifting from discriminative classification to generative feedback. The proposed models achieve state-of-the-art accuracy with significantly fewer parameters and training epochs than video-transformer baselines.