PEEK: Picking Essential frames via Efficient Knowledge distillation

Hugging Face Daily Papers 05/29/26, 12:00 AM Papers

Summary

Introduces PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a teacher model into a lightweight temporal model, outperforming state-of-the-art methods in video captioning while maintaining computational efficiency.

Video-language models can process only a limited number of frames, making frame selection a key bottleneck for efficient video captioning. Most captioning pipelines still rely on uniform sampling, which is computationally cheap but agnostic to visual content. Adaptive frame sampling has recently emerged as a promising approach for selecting the most informative frames from a video; however, existing methods remain computationally expensive. We introduce PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a stronger teacher model into a lightweight temporal model that operates only on visual content. We find that, overall, on ActivityNet Captions and MSR-VTT, our method outperforms state-of-the-art methods across all evaluated downstream vision language models, especially when only one or two frames are selected for captioning, obtaining the best CIDEr for most frame budgets. On ActivityNet Captions, PEEK is particularly strong, winning 14 out of 16 configurations. Zero-shot evaluation on MSR-VTT shows that our model transfers best at low frame budgets, while results at four and eight frames are more mixed as temporal coverage and visual diversity become increasingly competitive. Compared with recent adaptive baselines, PEEK is both more accurate in the low-budget regime and more efficient: it adds only 5.2% to the captioning time, compared with 65.4% for CSTA and 211.9% for MaxInfo. We release our code and pre-trained checkpoint at https://github.com/momentslab/peek.

Original Article

View Cached Full Text

Cached at: 06/01/26, 11:20 AM

Paper page - PEEK: Picking Essential frames via Efficient Knowledge distillation

Source: https://huggingface.co/papers/2605.31029

Abstract

PEEK is an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a teacher model into a lightweight temporal model, outperforming state-of-the-art methods in video captioning while maintaining computational efficiency.

Video-language modelscan process only a limited number of frames, makingframe selectiona key bottleneck for efficient video captioning. Most captioning pipelines still rely on uniform sampling, which is computationally cheap but agnostic to visual content.Adaptive frame samplinghas recently emerged as a promising approach for selecting the most informative frames from a video; however, existing methods remain computationally expensive. We introduce PEEK, an efficient dynamic frame sampling method that distillscaption-conditioned frame relevancerankings from a strongerteacher modelinto alightweight temporal modelthat operates only on visual content. We find that, overall, on ActivityNet Captions and MSR-VTT, our method outperforms state-of-the-art methods across all evaluated downstreamvision language models, especially when only one or two frames are selected for captioning, obtaining the bestCIDErfor most frame budgets. On ActivityNet Captions, PEEK is particularly strong, winning 14 out of 16 configurations.Zero-shot evaluationon MSR-VTT shows that our model transfers best at low frame budgets, while results at four and eight frames are more mixed astemporal coverageandvisual diversitybecome increasingly competitive. Compared with recent adaptive baselines, PEEK is both more accurate in the low-budget regime and more efficient: it adds only 5.2% to the captioning time, compared with 65.4% for CSTA and 211.9% for MaxInfo. We release our code and pre-trained checkpoint at https://github.com/momentslab/peek.

View arXiv page View PDF Project page GitHub0 Add to collection

Get this paper in your agent:

hf papers read 2605\.31029

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### momentslab/peek Updatedabout 4 hours ago • 10 • 2

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.31029 in a dataset README.md to link it from this page.

Spaces citing this paper1

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

PEEK: Picking Essential frames via Efficient Knowledge distillation

Paper page - PEEK: Picking Essential frames via Efficient Knowledge distillation

Abstract

Models citing this paper1

Datasets citing this paper0

Spaces citing this paper1

Collections including this paper0

Similar Articles

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

FrameSkip: Learning from Fewer but More Informative Frames in VLA Training

Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign

Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback

Submit Feedback

Similar Articles

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

FrameSkip: Learning from Fewer but More Informative Frames in VLA Training

Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign

Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback