Leveraging Vision-Language Models to Detect Attention in Educational Videos

arXiv cs.AI 05/22/26, 04:00 AM Papers

Summary

This paper explores using a Vision-Language Model (VLM) to detect attention loss in educational videos by combining gaze data with video content, but finds that VLM approaches do not outperform traditional machine learning baselines.

arXiv:2605.20211v1 Announce Type: cross Abstract: Educational videos are a cornerstone of remote and blended learning. However, learners' fluctuating attention remains a significant barrier to effective information retention. Prior research has attempted to mitigate this by detecting and reacting to attention loss at runtime using eye tracking. Such detection has been based so far on classical machine learning classifiers trained on engineered features, such as summary statistics over learners' fixations and saccades. These methods have struggled to capture the complex, temporal nature of learner engagement, thus exhibiting moderate prediction performance. In this study, we aim to advance the detection of attention by shifting from standard engineered features to a multimodal foundation models. Using an educational eye-tracking dataset (N = 70), we investigate a novel methodology that utilizes a Vision-Language Model (VLM) to analyze video content directly with superimposed gaze data. This approach aims to leverage the semantic reasoning capabilities of foundation models to contextualize learner focus within the video stream. We evaluate the performance of this VLM-based approach using several prompting strategies with Gemini 3, but ultimately found that none of them could outperform statistical baselines. Our results provide new insights into the limitations of using VLMs for real-time educational diagnostics.

Original Article

View Cached Full Text

Cached at: 05/22/26, 08:52 AM

# Leveraging Vision-Language Models to Detect Attention in Educational Videos
Source: [https://arxiv.org/abs/2605.20211](https://arxiv.org/abs/2605.20211)
Authors:[Gabriel Becquet](https://arxiv.org/search/cs?searchtype=author&query=Becquet,+G)\(LIP6, CNRS, SU\),[Sébastien Lallé](https://arxiv.org/search/cs?searchtype=author&query=Lall%C3%A9,+S)\(CNRS, LIP6, SU\),[Vanda Luengo](https://arxiv.org/search/cs?searchtype=author&query=Luengo,+V)\(LIP6, CNRS, SU\),[Ali Abou\-Hassan](https://arxiv.org/search/cs?searchtype=author&query=Abou-Hassan,+A)\(SU, CNRS, PHENIX, IUF\)

[View PDF](https://arxiv.org/pdf/2605.20211)

> Abstract:Educational videos are a cornerstone of remote and blended learning\. However, learners' fluctuating attention remains a significant barrier to effective information retention\. Prior research has attempted to mitigate this by detecting and reacting to attention loss at runtime using eye tracking\. Such detection has been based so far on classical machine learning classifiers trained on engineered features, such as summary statistics over learners' fixations and saccades\. These methods have struggled to capture the complex, temporal nature of learner engagement, thus exhibiting moderate prediction performance\. In this study, we aim to advance the detection of attention by shifting from standard engineered features to a multimodal foundation models\. Using an educational eye\-tracking dataset \(N = 70\), we investigate a novel methodology that utilizes a Vision\-Language Model \(VLM\) to analyze video content directly with superimposed gaze data\. This approach aims to leverage the semantic reasoning capabilities of foundation models to contextualize learner focus within the video stream\. We evaluate the performance of this VLM\-based approach using several prompting strategies with Gemini 3, but ultimately found that none of them could outperform statistical baselines\. Our results provide new insights into the limitations of using VLMs for real\-time educational diagnostics\.

## Submission history

From: Sebastien Lalle \[[view email](https://arxiv.org/show-email/cefdb055/2605.20211)\] \[via CCSD proxy\] **\[v1\]**Mon, 20 Apr 2026 08:11:43 UTC \(2,044 KB\)

Leveraging Vision-Language Models to Detect Attention in Educational Videos

Similar Articles

Large Vision-Language Models Get Lost in Attention

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Can Vision Language Models Be Adaptive in Mathematics Education? A Learner Model-based Rubric Study

Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models

When Vision Speaks for Sound

Submit Feedback

Similar Articles

Large Vision-Language Models Get Lost in Attention

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Can Vision Language Models Be Adaptive in Mathematics Education? A Learner Model-based Rubric Study

Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models