Leveraging Vision-Language Models to Detect Attention in Educational Videos

arXiv cs.AI Papers

Summary

This paper explores using a Vision-Language Model (VLM) to detect attention loss in educational videos by combining gaze data with video content, but finds that VLM approaches do not outperform traditional machine learning baselines.

arXiv:2605.20211v1 Announce Type: cross Abstract: Educational videos are a cornerstone of remote and blended learning. However, learners' fluctuating attention remains a significant barrier to effective information retention. Prior research has attempted to mitigate this by detecting and reacting to attention loss at runtime using eye tracking. Such detection has been based so far on classical machine learning classifiers trained on engineered features, such as summary statistics over learners' fixations and saccades. These methods have struggled to capture the complex, temporal nature of learner engagement, thus exhibiting moderate prediction performance. In this study, we aim to advance the detection of attention by shifting from standard engineered features to a multimodal foundation models. Using an educational eye-tracking dataset (N = 70), we investigate a novel methodology that utilizes a Vision-Language Model (VLM) to analyze video content directly with superimposed gaze data. This approach aims to leverage the semantic reasoning capabilities of foundation models to contextualize learner focus within the video stream. We evaluate the performance of this VLM-based approach using several prompting strategies with Gemini 3, but ultimately found that none of them could outperform statistical baselines. Our results provide new insights into the limitations of using VLMs for real-time educational diagnostics.
Original Article
View Cached Full Text

Cached at: 05/22/26, 08:52 AM

# Leveraging Vision-Language Models to Detect Attention in Educational Videos
Source: [https://arxiv.org/abs/2605.20211](https://arxiv.org/abs/2605.20211)
Authors:[Gabriel Becquet](https://arxiv.org/search/cs?searchtype=author&query=Becquet,+G)\(LIP6, CNRS, SU\),[Sébastien Lallé](https://arxiv.org/search/cs?searchtype=author&query=Lall%C3%A9,+S)\(CNRS, LIP6, SU\),[Vanda Luengo](https://arxiv.org/search/cs?searchtype=author&query=Luengo,+V)\(LIP6, CNRS, SU\),[Ali Abou\-Hassan](https://arxiv.org/search/cs?searchtype=author&query=Abou-Hassan,+A)\(SU, CNRS, PHENIX, IUF\)

[View PDF](https://arxiv.org/pdf/2605.20211)

> Abstract:Educational videos are a cornerstone of remote and blended learning\. However, learners' fluctuating attention remains a significant barrier to effective information retention\. Prior research has attempted to mitigate this by detecting and reacting to attention loss at runtime using eye tracking\. Such detection has been based so far on classical machine learning classifiers trained on engineered features, such as summary statistics over learners' fixations and saccades\. These methods have struggled to capture the complex, temporal nature of learner engagement, thus exhibiting moderate prediction performance\. In this study, we aim to advance the detection of attention by shifting from standard engineered features to a multimodal foundation models\. Using an educational eye\-tracking dataset \(N = 70\), we investigate a novel methodology that utilizes a Vision\-Language Model \(VLM\) to analyze video content directly with superimposed gaze data\. This approach aims to leverage the semantic reasoning capabilities of foundation models to contextualize learner focus within the video stream\. We evaluate the performance of this VLM\-based approach using several prompting strategies with Gemini 3, but ultimately found that none of them could outperform statistical baselines\. Our results provide new insights into the limitations of using VLMs for real\-time educational diagnostics\.

## Submission history

From: Sebastien Lalle \[[view email](https://arxiv.org/show-email/cefdb055/2605.20211)\] \[via CCSD proxy\] **\[v1\]**Mon, 20 Apr 2026 08:11:43 UTC \(2,044 KB\)

Similar Articles

Large Vision-Language Models Get Lost in Attention

arXiv cs.AI

This research paper analyzes the internal mechanics of Large Vision-Language Models (LVLMs) using information theory, revealing that attention mechanisms may be redundant while Feed-Forward Networks drive semantic innovation. The authors demonstrate that replacing learned attention weights with random values can yield comparable performance, suggesting current models 'get lost in attention'.

Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models

arXiv cs.CL

This paper presents the first systematic study of multilingual instruction following in Vision-Language-Action (VLA) models, revealing significant performance degradation when models trained on English are evaluated on other languages. The authors propose Multilingual Principal Component Alignment (MPCA) to reduce the multilingual performance gap.

When Vision Speaks for Sound

Hugging Face Daily Papers

This paper identifies that video-capable multimodal LLMs often appear to understand audio but actually rely on visual cues, a failure mode termed the audio-visual Clever Hans effect. It introduces Thud, an intervention-driven probing framework to diagnose this issue, and proposes an alignment recipe that improves audio-visual consistency by 28 percentage points.