Leveraging Vision-Language Models to Detect Attention in Educational Videos
Summary
This paper explores using a Vision-Language Model (VLM) to detect attention loss in educational videos by combining gaze data with video content, but finds that VLM approaches do not outperform traditional machine learning baselines.
View Cached Full Text
Cached at: 05/22/26, 08:52 AM
# Leveraging Vision-Language Models to Detect Attention in Educational Videos Source: [https://arxiv.org/abs/2605.20211](https://arxiv.org/abs/2605.20211) Authors:[Gabriel Becquet](https://arxiv.org/search/cs?searchtype=author&query=Becquet,+G)\(LIP6, CNRS, SU\),[Sébastien Lallé](https://arxiv.org/search/cs?searchtype=author&query=Lall%C3%A9,+S)\(CNRS, LIP6, SU\),[Vanda Luengo](https://arxiv.org/search/cs?searchtype=author&query=Luengo,+V)\(LIP6, CNRS, SU\),[Ali Abou\-Hassan](https://arxiv.org/search/cs?searchtype=author&query=Abou-Hassan,+A)\(SU, CNRS, PHENIX, IUF\) [View PDF](https://arxiv.org/pdf/2605.20211) > Abstract:Educational videos are a cornerstone of remote and blended learning\. However, learners' fluctuating attention remains a significant barrier to effective information retention\. Prior research has attempted to mitigate this by detecting and reacting to attention loss at runtime using eye tracking\. Such detection has been based so far on classical machine learning classifiers trained on engineered features, such as summary statistics over learners' fixations and saccades\. These methods have struggled to capture the complex, temporal nature of learner engagement, thus exhibiting moderate prediction performance\. In this study, we aim to advance the detection of attention by shifting from standard engineered features to a multimodal foundation models\. Using an educational eye\-tracking dataset \(N = 70\), we investigate a novel methodology that utilizes a Vision\-Language Model \(VLM\) to analyze video content directly with superimposed gaze data\. This approach aims to leverage the semantic reasoning capabilities of foundation models to contextualize learner focus within the video stream\. We evaluate the performance of this VLM\-based approach using several prompting strategies with Gemini 3, but ultimately found that none of them could outperform statistical baselines\. Our results provide new insights into the limitations of using VLMs for real\-time educational diagnostics\. ## Submission history From: Sebastien Lalle \[[view email](https://arxiv.org/show-email/cefdb055/2605.20211)\] \[via CCSD proxy\] **\[v1\]**Mon, 20 Apr 2026 08:11:43 UTC \(2,044 KB\)
Similar Articles
Large Vision-Language Models Get Lost in Attention
This research paper analyzes the internal mechanics of Large Vision-Language Models (LVLMs) using information theory, revealing that attention mechanisms may be redundant while Feed-Forward Networks drive semantic innovation. The authors demonstrate that replacing learned attention weights with random values can yield comparable performance, suggesting current models 'get lost in attention'.
VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization
This paper introduces a paradigm where Vision-Language Models (VLMs) act as test-time teachers to guide Video Generation Models (VGMs) via differentiable rewards and LoRA optimization, achieving a 16.7-point average improvement on video reasoning benchmarks.
Can Vision Language Models Be Adaptive in Mathematics Education? A Learner Model-based Rubric Study
This paper proposes a learner model-based rubric to evaluate the adaptivity of Vision Language Models (VLMs) in mathematics education. Experiments show measurable differences in adaptivity across models and reveal that current VLMs struggle to produce consistent learner-adaptive instructional responses.
Beyond English: Uncovering the Multilingual Gap in Vision-Language-Action Models
This paper presents the first systematic study of multilingual instruction following in Vision-Language-Action (VLA) models, revealing significant performance degradation when models trained on English are evaluated on other languages. The authors propose Multilingual Principal Component Alignment (MPCA) to reduce the multilingual performance gap.
When Vision Speaks for Sound
This paper identifies that video-capable multimodal LLMs often appear to understand audio but actually rely on visual cues, a failure mode termed the audio-visual Clever Hans effect. It introduces Thud, an intervention-driven probing framework to diagnose this issue, and proposes an alignment recipe that improves audio-visual consistency by 28 percentage points.