Towards Visually-Guided Movie Subtitle Translation for Indic Languages

arXiv cs.CL Papers

Summary

This paper presents a case study on visually-guided movie subtitle translation for low-resource Indic languages, demonstrating that selective visual grounding improves translation quality while addressing temporal misalignment challenges.

arXiv:2605.11993v1 Announce Type: new Abstract: Movie subtitle translation is inherently multimodal, yet text-only systems often miss visual cues needed to convey emotion, action, and social nuance, especially for low-resource Indic languages (English to Hindi, Bengali, Telugu, Tamil and Kannada). We present a case study on five full-length films and compare two lightweight visual grounding strategies: structured attribute summaries from a 5-minute sliding window and free-text summaries of inter-subtitle visual gaps. Our analysis shows that temporal misalignment between subtitles and frames is a major obstacle in long-form video, often rendering indiscriminate visual grounding ineffective. However, oracle selective grounding, which replaces only the lowest-quality 20-30\% of baseline segments with visual-enhanced outputs, consistently improves COMET over the text-only baseline while requiring far less visual processing. Among the two approaches, coarse attribute-based visual context summarization is more robust, capturing scene-level emotion and contextual subtle cues that text alone often misses
Original Article
View Cached Full Text

Cached at: 05/13/26, 06:20 AM

# Towards Visually-Guided Movie Subtitle Translation for Indic Languages
Source: [https://arxiv.org/html/2605.11993](https://arxiv.org/html/2605.11993)
Tarun Chintada, Kshetrimayum Boynao Singh11footnotemark:1, Asif Ekbal Department of Computer Science and Engineering Indian Institute of Technology Patna, India \{tarunchintada1, boynfrancis, asif\.ekbal\}@gmail\.com

###### Abstract

Movie subtitle translation is inherently multimodal, yet text\-only systems often miss visual cues needed to convey emotion, action, and social nuance, especially for low\-resource Indic languages \(English to Hindi, Bengali, Telugu, Tamil and Kannada\)\. We present a case study on five full\-length films and compare two lightweight visual grounding strategies: structured attribute summaries from a 5\-minute sliding window and free\-text summaries of inter\-subtitle visual gaps\. Our analysis shows that temporal misalignment between subtitles and frames is a major obstacle in long\-form video, often rendering indiscriminate visual grounding ineffective\. However, oracle selective111Oracle selective refers to a specialized, often theoretical, method used to achieve the absolute best outcomegrounding, which replaces only the lowest\-quality 20\-30% of baseline segments with visual\-enhanced outputs, consistently improves COMET over the text\-only baseline while requiring far less visual processing\. Among the two approaches, coarse attribute\-based visual context summarization is more robust, capturing scene\-level emotion and contextual subtle cues that text alone often misses\.

## 1Introduction

The global demand for movie subtitle translation has grown exponentially with the rise of streaming platforms and international releases\. Subtitles must convey meaning under strict spatial and temporal constraints, often compressing conversational speech, idiomatic expressions, and cultural references into short, timed segments\. For low‑resource Indian languages characterised by rich morphology\[[15](https://arxiv.org/html/2605.11993#bib.bib21)\], diglossia, and sparse parallel corpora, these challenges are magnified, and text‑only machine translation \(MT\) systems frequently produce translations that are either literal or contextually inadequate\[[16](https://arxiv.org/html/2605.11993#bib.bib2),[2](https://arxiv.org/html/2605.11993#bib.bib11)\]\.

![Refer to caption](https://arxiv.org/html/2605.11993v1/Multimodal2.png)Figure 1:Architecture of the multimodal subtitle translation pipelineFilms are inherently multimodal: meaning is distributed across dialogue, visual scenes, character actions, and emotional cues\. In principle, incorporating visual context could disambiguate references, resolve honorifics, and ground translations in the on‑screen situation\[[5](https://arxiv.org/html/2605.11993#bib.bib6)\]\. Yet, unlike traditional multimodal machine translation \(MT\) tasks \(e\.g\., translating image captions\), movie subtitle translation presents two distinctive difficulties\.

First, most subtitle segments do not rely on visual information\.The majority of dialogues are conversational and can be translated accurately from text alone\. Visual grounding is beneficial only in a minority of cases\-action cues\[[6](https://arxiv.org/html/2605.11993#bib.bib18),[7](https://arxiv.org/html/2605.11993#bib.bib22)\], emotion‑driven\[[14](https://arxiv.org/html/2605.11993#bib.bib25)\]exchanges, or references to visible objects\. For instance, translating the line“He’s coming\!”requires knowing whether the threat is a person, animal, or vehicle; such information is visually available\. In contrast, a typical exchange like“How was your day?”gains little from the accompanying visuals\. This asymmetry makes indiscriminate application of computationally expensive visual processing inefficient and often unnecessary\.

Second, visual and textual streams in long‑form movies are often misaligned\.Subtitles are generated independently of video frames, and cumulative temporal drift can cause a substantial fraction of subtitles to be paired with irrelevant or misleading visuals\[[18](https://arxiv.org/html/2605.11993#bib.bib5)\]\. Over a 180‑minute film, drift as small as one second per hour accumulates to a three‑minute mismatch, affecting a notable portion of subtitle segments\. When visual context is misaligned, it ceases to be helpful and can actively degrade translation quality\[[1](https://arxiv.org/html/2605.11993#bib.bib20)\], a phenomenon rarely discussed in the multimodal MT literature\[[12](https://arxiv.org/html/2605.11993#bib.bib19)\]\.

Motivated by these practical realities, we conduct a case study that systematically compares two summarization‑based strategies for integrating visual context into subtitle translation for five Indian languages: Hindi, Bengali, Telugu, Kannada, and Tamil\. These languages represent a range of morphological complexity and cultural nuances, making them ideal for studying multimodal subtitle translation in low‑resource settings\[[4](https://arxiv.org/html/2605.11993#bib.bib3)\]\. The two strategies are:

1. 1\.Attribute Visual Context \(Attr‑VC\): Aggregates a 5‑minute sliding window of raw visual descriptions and summarizes them using Llama 3\.1 into structured attributes \(e\.g\., setting, gender, honorifics, emotional intent\)\.
2. 2\.Inter‑Chunk Visual Summarization \(Inter\-VS\): Summarizes the visual content occurring between dialogue turns \(the gaps\) into a free‑text description using Llama 3\.1\.

Using subtitles from five full‑length movies spanning diverse genres, we evaluate these methods under realistic conditions\. A central finding is that applying visual context indiscriminately often degrades performance due to temporal misalignment\. However, an*oracle selective grounding*that replaces the worst 20\-30% of baseline segments \(by baseline COMET\) with visual‑enhanced translations consistently improves semantic adequacy \(COMET\) over the text‑only baseline, recovering most of the gain while using only a fraction of the visual processing\. Coarse attribute‑based summarization proves particularly robust, capturing emotional tone and scene‑level cues that text alone cannot convey\. The key attribute of this work are:

1. 1\.A comparative case studyof two visual summarization strategies for low‑resource subtitle translation\.
2. 2\.Identification and quantificationof temporal misalignment as a major practical obstacle in long‑form multimodal MT\.
3. 3\.Empirical evidencethat coarse attribute summarization is resilient to drift and that selective grounding can recover most of the gain\.

Table 1:Visual data statistics for the five movies\. Extracted frames are sampled within the time span of each subtitle segment\.
## 2Dataset Preparation and Resources

To evaluate multimodal subtitle translation in realistic settings, we curate a dataset derived from five commercially released movies selected to ensure diversity in genre, narrative style, and visual complexity:Titanic\(1997\),Skyfall\(2012\),Oppenheimer\(2023\),Spider‑Man 2\(2004\), andAvatar 2\(2022\)\. These films span romance, action, historical drama, science fiction, and superhero genres, providing a wide range of dialogue types and visually grounded scenes\.

### 2\.1Movies and Visual Data

For each movie, we extract video frames at 24\-fps and align them with subtitle timestamps\. Table[1](https://arxiv.org/html/2605.11993#S1.T1)summarizes the visual data, including total duration, total frames, and the number of frames extracted within subtitle time spans \(i\.e\., frames that fall within the time window of a subtitle segment\)\. The extracted frames serve as the visual input for our multimodal methods\.

### 2\.2Subtitle Corpora

We extract subtitles from publicly available sources222http://subtitlecat\.comand temporally align them with the corresponding video segments\. All subtitle pairs are preprocessed to remove noise, normalize punctuation, and filter excessively long or short segments\. Table[2](https://arxiv.org/html/2605.11993#S2.T2)provides detailed statistics for each movie, including the number of subtitle pairs \(English source and target language\), average English source length in words, and average English source length in characters\. Parallel subtitles are available for Hindi \(all movies except Avatar 2\), Bengali and Telugu \(all movies\), Kannada \(all movies except Spider‑Man 2\), and Tamil \(all movies except Titanic\)\. This selection ensures coverage of linguistically diverse Indian languages while respecting the availability of high‑quality parallel subtitles\. All movies were legally purchased as DVDs\. Frame extraction for research constitutes fair use and follows standard practice in video‑language benchmarks\.

Table 2:Subtitle statistics per movie\. Total pairs represent the number of English subtitle segments\.
### 2\.3Data Release

To foster reproducibility and further research, we will release the curated movie\-subtitle\-visual alignment data for all five languages under a fair‑use educational/research license\. The release includes the English source, reference translations, and extracted visual descriptions\. The code and instructions for reproducing the experiments will also be made publicly available\.

## 3Methodology

Our methodology is designed for real\-world applicability: all models are used off\-the\-shelf in a zero\-shot setting, no fine\-tuning or training is performed, and the pipeline is fully reproducible\. We use Qwen\-2\.5\-7B\-Instruct\[[11](https://arxiv.org/html/2605.11993#bib.bib13)\]as the translation model, Llama\-3\.1\-8B\-Instruct\[[8](https://arxiv.org/html/2605.11993#bib.bib17)\]for summarization\[[3](https://arxiv.org/html/2605.11993#bib.bib24)\], and Apple FastVLM\-0\.5B\[[17](https://arxiv.org/html/2605.11993#bib.bib16)\]for visual description extraction\. The pipeline is depicted in Figure[1](https://arxiv.org/html/2605.11993#S1.F1)\. For the text‑only baseline, Qwen is prompted with the English source only\. This yields the text‑only translation against which visual‑enhanced methods are compared\.

### 3\.1Visual Context Generation

From each movie, we sample frames at 1\-fps and obtain raw textual descriptions using FastVLM\-0\.5B\. These descriptions are then summarized by Llama\-3\.1 into two distinct forms:

#### Attribute Visual Context \(Attr‑VC\)

A 5‑minute sliding window \(centered on the subtitle start\) is aggregated and summarized into structured attributes:\[SETTING\],\[GENDER\],\[RELATION\],\[HONORIFIC\], and\[SUMMARY\]\. This yields a coarse, high‑level scene description\.

#### Inter‑Chunk Visual Summarization \(Inter\-VS\)

The raw descriptions that fall between the end of the previous subtitle and the start of the current subtitle \(the visual gap\) are summarized into a free‑text description\. This captures visual events that occur between dialogue turns\. The full prompts used for these summarization tasks are provided in Table[6](https://arxiv.org/html/2605.11993#A3.T6)\. Both the summaries are concatenated with the English source using the same prompt template, which instructs the model to ground its translation in the visual context\.

### 3\.2Oracle Selective Grounding

To estimate the upper bound of selective visual grounding, we compute per‑segment COMET scores for the baseline translations against the reference\. We then replace the worstk%k\\%of segments \(by baseline COMET\) with the corresponding visual‑enhanced translation \(from either Attr‑VC or Inter\-VS\)\. We experiment withk=20%k=20\\%and30%30\\%\. This oracle analysis shows the potential improvement if one could perfectly identify low‑quality baseline segments; it does not require any training and represents an upper bound for practical quality‑estimation systems\.

## 4Evaluation Results

### 4\.1Evaluation Setup

We evaluate on the full curated test set \(all aligned subtitle segments\) using corpus‑level BLEU\[[9](https://arxiv.org/html/2605.11993#bib.bib1)\], chrF\+\+\[[10](https://arxiv.org/html/2605.11993#bib.bib14)\], and COMET\[[13](https://arxiv.org/html/2605.11993#bib.bib15)\]\.

### 4\.2Results and Analysis

We compare the two visual summarisation strategies*Attribute Visual Context \(Attr‑VC\)*and*Inter‑Chunk Visual Summarization \(Inter\-VS\)*against a text‑only baseline\. We perform experiments with five movies in five Indian languages \(Hindi, Bengali, Telugu, Tamil, Kannada\) using Qwen\-2\.5\-7B\.

Results for full per‑movie, per‑language are shown in Table[3](https://arxiv.org/html/2605.11993#S4.T3)\. Table[4](https://arxiv.org/html/2605.11993#S4.T4)summarizes the language‑wise COMET improvements\.

MovieLangBaseline5‑Minute Slide Visual AttributeInter‑Chunk Visual SummarisationBLEUchrF\+\+COMETVisual\-EnhancedOracle SelectiveVisual\-EnhancedOracle SelectiveBLEUchrF\+\+COMETBLEUchrF\+\+COMETBLEUchrF\+\+COMETBLEUchrF\+\+COMETAvatarBen5\.6828\.710\.62986\.9527\.950\.70146\.9229\.980\.68298\.1028\.240\.71376\.9129\.800\.6865AvatarTel4\.3019\.660\.52573\.2818\.230\.51544\.3819\.790\.53903\.6718\.320\.51534\.6719\.850\.5390AvatarTam3\.8523\.490\.53524\.0822\.940\.55454\.2424\.360\.55804\.5723\.620\.56134\.3324\.500\.5639AvatarKan3\.5018\.940\.48572\.3415\.200\.45823\.3918\.650\.49332\.2315\.280\.46123\.3318\.370\.4946Oppenh\.Ben8\.0529\.380\.70265\.4725\.090\.67358\.0329\.410\.72486\.3726\.260\.68588\.0829\.500\.7237Oppenh\.Hin11\.7631\.640\.64678\.6227\.280\.597211\.8331\.740\.66429\.6228\.720\.629711\.8632\.060\.6690Oppenh\.Tel4\.0418\.860\.54753\.2917\.770\.53873\.8919\.130\.56543\.5318\.050\.53793\.9619\.210\.5647Oppenh\.Tam3\.1520\.600\.53663\.1521\.370\.56543\.3621\.850\.56303\.4721\.630\.56903\.6622\.180\.5715Oppenh\.Kan2\.9516\.810\.49382\.2314\.630\.47402\.9617\.200\.50662\.3014\.250\.47352\.9317\.050\.5090SkyfallBen6\.3127\.300\.69144\.1023\.510\.65886\.0427\.390\.70984\.5523\.680\.66125\.9327\.140\.7056SkyfallHin6\.3125\.650\.60265\.7425\.280\.58826\.5326\.680\.62586\.4125\.860\.60986\.6526\.470\.6245SkyfallTel2\.4717\.680\.52882\.1316\.660\.52482\.2418\.000\.54781\.4116\.840\.51572\.1617\.860\.5454SkyfallTam2\.3321\.090\.53502\.2221\.110\.55812\.5921\.780\.55811\.8320\.890\.55952\.6622\.020\.5639SkyfallKan1\.5916\.880\.49201\.7614\.380\.46681\.5916\.900\.50131\.5414\.360\.46821\.5816\.800\.5038Spider2Ben9\.5826\.810\.71906\.6924\.440\.69029\.1026\.970\.73598\.5525\.770\.70219\.4727\.180\.7350Spider2Hin12\.3329\.150\.645910\.1327\.710\.628612\.6130\.200\.674612\.1929\.170\.653212\.9330\.320\.6786Spider2Tel5\.2218\.470\.54073\.5717\.690\.53495\.0418\.840\.55673\.4017\.730\.53425\.0418\.740\.5569Spider2Tam4\.1221\.650\.54483\.2721\.390\.56014\.2922\.730\.56844\.0122\.090\.56944\.3222\.830\.5761TitanicBen9\.5925\.870\.69607\.0422\.970\.66169\.5526\.110\.71308\.6524\.970\.68499\.8226\.410\.7150TitanicHin11\.9826\.590\.61529\.1224\.450\.571111\.9227\.120\.632112\.2926\.900\.617612\.6627\.620\.6367TitanicTel5\.0317\.820\.53503\.8516\.990\.52114\.9218\.010\.54814\.3217\.520\.52965\.1118\.210\.5494TitanicKan4\.9517\.160\.49503\.1114\.040\.46704\.7316\.970\.50373\.1414\.450\.46654\.8617\.030\.5049

Table 3:Comparison of two visual summarization strategies\.*5‑Minute Slide Visual Attribute*aggregates a 5‑minute sliding window into structured attributes \(setting, gender, honorifics, emotion\);*Inter‑Chunk Visual Summarization*summarizes the visual content between dialogue turns\. For each method, we report*Visual\-Enhanced*\(using the full visual context for all segments\) and*Oracle Selective*\(replacing the worst 30% of baseline segments by baseline COMET with the visual‑enhanced translation\)\. This oracle shows the upper bound of selective grounding\. Metrics are corpus‑level BLEU, chrF\+\+, and COMET\.Boldindicates the higher score between the two methods for the same condition \(Visual\-Enhanced or Oracle Selective\)\.Table 4:Language\-wise average COMET improvement \(Δ\\Delta\) over baseline for each method and condition\. Positive values indicate improvement\. Oracle Selective replaces the worst 30% of baseline segments by baseline COMET\.![Refer to caption](https://arxiv.org/html/2605.11993v1/heatmap_genre.png)Figure 2:COMET gain from Oracle Selective Grounding \(30%\) by movie and language\. Gain is computed as the difference between the oracle selective translation \(replacing the worst 30% of baseline segments by baseline COMET\) and the text‑only baseline, using the better of the two visual summarisation methods for each pair\. Movie names are followed by their genre in parentheses\. Darker shades indicate larger gains\.
### 4\.3Overall Observations

Both summarisation methods show that applying visual context indiscriminately often degrades performance compared to the text‑only baseline\. For example, in several movie\-language pairs \(e\.g\., Oppenheimer Hindi, Skyfall Bengali\), full VT COMET is lower than the baseline\. This is directly attributable to temporal misalignment: when visual frames do not match the spoken dialogue, the model is misled\. However, oracle selective grounding consistently improves COMET over the baseline in almost all cases, recovering most of the potential gain while using only 30% of the visual processing\. Figure[2](https://arxiv.org/html/2605.11993#S4.F2)visualises the per‑movie, per‑language COMET gain from oracle selective grounding, highlighting that action‑rich movies \(e\.g\., Skyfall\) show larger improvements\. Human evaluation with a small set \(30 examples, each for Telugu and Hindi\) confirmed that selective grounding with oracle significantly improves adequacy: average score increased from 2\.9 \(baseline\) to 4\.1 \(selective\) on a 1\-5 scale\.

### 4\.4Comparison of Summarization Methods

Attr‑VC, which aggregates a 5‑minute window into coarse attributes, proves more robust to drift than Inter\-VS\. In many cases \(e\.g\., Avatar Bengali, Oppenheimer Telugu, Titanic Hindi\), its selective 30% COMET surpasses the full VT of Inter\-VS\. The attribute‑based representation, by ignoring precise timing, effectively filters out irrelevant frames\. Inter\-VS, while capturing finer‑grained visual events, is more sensitive to misalignment; its full VT often underperforms the baseline, but selective grounding still yields gains\.

![Refer to caption](https://arxiv.org/html/2605.11993v1/comet_language_comparison.png)Figure 3:Language‑wise COMET scores for the two visual summarization methods\. For each method, bars show the baseline \(text‑only\), full visual‑enhanced translation, and oracle selective grounding \(replacing the worst 30% of baseline segments by baseline COMET\)\. The oracle selective consistently improves COMET over the baseline across all languages, with relative gains of 2\-5%\.
### 4\.5Language‑Wise Trends

Table[4](https://arxiv.org/html/2605.11993#S4.T4)aggregates COMET improvement over the baseline for each language\. For Attr‑VC selective, all languages show positive gains \(range \+2\.3% to \+5\.9%\)\. The gains are larger for morphologically rich languages \(Bengali, Tamil, Kannada\) where visual cues \(e\.g\., honorifics, emotional tone\) help resolve pragmatic ambiguities\. Inter\-VS selective also yields consistent improvements, though slightly lower for some languages\. The COMET improvements across languages are further illustrated in Figure[3](https://arxiv.org/html/2605.11993#S4.F3)\.

### 4\.6Summary of Key Findings

Our case study yields three actionable insights:

1. 1\.Coarse attribute‑based summarization is robust to temporal driftBy aggregating visual information over a 5‑minute window into structured attributes, Attr‑VC achieves pragmatic gains without being misled by misaligned frames\.
2. 2\.Selective visual grounding can recover most of the gainAn oracle that replaces only the worst 20\-30% of baseline segments with visual‑enhanced translations consistently improves COMET over baseline, using a fraction of the visual processing\.
3. 3\.Alignment quality often outweighs architectural complexityFine‑grained summarization methods that rely on precise temporal alignment are fragile\. For real‑world deployment, drift‑tolerant architectures are preferred\.

## 5Discussion

Our case study reveals a central tension: while visual context can provide critical grounding for a small subset of segments, it is irrelevant or even harmful for the majority\. This asymmetry, combined with temporal misalignment, shapes the relative performance of the two summarization strategies and the effectiveness of selective grounding\.

### 5\.1When Visual Context Helps and When It Does Not

Only a minority of subtitle segments truly depend on visual information\. Action cues \(“He’s coming\!”\), emotion‑driven exchanges \(“I’m so sorry”\), and references to on‑screen objects \(“That one”\) require visual grounding to resolve ambiguity\. For the vast majority of conversational dialogue, text alone is sufficient, and adding visual context adds no benefit\. In our dataset, we estimate that fewer than 15% of subtitles are visually grounded in this sense\. This explains why even the best‑performing method oracle selective grounding\-achieves only a modest overall COMET gain \(2\-5%\) while replacing only 20\-30% of segments\.

### 5\.2Why Attribute Summarization Outperforms Gap Summarization?

Attr‑VC aggregates a 5‑minute sliding window into high‑level attributes\. This coarse representation is rarely misleading for neutral segments and provides valuable pragmatic context for the minority that need it\. Moreover, it is inherently robust to misalignment because it aggregates over longer time windows, effectively ignoring irrelevant frames\. Inter\-VS summarizes the visual content between dialogue turns\. This finer‑grained representation captures visual events that may be directly relevant, but it is more sensitive to drift\. When visual frames are misaligned, Inter\-VS can introduce misleading information, causing its full visual‑enhanced translations to sometimes underperform the baseline\. However, when applied selectively \(only to the worst baseline segments\), Inter\-VS still yields substantial gains on the replaced set\.

### 5\.3The Role of Oracle Selective Grounding

Our oracle analysis replaces the worst 20\-30% of baseline segments \(by baseline COMET\) with the corresponding visual‑enhanced translation\. This shows the upper bound of what could be achieved with a perfect quality‑estimation model\. For both methods, selective grounding consistently lifts COMET above the baseline, often matching or even exceeding the full visual‑enhanced performance of the other method\.

### 5\.4Insights for Low‑Resource Indian Languages

For morphologically rich languages such as Bengali and Kannada, the small subset of visually grounded segments often includes honorifics, implicit referents, or emotional tone areas, where text‑only models are weakest\. Coarse attribute summarisation helps in these cases without harming the rest, yielding clear COMET gains in the selective setting \(e\.g\., \+3\.7% for Bengali, \+2\.3% for Kannada\)\. This suggests that for low‑resource settings, investing in reliable, drift‑tolerant visual abstractions is more practical than pursuing fine‑grained fusion or high‑frequency visual processing\.

### 5\.5Practical Implications for Movie Localisation

Our findings lead to three actionable recommendations for deploying multimodal subtitle translation:

1. 1\.Be selective:Not every subtitle needs visual context\. An oracle study shows that replacing only the worst 20\-30% of baseline segments can recover most of the gain\. A practical system would use a quality‑estimation model \(e\.g\., based on sentence length, emotion words, or visual confidence\) to trigger visual enhancement only when needed\.
2. 2\.Prefer robust architectures:Attribute‑based summarization \(5‑minute sliding window condensed into structured cues\) offers a lightweight, drift‑tolerant solution suitable for production pipelines\.
3. 3\.Fix alignment before using fine‑grained context:If a more detailed visual context is desired \(e\.g\., gap‑based summarization\), pre‑processing steps such as dynamic time warping \(DTW\) or audio‑visual synchronisation are essential to avoid performance degradation\.

## 6Conclusion

In this paper, we compare two summarization strategies for integrating visual context into subtitle translation for five low‑resource Indian languages\. We find that temporal misalignment\- a common real‑world issue\- causes full visual‑enhanced translation to often underperform the text‑only baseline\. However, an oracle selective grounding that replaces only the worst 20\-30% of baseline segments with visual‑enhanced translations consistently improves semantic adequacy \(COMET\) across all the languages, recovering most of the potential gain while using a fraction of the visual processing\. Coarse attribute‑based summarization proves particularly robust to drift, capturing emotional tone and scene‑level cues that text alone cannot convey\. Our results underscore that alignment quality often outweighs architectural complexity and that selective visual grounding offers a practical path to efficient, deployable multimodal subtitle translation\.

## Limitations

Our study is limited to five movies and five languages; the proportion of visually grounded segments may vary by genre and across different types of audiovisual content\. The oracle selective grounding demonstrates an upper bound; future work should develop automatic quality‑estimation models that can identify low‑quality baseline segments without reference translations, enabling practical selective grounding\. Human evaluation of pragmatic adequacy \(e\.g\., honorifics, emotional tone\) would complement automatic metrics to better capture the subtle benefits of selective visual grounding\. Additionally, exploring more sophisticated alignment techniques \(e\.g\., dynamic time warping\) could further reduce temporal misalignment and improve the robustness of fine‑grained visual context\.

## Acknowledgements

The authors would like to express their sincere gratitude to the project, Centre of Indian Language Data \(COIL\-D\) under Bhashini, funded by the Ministry of Electronics and Information Technology \(MeitY\), Government of India for its generous support\.

## References

- \[1\]R\. Appicharla, B\. Gain, S\. Pal, A\. Ekbal, and P\. Bhattacharyya\(2024\-06\)A case study on context\-aware neural machine translation with multi\-task learning\.InProceedings of the 25th Annual Conference of the European Association for Machine Translation \(Volume 1\),Sheffield, UK,pp\. 246–257\.External Links:[Link](https://aclanthology.org/2024.eamt-1.21/)Cited by:[§1](https://arxiv.org/html/2605.11993#S1.p4.1)\.
- \[2\]M\. Artetxe, S\. Ruder, and D\. Yogatama\(2020\-07\)On the cross\-lingual transferability of monolingual representations\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 4623–4637\.External Links:[Link](https://aclanthology.org/2020.acl-main.421/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.421)Cited by:[§1](https://arxiv.org/html/2605.11993#S1.p1.1)\.
- \[3\]D\. Datta, S\. Paul, K\. B\. Singh, S\. Kumar, A\. Joshi, S\. Mishra, S\. Jain, A\. Ekbal, P\. Goyal, A\. Modi, and S\. Ghosh\(2025\-12\)Findings of the JUST\-NLP 2025 shared task on summarization of Indian court judgments\.InProceedings of the 1st Workshop on NLP for Empowering Justice \(JUST\-NLP 2025\),Mumbai, India,pp\. 5–11\.External Links:[Link](https://aclanthology.org/2025.justnlp-main.2/),[Document](https://dx.doi.org/10.18653/v1/2025.justnlp-main.2),ISBN 979\-8\-89176\-312\-8Cited by:[§3](https://arxiv.org/html/2605.11993#S3.p1.1)\.
- \[4\]D\. M\. Eberhard, G\. F\. Simons, and C\. D\. Fennig\(2025\)Ethnologue: languages of the world\.28th edition,SIL International,Dallas, Texas\.External Links:[Link](http://www.ethnologue.com/)Cited by:[§1](https://arxiv.org/html/2605.11993#S1.p5.1)\.
- \[5\]D\. Elliott, S\. Frank, K\. Sima’an, and L\. Specia\(2016\)Multi30k: multilingual english\-german image descriptions\.InProceedings of the 5th Workshop on Vision and Language,pp\. 70–74\.Cited by:[§1](https://arxiv.org/html/2605.11993#S1.p2.1)\.
- \[6\]B\. Gain, D\. Bandyopadhyay, S\. Mukherjee, C\. Adak, and A\. Ekbal\(2025\-08\)Impact of visual context on noisy multimodal nmt: an empirical study for english to indian languages\.ACM Trans\. Asian Low\-Resour\. Lang\. Inf\. Process\.24\(8\)\.External Links:ISSN 2375\-4699,[Link](https://doi.org/10.1145/3748311),[Document](https://dx.doi.org/10.1145/3748311)Cited by:[§1](https://arxiv.org/html/2605.11993#S1.p3.1)\.
- \[7\]D\. Kumar, B\. Gain, K\. B\. Singh, and A\. Ekbal\(2025\-12\)Does vision still help? multimodal translation with CLIP\-based image selection\.InProceedings of the Twelfth Workshop on Asian Translation \(WAT 2025\),T\. Nakazawa and I\. Goto \(Eds\.\),Mumbai, India,pp\. 115–123\.External Links:[Link](https://aclanthology.org/2025.wat-1.12/),[Document](https://dx.doi.org/10.18653/v1/2025.wat-1.12),ISBN 979\-8\-89176\-309\-8Cited by:[§1](https://arxiv.org/html/2605.11993#S1.p3.1)\.
- \[8\]Meta\(2024\-07\)Llama 3\.1: the llama 3\.1 collection of multilingual large language models\.Note:https://huggingface\.co/meta\-llama/Llama\-3\.1\-8BModel release date: July 23, 2024\. Accessed: 2026\-03\-27Cited by:[§3](https://arxiv.org/html/2605.11993#S3.p1.1)\.
- \[9\]K\. Papineni, S\. Roukos, T\. Ward, and W\. Zhu\(2002\-07\)Bleu: a method for automatic evaluation of machine translation\.InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics,P\. Isabelle, E\. Charniak, and D\. Lin \(Eds\.\),Philadelphia, Pennsylvania, USA,pp\. 311–318\.External Links:[Link](https://aclanthology.org/P02-1040/),[Document](https://dx.doi.org/10.3115/1073083.1073135)Cited by:[§C\.2](https://arxiv.org/html/2605.11993#A3.SS2.p1.1),[§4\.1](https://arxiv.org/html/2605.11993#S4.SS1.p1.1)\.
- \[10\]M\. Popović\(2015\-09\)ChrF: character n\-gram F\-score for automatic MT evaluation\.InProceedings of the Tenth Workshop on Statistical Machine Translation,O\. Bojar, R\. Chatterjee, C\. Federmann, B\. Haddow, C\. Hokamp, M\. Huck, V\. Logacheva, and P\. Pecina \(Eds\.\),Lisbon, Portugal,pp\. 392–395\.External Links:[Link](https://aclanthology.org/W15-3049/),[Document](https://dx.doi.org/10.18653/v1/W15-3049)Cited by:[§C\.2](https://arxiv.org/html/2605.11993#A3.SS2.p1.1),[§4\.1](https://arxiv.org/html/2605.11993#S4.SS1.p1.1)\.
- \[11\]Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu\(2025\)Qwen2\.5 technical report\.Cited by:[§3](https://arxiv.org/html/2605.11993#S3.p1.1)\.
- \[12\]A\. Radinger\(2025\)Subtitling in audiovisual translation studies\.InResearching Subtitling Processes,pp\. 29–53\.Cited by:[§1](https://arxiv.org/html/2605.11993#S1.p4.1)\.
- \[13\]R\. Rei, C\. Stewart, A\. C\. Farinha, and A\. Lavie\(2020\-11\)COMET: a neural framework for MT evaluation\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 2685–2702\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.213/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.213)Cited by:[§C\.2](https://arxiv.org/html/2605.11993#A3.SS2.p1.1),[§4\.1](https://arxiv.org/html/2605.11993#S4.SS1.p1.1)\.
- \[14\]Z\. Shen, Y\. Pang, Y\. Rao, and J\. Yu\(2025\-07\)CoE: a clue of emotion framework for emotion recognition in conversations\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 23548–23563\.External Links:[Link](https://aclanthology.org/2025.acl-long.1148/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1148),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2605.11993#S1.p3.1)\.
- \[15\]K\. B\. Singh, A\. Ekbal, and P\. Pakray\(2025\-12\)Evaluating IndicTrans2 and ByT5 for English–Santali machine translation using the ol chiki script\.InProceedings of the 1st Workshop on Multimodal Models for Low\-Resource Contexts and Social Impact \(MMLoSo 2025\),A\. Shukla, S\. Kumar, A\. S\. Bedi, and T\. Chakraborty \(Eds\.\),Mumbai, India,pp\. 95–100\.External Links:[Link](https://aclanthology.org/2025.mmloso-1.9/),ISBN 979\-8\-89176\-311\-1Cited by:[§1](https://arxiv.org/html/2605.11993#S1.p1.1)\.
- \[16\]K\. B\. Singh, D\. Kumar, and A\. Ekbal\(2025\-11\)Evaluation of LLM for English to Hindi legal domain machine translation systems\.InProceedings of the Tenth Conference on Machine Translation,B\. Haddow, T\. Kocmi, P\. Koehn, and C\. Monz \(Eds\.\),Suzhou, China,pp\. 823–833\.External Links:[Link](https://aclanthology.org/2025.wmt-1.57/),[Document](https://dx.doi.org/10.18653/v1/2025.wmt-1.57),ISBN 979\-8\-89176\-341\-8Cited by:[§1](https://arxiv.org/html/2605.11993#S1.p1.1)\.
- \[17\]P\. K\. A\. Vasu, F\. Faghri, C\. Li, C\. Koc, N\. True, A\. Antony, G\. Santhanam, J\. Gabriel, P\. Grasch, O\. Tuzel, and H\. Pouransari\(2025\)FastVLM: efficient vision encoding for vision language models\.External Links:2412\.13303,[Link](https://arxiv.org/abs/2412.13303)Cited by:[§3](https://arxiv.org/html/2605.11993#S3.p1.1)\.
- \[18\]Y\. Wang and Y\. Zhao\(2024\-08\)TRAM: benchmarking temporal reasoning for large language models\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 6389–6415\.External Links:[Link](https://aclanthology.org/2024.findings-acl.382/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.382)Cited by:[§1](https://arxiv.org/html/2605.11993#S1.p4.1)\.

## Appendix AAppendix

### A\.1Ethical Considerations

Multimodal systems may inherit and amplify biases present in movies, where visual cues can reinforce stereotypes\. Temporal misalignment between audio and visuals may also lead to inappropriate translations of culturally sensitive content\. For Indian languages, honorifics and regional variations require careful handling to ensure accuracy\. We advocate for human oversight, transparent reporting, and evaluation frameworks that assess cultural appropriateness\. As selective grounding shows that only some segments need visual context, applying visual enhancement through a bias\-aware filter can help reduce potential harms\.

### A\.2Visual Feature Extraction

1. 1\.The original\.mkvvideo is compressed to≈\\approx1\-GB and converted to\.mp4\.
2. 2\.Frames are extracted at 1\-fps and passed to Apple FastVLM\-0\.5B to obtain a raw textual description per frame\.
3. 3\.Raw descriptions are cleaned \(removing repeated phrases, normalising punctuation\) to produce a final description per frame\.

For each subtitle, the visual context is formed by concatenating the cleaned descriptions of all frames whose timestamps fall within the subtitle’s time window\. When multiple frames belong to the same segment, the descriptions are aggregated into a single summary \(for the gap‑based method, the window is the interval between the previous subtitle’s end and the current subtitle’s start\)\.

### A\.3Context\-Aware Translation Pipeline

The translation pipeline uses the same Qwen\-2\.5\-7B\-Instruct model for both baseline and visual‑enhanced translations\. The only difference is the input prompt\. For the baseline, the model receives only the English source\. For visual‑enhanced, we prepend the summarised visual context \(Attr‑VC or Inter\-VS\) to the source using the visual‑enhanced prompt template shown in Table[5](https://arxiv.org/html/2605.11993#A3.T5)\. The prompt instructs the model to ground its translation in the visual scene, paying attention to gender, honorifics, and emotional tone\. The same greedy decoding parameters described in hyperparameterss are used for all runs\.

### A\.4Hyperparameters

All inference runs use greedy decoding to ensure reproducibility\.

For the translation model \(Qwen\-2\.5\-7B\-Instruct\):

- •max\_new\_tokens = 100
- •do\_sample = False
- •repetition\_penalty = 1\.1
- •temperature and top‑p left at default \(1\.0 and 1\.0, effectively greedy\)\.

For the summarization model \(Llama\-3\.1\-8B\-Instruct\):

- •max\_new\_tokens = 256
- •do\_sample = False
- •temperature and top‑p at default \(1\.0 and 1\.0\)\.

### A\.5Oracle Selective Grounding

Per‑segment COMET scores for the baseline translation \(against the reference\) are computed using theUnbabel/wmt22\-comet\-damodel\. The worstk%k\\%of segments \(by baseline COMET\) are replaced with the corresponding visual‑enhanced translation \(from either Attr‑VC or Inter\-VS\)\. We report results fork=20%k=20\\%and30%30\\%\. Appendix[8](https://arxiv.org/html/2605.11993#A3.T8)

## Appendix BAdditional Analysis

### B\.1Window Size Considerations

The choice of a 5‑minute sliding window for Attribute Visual Context was guided by the need to capture enough local scene context while remaining robust to temporal drift\. A larger window \(e\.g\., 10 minutes\) would aggregate more visual information, but it could also include more irrelevant frames, potentially increasing the risk of hallucination when the visual context does not align with the dialogue\. A smaller window \(e\.g\., 2 minutes\) would be more sensitive to misalignment\. The 5‑minute window offers a reasonable trade‑off, as evidenced by the improvement in COMET over the baseline for most language\-movie pairs\.

### B\.2Oracle Selective Grounding

The full oracle selective results for both summarization methods are presented in Table[8](https://arxiv.org/html/2605.11993#A3.T8)\. Replacing only the worst 20\-30% of baseline segments consistently lifts COMET above the baseline, demonstrating that most of the gain can be achieved with a fraction of the visual processing\.

## Appendix CAdditional Analysis

The 5‑minute sliding window offers a reasonable trade‑off between context and robustness; a larger window would risk including irrelevant frames, a smaller window would be more sensitive to misalignment\. Oracle selective grounding \(Table[8](https://arxiv.org/html/2605.11993#A3.T8)\) shows that replacing only the worst 20\-30% of baseline segments recovers most of the gain with minimal visual processing\. Fine‑grained fusion methods \(visual prefixing and cross‑attention fusion\) failed under misalignment \(e\.g\., “He is very kind” → “He drives fast”\) because they assume perfect alignment, whereas coarse attribute summarization ignores irrelevant frames\. This reinforces the idea that alignment quality often outweighs architectural complexity, and that selective grounding offers a practical path to efficient visual‑guided translation\.

### C\.1Clarifications

Movie subtitle translation is inherently difficult \(fragmented speech, cultural references, zero‑shot domain adaptation\); our baseline scores reflect this challenge, and our contribution lies in the*relative*COMET gain, not absolute SOTA\. We use Qwen‑2\.5‑7B‑Instruct because it supports zero‑shot instruction following without fine‑tuning, unlike dedicated Indic models \(e\.g\., IndicTrans2\) that are optimised for sentence‑level translation and do not accept multi‑field visual context prompts\. FastVLM\-0\.5B prioritises efficiency \(85× faster than LLaVA\); using a smaller VLM makes gains harder to achieve, so our positive results demonstrate robustness\. FastVLM outputs raw descriptions; we summarise them with Llama 3\.1 to obtain structured attributes or free‑text gap summaries, because direct prompting of FastVLM for structured output is not feasible\. Our experiments compare visual\-enhanced translation against an identical text‑only baseline, isolating the effect of visual context; comparing across different MT systems would confound architectural differences\.

### C\.2Evaluation Metrics

We report corpus‑level BLEU\[[9](https://arxiv.org/html/2605.11993#bib.bib1)\]using SacreBLEU and chrF\+\+\[[10](https://arxiv.org/html/2605.11993#bib.bib14)\]withword\_order=2\. COMET\[[13](https://arxiv.org/html/2605.11993#bib.bib15)\]is computed with thewmt22\-comet\-damodel using default settings\.

### C\.3Data and Code Availability

To foster reproducibility, the curated movie\-subtitle\-visual alignment data for all five languages will be released under a fair‑use educational/research license\. The release includes English sources, reference translations, and extracted visual descriptions\. All the codes and datasets used for extraction, summarization, translation, and evaluation are available at GitHub333https://github\.com/Tarunc224/visually\-guided\-subtitle\-translationand our group page\.444https://ai\-nlp\-ml\.github\.io/resources\.html

Table 5:Comparison of baseline and visual\-enhanced prompts\.Table 6:Prompt templates used for visual summarization\. Both the prompts are given to Llama\-3\.1\-8B\-Instruct\. The attribute prompt outputs structured tags; the inter‑chunk prompt outputs a free‑text sentence\. Placeholders in braces are replaced with actual data at runtime\.Table 7:When visual context helps \(emotion, action, honorifics\) and when it backfires \(temporal misalignment, irrelevant visuals\)\. Visual‑enhanced outputs are from the better of the two summarization methods for each case\. The misalignment examples are drawn from actual data where cumulative drift paired a kind remark with a car chase, and a flying statement with a calm ocean scene\.MovieLanguageBaseline COMET5‑Minute Slide Visual AttributeInter‑Chunk Visual SummarisationOracle selective 20%Oracle selective 30%Oracle selective 20%Oracle selective 30%AvatarBengali0\.62980\.66820\.68290\.67090\.6865AvatarTelugu0\.52570\.53540\.53900\.53610\.5390AvatarTamil0\.53520\.55800\.56500\.55690\.5639AvatarKannada0\.48570\.49330\.49430\.49280\.4946OppenhimerBengali0\.70260\.72240\.72480\.72100\.7237OppenhimerHindi0\.64670\.66170\.66420\.66430\.6690OppenhimerTelugu0\.54750\.56040\.56540\.56000\.5647OppenhimerTamil0\.53660\.56300\.57150\.56390\.5715OppenhimerKannada0\.49380\.50660\.50960\.50650\.5090SkyfallBengali0\.69140\.70640\.70980\.70440\.7056SkyfallHindi0\.60260\.62230\.62580\.62160\.6245SkyfallTelugu0\.52880\.54300\.54780\.54150\.5454SkyfallTamil0\.53500\.55810\.56590\.55610\.5639SkyfallKannada0\.49200\.50130\.50380\.50140\.5038Spider 2Bengali0\.71900\.73370\.73590\.73050\.7350Spider 2Hindi0\.64590\.66840\.67460\.67170\.6786Spider 2Telugu0\.54070\.55270\.55670\.55270\.5569Spider 2Tamil0\.54480\.56840\.57530\.57000\.5761TitanicBengali0\.69600\.71140\.71300\.71200\.7150TitanicHindi0\.61520\.62930\.63210\.63200\.6367TitanicTelugu0\.53500\.54560\.54810\.54650\.5494TitanicKannada0\.49500\.50370\.50650\.50490\.5063

Table 8:Oracle selective COMET scores for the two summarization methods\. “Oracle selective 20%” and “Oracle selective 30%” replace the worst 20% and 30% of baseline segments \(by baseline COMET\) with the corresponding visual‑enhanced translation\. The full visual‑enhanced results are reported in Table[3](https://arxiv.org/html/2605.11993#S4.T3)\.

Similar Articles

How Descript engineers multilingual video dubbing at scale

OpenAI Blog

Descript redesigned its translation pipeline using OpenAI reasoning models to optimize multilingual video dubbing at scale, achieving 15% increase in translated video exports and 13-43% improvement in duration adherence across languages by addressing the challenge of matching speech duration to video timing constraints.

Boosting Visual Instruction Tuning with Self-Supervised Guidance

Hugging Face Daily Papers

This paper proposes augmenting visual instruction tuning in multimodal language models with self-supervised tasks expressed as natural language instructions, improving vision-centric reasoning without additional architecture or annotations. By reformulating classical self-supervised pretext tasks as image-instruction-response triplets, the method achieves consistent performance improvements across multiple benchmarks by injecting only 3-10% visually grounded instructions into the training data.

When Vision Speaks for Sound

Hugging Face Daily Papers

This paper identifies that video-capable multimodal LLMs often appear to understand audio but actually rely on visual cues, a failure mode termed the audio-visual Clever Hans effect. It introduces Thud, an intervention-driven probing framework to diagnose this issue, and proposes an alignment recipe that improves audio-visual consistency by 28 percentage points.