FBK's Long-form SpeechLLMs for IWSLT 2026 Instruction Following

arXiv cs.CL Papers

Summary

This paper describes FBK's submission to the IWSLT 2026 Instruction Following shared task, developing SpeechLLMs for short-form and long-form speech instruction following, exploring segmentation methods and achieving robust long-form performance with fixed 30-second segmentation.

arXiv:2606.26819v1 Announce Type: new Abstract: This paper describes our submission to the IWSLT 2026 Instruction Following shared task. SpeechLLMs are developed for both short-form and long-form speech instruction following under constrained settings. For the short track, strong performance is achieved on MCIF, with a SIFS score of 2.0708. For the long track, three speech segmentation methods are explored, and the HIFS score is introduced to account for unstable long-form generation. Experimental results show that fixed 30-second segmentation provides the most robust long-form performance, achieving the highest HIFS score of 2.0663. Further analysis shows that hallucination mainly manifests as repetitive insertions in generated outputs, substantially affecting ASR and SSUM, while short-form capabilities are largely retained after long-form extension.
Original Article
View Cached Full Text

Cached at: 06/26/26, 05:19 AM

# FBK’s Long-form SpeechLLMs for IWSLT 2026 Instruction Following
Source: [https://arxiv.org/html/2606.26819](https://arxiv.org/html/2606.26819)
Zhihang Xie1,2,Marco Gaido1,Sara Papi1,Matteo Negri1,Luisa Bentivogli1

1Fondazione Bruno Kessler,2University of Trento

###### Abstract

This paper describes our submission to the IWSLT 2026 Instruction Following shared task\. SpeechLLMs are developed for both short\-form and long\-form speech instruction following under constrained settings\. For the short track, strong performance is achieved on MCIF, with a SIFS score of 2\.0708\. For the long track, three speech segmentation methods are explored, and the HIFS score is introduced to account for unstable long\-form generation\. Experimental results show that fixed 30\-second segmentation provides the most robust long\-form performance, achieving the highest HIFS score of 2\.0663\. Further analysis shows that hallucination mainly manifests as repetitive insertions in generated outputs, substantially affecting ASR and SSUM, while short\-form capabilities are largely retained after long\-form extension\.

FBK’s Long\-form SpeechLLMs for IWSLT 2026 Instruction Following

Zhihang Xie1,2, Marco Gaido1, Sara Papi1, Matteo Negri1, Luisa Bentivogli11Fondazione Bruno Kessler,2University of Trento

## 1Introduction

Speech large language models \(SpeechLLMs\) extend the instruction\-following \(IF\) capabilities of large language models to spoken inputs, where a speech encoder is connected to an LLM decoder through a modality adapterChenet al\.\([2024](https://arxiv.org/html/2606.26819#bib.bib9)\); Huanget al\.\([2024](https://arxiv.org/html/2606.26819#bib.bib1)\)\. This architecture supports multiple speech\-to\-text tasks with natural\-language instructions, including automatic speech recognition \(ASR\), speech translation \(ST\), spoken question answering \(SQA\), and speech summarization \(SSUM\)\. Compared with task\-specific systems, instruction\-following SpeechLLMsFathullahet al\.\([2024](https://arxiv.org/html/2606.26819#bib.bib2)\); Leeet al\.\([2025](https://arxiv.org/html/2606.26819#bib.bib24)\)provide a flexible interface for multilingual and multi\-task speech processing, while allowing spoken inputs to benefit from the generation abilities of pretrained LLMs\.

The International Conference on Spoken Language Translation \(IWSLT\) 2026 Instruction Following shared task111[IWSLT 2026 Instruction Following page](https://iwslt.org/2026/instruction-following)evaluates speech\-based instruction\-following systems in short\-form and long\-form settings\. The short track focuses on short\-form speech and covers ASR, ST, and SQA across languages\. The long track extends this setting to longer audio, requiring systems to perform ASR, ST, SQA, SSUM, and audio chaptering \(ACHAP\)\. Together, the two tracks provide a benchmark for assessing instruction\-following ability and robustness to long\-form speech\.

Despite recent progress in SpeechLLMs, long\-form speech processing remains challenging\. Long\-form inputs introduce high computational cost, long acoustic token sequences, discourse\-level context modeling, and hallucination risks\. In IWSLT 2025, KITKoneruet al\.\([2025](https://arxiv.org/html/2606.26819#bib.bib23)\)was the only participant in the long trackAbdulmumin and others \([2025](https://arxiv.org/html/2606.26819#bib.bib3)\)\. For SpeechLLMs, longer speech sequences place greater pressure on the LLM context and may lead to unstable generation\. It therefore remains unclear how effectively short\-form SpeechLLMs can be extended to long\-form settings, how much short\-form ability is retained, and which segmentation strategies are most effective for long\-form inference\.

Under the constrained setting, this study examines three hypotheses: \(1\) long\-form extension can improve long\-form performance while preserving competitive short\-form capability; \(2\) speech segmentation substantially affects long\-form instruction following; and \(3\) hallucination\-aware evaluation is needed for reliable comparison of long\-form systems\. The main contributions are:

![Refer to caption](https://arxiv.org/html/2606.26819v1/architecture.png)\(a\)Model architecture
![Refer to caption](https://arxiv.org/html/2606.26819v1/finetuning.png)\(b\)Short\-form fine\-tuning
![Refer to caption](https://arxiv.org/html/2606.26819v1/extension.png)\(c\)Long\-form extension

Figure 1:Model architecture and training pipeline of the SpeechLLMs for the short and long tracks\.- •Practical and effective SpeechLLM solutions for short\-form and long\-form instruction following under constrained settings\.
- •Empirical comparisons across three speech segmentation strategies: fixed\-window, CRDNN\-based, and hybrid segmentation\.
- •Detailed evaluation analysis of hallucination effects and short\-form capability retention after long\-form extension\.

## 2Model Architecture

In the constrained setting, SpeechLLMs are required to be built upon two pretrained models,SeamlessM4T\-v2\-large222[SeamlessM4T\-v2\-large on Huggingface](https://huggingface.co/facebook/seamless-m4t-v2-large)andQwen3\-4B\-Instruct333[Qwen3\-4B\-Instructon Huggingface](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)\. As illustrated in Figure[1\(a\)](https://arxiv.org/html/2606.26819#S1.F1.sf1), our SpeechLLMs comprise a speech encoder, a modality adapter, and an LLM decoder\. The model architectures are identical for both the short track and long track, where the finetuned short\-form SpeechLLM can be extended to long\-form directly\.

### 2\.1Speech Encoder

The speech encoder is adopted fromSeamlessM4T\-v2\-largeand serves as the acoustic front end of the model, transforming raw speech into intermediate speech representations for downstream processingBarrault and others \([2023](https://arxiv.org/html/2606.26819#bib.bib4)\)\. The feature extractor operates on 16 kHz audio and converts raw waveforms into 80\-dimensional log\-Mel filterbank features\. With a 10 ms hop size, these features are extracted at 100 Hz\. A stride of 2 is then applied in the frontend, such that two consecutive frames are combined into a 160\-dimensional representation before feature projection, reducing the temporal resolution to 50 Hz\. These acoustic features are provided to the model together with an attention mask indicating valid and padded positions\. In this way, the feature extractor forms the interface between speech input and the encoder stack\.

The speech encoder is built upon the extracted acoustic features and comprises a feature projection module followed by a Conformer stackKoluguriet al\.\([2023](https://arxiv.org/html/2606.26819#bib.bib5)\)\. The feature projection maps the 160\-dimensional frontend features to the model hidden size of 1024, and the Conformer encoder then processes them into high\-level speech representations while preserving the temporal resolution of 50 Hz\. In the released large configuration, the encoder comprises 24 Conformer layers, each with 16 attention heads and a feed\-forward network of intermediate dimension 4096\. Relative positional representationsDaiet al\.\([2019](https://arxiv.org/html/2606.26819#bib.bib6)\)and a depthwise convolution with kernel size 31 are incorporated in each Conformer block, enabling the encoder to capture both long\-range contextual dependencies and local acoustic structure effectively\.

### 2\.2Modality Adapter

The modality adapter comprises the adapter inherited fromSeamlessM4T\-v2\-largefor temporal compression and a linear adapter for projection into the hidden space of the LLM decoder\.

Given 50 Hz speech representations from the speech encoder, an intermediate feed\-forward module first transforms the encoder outputs at the 1024\-dimensional hidden size before passing them to the inherited adapter for temporal compression\. In this adapter, the residual and self\-attention branches each apply a one\-dimensional convolution with kernel size 8, stride 8, and padding 4, followed by a gated linear unit\. This design reduces the sequence length by approximately a factor of 8, yielding an effective frame rate of about 6\.25 Hz while preserving the 1024\-dimensional hidden size\.

The adapter layer incorporates self\-attention and a feed\-forward network with an intermediate dimension of 4096, allowing the compressed representations to be refined after subsampling\. Following the inherited adapter, a linear adapter is introduced to project the compressed speech representations from 1024 to 2560\. This projection aligns the speech representations with the decoder input space, so that the resulting speech tokens have the same dimensionality as the text tokens\.

### 2\.3LLM Decoder

The LLM decoder is inherited fromQwen3\-4B\-Instructand leverages its strong instruction\-following ability for general reasoning and token generation\. To incorporate speech modality, the prepending fusion mechanismLamet al\.\([2025](https://arxiv.org/html/2606.26819#bib.bib7)\)is employed, whereby speech representations are prepended to the text embeddings of the instruction\. In addition, parameter\-efficient fine\-tuning methods such as LoRAHuet al\.\([2021](https://arxiv.org/html/2606.26819#bib.bib8)\)are commonly used to adapt the LLM to downstream tasks without updating the full set of model parametersChenet al\.\([2024](https://arxiv.org/html/2606.26819#bib.bib9)\); Microsoft \([2025](https://arxiv.org/html/2606.26819#bib.bib10)\)\.

![Refer to caption](https://arxiv.org/html/2606.26819v1/short.png)\(a\)Short\-form capabilities
![Refer to caption](https://arxiv.org/html/2606.26819v1/long.png)\(b\)Long\-form capabilities

Figure 2:Radar charts of short\-form and long\-form capabilities, with ASR measured by accuracy, ST by mean COMET score, SQA and SSUM by BERTScore, and long\-form scores penalized by hallucination rates\.

## 3Short Track

### 3\.1Corpora

In the constrained setting, training is conducted using CoVoST2Wanget al\.\([2020](https://arxiv.org/html/2606.26819#bib.bib11)\), EuroParlSTIranzo\-Sánchezet al\.\([2020](https://arxiv.org/html/2606.26819#bib.bib25)\), GigaSTYeet al\.\([2023](https://arxiv.org/html/2606.26819#bib.bib13)\), and LibriSQAZhaoet al\.\([2024](https://arxiv.org/html/2606.26819#bib.bib14)\)\. For data augmentation, synthetic data are generated withSeamlessM4T\-v2\-largeandQwen3\-4B\-Instruct\. Validation is primarily performed on the MCIFPapiet al\.\([2026](https://arxiv.org/html/2606.26819#bib.bib15)\)short\-track datasets\.

The training corpora, summarized in Table[4](https://arxiv.org/html/2606.26819#A1.T4), comprise 5,056,973 samples with a total duration of 8,436\.63 hours and an average duration of 6\.01 seconds\. Because these corpora differ substantially in annotation format, language coverage, and task supervision, dataset\-specific processing is applied to construct a unified multilingual instruction\-following training set for ASR, ST, and SQA\. Specifically, synthetic translations are generated withSeamlessM4T\-v2\-largewhen target\-language annotations are missing, low\-quality samples are filtered using COMETReiet al\.\([2022](https://arxiv.org/html/2606.26819#bib.bib16)\), and synthetic question\-answer pairs are generated withQwen3\-4B\-Instructwhen question\-answering supervision is unavailable or limited\. A small proportion of unanswerable examples is also introduced to improve robustness in SQA\. The processing applied to each dataset is described below, and validation is primarily performed on the MCIFPapiet al\.\([2026](https://arxiv.org/html/2606.26819#bib.bib15)\)short\-track datasets, which provide a unified benchmark for multiple short\-form tasks and languages\.

CoVoST2The original dataset provides English transcriptions together with translations from English into German and Chinese\. Since Italian translations are not included, they are generated from the English transcriptions usingSeamlessM4T\-v2\-large, after which low\-quality translations are removed using COMETReiet al\.\([2022](https://arxiv.org/html/2606.26819#bib.bib16)\)scores below 0\.85444Unbabel/wmt22\-cometkiwi\-da\. To increase the amount of training data for spoken question answering, synthetic question\-answer pairs are generated withQwen3\-4B\-Instruct\. In addition, to explicitly model unanswerable cases, five percent of the pairs in each language are randomly sampled, their questions are replaced with random questions drawn from the remaining samples, and their answers are set to “Not answerable” in the corresponding language\.

EuroParlSTThe original dataset provides English transcriptions together with translations from English into German and Italian\. Two training subsets are available,trainandtrain\-noisy, but only thetrainsubsets are used\. No synthetic translations from English to Chinese are generated\.

GigaSTThe original dataset provides English transcriptions together with translations from English into German and Chinese\. Five training subsets are available,XS,S,M,L,XL, but only theMsubsets are used to maintain a training data scale comparable to that of the other datasets\. Since the English transcriptions are provided entirely in uppercase, text normalization is applied withQwen3\-4B\-Instructto convert them into natural sentence form\. Low\-quality translations are removed based on COMET scores lower than 0\.85\.

LibriSQAThe original dataset provides English transcriptions together with question\-answer pairs\. To create synthetic translations, the English transcriptions are first translated into German, Italian, and Chinese usingSeamlessM4T\-v2\-large, after which low\-quality translations are removed based on COMET thresholds of 0\.8, 0\.8, and 0\.75, respectively\. Synthetic question\-answer pairs are then generated usingQwen3\-4B\-Instruct\. Following the same procedure as for CoVoST2, five percent of the pairs are randomly selected, and their answers are replaced with “Not answerable” in the corresponding language\.

### 3\.2Training Strategy

The training pipeline is illustrated in Figure[1\(b\)](https://arxiv.org/html/2606.26819#S1.F1.sf2)\. The SpeechLLM contains 4\.73B parameters in total and includes a newly introduced two\-layer linear module with an intermediate dimension of 3584\. Parameter\-efficient adaptation is performed by training 112\.2M parameters, with LoRA applied to the query, key, and output projection modules using rank 8 and alpha 16\. The model is optimized for two epochs with AdamW, using a total batch size of 128 and gradient clipping set to 1\.0\. Separate learning rates are used for the base model and LoRA parameters, set to1​e−41e\{\-4\}and3​e−43e\{\-4\}, respectively, together with a cosine learning\-rate scheduler and a warmup phase covering 3% of the total training steps\. To improve robustness, data augmentation is applied, including speed perturbation with factors \[0\.9, 1\.0, 1\.1\] and SpecAugment with two time masks and two frequency masks, where the maximum mask widths are 50 and 10, respectively\. The model is trained on four NVIDIA A100 64GB GPUs for approximately two days, The final checkpoint is used for evaluation and submission\.

### 3\.3System Evaluation

SubmissionLanguageASRAccuracySTCOMETSQABERTScoreSIFSPrimaryShort\-formen\-en0\.8877\-0\.4426\-en\-de\-0\.72860\.4152\-en\-it\-0\.74960\.4034\-en\-zh\-0\.78690\.4513\-score0\.88770\.75500\.42812\.0708ContrastiveLong\-formen\-en0\.8640\-0\.4413\-en\-de\-0\.70360\.4294\-en\-it\-0\.75160\.4128\-en\-zh\-0\.74250\.4466\-score0\.86400\.73260\.43252\.0291Table 1:Performance of the primary and contrastive systems on the MCIF short\-form tasks\.Validation is performed on the MCIFPapiet al\.\([2026](https://arxiv.org/html/2606.26819#bib.bib15)\)short\-track datasets covering multiple tasks and languages\. The Short\-form Instruction\-Following Score \(SIFS\) is computed by summing the averaged task\-level scores, as shown in Equation[1](https://arxiv.org/html/2606.26819#S3.E1)\. Here,𝒯\\mathcal\{T\}includes ASR, ST, and SQA;ℒt\\mathcal\{L\}\_\{t\}denotes the language pairs with valid results for tasktt; andmℓ,tm\_\{\\ell,t\}denotes the task\-specific score, with ASR measured by1−WER1\-\\mathrm\{WER\}\.

SIFS=∑t∈𝒯1\|ℒt\|​∑ℓ∈ℒtmℓ,t\.\\mathrm\{SIFS\}=\\sum\_\{t\\in\\mathcal\{T\}\}\\frac\{1\}\{\|\\mathcal\{L\}\_\{t\}\|\}\\sum\_\{\\ell\\in\\mathcal\{L\}\_\{t\}\}m\_\{\\ell,t\}\.\(1\)
As shown in Table[1](https://arxiv.org/html/2606.26819#S3.T1)and Figure[2\(a\)](https://arxiv.org/html/2606.26819#S2.F2.sf1), the primary system achieves competitive short\-form performance, with a SIFS score of 2\.0708, comprising an ASR accuracy of 0\.8877, an average ST COMET score of 0\.7550, and an average SQA BERTScore of 0\.4281\.

### 3\.4System Submission

As shown in Table[1](https://arxiv.org/html/2606.26819#S3.T1), the primary short\-form system achieves the best overall performance, with a SIFS score of 2\.0708\. The contrastive system is based on the long\-form model and obtains a lower SIFS score of 2\.0291, but remains competitive on SQA\. This system is discussed further in Section[4\.3](https://arxiv.org/html/2606.26819#S4.SS3)\.

Table[6](https://arxiv.org/html/2606.26819#A1.T6)shows the IWSLT 2026 short\-form resultsAdelaniet al\.\([2026](https://arxiv.org/html/2606.26819#bib.bib26)\)\. The primary submission outperforms the contrastive submission on most tasks, with comparable results on EN\-EN and EN\-IT\. Overall, the primary submission is more balanced across all tasks, whereas the contrastive submission is less robust to unseen tasks\.

## 4Long Track

### 4\.1Corpora

In the constrained setting, the training corpora comprise LibriSQA for ASR, ST, and SQA, NutshellZüfleet al\.\([2025](https://arxiv.org/html/2606.26819#bib.bib17)\)for SSUM and YTSegRetkowski and Waibel \([2024](https://arxiv.org/html/2606.26819#bib.bib18)\)for ACHAP\. For data augmentation,SeamlessM4T\-v2\-largeandQwen3\-4B\-Instructare used to process the synthetic data\. Validation is performed on the MCIFPapiet al\.\([2026](https://arxiv.org/html/2606.26819#bib.bib15)\)long\-track datasets\.

The training corpora, summarized in Table[5](https://arxiv.org/html/2606.26819#A1.T5), comprise 825,256 samples with a total duration of 9,496\.36 hours and an average duration of 41\.43 seconds\. Because Nutshell provides only abstracts without transcriptions, and YTSeg contains highly noisy transcriptions, neither dataset provides reliable supervision for long\-form speech recognition and translation\. Initial zero\-shot experiments further indicate that a SpeechLLM trained only on short\-form data generalizes poorly to long\-form speech, showing substantial performance degradation when required to process extended acoustic and linguistic context\. This limitation is particularly important because speech recognition and translation constitute the foundation for higher\-level long\-form speech understanding tasks, such as speech summarization and audio chaptering\. To address this gap, long\-form training data for speech recognition, translation, and question answering are constructed from LibriSQA through artificial concatenationFoxet al\.\([2024](https://arxiv.org/html/2606.26819#bib.bib19)\)\. The processing applied to each dataset is described below\. Validation is performed on the MCIFPapiet al\.\([2026](https://arxiv.org/html/2606.26819#bib.bib15)\)long\-track, which provide a unified benchmark for the long\-form tasks and languages\.

LibriSQAThe original dataset provides exclusively short\-form samples, but these can be reorganized into long\-form examples by leveraging the chapter structure\. For artificial concatenation, utterance\-level audio segments are first grouped by chapter and then concatenated in sentence order, with a 0\.5\-second silence inserted between consecutive utterances to preserve natural boundaries\. The transcriptions are processed in the same way: Chinese translations are concatenated directly, whereas spaces are inserted between consecutive sentences in English, German, and Italian to maintain sentence boundaries\. Due to low\-quality translations are removed during preprocessing, the resulting concatenated chapters in German, Italian, and Chinese are sometimes incomplete and therefore do not fully correspond to their English counterparts\. Since question\-answer pairs are provided at the utterance level, only one pair is randomly selected for each concatenated sample; otherwise, the total training duration would exceed 10,000 hours and become computationally prohibitive\. The maximum duration at chapter level is limited to 10 minutes to reduce computational cost\.

NutshellThe original dataset provides English abstracts from scientific talks, and translations into German, Italian, and Chinese are generated usingSeamlessM4T\-v2\-large\. Low\-quality translations at abstract level are removed using COMET scores below 0\.75, 0\.75, and 0\.8, respectively\. The maximum duration at abstract level is limited to 15 minutes to reduce computational cost\.

YTSegThe original dataset provides English transcriptions and chapter annotations for YouTube videos\. Chapter titles and timestamps are automatically generated usingWhisperand then merged into multiple target formats\. The chapter titles in thetarget\_text\_reffield are extracted and processed to insert newlines and Markdown formatting\. Because the videos often contain noisy acoustic content, such as music and background noise, VAD is applied to estimate voiced speech duration, and samples with speech ratios below 0\.7 are removed\. Translations into German, Italian, and Chinese are generated usingSeamlessM4T\-v2\-large\. No COMET\-based filtering is applied to preserve complete chapter information\. The maximum duration at chapter level is limited to 10 minutes, shorter than that used for Nutshell because chaptering requires substantially longer target texts\.

SubmissionHALWERHITSUBINSDELFixed 30s\+/\-0\.141814,7921,124376881CRDNN\-0\.157913,0379832711,139\+1\.015813,7031,93813,9691,156Hybrid\-0\.156913,8281,0543121,149\+0\.434114,3461,2904,8411,161Table 2:ASR WER breakdown for different segmentation strategies with and without hallucination \(HAL\)\.SubmissionLanguageASRSTSQASUMAccuracyHALCOMETHALBERTScoreHALBERTScoreHALHIFSPrimaryFixed 30sen\-en0\.85820/21\-\-0\.37551/2200\.18507/100\-en\-de\-\-0\.71933/210\.36990/2200\.200825/100\-en\-it\-\-0\.75461/210\.33564/2200\.220331/100\-en\-zh\-\-0\.73654/210\.39374/2200\.358610/100\-score0\.85820/210\.64388/630\.36499/8800\.199373/4002\.0663ContrastiveCRDNNen\-en0\.84212/21\-\-0\.37550/2200\.193211/100\-en\-de\-\-0\.70352/210\.37950/2200\.191321/100\-en\-it\-\-0\.77495/210\.33202/2200\.220528/100\-en\-zh\-\-0\.76965/210\.40412/2200\.35732/100\-score0\.76192/210\.604412/630\.37114/8800\.208062/4001\.9454ContrastiveHybriden\-en0\.84311/21\-\-0\.37280/2200\.197511/100\-en\-de\-\-0\.70431/210\.37470/2200\.202428/100\-en\-it\-\-0\.77614/210\.33333/2200\.233538/100\-en\-zh\-\-0\.76784/210\.40382/2200\.35454/100\-score0\.80301/210\.64029/630\.36915/8800\.201681/4002\.0139Table 3:Performance of different speech segmentation strategies on the MCIF long\-form tasks, together with hallucination \(HAL\) statistics, where the fixed 30\-second window strategy achieves the best overall performance\.
### 4\.2Training Strategy

The training pipeline is illustrated in Figure[1\(c\)](https://arxiv.org/html/2606.26819#S1.F1.sf3)\. The model follows the same architecture as the short\-track system and is optimized using the same hyperparameters\. Speed perturbation is disabled, while SpecAugment is kept unchanged\. The gradient accumulation step is set to 8, and the fixed batch\-size strategy is replaced with a dynamic bucket sampler to better accommodate variable\-length speech inputs\. For long\-form audio, speech segmentation is applied before training\. The model is trained on four NVIDIA A100 64GB GPUs for approximately three days\. Due to time constraints, only one of the two scheduled training epochs is completed by the submission deadline\. The final checkpoint is used for evaluation and submission\.

#### 4\.2\.1Dynamic Bucket Sampler

To improve training efficiency under large variation in audio duration, we employ a dynamic bucket\-based distributed batch sampler\. The sampler assigns each sample to a duration bucket defined by a set of boundaries, and each bucket is associated with a dedicated per\-replica batch size\. This allows shorter utterances to be grouped into larger batches and longer utterances into smaller ones, reducing padding overhead and improving hardware utilization\. At the beginning of each epoch, sample indices are deterministically shuffled and grouped by bucket\. Within each bucket, synchronized global chunks are formed and evenly partitioned across replicas so that all workers process samples from the same duration range at each step\.

When the number of samples in a bucket is not divisible by the global chunk size, the sampler either drops the remainder or pads the bucket by deterministically resampling indices from the same bucket during training\. It further supports both sequential and round\-robin scheduling across buckets, enabling either contiguous or interleaved traversal of different duration ranges\. The sampler maintains explicit state, including the current epoch and iterator position, making it fully compatible with resumable and stateful distributed training\.

#### 4\.2\.2Speech Segmentation

Algorithm 1DivideAndConquer1:Pause list

𝒫\\mathcal\{P\}, left boundary

LL, right boundary

RR, maximum duration

DD
2:Refined segment list within

\[L,R\]\[L,R\]
3:functionDivideAndConquer\(

𝒫,L,R,D\\mathcal\{P\},L,R,D\)

4:if

R≤LR\\leq Lthen

5:return

\[\]\[\\ \]
6:endif

7:if

R−L≤DR\-L\\leq Dthen

8:return

\[\(L,R\)\]\[\(L,R\)\]
9:endif

10:

p⋆←p^\{\\star\}\\leftarrowFindLongestPause\(

𝒫,L,R\\mathcal\{P\},L,R\)

11:if

p⋆=Nonep^\{\\star\}=\\texttt\{None\}then

12:return

\[\(L,R\)\]\[\(L,R\)\]
13:endif

14:

\(ps,pe\)←p⋆\(p\_\{s\},p\_\{e\}\)\\leftarrow p^\{\\star\}
15:

ℛ←\[\]\\mathcal\{R\}\\leftarrow\[\\ \]
16:if

L<psL<p\_\{s\}then

17:

ℛ←ℛ∪\\mathcal\{R\}\\leftarrow\\mathcal\{R\}\\cupDivideAndConquer\(

𝒫,L,ps,D\\mathcal\{P\},L,p\_\{s\},D\)

18:endif

19:if

pe<Rp\_\{e\}<Rthen

20:

ℛ←ℛ∪\\mathcal\{R\}\\leftarrow\\mathcal\{R\}\\cupDivideAndConquer\(

𝒫,pe,R,D\\mathcal\{P\},p\_\{e\},R,D\)

21:endif

22:return

ℛ\\mathcal\{R\}
23:endfunction

To process long\-form speech with a model trained primarily on shorter utterances, three segmentation strategies are explored\. The first strategy uses fixed windows, where each audio is divided into consecutive non\-overlapping segments with durations of 15, 30, 45, or 60 seconds\. This setting provides a simple and controlled baseline for analyzing the effect of segment length on downstream performance\. Shorter windows can reduce memory and context burden within each segment, while longer windows preserve more local context but increase the risk of deletion and truncation errors\.

The second strategy applies voice activity detection using the SpeechBrain CRDNN model to produce speech\-based segmentsRavanelliet al\.\([2021](https://arxiv.org/html/2606.26819#bib.bib20)\)\. Theactivation\_th,deactivation\_th, andspeech\_ththresholds control speech onset detection, speech offset detection, and speech region selection, respectively\. The default setting, \(0\.5, 0\.25, 0\.5\), is used as the baseline; however, it may produce fragmented segments and limit the long\-form context\. Therefore, less aggressive variants are evaluated by relaxing these thresholds to \(0\.3, 0\.2, 0\.3\) and \(0\.2, 0\.1, 0\.2\), encouraging the detector to preserve more continuous speech regions and produce longer segments with broader context\.

The third strategy adopts a hybrid segmentation approach inspired byGaidoet al\.\([2021](https://arxiv.org/html/2606.26819#bib.bib21)\); Potapczyk and Przybysz \([2020](https://arxiv.org/html/2606.26819#bib.bib22)\), combining pause\-based segmentation with a duration constraint\. The double\-check step in the original pipeline is replaced with a divide\-and\-conquer algorithm, as shown in Algorithm[1](https://arxiv.org/html/2606.26819#alg1)\. For segments exceeding the maximum allowed duration, the algorithm identifies the longest pause within the current interval and recursively splits the segment at that point\. If no valid pause is found, the segment is kept unchanged\.

### 4\.3System Evaluation

The long\-form system is evaluated from four perspectives: short\-form capability retention, hallucination, the evaluation metric, and speech segmentation\. The analysis is conducted on the MCIF long\-form tasks, covering ASR, ST, SQA, and SSUM, with task\-specific scores and hallucination statistics reported for a comprehensive assessment\.

#### 4\.3\.1Short\-form

The long\-form model is also evaluated on the MCIF short\-form tasks to examine whether long\-form extension preserves short\-form capabilities\. As shown in Table[1](https://arxiv.org/html/2606.26819#S3.T1), the resulting contrastive system achieves a SIFS score of 2\.0291, lower than the primary short\-form score of 2\.0708 but still competitive\. This degradation may be partially explained by the training setup: to reduce computational cost, only LibriSQA short\-form data are used instead of the full short\-form training set\. Overall, these results suggest that short\-form capabilities can largely be retained after long\-form extension\.

#### 4\.3\.2Hallucination

Hallucination has a substantial impact on long\-form evaluation, particularly for ASR and SSUM, and can significantly distort the apparent effectiveness of a segmentation strategy\. As shown in Table[8](https://arxiv.org/html/2606.26819#A1.T8), under CRDNN segmentation, including hallucinated outputs increases ASR WER from 0\.1580 to 1\.0161 for English, while also reducing ST COMET and SSUM BERTScore across target languages\. In contrast, SQA is less affected, with only small differences in BERTScore after hallucinated outputs are included\. This suggests that hallucination introduces task\-dependent degradation, with generation\-heavy tasks being vulnerable\.

The ASR error breakdown in Table[2](https://arxiv.org/html/2606.26819#S4.T2)further shows that hallucination mainly increases insertion errors\. For CRDNN segmentation, the number of insertions rises sharply from 271 to 13,969 when hallucinated outputs are included, leading to a much higher WER\. A similar but smaller trend is observed for the hybrid strategy, where insertions increase from 312 to 4,841\. Manual inspection of the model outputs shows substantial hallucinated repetitions, which explains why excessive insertions become the dominant source of degradation\. Inspired by the post\-editing strategy in prior workKoneruet al\.\([2025](https://arxiv.org/html/2606.26819#bib.bib23)\), an alternative regular\-expression\-based post\-editing strategy is applied to remove hallucinated repetitions and improve the reliability of long\-form outputs before evaluation\.

#### 4\.3\.3Evaluation Metric

The Hallucination\-Penalized Instruction\-Following Score \(HIFS\) is computed by applying a hallucination penalty to each available task score and then summing the task\-level averages, as shown in Equation[2](https://arxiv.org/html/2606.26819#S4.E2)\. Here,𝒯\\mathcal\{T\}denotes the set of evaluated tasks, including ASR, ST, SQA, and SSUM;ℒ​t\\mathcal\{L\}tdenotes the set of language pairs with valid results for tasktt;m​ℓ,tm\{\\ell,t\}denotes the task\-specific metric score; andhℓ,t/nℓ,th\_\{\\ell,t\}/n\_\{\\ell,t\}denotes the corresponding hallucination rate\. For ASR,mℓ,tm\_\{\\ell,t\}is computed as1−WER1\-\\mathrm\{WER\}\.

HIFS=∑t∈𝒯1\|ℒt\|​∑ℓ∈ℒtmℓ,t​\(1−hℓ,tnℓ,t\)\.\\mathrm\{HIFS\}=\\sum\_\{t\\in\\mathcal\{T\}\}\\frac\{1\}\{\|\\mathcal\{L\}\_\{t\}\|\}\\sum\_\{\\ell\\in\\mathcal\{L\}\_\{t\}\}m\_\{\\ell,t\}\\left\(1\-\\frac\{h\_\{\\ell,t\}\}\{n\_\{\\ell,t\}\}\\right\)\.\(2\)

#### 4\.3\.4Speech Segmentation

Three speech segmentation strategies are compared for long\-form inference: fixed\-window segmentation, CRDNN\-based segmentation, and hybrid segmentation\. For fixed\-window segmentation, Table[9](https://arxiv.org/html/2606.26819#A1.T9)shows that the 30\-second window achieves the best overall performance, with the highest HIFS score of 2\.0663\. The best ASR accuracy and ST COMET score are also obtained with this setting, while relatively low hallucination rates are maintained\. Although slightly better SQA and SSUM scores are achieved by the 15\-second and 45\-second windows, respectively, these task\-specific gains do not lead to the best overall performance\. This suggests that a 30\-second window provides a better trade\-off between preserving sufficient context and avoiding unstable generation\. Adaptive segmentation is further evaluated using CRDNN\-based VAD with different parameter settings\. As shown in Table[10](https://arxiv.org/html/2606.26819#A1.T10), the\(0\.5,0\.25,0\.5\)\(0\.5,0\.25,0\.5\)setting achieves the highest HIFS score of 1\.9591, mainly due to its stronger ST performance\. In comparison, better SQA and SSUM scores and lower hallucination rates are achieved by the\(0\.3,0\.2,0\.3\)\(0\.3,0\.2,0\.3\)setting, but its lower ST score results in a slightly lower overall score\. These results indicate that CRDNN segmentation is sensitive to its parameter settings, and that optimization for one task does not necessarily yield the best overall long\-form performance\. Finally, hybrid segmentation, which combines fixed\-window and VAD\-based segmentation, is evaluated in Table[11](https://arxiv.org/html/2606.26819#A1.T11), and the scores are visualized in Figure[2\(b\)](https://arxiv.org/html/2606.26819#S2.F2.sf2)\. The\(0\.3,0\.2,0\.3\)\(0\.3,0\.2,0\.3\)setting performs best among hybrid configurations, reaching an HIFS score of 2\.0139\. Compared with CRDNN segmentation, the overall score is improved, and the best ASR accuracy and SQA BERTScore are obtained among hybrid settings\. However, this strategy remains below the fixed 30\-second window in overall performance\. These results suggest that adaptive segmentation can be beneficial but requires careful tuning, while the fixed 30\-second window provides the most robust segmentation strategy for long\-form inference\.

### 4\.4System Submission

Based on the segmentation analysis, three systems are selected for submission: fixed 30\-second segmentation as the primary system, and CRDNN\-based and hybrid segmentation as contrastive systems\. As shown in Table[3](https://arxiv.org/html/2606.26819#S4.T3), the primary system achieves the best performance, with the highest HIFS score of 2\.0663, as well as the best ASR accuracy and ST COMET score\. Among the contrastive systems, CRDNN obtains the best SQA and SSUM scores, while hybrid segmentation provides a stronger balance with an HIFS score of 2\.0139\.

Table[7](https://arxiv.org/html/2606.26819#A1.T7)shows the IWSLT 2026 long\-form resultsAdelaniet al\.\([2026](https://arxiv.org/html/2606.26819#bib.bib26)\)\. No segmentation method consistently performs best across all tasks and metrics\. Hybrid segmentation provides the most evident ASR improvement, whereas gains on other tasks are less stable\. SSUM and ACHAP do not function reliably in this setting, with scores remaining low or degenerate across segmentation methods\. These results highlight the difficulty of extending short\-form to long\-form speech processing\.

## 5Conclusion

This paper presented SpeechLLM systems for short\-form and long\-form instruction following\. Strong short\-form performance was achieved, while long\-form performance remained sensitive to segmentation and hallucination\. Fixed 30\-second segmentation emerged as the most robust strategy, while hallucination mainly occurred as repetitive insertions that degraded ASR and SSUM\. Long\-form extension preserved much of the short\-form capability, but better segmentation and hallucination mitigation remain important future directions\.

## Limitations

This work has several limitations\. First, only one epoch of long\-form training is completed due to time and computational constraints, which may limit adaptation to long\-form speech\. Second, using only a subset of short\-form data during long\-form extension may contribute to short\-form degradation\. Third, only three SpeechBrain VAD parameters are explored, leaving broader hyperparameter tuning and discourse\-aware segmentation for future work\. Finally, regular\-expression\-based post\-editing provides limited hallucination mitigation, targeting surface\-level repetitions but missing subtler hallucinations or semantic inconsistencies\.

## Acknowledgments

This paper has received funding from the European Union’s Horizon Europe programme under grant agreement No\. 101213369 \(project DVPS\), and from the Digital Europe Programme \(DIGITAL\) under grant agreement No\. 101198470 \(project LLMs4EU\)\. This work was also supported by the CINECA ISCRA\-B programme through project LoST \(Long\-form Speech\-to\-Text Models via Sequence Compression\), which provided access to high\-performance computing resources and support\.

## References

- Findings of the IWSLT 2025 Evaluation Campaign\.InProc\. of IWSLT,Cited by:[§1](https://arxiv.org/html/2606.26819#S1.p3.1)\.
- D\. I\. Adelani, V\. Agostinelli, A\. Anastasopoulos, L\. Bentivogli, O\. Bojar, S\. Bratières, M\. Carpuat, R\. Cattoni, M\. Cettolo, L\. Chen, M\. Federico, M\. Gaido, M\. Gupta, H\. Han, A\. Hatami, D\. Javorský, Y\. Jeon, M\. Kasztelnik, A\. Laurent, D\. Liu, N\. Luu, M\. Ma, D\. Macháček, M\. Maltais, E\. Matusov, C\. K\. Maurya, J\. P\. McCrae, C\. Meng, M\. Mohammad, Y\. Moslem, K\. Murray, S\. Nakamura, M\. Negri, J\. Niehues, A\. Kr\. Ojha, J\. Ortega, S\. Ouyang, S\. Papi, P\. Polák, F\. Retkowski, B\. Savoldi, C\. Sikasote, M\. Sperber, S\. Stüker, K\. Sudoh, M\. Tahon, M\. Turchi, A\. Waibel, P\. Wilken, R\. Zevallos, V\. Zouhar, and M\. Züfle \(2026\)Speech translation and metrics in 2026: findings of the iwslt campaign\.InProc\. of IWSLT,Cited by:[§3\.4](https://arxiv.org/html/2606.26819#S3.SS4.p2.1),[§4\.4](https://arxiv.org/html/2606.26819#S4.SS4.p2.1)\.
- L\. Barraultet al\.\(2023\)SeamlessM4T: Massively Multilingual & Multimodal Machine Translation\.InProc\. of WMT,Cited by:[§2\.1](https://arxiv.org/html/2606.26819#S2.SS1.p1.1)\.
- X\. Chen, S\. Zhang, Q\. Bai, K\. Chen, and S\. Nakamura \(2024\)LLaST: Improved End\-to\-end Speech Translation System Leveraged by Large Language Models\.arXiv\.Cited by:[§1](https://arxiv.org/html/2606.26819#S1.p1.1),[§2\.3](https://arxiv.org/html/2606.26819#S2.SS3.p1.1)\.
- Z\. Dai, Z\. Yang, Y\. Yang, J\. Carbonell, Q\. Le, and R\. Salakhutdinov \(2019\)Transformer\-XL: Attentive Language Models beyond a Fixed\-Length Context\.InProc\. of ACL,Cited by:[§2\.1](https://arxiv.org/html/2606.26819#S2.SS1.p2.1)\.
- Y\. Fathullah, C\. Wu, E\. Lakomkin, K\. Li, J\. Jia, Y\. Shangguan, J\. Mahadeokar, O\. Kalinli, C\. Fuegen, and M\. Seltzer \(2024\)AudioChatLlama: Towards General\-Purpose Speech Abilities for LLMs\.InProc\. of NAACL,Cited by:[§1](https://arxiv.org/html/2606.26819#S1.p1.1)\.
- J\. D\. Fox, D\. Raj, N\. Delworth, Q\. McNamara, C\. Miller, and M\. Jetté \(2024\)Updated Corpora and Benchmarks for Long\-Form Speech Recognition\.InProc\. of ICASSP,Cited by:[§4\.1](https://arxiv.org/html/2606.26819#S4.SS1.p2.1)\.
- M\. Gaido, M\. Negri, M\. Cettolo, and M\. Turchi \(2021\)Beyond Voice Activity Detection: Hybrid Audio Segmentation for Direct Speech Translation\.InProc\. of ICNLSP,Cited by:[§4\.2\.2](https://arxiv.org/html/2606.26819#S4.SS2.SSS2.p3.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2021\)LoRA: Low\-Rank Adaptation of Large Language Models\.InProc\. of ICLR,Cited by:[§2\.3](https://arxiv.org/html/2606.26819#S2.SS3.p1.1)\.
- C\. Huang, H\. Lu, H\. Gong, H\. Inaguma, I\. Kulikov, R\. Mavlyutov, and S\. Popuri \(2024\)Investigating Decoder\-only Large Language Models for Speech\-to\-text Translation\.InProc\. of Interspeech,Cited by:[§1](https://arxiv.org/html/2606.26819#S1.p1.1)\.
- J\. Iranzo\-Sánchez, J\. A\. Silvestre\-Cerdà, J\. Jorge, N\. Roselló, A\. Giménez, A\. Sanchis, J\. Civera, and A\. Juan \(2020\)Europarl\-ST: A Multilingual Corpus for Speech Translation of Parliamentary Debates\.InProc\. of Interspeech,Cited by:[§3\.1](https://arxiv.org/html/2606.26819#S3.SS1.p1.1)\.
- N\. R\. Koluguri, S\. Kriman, G\. Zelenfroind, S\. Majumdar, D\. Rekesh, V\. Noroozi, J\. Balam, and B\. Ginsburg \(2023\)Investigating End\-to\-End ASR Architectures for Long Form Audio Transcription\.InProc\. of Interspeech,Cited by:[§2\.1](https://arxiv.org/html/2606.26819#S2.SS1.p2.1)\.
- S\. Koneru, M\. Züfle, T\. Nguyen, S\. Akti, J\. Niehues, and A\. Waibel \(2025\)KIT’s Offline Speech Translation and Instruction Following Submission for IWSLT 2025\.InProc\. of IWSLT,Cited by:[§1](https://arxiv.org/html/2606.26819#S1.p3.1),[§4\.3\.2](https://arxiv.org/html/2606.26819#S4.SS3.SSS2.p2.1)\.
- T\. K\. Lam, M\. Gaido, S\. Papi, L\. Bentivogli, and B\. Haddow \(2025\)Prepending or Cross\-Attention for Speech\-to\-Text? An Empirical Comparison\.InProc\. of NAACL,Cited by:[§2\.3](https://arxiv.org/html/2606.26819#S2.SS3.p1.1)\.
- B\. Lee, M\. Z\. Boito, L\. Besacier, and I\. Calapodescu \(2025\)NAVER LABS Europe Submission to the Instruction\-following Track\.InProc\. of IWSLT,Cited by:[§1](https://arxiv.org/html/2606.26819#S1.p1.1)\.
- Microsoft \(2025\)Phi\-4\-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture\-of\-LoRAs\.arXiv\.Cited by:[§2\.3](https://arxiv.org/html/2606.26819#S2.SS3.p1.1)\.
- S\. Papi, M\. Züfle, M\. Gaido, B\. Savoldi, D\. Liu, I\. Douros, L\. Bentivogli, and J\. Niehues \(2026\)MCIF: Multimodal Crosslingual Instruction\-Following Benchmark from Scientific Talks\.InProc\. of ICLR,Cited by:[§3\.1](https://arxiv.org/html/2606.26819#S3.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.26819#S3.SS1.p2.1),[§3\.3](https://arxiv.org/html/2606.26819#S3.SS3.p1.5),[§4\.1](https://arxiv.org/html/2606.26819#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.26819#S4.SS1.p2.1)\.
- T\. Potapczyk and P\. Przybysz \(2020\)SRPOL’s System for the IWSLT 2020 End\-to\-End Speech Translation Task\.InProc\. of IWSLT,Cited by:[§4\.2\.2](https://arxiv.org/html/2606.26819#S4.SS2.SSS2.p3.1)\.
- M\. Ravanelli, T\. Parcollet, P\. Plantinga, A\. Rouhe, S\. Cornell, L\. Lugosch, C\. Subakan, N\. Dawalatabad, A\. Heba, J\. Zhong, J\. Chou, S\. Yeh, S\. Fu, C\. Liao, E\. Rastorgueva, F\. Grondin, W\. Aris, H\. Na, Y\. Gao, R\. D\. Mori, and Y\. Bengio \(2021\)SpeechBrain: A General\-Purpose Speech Toolkit\.arXiv\.Cited by:[§4\.2\.2](https://arxiv.org/html/2606.26819#S4.SS2.SSS2.p2.1)\.
- R\. Rei, J\. G\. C\. de Souza, D\. Alves, C\. Zerva, A\. C\. Farinha, T\. Glushkova, A\. Lavie, L\. Coheur, and A\. F\. T\. Martins \(2022\)COMET\-22: Unbabel\-IST 2022 Submission for the Metrics Shared Task\.InProc\. of WMT,Cited by:[§3\.1](https://arxiv.org/html/2606.26819#S3.SS1.p2.1),[§3\.1](https://arxiv.org/html/2606.26819#S3.SS1.p3.1)\.
- F\. Retkowski and A\. Waibel \(2024\)From Text Segmentation to Smart Chaptering: A Novel Benchmark for Structuring Video Transcriptions\.InProc\. of ECACL,Cited by:[§4\.1](https://arxiv.org/html/2606.26819#S4.SS1.p1.1)\.
- C\. Wang, A\. Wu, and J\. Pino \(2020\)CoVoST 2 and Massively Multilingual Speech\-to\-Text Translation\.arXiv\.Cited by:[§3\.1](https://arxiv.org/html/2606.26819#S3.SS1.p1.1)\.
- R\. Ye, C\. Zhao, T\. Ko, C\. Meng, T\. Wang, M\. Wang, and J\. Cao \(2023\)GigaST: A 10,000\-hour Pseudo Speech Translation Corpus\.arXiv\.Cited by:[§3\.1](https://arxiv.org/html/2606.26819#S3.SS1.p1.1)\.
- Z\. Zhao, Y\. Jiang, H\. Liu, Y\. Wang, and Y\. Wang \(2024\)LibriSQA: a novel dataset and framework for spoken question answering with large language models\.Trans\. on Artificial Intelligence\.Cited by:[§3\.1](https://arxiv.org/html/2606.26819#S3.SS1.p1.1)\.
- M\. Züfle, S\. Papi, B\. Savoldi, M\. Gaido, L\. Bentivogli, and J\. Niehues \(2025\)NUTSHELL: A Dataset for Abstract Generation from Scientific Talks\.InProc\. of IWSLT,Cited by:[§4\.1](https://arxiv.org/html/2606.26819#S4.SS1.p1.1)\.

## Appendix AAppendix

\[b\]DatasetTaskLanguageSamplesTotalMeanCoVoSTASROen\-en289,412429\.61h5\.34sSTOen\-de289,412429\.61h5\.34sSTSen\-it222,604333\.08h5\.39sSTOen\-zh289,412429\.61h5\.34sSQASen303,809450\.90h5\.34sSQASde300,165444\.59h5\.33sSQASit231,479345\.74h5\.38sSQASzh303,735450\.83h5\.34sEuroParlSTASROen\-en62,180151\.56h8\.61sSTOen\-de32,62877\.17h8\.51sSTOen\-it29,55274\.39h9\.06sGigaSTASROen\-en909,728997\.96h3\.95sSTOen\-de570,183650\.08h4\.10sSTOen\-zh444,506489\.43h3\.96sLibriSQAASROen\-en109,214381\.82h12\.59sSTSen\-de86,816294\.69h12\.22sSTSen\-it85,072287\.66h12\.17sSTSen\-zh107,982376\.87h12\.56sSQAOen109,214381\.82h12\.59sSQASde86,816294\.69h12\.22sSQASit85,072287\.66h12\.17sSQASzh107,982376\.87h12\.56sTotal\-\-5,056,9738,436\.64h6\.01s

- OOriginal data provided by the dataset\.
- SSynthetic data generated by the models\.

Table 4:Statistics of the training corpora for the IWSLT short track under constrained settings\.\[b\]DatasetTaskLanguageSamplesTotalMeanLibriSQAShort\-778,1682,682\.08h12\.41sASROen\-en1,502207\.60h497\.58sSTSen\-de1,786209\.18h421\.64sSTSen\-it1,808207\.73h413\.63sSTSen\-zh1,515207\.47h493\.00sSQAOen1,502207\.60h497\.58sSQASde1,786209\.18h421\.64sSQASit1,808207\.73h413\.63sSQASzh1,515207\.47h493\.00sNutshellASROen\-en3,139501\.94h575\.66sSTSen\-de2,363374\.70h570\.85sSTSen\-it2,675426\.57h574\.07sSTSen\-zh2,371379\.01h575\.46sYTSegASROen\-en6,243694\.63h400\.56sSTSen\-de6,091679\.68h401\.72sSTSen\-it5,921662\.75h402\.95sSTSen\-zh5,063570\.96h405\.98sTotal\-\-825,2569,496\.36h41\.43s

- OOriginal data provided by the dataset\.
- SSynthetic data generated by the models\.

Table 5:Statistics of the training corpora for the IWSLT long track under constrained settings\.SubmissionEN\-ENASR\.WERSQA\.BERTScoreQE\.AccuracyQE\.Format\.AccuracyPrimary0\.1230\.507\-\-Contrastive0\.1450\.505\-\-SubmissionEN\-DEST\.COMETSQA\.BERTScoreQE\.AccuracyQE\.Format\.AccuracyPrimary0\.7620\.4770\.7620\.974Contrastive0\.7390\.5050\.5010\.584SubmissionEN\-ITST\.COMETSQA\.BERTScoreQE\.AccuracyQE\.Format\.AccuracyPrimary0\.7420\.524\-\-Contrastive0\.7330\.527\-\-SubmissionEN\-ZHST\.COMETSQA\.BERTScoreQE\.AccuracyQE\.Format\.AccuracyPrimary0\.7770\.5200\.9150\.961Contrastive0\.7340\.4820\.6580\.819Table 6:Performance of the primary and contrastive submissions on the IWSLT 2026 short\-form tasks\.SubmissionEN\-ENASRSQASSUMQEQE\.FormatACHAPACHAPACHAP\.GCACHAP\.TMACHAP\.TMWERBERTScoreBERTScoreAccuracyAccuracyWERCollarF1BERTScoreBERTScoreMatchedFixed0\.1960\.3900\.152\-\-0\.3590\.000 \(0\.183\)†0\.804 \(0\.823\)†\-\-CRDNN0\.1750\.3770\.160\-\-0\.2000\.000 \(0\.271\)†0\.803 \(0\.842\)†\-\-Hybrid0\.1260\.3770\.156\-\-0\.2000\.000 \(0\.271\)†0\.803 \(0\.842\)†\-\-SubmissionEN\-DESTSQASSUMQEQE\.FormatACHAPACHAPACHAP\.GCACHAP\.TMACHAP\.TMCOMETBERTScoreBERTScoreAccuracyAccuracyCOMETCollarF1BERTScoreBERTScoreMatchedFixed0\.6940\.3480\.1490\.5010\.5840\.6900\.000 \(0\.195\)†0\.584 \(0\.618\)†0\.7930\.127CRDNN0\.7220\.3510\.1530\.5010\.5840\.6350\.000 \(0\.136\)†0\.616 \(0\.640\)†0\.7890\.135Hybrid0\.7230\.3500\.1530\.5010\.5840\.6470\.000 \(0\.165\)†0\.616 \(0\.650\)†0\.7890\.135SubmissionEN\-ITSTSQASSUMQEQE\.FormatACHAPACHAPACHAP\.GCACHAP\.TMACHAP\.TMCOMETBERTScoreBERTScoreAccuracyAccuracyCOMETCollarF1BERTScoreBERTScoreMatchedFixed0\.7070\.3850\.185\-\-0\.7230\.000 \(0\.335\)†0\.532 \(0\.597\)†0\.7730\.123CRDNN0\.7020\.3760\.171\-\-0\.7190\.000 \(0\.348\)†0\.586 \(0\.660\)†0\.7900\.135Hybrid0\.6950\.3730\.174\-\-0\.7350\.000 \(0\.315\)†0\.528 \(0\.665\)†0\.7670\.123SubmissionEN\-ZHSTSQASSUMQEQE\.FormatACHAPACHAPACHAP\.GCACHAP\.TMACHAP\.TMCOMETBERTScoreBERTScoreAccuracyAccuracyCOMETCollarF1BERTScoreBERTScoreMatchedFixed0\.6550\.3800\.3190\.6580\.8190\.6810\.000 \(0\.073\)†0\.496 \(0\.518\)†0\.7570\.135CRDNN0\.6990\.3740\.3250\.6580\.8190\.6980\.000 \(0\.071\)†0\.496 \(0\.522\)†0\.7570\.135Hybrid0\.6850\.3730\.3250\.6580\.8190\.6830\.000 \(0\.021\)†0\.496 \(0\.506\)†0\.7570\.135

- †Values in parentheses are obtained under a relaxed Markdown\-format evaluation\.

Table 7:Performance of the primary and contrastive submissions on the IWSLT 2026 long\-form tasks\.CRDNNLanguageASRSTSQASUMWERHALCOMETHALBERTScoreHALBERTScoreHAL\-Hallucinationen\-en0\.15802/21\-\-0\.37910/2200\.186016/100en\-de\-\-0\.72482/210\.37483/2200\.186129/100en\-it\-\-0\.77072/210\.33113/2200\.235138/100en\-zh\-\-0\.76305/210\.40093/2200\.363216/100\+Hallucinationen\-en1\.01612/21\-\-0\.37910/2200\.164316/100en\-de\-\-0\.69362/210\.36843/2200\.147629/100en\-it\-\-0\.74802/210\.32413/2200\.171238/100en\-zh\-\-0\.68415/210\.39483/2200\.347316/100\|diff\|\|\\mathrm\{diff\}\|\-0\.8581\-0\.1328\-0\.0195\-0\.1400\-Table 8:Performance of CRDNN segmentation \+/\-hallucination with\(0\.5,0\.25,0\.5\)\(0\.5,0\.25,0\.5\)on the MCIF long\-form tasks\.WindowLanguageASRSTSQASUMAccuracyHALCOMETHALBERTScoreHALBERTScoreHALHIFS15sen\-en0\.87373/21\-\-0\.38001/2200\.192912/100\-en\-de\-\-0\.71243/210\.37564/2200\.188426/100\-en\-it\-\-0\.76673/210\.34960/2200\.225227/100\-en\-zh\-\-0\.75954/210\.39273/2200\.35789/100\-score0\.74893/210\.627510/630\.37108/4400\.199874/4001\.947230sen\-en0\.85820/21\-\-0\.37551/2200\.18507/100\-en\-de\-\-0\.71933/210\.36990/2200\.200825/100\-en\-it\-\-0\.75461/210\.33564/2200\.220331/100\-en\-zh\-\-0\.73654/210\.39374/2200\.358610/100\-score0\.85820/210\.64388/630\.36499/4400\.199373/4002\.066345sen\-en0\.87622/21\-\-0\.37951/2200\.191713/100\-en\-de\-\-0\.66365/210\.36641/2200\.191718/100\-en\-it\-\-0\.75572/210\.33693/2200\.240234/100\-en\-zh\-\-0\.76087/210\.39626/2200\.36235/100\-score0\.79282/210\.565514/630\.365111/4400\.206770/4001\.930060sen\-en0\.86792/21\-\-0\.36980/2200\.18508/100\-en\-de\-\-0\.70672/210\.35732/2200\.199427/100\-en\-it\-\-0\.76114/210\.32661/2200\.237738/100\-en\-zh\-\-0\.75295/210\.39566/2200\.35627/100\-score0\.78522/210\.609711/630\.35849/4400\.198680/4001\.9520Table 9:Performance of different window sizes on the MCIF long\-form tasks, together with hallucination \(HAL\) statistics, where the 30\-second window achieves the best overall performance\.ParametersLanguageASRSTSQASUMAccuracyHALCOMETHALBERTScoreHALBERTScoreHALHIFS\(0\.5, 0\.25, 0\.5\)en\-en0\.84202/21\-\-0\.37910/2200\.186016/100\-en\-de\-\-0\.72482/210\.37483/2200\.186129/100\-en\-it\-\-0\.77072/210\.33113/2200\.235138/100\-en\-zh\-\-0\.76305/210\.40093/2200\.363216/100\-score0\.76182/210\.64489/630\.36779/4400\.184899/4001\.9591\(0\.3, 0\.2, 0\.3\)en\-en0\.84212/21\-\-0\.37550/2200\.193211/100\-en\-de\-\-0\.70352/210\.37950/2200\.191321/100\-en\-it\-\-0\.77495/210\.33202/2200\.220528/100\-en\-zh\-\-0\.76965/210\.40412/2200\.35732/100\-score0\.76192/210\.604412/630\.37114/8800\.208062/4001\.9454\(0\.2, 0\.1, 0\.2\)en\-en0\.83673/21\-\-0\.36661/2200\.202510/100\-en\-de\-\-0\.71163/210\.36830/2200\.204429/100\-en\-it\-\-0\.75784/210\.33310/2200\.235338/100\-en\-zh\-\-0\.77275/210\.40023/2200\.35478/100\-score0\.71723/210\.604011/630\.36539/4400\.199980/4001\.8864Table 10:Performance of different parameters with the CRDNN segmentation on the MCIF long\-form tasks, together with hallucination \(HAL\) statistics, where the \(0\.5, 0\.25, 0\.5\) setting achieves the best overall performance\.ParametersLanguageASRSTSQASUMAccuracyHALCOMETHALBERTScoreHALBERTScoreHALHIFS\(0\.5, 0\.25, 0\.5\)en\-en0\.84302/21\-\-0\.38070/2200\.184414/100\-en\-de\-\-0\.71893/210\.37473/2200\.187422/100\-en\-it\-\-0\.77391/210\.33324/2200\.219229/100\-en\-zh\-\-0\.76125/210\.39903/2200\.362016/100\-score0\.76272/210\.64449/630\.367710/4400\.191181/4001\.9660\(0\.3, 0\.2, 0\.3\)en\-en0\.84311/21\-\-0\.37280/2200\.197511/100\-en\-de\-\-0\.70431/210\.37470/2200\.202428/100\-en\-it\-\-0\.77614/210\.33333/2200\.233538/100\-en\-zh\-\-0\.76784/210\.40382/2200\.35454/100\-score0\.80301/210\.64029/630\.36915/8800\.201681/4002\.0139\(0\.2, 0\.1, 0\.2\)en\-en0\.76922/21\-\-0\.36680/2200\.19957/100\-en\-de\-\-0\.69892/210\.37013/2200\.194424/100\-en\-it\-\-0\.75764/210\.33461/2200\.213627/100\-en\-zh\-\-0\.77345/210\.39972/2200\.35296/100\-score0\.69592/210\.611611/630\.36526/4400\.205264/4001\.8781Table 11:Performance of different parameters with the hybrid segmentation on the MCIF long\-form tasks, together with hallucination \(HAL\) statistics, where the \(0\.3, 0\.2, 0\.3\) setting achieves the best overall performance\.

Similar Articles

Streaming Speech-to-Text Translation with a SpeechLLM

arXiv cs.CL

Presents a SpeechLLM architecture for streaming speech-to-text translation that adaptively decides when to output tokens based on audio, achieving 1-2 second latency with quality close to non-streaming baselines.

Liberating LLM Capabilities in Full-Duplex Speech Models

Hugging Face Daily Papers

Proposes Listen-Write-Speak (LWS), a text-first tri-channel paradigm that allows a single autoregressive LLM to continuously listen, write visible text, and speak in real-time, enabling full-duplex speech interaction without architectural modifications.