A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

arXiv cs.CL Papers

Summary

This paper proposes a practical evaluation method for long-form simultaneous speech-to-speech translation that uses ASR, forced alignment, and sentence embedding alignment to compute latency and quality metrics on continuous speech, overcoming limitations of prior approaches.

arXiv:2606.15059v1 Announce Type: new Abstract: Simultaneous speech-to-speech translation (SimulS2ST) enables real-time cross-lingual communication, but existing evaluation has focused largely on short or pre-segmented speech rather than long-form, continuous input. Prior approaches are difficult to reproduce and make assumptions that do not hold for end-to-end systems. We present a practical evaluation method for long-form SimulS2ST. Given source speech, pre-segmented source transcripts, and reference translations, we run automatic speech recognition (ASR) and forced alignment on the generated target speech to recover token-level timestamps, then apply a sentence-embedding-based aligner to match the target text to its corresponding source sentences. This enables sentence-level computation of latency and quality metrics, including YAAL and xCOMET, which are then aggregated into final system-level scores. Experiments on representative SimulS2ST systems show that the method is effective in practice and reveal that current systems suffer from substantial latency accumulation on long speech.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:44 AM

# A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation
Source: [https://arxiv.org/html/2606.15059](https://arxiv.org/html/2606.15059)
Yulin Xue,Siqi Ouyang,Lei Li Carnegie Mellon University \{yulinx,siqiouya\}@andrew\.cmu\.edu,leili@cs\.cmu\.edu

###### Abstract

Simultaneous speech\-to\-speech translation \(SimulS2ST\) enables real\-time cross\-lingual communication, but existing evaluation has focused largely on short or pre\-segmented speech rather than long\-form, continuous input\. Prior approaches are difficult to reproduce and make assumptions that do not hold for end\-to\-end systems\. We present a practical evaluation method for long\-form SimulS2ST\. Given source speech, pre\-segmented source transcripts, and reference translations, we run automatic speech recognition \(ASR\) and forced alignment on the generated target speech to recover token\-level timestamps, then apply a sentence\-embedding\-based aligner to match the target text to its corresponding source sentences\. This enables sentence\-level computation of latency and quality metrics, including YAAL and xCOMET, which are then aggregated into final system\-level scores\. Experiments on representative SimulS2ST systems show that the method is effective in practice and reveal that current systems suffer from substantial latency accumulation on long speech\.

A Practical Evaluation Method for Long\-Form Simultaneous Speech\-to\-Speech Translation

Yulin Xue, Siqi Ouyang, Lei LiCarnegie Mellon University\{yulinx,siqiouya\}@andrew\.cmu\.edu,leili@cs\.cmu\.edu

## 1Introduction

Simultaneous Speech\-to\-Speech Translation \(SimulS2ST\) translates streaming source speech into target\-language speech in real time\(Zhenget al\.,[2020](https://arxiv.org/html/2606.15059#bib.bib1)\), enabling low\-latency cross\-lingual communication in scenarios such as multilingual conversations and international conferences\. However, most prior work evaluates SimulS2ST on pre\-segmented or short speech, despite the fact that real\-world input, such as conference speech, is often continuous and may last for hours\(Sudohet al\.,[2020](https://arxiv.org/html/2606.15059#bib.bib2); Maet al\.,[2022](https://arxiv.org/html/2606.15059#bib.bib3); Liuet al\.,[2022](https://arxiv.org/html/2606.15059#bib.bib4); Communicationet al\.,[2023](https://arxiv.org/html/2606.15059#bib.bib5); Zhanget al\.,[2024](https://arxiv.org/html/2606.15059#bib.bib6)\)\.

One early effort toward long\-form SimulS2ST evaluation is Boundary\-Aware Latency \(pBAL\), which segments target speech into sentences, applies forced alignment to recover target token timestamps, and computes latency based on these timestamps\(Zhenget al\.,[2020](https://arxiv.org/html/2606.15059#bib.bib1)\)\. This general paradigm is closely related to recent efforts in long\-form simultaneous speech\-to\-text translation \(SimulS2TT\) evaluation\(Papiet al\.,[2024](https://arxiv.org/html/2606.15059#bib.bib7); Poláket al\.,[2026](https://arxiv.org/html/2606.15059#bib.bib8)\)\. However, pBAL has important practical limitations\. First, it is not open\-sourced, which makes it difficult to reproduce and adopt in subsequent research\. Second, pBAL was designed for cascade systems comprising ASR, machine translation \(MT\), and text\-to\-speech \(TTS\), which introduces several limitations\. In particular, it segments the target speech to align with the streaming ASR output of the source speech rather than with ground\-truth source sentences, making the evaluation sensitive to source\-side ASR errors\. It also assumes access to target text for forced alignment, which is not available for some end\-to\-end \(E2E\) SimulS2ST systems\(Labiausseet al\.,[2025](https://arxiv.org/html/2606.15059#bib.bib9)\)\.

In this paper, we propose a practical evaluation method for long\-form SimulS2ST\. We assume access to source speech, source transcripts pre\-segmented into sentences, and their corresponding translation sentences\. Given target speech produced by a SimulS2ST system, we first run ASR and forced alignment with state\-of\-the\-art models to obtain target text with token\-level timestamps\. We then use the sentence\-embedding\-based method SEGALE\(Wanget al\.,[2025](https://arxiv.org/html/2606.15059#bib.bib10)\)to segment the target text into sentences aligned with the source sentences\. Finally, for each aligned sentence, we compute standard latency metrics such as YAAL\(Poláket al\.,[2026](https://arxiv.org/html/2606.15059#bib.bib8)\)and quality metrics such as xCOMET\(Guerreiroet al\.,[2024](https://arxiv.org/html/2606.15059#bib.bib11)\), and average the sentence\-level scores to obtain the final latency and quality scores\. In our experiments, we evaluate several representative SimulS2ST systems with this method and analyze the quality of both ASR and sentence segmentation\. We observe that even state\-of\-the\-art systems exhibit latency accumulation on long speech\. We will release the GitHub repository in the camera\-ready version\.

## 2Related Works

#### Long\-form simultaneous translation evaluation

Latency evaluation for simultaneous translation has traditionally been studied in pre\-segmented settings, where the input speech is split into utterances prior to evaluation\. StreamLAAL\(Papiet al\.,[2024](https://arxiv.org/html/2606.15059#bib.bib7)\)extends utterance\-level evaluation to the long\-form setting by first segmenting the hypothesis into utterances aligned with the reference translation sentences using mwerSegmenter\(Matusovet al\.,[2005](https://arxiv.org/html/2606.15059#bib.bib12)\), then computing latency for each aligned hypothesis utterance and its corresponding reference sentence\. LongYAAL\(Poláket al\.,[2026](https://arxiv.org/html/2606.15059#bib.bib8)\)improves upon StreamLAAL by mitigating the structural bias in latency evaluation and introduces SoftSegmenter, which yields better segmentation and alignment than mwerSegmenter\. These methods are designed for simultaneous speech\-to\-text translation, while our work extends them to the evaluation of simultaneous speech\-to\-speech translation\.

#### Long\-form machine translation evaluation\.

Another related line of work studies automatic evaluation for long\-form machine translation\. mwerSegmenter\(Matusovet al\.,[2005](https://arxiv.org/html/2606.15059#bib.bib12)\)aligns hypothesis and reference translation sentences by minimizing word error rate; however, it handles sentence boundaries poorly and often fails in cases of over\- or under\-translation\. SEGALE\(Wanget al\.,[2025](https://arxiv.org/html/2606.15059#bib.bib10)\)improves upon mwerSegmenter by using a sentence boundary detector such as spaCy111[https://spacy\.io/](https://spacy.io/)to recover sentence boundaries and by correctly penalizing over\- and under\-translation\. Our work leverages SEGALE as a more robust segmenter for the long\-form hypothesis\.

## 3Method

In this section, we first introduce the formulation \(Section[3\.1](https://arxiv.org/html/2606.15059#S3.SS1)\)\. We then describe the ASR and forced alignment procedures \(Section[3\.3](https://arxiv.org/html/2606.15059#S3.SS3)\), the target speech segmentation method \(Section[3\.4](https://arxiv.org/html/2606.15059#S3.SS4)\), and the computation of the final latency and quality scores \(Section[3\.5](https://arxiv.org/html/2606.15059#S3.SS5)\)\.

### 3\.1Formulation

We define a long\-form input speech stream as𝒔=\(𝒙1,𝒙2,⋯,𝒙n\)\{\\bm\{s\}\}=\(\{\\bm\{x\}\}\_\{1\},\{\\bm\{x\}\}\_\{2\},\\cdots,\{\\bm\{x\}\}\_\{n\}\), where each𝒙i∈ℝ\|𝒙i\|\{\\bm\{x\}\}\_\{i\}\\in\\mathbb\{R\}^\{\|\{\\bm\{x\}\}\_\{i\}\|\}denotes the speech waveform of theii\-th sentence\. Let𝒚i\{\\bm\{y\}\}\_\{i\}denote the reference text translation of sentenceii\. Given the input speech stream𝒔\{\\bm\{s\}\}, a SimulS2ST system incrementally generates target speech𝒕^\\hat\{\{\\bm\{t\}\}\}\. We assume that the input and target speech streams are temporally aligned at the start, i\.e\., they share the same initial timestamp\. The goal of the evaluation method is to compute latency and quality scores for the generated target speech𝒕^\\hat\{\{\\bm\{t\}\}\}given𝒔\{\\bm\{s\}\},𝒙1:n\{\\bm\{x\}\}\_\{1:n\}, and𝒚1:n\{\\bm\{y\}\}\_\{1:n\}\.

### 3\.2Overview

At a high level, our evaluation pipeline consists of three stages\. First, given the target speech generated by a SimulS2ST system, we run ASR to obtain the target\-side text and apply forced alignment to recover token\-level timestamps on the target speech\. Second, following SEGALE, we segment the target text into sentences and align them with the source transcript sentences and their reference translations, producing sentence groups that may reflect one\-to\-one, one\-to\-many, many\-to\-one, many\-to\-many, or null alignments\. Finally, for each aligned group, we compute latency using existing metrics such as YAAL and translation quality using sentence\-level metrics such as xCOMET\. The group\-level scores are then averaged into final system\-level latency and quality scores\.

### 3\.3Transcribe with Timestamps

Given target speech𝒕^\\hat\{\{\\bm\{t\}\}\}, we use state\-of\-the\-art ASR and forced alignment models: Qwen3\-ASR\-1\.7B and Qwen3\-ForcedAligner\-0\.6B\(Shiet al\.,[2026](https://arxiv.org/html/2606.15059#bib.bib13)\), to transcribe target speech𝒕^\\hat\{\{\\bm\{t\}\}\}into text𝒚^=\(y^1,⋯,y^\|𝒚^\|\)\\hat\{\{\\bm\{y\}\}\}=\(\\hat\{y\}\_\{1\},\\cdots,\\hat\{y\}\_\{\|\\hat\{\{\\bm\{y\}\}\}\|\}\)and obtain token\-level timestamps𝒅=\(d1,⋯,d\|𝒚^\|\)\{\\bm\{d\}\}=\(d\_\{1\},\\cdots,d\_\{\|\\hat\{\{\\bm\{y\}\}\}\|\}\)wheredid\_\{i\}denotes the end time of tokeny^i\\hat\{y\}\_\{i\}\.

For long\-form speech, we process the input in a chunk\-wise manner\. We divide the target speech𝒕^\\hat\{\{\\bm\{t\}\}\}intoCCconsecutive chunks, each with duration 180 seconds,

𝒕^=\(𝒕^\(1\),𝒕^\(2\),…,𝒕^\(C\)\)\.\\displaystyle\\hat\{\{\\bm\{t\}\}\}=\(\\hat\{\{\\bm\{t\}\}\}^\{\(1\)\},\\hat\{\{\\bm\{t\}\}\}^\{\(2\)\},\\ldots,\\hat\{\{\\bm\{t\}\}\}^\{\(C\)\}\)\.\(1\)
For each chunk𝒕^\(c\)\\hat\{\{\\bm\{t\}\}\}^\{\(c\)\}, the ASR model produces a partial transcription

𝒚^\(c\)=\(y^1\(c\),…,y^\|𝒚^\(c\)\|\(c\)\)\.\\displaystyle\\hat\{\{\\bm\{y\}\}\}^\{\(c\)\}=\(\\hat\{y\}^\{\(c\)\}\_\{1\},\\ldots,\\hat\{y\}^\{\(c\)\}\_\{\|\\hat\{\{\\bm\{y\}\}\}^\{\(c\)\}\|\}\)\.\(2\)
Forced alignment is then applied to each chunk using the corresponding audio and recognized text to produce a chunk\-level timestamp sequence

𝒅\(c\)=\(d1\(c\),…,d\|𝒚^\(c\)\|\(c\)\),\\displaystyle\{\\bm\{d\}\}^\{\(c\)\}=\(d^\{\(c\)\}\_\{1\},\\ldots,d^\{\(c\)\}\_\{\|\\hat\{\{\\bm\{y\}\}\}^\{\(c\)\}\|\}\),\(3\)wheredi\(c\)d^\{\(c\)\}\_\{i\}denotes the end time of tokeny^i\(c\)\\hat\{y\}^\{\(c\)\}\_\{i\}within thecc\-th chunk\. Letoco\_\{c\}be the starting time offset of chunk𝒕\(c\)\{\\bm\{t\}\}^\{\(c\)\}in the original speech stream\. We map chunk\-level timestamps back to the global timeline by

d~i\(c\)=di\(c\)\+oc\.\\displaystyle\\tilde\{d\}^\{\(c\)\}\_\{i\}=d^\{\(c\)\}\_\{i\}\+o\_\{c\}\.\(4\)
Finally, the full transcription and timestamp sequence are obtained by concatenating all chunk\-level results:

𝒚^\\displaystyle\\hat\{\{\\bm\{y\}\}\}=𝒚^\(1\)⊕⋯⊕𝒚^\(C\)\\displaystyle=\\hat\{\{\\bm\{y\}\}\}^\{\(1\)\}\\oplus\\cdots\\oplus\\hat\{\{\\bm\{y\}\}\}^\{\(C\)\}\(5\)𝒅\\displaystyle\{\\bm\{d\}\}=𝒅~\(1\)⊕⋯⊕𝒅~\(C\)\.\\displaystyle=\\tilde\{\{\\bm\{d\}\}\}^\{\(1\)\}\\oplus\\cdots\\oplus\\tilde\{\{\\bm\{d\}\}\}^\{\(C\)\}\.\(6\)

### 3\.4Robust Segmentation with SEGALE

We segment the target text𝒚^\\hat\{\{\\bm\{y\}\}\}into sentence\-level units and align them with the source speech sentences𝒙1:n\{\\bm\{x\}\}\_\{1:n\}using SEGALE\. We first split𝒚^\\hat\{\{\\bm\{y\}\}\}into sentences𝒚^1:m\\hat\{\{\\bm\{y\}\}\}\_\{1:m\}with spaCy222[https://spacy\.io/](https://spacy.io/)\. Given the source speech sentences𝒙1:n\{\\bm\{x\}\}\_\{1:n\}, their reference translations𝒚1:n\{\\bm\{y\}\}\_\{1:n\}, and the segmented target sentences𝒚^1:m\\hat\{\{\\bm\{y\}\}\}\_\{1:m\}, SEGALE performs sentence alignment using Vecalign\(Thompson and Koehn,[2020](https://arxiv.org/html/2606.15059#bib.bib14)\)with an adaptive skip\-penalty search strategy\.

To support many\-to\-many alignment, SEGALE constructs candidate contiguous spans on both the source and target sides, rather than restricting alignment to individual sentences\. Let𝒙i:j=𝒙i⊕⋯⊕𝒙j\{\\bm\{x\}\}\_\{i:j\}=\{\\bm\{x\}\}\_\{i\}\\oplus\\cdots\\oplus\{\\bm\{x\}\}\_\{j\}denote a source span and𝒚^p:q=𝒚^p⊕⋯⊕𝒚^q\\hat\{\{\\bm\{y\}\}\}\_\{p:q\}=\\hat\{\{\\bm\{y\}\}\}\_\{p\}\\oplus\\cdots\\oplus\\hat\{\{\\bm\{y\}\}\}\_\{q\}denote a target span, where⊕\\oplusdenotes concatenation\. For each source span𝒙i:j\{\\bm\{x\}\}\_\{i:j\}and target span𝒚^p:q\\hat\{\{\\bm\{y\}\}\}\_\{p:q\}, SEGALE computes an embedding\-based matching cost, with lower cost assigned to more semantically similar spans\. Vecalign then finds a monotonic alignment between the source and target sentence sequences while allowing null alignments on either side, controlled by a skip penaltyβskip\\beta\_\{\\mathrm\{skip\}\}\.

The skip penalty determines the trade\-off between forcing matches and allowing deletions\. A largeβskip\\beta\_\{\\mathrm\{skip\}\}makes skipping expensive, so the aligner prefers fewer null alignments and more forced matches; this usually keeps the null\-alignment ratio \(NA ratio\) low but increases the average alignment cost because semantically weak pairs are more likely to be matched\. In contrast, a smallβskip\\beta\_\{\\mathrm\{skip\}\}makes skipping cheap, so the aligner more readily leaves segments unmatched; this typically increases the NA ratio and decreases the average alignment cost, since high\-cost pairs are skipped and only easier matches remain\. Therefore, SEGALE adaptively searches overβskip\\beta\_\{\\mathrm\{skip\}\}: it starts from a relatively large value and progressively decreases it in small steps\. Once the average alignment cost falls below a threshold or the NA ratio exceeds a threshold, SEGALE treats this as the onset of over\-deletion and returns the alignment from the previous step\.

We denote the alignment output as

A=\(A1,…,Ar\),\\displaystyle A=\(A\_\{1\},\\ldots,A\_\{r\}\),\(7\)where each alignment group is defined as

Ak=\(Xk,Yk,Y^k\)\.\\displaystyle A\_\{k\}=\(X\_\{k\},Y\_\{k\},\\hat\{Y\}\_\{k\}\)\.\(8\)Here,XkX\_\{k\}is a consecutive subset of source sentences from𝒙1:n\{\\bm\{x\}\}\_\{1:n\},YkY\_\{k\}is a consecutive subset of reference translation sentences from𝒚1:n\{\\bm\{y\}\}\_\{1:n\}, andY^k\\hat\{Y\}\_\{k\}is a consecutive subset of target sentences from𝒚^1:m\\hat\{\{\\bm\{y\}\}\}\_\{1:m\}\. SEGALE naturally handles both over\-translation and under\-translation\. In the case of over\-translation, some target sentences do not correspond to any source sentence, resulting in an emptyXkX\_\{k\}andYkY\_\{k\}\. In the case of under\-translation, some source sentences do not correspond to any target sentence, resulting in an emptyY^k\\hat\{Y\}\_\{k\}\. Such phenomena occur frequently in simultaneous translation, making this robustness important for long\-form SimulS2ST evaluation\.

### 3\.5Latency Computation

Given the alignment produced by SEGALE, we compute latency at the alignment\-group level\. For each groupAk=\(Xk,Yk,Y^k\)A\_\{k\}=\(X\_\{k\},Y\_\{k\},\\hat\{Y\}\_\{k\}\), letTksT\_\{k\}^\{s\}andTkeT\_\{k\}^\{e\}denote the start and end times of the source spanXkX\_\{k\}, respectively\. Let𝒅k=\(d1,…,d\|Y^k\|\)\{\\bm\{d\}\}\_\{k\}=\(d\_\{1\},\\dots,d\_\{\|\\hat\{Y\}\_\{k\}\|\}\)denote the token\-level timestamps obtained by forced alignment for the target sentence groupY^k\\hat\{Y\}\_\{k\}\.

We define the ideal delay of theii\-th target token as

di∗=Tks\+\(i−1\)⋅Tke−Tksmax⁡\{\|Yk\|,\|Y^k\|\},\\displaystyle d\_\{i\}^\{\*\}=T\_\{k\}^\{s\}\+\(i\-1\)\\cdot\\frac\{T\_\{k\}^\{e\}\-T\_\{k\}^\{s\}\}\{\\max\\\{\|Y\_\{k\}\|,\|\\hat\{Y\}\_\{k\}\|\\\}\},\(9\)where\|Yk\|\|Y\_\{k\}\|and\|Y^k\|\|\\hat\{Y\}\_\{k\}\|are the numbers of tokens in the reference and target sentence groups, respectively\.

The latency of groupAkA\_\{k\}is then computed as

lk=1\|Y^k\|​∑i=1\|Y^k\|\(di−di∗\)\.\\displaystyle l\_\{k\}=\\frac\{1\}\{\|\\hat\{Y\}\_\{k\}\|\}\\sum\_\{i=1\}^\{\|\\hat\{Y\}\_\{k\}\|\}\(d\_\{i\}\-d\_\{i\}^\{\*\}\)\.\(10\)Following LongYAAL\(Poláket al\.,[2026](https://arxiv.org/html/2606.15059#bib.bib8)\), we exclude the target tokens generated after the end of the full source stream𝒔\{\\bm\{s\}\}\.

Finally, we compute the long\-form latency by averaging over all alignment groups:

Latency=1r​∑k=1rlk\.\\displaystyle\\mathrm\{Latency\}=\\frac\{1\}\{r\}\\sum\_\{k=1\}^\{r\}l\_\{k\}\.\(11\)For over\-translation or under\-translation cases, whereXk=Yk=∅X\_\{k\}=Y\_\{k\}=\\emptysetorY^k=∅\\hat\{Y\}\_\{k\}=\\emptyset, we exclude these groups from latency computation, since latency is not well\-defined without both source and target content\.

### 3\.6Quality Computation

LetQQdenote a sentence\-level quality metric, such as COMET\(Reiet al\.,[2020](https://arxiv.org/html/2606.15059#bib.bib15)\)or MetricX\(Juraskaet al\.,[2024](https://arxiv.org/html/2606.15059#bib.bib16)\)\. For each alignment groupAk=\(Xk,Yk,Y^k\)A\_\{k\}=\(X\_\{k\},Y\_\{k\},\\hat\{Y\}\_\{k\}\), we compute the quality score of groupAkA\_\{k\}as

qk=Q​\(Xk,Yk,Y^k\)\.\\displaystyle q\_\{k\}=Q\(X\_\{k\},Y\_\{k\},\\hat\{Y\}\_\{k\}\)\.\(12\)
For over\-translation or under\-translation cases, we directly assign the minimum possible score of the metric, denoted byQminQ\_\{\\min\}\. For example,Qmin=0Q\_\{\\min\}=0for COMET andQmin=−25Q\_\{\\min\}=\-25for MetricX\. The final long\-form quality score is then computed by averaging over all alignment groups:

Quality=1r​∑k=1rqk\.\\displaystyle\\mathrm\{Quality\}=\\frac\{1\}\{r\}\\sum\_\{k=1\}^\{r\}q\_\{k\}\.\(13\)
Table 1:Evaluation results of SimulS2ST systems on ACL 60/60 dev set\. A / B denotes Latency \(second\) / xCOMET\-XL\. The best latency and quality scores are shown in bold\.Table 2:Evaluation results of SimulS2ST systems on Audio\-NTREX\-L test set\. A / B denotes Latency \(second\) / xCOMET\-XL\. The best latency and quality scores are shown in bold\.Table 3:WER/CER of Qwen3\-ASR\-1\.7B and WhisperX on generated target speech for ACL 60/60 dev\.![Refer to caption](https://arxiv.org/html/2606.15059v1/x1.png)Figure 1:The ending offset of each aligned sentence for two systems on every speech in the ACL 60/60 dev set\. The results show that latency generally accumulates as the source speech becomes longer\.![Refer to caption](https://arxiv.org/html/2606.15059v1/figures/tgt_minus_src_length_histogram.png)Figure 2:Distribution of target–source duration differences on the ACL 60/60 dev set\. En→\\rightarrowZh is mostly negative, En→\\rightarrowDe is centered around zero, and En→\\rightarrowJa is mostly positive\.

## 4Experiments

### 4\.1Setup

#### Datasets

We evaluate existing SimulS2ST systems on two datasets: ACL 60/60 devv\(Saleskyet al\.,[2023](https://arxiv.org/html/2606.15059#bib.bib17)\)and Audio\-NTREX\-4L test\(Labiausseet al\.,[2026](https://arxiv.org/html/2606.15059#bib.bib18)\)\. ACL 60/60 dev consists of five English ACL talks, each approximately 10 minutes long, translated into multiple languages, in which we consider three directions: English to German/Japanese/Chinese\. Audio\-NTREX\-4L is a multilingual speech translation benchmark introduced in Hibiki\-Zero\(Labiausseet al\.,[2026](https://arxiv.org/html/2606.15059#bib.bib18)\)\. It is built from the NTREX text translation dataset by synthesizing source\-language speech with high\-quality TTS systems and multilingual speaker voices\. The benchmark covers French, German, Portuguese, and Spanish as source languages and English as the target language\. Each direction in our test split contains 450 speeches with an average duration of about 45 seconds, and we evaluate all four X\-to\-English directions\.

#### SimulS2ST Systems

We consider three representative multilingual SimulS2ST systems:

- •Seed LiveInterpret 2\.0\(Chenget al\.,[2025](https://arxiv.org/html/2606.15059#bib.bib19)\): a product\-level end\-to\-end simultaneous interpretation system designed for high\-fidelity, ultra\-low\-latency speech\-to\-speech generation\. It supports voice cloning and is built on a duplex speech\-to\-speech architecture\.
- •Hibiki\-Zero\(Labiausseet al\.,[2026](https://arxiv.org/html/2606.15059#bib.bib18)\): an end\-to\-end simultaneous speech\-to\-speech translation system built on the Moshi duplex architecture\(Défossezet al\.,[2024](https://arxiv.org/html/2606.15059#bib.bib20)\)\. It is first trained on sentence\-level aligned speech translation data and then further optimized with GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.15059#bib.bib21)\)to reduce latency while preserving translation quality\.
- •SeamlessStreaming\(Communicationet al\.,[2023](https://arxiv.org/html/2606.15059#bib.bib5)\): a multilingual streaming speech translation model from the Seamless family\. It uses Efficient Monotonic Multihead Attention \(EMMA\)\(Maet al\.,[2023](https://arxiv.org/html/2606.15059#bib.bib22)\)to generate low\-latency translations without waiting for the full source utterance, enabling simultaneous speech\-to\-speech and speech\-to\-text translation across multiple source and target languages\.

Seed LiveInterpret 2\.0 is evaluated through the Volcano Engine API333[https://www\.volcengine\.com/docs/6561/1756902?lang=en](https://www.volcengine.com/docs/6561/1756902?lang=en), while Hibiki\-Zero and SeamlessStreaming are run locally on a single NVIDIA L40S GPU\.

### 4\.2Evaluation

The evaluation results on ACL 60/60 dev set are shown in Table[1](https://arxiv.org/html/2606.15059#S3.T1)\. Seed LiveInterpret 2\.0 consistently achieves better translation quality than SeamlessStreaming, but at substantially higher latency, e\.g\., 9\.4 seconds for En\-Ja\.

The results on Audio\-NTREX\-L test set are shown in Table[2](https://arxiv.org/html/2606.15059#S3.T2)\. Overall, all systems achieve reasonably good translation quality\. Among them, Seed LiveInterpret 2\.0 obtains the best translation quality, but with more than 2 seconds higher latency than Hibiki\-Zero and SeamlessStreaming\.

We also observe that latency is less stable for En→\\rightarrowX directions, whereas it is much more consistent for X→\\rightarrowEn directions\.

### 4\.3Analysis

#### Poor En→\\rightarrowJa Performance of Seed LiveInterpret 2\.0

On En→\\rightarrowJa direction, Seed LiveInterpret 2\.0 exhibits very high latency, reaching nearly 10 seconds, while achieving an xCOMET\-XL score of only 45\.48\. Our initial analysis suggests that the ASR transcripts of the generated target speech contain substantial gibberish and fragmented Japanese\. To determine whether this issue stems from Qwen3\-ASR’s limited Japanese recognition performance or from poor Japanese speech synthesis quality, we measure word/character error rates \(WER/CER\) using both Qwen3\-ASR\-1\.7B and WhisperX\(Bainet al\.,[2023](https://arxiv.org/html/2606.15059#bib.bib23)\)\. The ground\-truth target text is taken from the Seed LiveInterpret 2\.0 API, which returns both synthesized target speech and target text\. The results are shown in Table[3](https://arxiv.org/html/2606.15059#S3.T3)\. Both Qwen3\-ASR\-1\.7B and WhisperX yield very high CER on En→\\rightarrowJa, suggesting that the problem is more likely caused by poor Japanese speech synthesis quality in Seed LiveInterpret 2\.0\.

#### Segmentation Quality

We analyze the segmentation quality of SEGALE on the ACL 60/60 En→\\rightarrowZh dev set and compare it with the recently proposed SoftSegmenter\(Poláket al\.,[2026](https://arxiv.org/html/2606.15059#bib.bib8)\)\. The results show that SEGALE achieves a segmentation accuracy of 90\.9%, substantially outperforming SoftSegmenter \(79\.1%\)\. We observe that SoftSegmenter often shifts boundary\-adjacent fragments across neighboring sentences, e\.g\., attaching the beginning of a sentence to the previous segment or the ending to the next one\. Softsegmenter is also particularly brittle when semantically related content is realized with different surface forms\. For example, when the reference contains foreign\-language expressions while the prediction translates or paraphrases them, local token\-level matching can break down\. This is likely because its re\-segmentation relies on local token\-level matching rather than explicit sentence\-level boundary modeling\. In contrast, such errors are much less frequent with SEGALE\. In contrast, SEGALE is much less affected by such cases and produces more semantically coherent segmentation\.

#### Latency Accumulation

We further observe that, on long\-form speech in the ACL 60/60 dev set \(around 10 minutes\), latency is substantially higher than on the Audio\-NTREX\-L test set \(around 45 seconds\)\. To better understand this phenomenon, we compute the ending offset of SEGALE\-aligned sentences for both SeamlessStreaming and Seed LiveInterpret 2\.0 on each speech in the ACL 60/60 dev set, as shown in Figure[1](https://arxiv.org/html/2606.15059#S3.F1)\. We find that on long speech, both systems exhibit increasingly larger ending offsets as more input speech arrives, with the only exception being Seed LiveInterpret 2\.0 on the En→\\rightarrowZh direction\. Further analysis shows that latency accumulation is related to the sentence\-level target\-source duration difference, as shown in Figure[2](https://arxiv.org/html/2606.15059#S3.F2)\. For En→\\rightarrowZh, where the target speech is generally shorter than the source speech and the duration difference is concentrated in a negative range \(roughly\[−2\.5,−0\.5\]\[\-2\.5,\-0\.5\]seconds\), the ending offset stays small and stable\. By contrast, for En→\\rightarrowJa, where the target speech is typically longer than the source and the duration difference falls in a positive range \(roughly\[0,3\]\[0,3\]seconds\), the ending offset exceeds 200 seconds near the end of the speech\. This suggests that even state\-of\-the\-art SimulS2ST systems still suffer from latency accumulation on speech spanning minutes, highlighting the need for future research to address this issue\.

## 5Conclusion

We present a practical evaluation method for long\-form SimulS2ST\. Combining ASR, forced alignment, and SEGALE\-based sentence alignment, it enables sentence\-level evaluation of latency and translation quality on continuous speech across representative SimulS2ST systems\. Experiments on ACL 60/60 and Audio\-NTREX\-L validate the method and show that SEGALE provides robust segmentation for long\-form evaluation\. More importantly, our analysis reveals a key limitation of current systems: latency accumulates substantially on long speech\. We hope this work lays a strong foundation for future research on reliable, low\-latency SimulS2ST\.

## References

- WhisperX: Time\-Accurate Speech Transcription of Long\-Form Audio\.InInterspeech 2023,pp\. 4489–4493\.External Links:[Document](https://dx.doi.org/10.21437/Interspeech.2023-78),ISSN 2958\-1796Cited by:[§4\.3](https://arxiv.org/html/2606.15059#S4.SS3.SSS0.Px1.p1.2)\.
- S\. Cheng, Y\. Bao, Z\. Huang, Y\. Lu, N\. Peng, L\. Xu, R\. Yu, R\. Cao, Y\. Du, T\. Han, Y\. Hu, Z\. Li, S\. Liu, S\. Ma, S\. Pan, J\. Xiao, N\. Xu, M\. Yang, R\. Ye, Y\. Yu, J\. Zhang, R\. Zhang, W\. Zhang, W\. Zhu, L\. Zou, L\. Lu, Y\. Wang, and Y\. Wu \(2025\)Seed liveinterpret 2\.0: end\-to\-end simultaneous speech\-to\-speech translation with your voice\.External Links:[Link](https://arxiv.org/abs/2507.17527)Cited by:[1st item](https://arxiv.org/html/2606.15059#S4.I1.i1.p1.1)\.
- S\. Communication, L\. Barrault, Y\. Chung, M\. C\. Meglioli, D\. Dale, N\. Dong, M\. Duppenthaler, P\. Duquenne, B\. Ellis, H\. Elsahar, J\. Haaheim, J\. Hoffman, M\. Hwang, H\. Inaguma, C\. Klaiber, I\. Kulikov, P\. Li, D\. Licht, J\. Maillard, R\. Mavlyutov, A\. Rakotoarison, K\. R\. Sadagopan, A\. Ramakrishnan, T\. Tran, G\. Wenzek, Y\. Yang, E\. Ye, I\. Evtimov, P\. Fernandez, C\. Gao, P\. Hansanti, E\. Kalbassi, A\. Kallet, A\. Kozhevnikov, G\. M\. Gonzalez, R\. S\. Roman, C\. Touret, C\. Wong, C\. Wood, B\. Yu, P\. Andrews, C\. Balioglu, P\. Chen, M\. R\. Costa\-jussà, M\. Elbayad, H\. Gong, F\. Guzmán, K\. Heffernan, S\. Jain, J\. Kao, A\. Lee, X\. Ma, A\. Mourachko, B\. Peloquin, J\. Pino, S\. Popuri, C\. Ropers, S\. Saleem, H\. Schwenk, A\. Sun, P\. Tomasello, C\. Wang, J\. Wang, S\. Wang, and M\. Williamson \(2023\)Seamless: multilingual expressive and streaming speech translation\.External Links:[Link](https://arxiv.org/abs/2312.05187)Cited by:[§1](https://arxiv.org/html/2606.15059#S1.p1.1),[3rd item](https://arxiv.org/html/2606.15059#S4.I1.i3.p1.1)\.
- A\. Défossez, L\. Mazaré, M\. Orsini, A\. Royer, P\. Pérez, H\. Jégou, E\. Grave, and N\. Zeghidour \(2024\)Moshi: a speech\-text foundation model for real\-time dialogue\.External Links:[Link](https://arxiv.org/abs/2410.00037)Cited by:[2nd item](https://arxiv.org/html/2606.15059#S4.I1.i2.p1.1)\.
- N\. M\. Guerreiro, R\. Rei, D\. v\. Stigt, L\. Coheur, P\. Colombo, and A\. F\. T\. Martins \(2024\)XCOMET: transparent machine translation evaluation through fine\-grained error detection\.Transactions of the Association for Computational Linguistics12,pp\. 979–995\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00683),[Link](https://aclanthology.org/2024.tacl-1.54/)Cited by:[§1](https://arxiv.org/html/2606.15059#S1.p3.1)\.
- J\. Juraska, D\. Deutsch, M\. Finkelstein, and M\. Freitag \(2024\)MetricX\-24: the Google submission to the WMT 2024 metrics shared task\.InProceedings of the Ninth Conference on Machine Translation,B\. Haddow, T\. Kocmi, P\. Koehn, and C\. Monz \(Eds\.\),Miami, Florida, USA,pp\. 492–504\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.wmt-1.35),[Link](https://aclanthology.org/2024.wmt-1.35/)Cited by:[§3\.6](https://arxiv.org/html/2606.15059#S3.SS6.p1.3)\.
- T\. Labiausse, R\. Fabre, Y\. Estève, A\. Défossez, and N\. Zeghidour \(2026\)Simultaneous speech\-to\-speech translation without aligned data\.External Links:[Link](https://arxiv.org/abs/2602.11072)Cited by:[2nd item](https://arxiv.org/html/2606.15059#S4.I1.i2.p1.1),[§4\.1](https://arxiv.org/html/2606.15059#S4.SS1.SSS0.Px1.p1.1)\.
- T\. Labiausse, L\. Mazaré, E\. Grave, A\. Défossez, and N\. Zeghidour \(2025\)High\-fidelity simultaneous speech\-to\-speech translation\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=fgjN8B6xVX)Cited by:[§1](https://arxiv.org/html/2606.15059#S1.p2.1)\.
- D\. Liu, C\. Wang, H\. Gong, X\. Ma, Y\. Tang, and J\. Pino \(2022\)From Start to Finish: Latency Reduction Strategies for Incremental Speech Synthesis in Simultaneous Speech\-to\-Speech Translation\.InInterspeech 2022,pp\. 1771–1775\.External Links:[Document](https://dx.doi.org/10.21437/Interspeech.2022-10568),ISSN 2958\-1796Cited by:[§1](https://arxiv.org/html/2606.15059#S1.p1.1)\.
- X\. Ma, H\. Gong, D\. Liu, A\. Lee, Y\. Tang, P\. Chen, W\. Hsu, P\. Koehn, and J\. Pino \(2022\)Direct simultaneous speech\-to\-speech translation with variational monotonic multihead attention\.External Links:[Link](https://arxiv.org/abs/2110.08250)Cited by:[§1](https://arxiv.org/html/2606.15059#S1.p1.1)\.
- X\. Ma, A\. Sun, S\. Ouyang, H\. Inaguma, and P\. Tomasello \(2023\)Efficient monotonic multihead attention\.External Links:[Link](https://arxiv.org/abs/2312.04515)Cited by:[3rd item](https://arxiv.org/html/2606.15059#S4.I1.i3.p1.1)\.
- E\. Matusov, G\. Leusch, O\. Bender, and H\. Ney \(2005\)Evaluating machine translation output with automatic sentence segmentation\.InProceedings of the Second International Workshop on Spoken Language Translation,Pittsburgh, Pennsylvania, USA\.External Links:[Link](https://aclanthology.org/2005.iwslt-1.19/)Cited by:[§2](https://arxiv.org/html/2606.15059#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.15059#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Papi, M\. Gaido, M\. Negri, and L\. Bentivogli \(2024\)StreamAtt: direct streaming speech\-to\-text translation with attention\-based audio history selection\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 3692–3707\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.202),[Link](https://aclanthology.org/2024.acl-long.202/)Cited by:[§1](https://arxiv.org/html/2606.15059#S1.p2.1),[§2](https://arxiv.org/html/2606.15059#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Polák, S\. Papi, L\. Bentivogli, and O\. Bojar \(2026\)Better late than never: meta\-evaluation of latency metrics for simultaneous speech\-to\-text translation\.External Links:[Link](https://arxiv.org/abs/2509.17349)Cited by:[§1](https://arxiv.org/html/2606.15059#S1.p2.1),[§1](https://arxiv.org/html/2606.15059#S1.p3.1),[§2](https://arxiv.org/html/2606.15059#S2.SS0.SSS0.Px1.p1.1),[§3\.5](https://arxiv.org/html/2606.15059#S3.SS5.p3.2),[§4\.3](https://arxiv.org/html/2606.15059#S4.SS3.SSS0.Px2.p1.1)\.
- R\. Rei, C\. Stewart, A\. C\. Farinha, and A\. Lavie \(2020\)COMET: a neural framework for MT evaluation\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 2685–2702\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.213),[Link](https://aclanthology.org/2020.emnlp-main.213/)Cited by:[§3\.6](https://arxiv.org/html/2606.15059#S3.SS6.p1.3)\.
- E\. Salesky, K\. Darwish, M\. Al\-Badrashiny, M\. Diab, and J\. Niehues \(2023\)Evaluating multilingual speech translation under realistic conditions with resegmentation and terminology\.InProceedings of the 20th International Conference on Spoken Language Translation \(IWSLT 2023\),E\. Salesky, M\. Federico, and M\. Carpuat \(Eds\.\),Toronto, Canada \(in\-person and online\),pp\. 62–78\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.iwslt-1.2),[Link](https://aclanthology.org/2023.iwslt-1.2/)Cited by:[§4\.1](https://arxiv.org/html/2606.15059#S4.SS1.SSS0.Px1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:[Link](https://arxiv.org/abs/2402.03300)Cited by:[2nd item](https://arxiv.org/html/2606.15059#S4.I1.i2.p1.1)\.
- X\. Shi, X\. Wang, Z\. Guo, Y\. Wang, P\. Zhang, X\. Zhang, Z\. Guo, H\. Hao, Y\. Xi, B\. Yang, J\. Xu, J\. Zhou, and J\. Lin \(2026\)Qwen3\-asr technical report\.External Links:[Link](https://arxiv.org/abs/2601.21337)Cited by:[§3\.3](https://arxiv.org/html/2606.15059#S3.SS3.p1.6)\.
- K\. Sudoh, T\. Kano, S\. Novitasari, T\. Yanagita, S\. Sakti, and S\. Nakamura \(2020\)Simultaneous speech\-to\-speech translation system with neural incremental asr, mt, and tts\.External Links:[Link](https://arxiv.org/abs/2011.04845)Cited by:[§1](https://arxiv.org/html/2606.15059#S1.p1.1)\.
- B\. Thompson and P\. Koehn \(2020\)Exploiting sentence order in document alignment\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 5997–6007\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.483),[Link](https://aclanthology.org/2020.emnlp-main.483/)Cited by:[§3\.4](https://arxiv.org/html/2606.15059#S3.SS4.p1.7)\.
- K\. Wang, S\. Ding, C\. H\. Yang, P\. Hsieh, W\. Peng, V\. Lavrukhin, and B\. Ginsburg \(2025\)Extending automatic machine translation evaluation to book\-length documents\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 32323–32339\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1645),ISBN 979\-8\-89176\-332\-6,[Link](https://aclanthology.org/2025.emnlp-main.1645/)Cited by:[§1](https://arxiv.org/html/2606.15059#S1.p3.1),[§2](https://arxiv.org/html/2606.15059#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Zhang, Q\. Fang, S\. Guo, Z\. Ma, M\. Zhang, and Y\. Feng \(2024\)StreamSpeech: simultaneous speech\-to\-speech translation with multi\-task learning\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 8964–8986\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.485),[Link](https://aclanthology.org/2024.acl-long.485/)Cited by:[§1](https://arxiv.org/html/2606.15059#S1.p1.1)\.
- R\. Zheng, M\. Ma, B\. Zheng, K\. Liu, J\. Yuan, K\. Church, and L\. Huang \(2020\)Fluent and low\-latency simultaneous speech\-to\-speech translation with self\-adaptive training\.InFindings of the Association for Computational Linguistics: EMNLP 2020,T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 3928–3937\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.findings-emnlp.349),[Link](https://aclanthology.org/2020.findings-emnlp.349/)Cited by:[§1](https://arxiv.org/html/2606.15059#S1.p1.1),[§1](https://arxiv.org/html/2606.15059#S1.p2.1)\.

Similar Articles

Streaming Speech-to-Text Translation with a SpeechLLM

arXiv cs.CL

Presents a SpeechLLM architecture for streaming speech-to-text translation that adaptively decides when to output tokens based on audio, achieving 1-2 second latency with quality close to non-streaming baselines.

OpenSTBench: Beyond Semantic Evaluation for Speech Translation

Hugging Face Daily Papers

OpenSTBench is a unified multidimensional evaluation framework for speech translation systems that jointly assesses translation quality, speech quality, speaker preservation, emotion fidelity, and latency across both S2TT and S2ST systems in offline and streaming settings. The framework addresses the gap left by fragmented evaluation protocols and provides a reproducible benchmark for comparing heterogeneous speech translation systems.

Benchmarking Speech-to-Speech Translation Models

arXiv cs.CL

COMPASS is a unified benchmarking framework for speech-to-speech translation (S2ST) that integrates 46 metrics across eight dimensions, evaluated on 1,248 model-language configurations. It identifies complementary architecture strengths and proposes reduced metric subsets that preserve rankings while cutting evaluation time.