Direct Translation between Sign Languages

arXiv cs.CL Papers

Summary

This paper introduces a direct sign-to-sign translation model that bypasses intermediate text by using back-translation to create synthetic parallel sign language data, achieving significant improvements in speed and accuracy over cascade methods for ASL, CSL, and DGS.

arXiv:2605.20588v1 Announce Type: new Abstract: The field of sign language translation has witnessed significant progress in the translation between sign and spoken languages, but the translation between sign languages remains largely unexplored and out of reach. The latter can help 1.5 billion deaf and hard-of-hearing (DHH) people worldwide communicate across language barriers without relying on hearing interpreters or written-language fluency. The cascade approach composing separate sign-to-text, text-to-text, and text-to-sign systems suffers from error propagation and extra latency as well as the loss of information unique in the visual modality. We aim to develop direct sign-to-sign translation. However, a large-scale open-domain parallel corpus has not been curated between sign languages. To enable direct translation between sign language utterances, we use back-translation to produce synthetic sign-sign pairs from unaligned individual language utterance-sign corpora. Using this data, we jointly train a single MBART-based model for both text->sign (T2S) and sign->sign (S2S). On synthetically generated paired sets between American Sign Language (ASL), Chinese Sign Language (CSL), and German Sign Language (DGS), our direct S2S method outperforms the cascaded baseline on geometric sign error metrics (20% lower DTW-aligned MPJPE) and language matching metrics after predicted sign utterances are translated back to sentences (50% high BLEU-4) while achieving a roughly 2.3* speedup. On a small set of pre-existing cross-lingual sign data, we find similar improvements for our proposed method.
Original Article
View Cached Full Text

Cached at: 05/21/26, 06:33 AM

# Direct Translation between Sign Languages
Source: [https://arxiv.org/html/2605.20588](https://arxiv.org/html/2605.20588)
Zetian Wu Bowen Xie Wuyang Meng Milan Gautam Stefan Lee Liang Huang Oregon State University \{wuzet, xiebo, mengwu, gautammi, leestef, liang\.huang\}@oregonstate\.edu

###### Abstract

The field of sign language translation has witnessed significant progress in the translation between sign and spoken languages, but the translation between sign languages remains largely unexplored and out of reach\. The latter can help 1\.5 billiondeaf and hard\-of\-hearing \(DHH\)people worldwide communicate across language barriers without relying on hearing interpreters or written\-language fluency\. The cascade approach composing separate sign\-to\-text, text\-to\-text, and text\-to\-sign systems suffers from error propagation and extra latency as well as the loss of information unique in the visual modality\. We aim to develop direct sign\-to\-sign translation\. However, a large\-scale open\-domain parallel corpus has not been curated between sign languages\. To enable direct translation between sign language utterances, we useback\-translationto produce synthetic sign\-sign pairs from unaligned individual language utterance\-sign corpora\. Using this data, we jointly train a singlemBART\-based model for bothtext→sign\\text\{text\}\\\!\\to\\\!\\text\{sign\}\(T2S\) andsign→sign\\text\{sign\}\\\!\\to\\\!\\text\{sign\}\(S2S\)\. On synthetically generated paired sets betweenAmerican Sign Language \(ASL\),Chinese Sign Language \(CSL\), andGerman Sign Language \(DGS\), our directS2Smethod outperforms the cascaded baseline on geometric sign error metrics \(20% lower DTW\-aligned MPJPE\) and language matching metrics after predicted sign utterances are translated back to sentences \(50% high BLEU\-4\) while achieving a roughly2\.3×2\.3\\timesspeedup\. On a small set of pre\-existing cross\-lingual sign data, we find similar improvements for our proposed method\.

## 1Introduction

Deaf and hard\-of\-hearing signers from different countries cannot converse directly in their native sign languages today\. Like spoken languages, sign languages differ by dialect: American Sign Langauge \(ASL\),Chinese Sign Language \(CSL\), andGerman Sign Language \(DGS\)are mutually unintelligible, and hundreds more are in active use worldwide\(Yinet al\.,[2021](https://arxiv.org/html/2605.20588#bib.bib5)\)\. Such a conversation today routes through a chain of human interpreters, or through written text in a spoken language that is typically not the signer’s first language; both options break the conversation out of sign and discard the spatial grammar, classifier predicates and prosody that signed dialogue relies on\(Yinet al\.,[2021](https://arxiv.org/html/2605.20588#bib.bib5); De Costeret al\.,[2023](https://arxiv.org/html/2605.20588#bib.bib6)\)\. A system that translates directly between sign languages—taking one signer’s clip in, returning an equivalent clip in another, with no detour through spoken\-language text—would close this gap \(Figure[1](https://arxiv.org/html/2605.20588#S1.F1)\)\.

![Refer to caption](https://arxiv.org/html/2605.20588v1/figures/fig1-direct.png)Figure 1:Direct sign\-to\-sign translation\.Given a source clip in one sign language \(e\.g\.CSL\), our singlemBART\-based model produces an equivalent clip in a target sign language \(e\.g\.ASL\) without going through written text\. Compared with the cascadedS2T→\\toMT→\\toT2Sbaseline, the direct model is roughly2\.3×2\.3\\timesfaster and yields lower DTW\-aligned MPJPE\.The most natural way is a*cascade*: chain a sign\-to\-text \(S2T\) model\(Camgözet al\.,[2018](https://arxiv.org/html/2605.20588#bib.bib12),[2020](https://arxiv.org/html/2605.20588#bib.bib13); Linet al\.,[2023](https://arxiv.org/html/2605.20588#bib.bib14); Yin and Read,[2020](https://arxiv.org/html/2605.20588#bib.bib47)\)with a spoken\-language MT system\(Liuet al\.,[2020](https://arxiv.org/html/2605.20588#bib.bib18); Costa\-jussàet al\.,[2022](https://arxiv.org/html/2605.20588#bib.bib20)\)and a text\-to\-sign \(T2S\) generator\(Stollet al\.,[2018](https://arxiv.org/html/2605.20588#bib.bib9); Saunderset al\.,[2020](https://arxiv.org/html/2605.20588#bib.bib7),[2021](https://arxiv.org/html/2605.20588#bib.bib8); Zelinka and Kanis,[2020](https://arxiv.org/html/2605.20588#bib.bib10); Zuoet al\.,[2025](https://arxiv.org/html/2605.20588#bib.bib11)\)\. This route reuses well\-studied components, but it compounds the error of three error\-prone stages, runs three sequential forward passes per query, and is by construction blind to visual\-only information\.

A direct sign\-to\-sign \(S2S\) model would avoid all three issues \(just as direct*speech\-to\-speech translation*\(Jiaet al\.,[2019](https://arxiv.org/html/2605.20588#bib.bib25),[2022](https://arxiv.org/html/2605.20588#bib.bib26); Rubensteinet al\.,[2023](https://arxiv.org/html/2605.20588#bib.bib27)\)does in the spoken modality\), but it is held back by the lack parallelS2Sdata\. The only published*direct*S2Ssystem, fromInanet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib1)\), automatically aligned three pairwise corpora \(ASL↔\\leftrightarrowCSL,ASL↔\\leftrightarrowDGS,CSL↔\\leftrightarrowDGS\) and trained a model combiningCamgözet al\.\([2020](https://arxiv.org/html/2605.20588#bib.bib13)\)\(encoder\) andSaunderset al\.\([2020](https://arxiv.org/html/2605.20588#bib.bib7)\)\(decoder\)\. Their work suffers from two major limitations: their parallel training set is very small \(at most 2\.3KS2Spairs in one direction\) and rather noisy, resulting in0BLEU\-4on multiple directions and at most∼\\sim77elsewhere, leavingS2San open question\.

We close the gap by importing*back\-**translation*\(BT\)\(Sennrichet al\.,[2016](https://arxiv.org/html/2605.20588#bib.bib28); Edunovet al\.,[2018](https://arxiv.org/html/2605.20588#bib.bib29)\)111Throughout this paper,*back\-translation*refers to the data\-augmentation technique introduced for NMT bySennrichet al\.\([2016](https://arxiv.org/html/2605.20588#bib.bib28)\),*not*the back\-translation*evaluation*protocol common in text\-to\-sign translation\(Saunderset al\.,[2020](https://arxiv.org/html/2605.20588#bib.bib7)\)\., the most successful low\-resource translation from neural machine translation \(NMT\), into the sign modality: for each gold \(text, sign\) pair in a monolingual corpus, we translate the text into another spoken language, feed the result to ourT2Smodel to produce a synthetic source\-sign clip, and pair that clip with the original gold sign as the target — yielding a large\-scale parallelS2Straining corpus \(§[2](https://arxiv.org/html/2605.20588#S2)\)\. For example, given an English–ASLpair\(Ten,SASL\)\(T\_\{\\text\{en\}\},S\_\{\\text\{ASL\}\}\)fromHow2Sign, we translateTenT\_\{\\text\{en\}\}into ChineseTzhT\_\{\\text\{zh\}\}via MT, and then use ourT2Smodel to generate a syntheticCSLclipS^CSL\\hat\{S\}\_\{\\text\{CSL\}\}fromTzhT\_\{\\text\{zh\}\}; finally, we pair\(S^CSL,SASL\)\(\\hat\{S\}\_\{\\text\{CSL\}\},S\_\{\\text\{ASL\}\}\)as aCSL→\\toASLtraining instance\. Sign\-sideBThas previously been used in sign\-to\-text translation to manufacture additional gloss/text pairs\(Zhouet al\.,[2021](https://arxiv.org/html/2605.20588#bib.bib15); Moryossefet al\.,[2021](https://arxiv.org/html/2605.20588#bib.bib32)\); ours is, to our knowledge, the first use that yields parallelsign↔sign\\text\{sign\}\\\!\\leftrightarrow\\\!\\text\{sign\}training data\. Building onZuoet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib11)\), we train a single model jointly onT2SandS2Susing our synthetic data \(§[3](https://arxiv.org/html/2605.20588#S3)\)\.

Concretely, this paper makes three main contributions:

- •We synthesize the first large\-scale parallelsign↔sign\\text\{sign\}\\\!\\leftrightarrow\\\!\\text\{sign\}training corpus by porting back\-translation\(Sennrichet al\.,[2016](https://arxiv.org/html/2605.20588#bib.bib28); Edunovet al\.,[2018](https://arxiv.org/html/2605.20588#bib.bib29)\)into the sign modality: aT2Smodel produces synthetic source clips from translated texts, which we pair with gold target clips from monolingual sign corpora \(§[2](https://arxiv.org/html/2605.20588#S2)\)\.
- •The cross\-lingual test pairs ofInanet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib1)\)are aligned automatically and noisy by their own report; we extract a stricter subset using an LLM judger and Sentence\-BERTReimers and Gurevych \([2019](https://arxiv.org/html/2605.20588#bib.bib34)\)to produce a more meaningful benchmark \(§[4](https://arxiv.org/html/2605.20588#S4)\)\.
- •Our directS2Smodel substantially outperformsInanet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib1)\)on every direction of the only previously released cross\-lingualsign↔sign\\text\{sign\}\\\!\\leftrightarrow\\\!\\text\{sign\}benchmark, and outperforms the cascadedS2T→\\toMT→\\toT2Schain on per\-part DTW\-aligned MPJPE in our own evaluation on our strict subset, at roughly2\.3×2\.3\\timesthe cascade’s wall\-clock speed \(§[5](https://arxiv.org/html/2605.20588#S5)\)\.

## 2Synthetic Sign\-to\-Sign Corpus via Back\-Translation

The absence of natural parallelsign↔sign\\text\{sign\}\\\!\\leftrightarrow\\\!\\text\{sign\}corpora rules out direct supervised training ofS2S\. We close this gap by importing the standardback\-translationrecipe\(Sennrichet al\.,[2016](https://arxiv.org/html/2605.20588#bib.bib28); Edunovet al\.,[2018](https://arxiv.org/html/2605.20588#bib.bib29)\)from text MT into the sign modality, treating ourT2Smodel as the synthetic\-source generator \(Figure[2](https://arxiv.org/html/2605.20588#S2.F2)\)\. We first reviewback\-translationin NMT \(§[2\.1](https://arxiv.org/html/2605.20588#S2.SS1)\), then describe how we instantiate it for cross\-lingualS2Straining \(§[2\.2](https://arxiv.org/html/2605.20588#S2.SS2)\), and report the statistics of the resulting corpus \(§[2\.3](https://arxiv.org/html/2605.20588#S2.SS3)\)\.

![Refer to caption](https://arxiv.org/html/2605.20588v1/x1.png)Figure 2:Back\-translation: NMT and signs side by side\.\(a\)Standard NMTback\-translation\(Sennrichet al\.,[2016](https://arxiv.org/html/2605.20588#bib.bib28)\)introduced in[2\.1](https://arxiv.org/html/2605.20588#S2.SS1)\(b\)Our cross\-lingual signback\-translationintroduced in[2\.2](https://arxiv.org/html/2605.20588#S2.SS2)### 2\.1Preliminaries: Back\-Translation in Neural Machine Translation

Neural machine translation requires large parallel corpora, yet for most language pairs only one side has abundant text\(Liuet al\.,[2020](https://arxiv.org/html/2605.20588#bib.bib18); Costa\-jussàet al\.,[2022](https://arxiv.org/html/2605.20588#bib.bib20)\)\.Sennrichet al\.\([2016](https://arxiv.org/html/2605.20588#bib.bib28)\)introduced*back\-translation*\(BT\) to exploit this asymmetry\. Given a forward modelf:𝒳→𝒴f:\\mathcal\{X\}\\to\\mathcal\{Y\}to be improved and a reverse modelg:𝒴→𝒳g:\\mathcal\{Y\}\\to\\mathcal\{X\}already at hand, one draws monolingual sentencesyyfrom a target\-language corpus, runsx^=g​\(y\)\\hat\{x\}=g\(y\)to obtain a synthetic source, and treats\(x^,y\)\(\\hat\{x\},y\)as an additional training pair forff\(Figure[2](https://arxiv.org/html/2605.20588#S2.F2)\(a\)\)\. The forward loss is computed against the*gold*targetyy, so noise on the synthetic sourcex^\\hat\{x\}is bounded: the model learns to recover a clean target from a noisy input rather than to imitate a noisy supervision signal\.Edunovet al\.\([2018](https://arxiv.org/html/2605.20588#bib.bib29)\)subsequently showed that the technique scales to hundreds of millions of monolingual sentences and consistently improves low\-resource directions, makingBTa default ingredient of low\-resource NMT systems\.BTthus requires only \(i\) a target\-side monolingual corpus and \(ii\) a usable reverse model – both of which we instantiate in the sign modality below\.

### 2\.2Cross\-Lingual Sign Back\-Translation

#### Setting\.

For each sign languages∈\{ASL,CSL,DGS\}s\\in\\\{\\text\{ASL\},\\text\{CSL\},\\text\{DGS\}\\\}, an existing corpus𝒞s\\mathcal\{C\}\_\{s\}provides aligned \(text, sign\) pairs\(xl​\(s\),zs\)\(x\_\{l\(s\)\},z\_\{s\}\), wherel​\(s\)l\(s\)is the corresponding spoken language:How2Sign\(Duarteet al\.,[2021](https://arxiv.org/html/2605.20588#bib.bib16)\)for\(en,ASL\)\(\\text\{en\},\\text\{ASL\}\),CSL\-Daily\(Zhouet al\.,[2021](https://arxiv.org/html/2605.20588#bib.bib15)\)for\(zh,CSL\)\(\\text\{zh\},\\text\{CSL\}\), andPhoenix\-2014T\(Camgözet al\.,[2018](https://arxiv.org/html/2605.20588#bib.bib12)\)for\(de,DGS\)\(\\text\{de\},\\text\{DGS\}\)\. No corpus, however, provides parallel data\(zs,zs′\)\(z\_\{s\},z\_\{s^\{\\prime\}\}\)for two different sign languagesssands′s^\{\\prime\}at scale\. Our goal is to construct such cross\-lingual pairs\(z^s,zs′\)\(\\hat\{z\}\_\{s\},z\_\{s^\{\\prime\}\}\)– synthetic source signs in languagesspaired with gold target signs in languages′s^\{\\prime\}– for all six ordered directionss→s′s\\to s^\{\\prime\}withs≠s′s\\neq s^\{\\prime\}\.

#### Procedure\.

For each gold pair\(xl​\(s′\),zs′\)∈𝒞s′\(x\_\{l\(s^\{\\prime\}\)\},z\_\{s^\{\\prime\}\}\)\\in\\mathcal\{C\}\_\{s^\{\\prime\}\}and each desired source sign languagess, we synthesize the source clip in three stages \(Figure[2](https://arxiv.org/html/2605.20588#S2.F2)\(b\)\):

1. 1\.Spoken\-language MT\.Translate the gold textxl​\(s′\)x\_\{l\(s^\{\\prime\}\)\}from spoken languagel​\(s′\)l\(s^\{\\prime\}\)into spoken languagel​\(s\)l\(s\), i\.e\.,xl​\(s\)=Ml​\(s′\)→l​\(s\)​\(xl​\(s′\)\)x\_\{l\(s\)\}=M\_\{l\(s^\{\\prime\}\)\\to l\(s\)\}\(x\_\{l\(s^\{\\prime\}\)\}\), usingTranslateGemma4B\(Finkelsteinet al\.,[2026](https://arxiv.org/html/2605.20588#bib.bib21)\)as the off\-the\-shelf MT systemMM\.
2. 2\.Sign synthesis\.Generate the source sign tokens viaT2Smodel,z^s=T2S​\(xl​\(s\)\)\\hat\{z\}\_\{s\}=\\mathrm\{T2S\}\(x\_\{l\(s\)\}\), using the same decoding configuration as inference so that train and inference distributions match\.
3. 3\.Pairing\.Form theS2Straining instance\(z^s,zs′\)\(\\hat\{z\}\_\{s\},z\_\{s^\{\\prime\}\}\)–*synthetic*source,*gold*target

This construction preserves the key property that makesBTeffective: the supervision signalzs′z\_\{s^\{\\prime\}\}is gold, so noise on the synthetic source shapes only the conditional distribution the model learns and does not appear as the supervised target\.222Equivalently: source noise affects whichp​\(zs′∣z^s\)p\(z\_\{s^\{\\prime\}\}\\mid\\hat\{z\}\_\{s\}\)we learn, but every loss value is computed against a clean target\.

Our pipeline departs from standard NMTBT\(Sennrichet al\.,[2016](https://arxiv.org/html/2605.20588#bib.bib28)\)in one respect: the bridge between the two sign languages routes through their partner spoken languages rather than a direct sign\-to\-sign reverse model, since no such reverse model exists for sign\. The detour discards sign\-only information on the source side, butBTstill applies because the gold targetzs′z\_\{s^\{\\prime\}\}is unaffected and translation between high\-resource spoken languages is comparatively well\-studied\(Costa\-jussàet al\.,[2022](https://arxiv.org/html/2605.20588#bib.bib20)\)\. It also differs from prior sign\-sideBT, which has been used in sign\-to\-text translation to manufacture additional sign/text pairs within a single sign\-language–spoken\-language pair\(Zhouet al\.,[2021](https://arxiv.org/html/2605.20588#bib.bib15); Moryossefet al\.,[2021](https://arxiv.org/html/2605.20588#bib.bib32)\); our use is cross\-lingual, with synthetic source and gold target in two*different*sign languages, yielding parallelsign↔sign\\text\{sign\}\\\!\\leftrightarrow\\\!\\text\{sign\}training data for directS2S– a setting, to our knowledge, previously unaddressed\.

### 2\.3Constructed Sign\-to\-Sign Training Corpus

Applying the procedure of §[2\.2](https://arxiv.org/html/2605.20588#S2.SS2)to all three corpora yields aS2Straining corpus that covers the six ordered directions across\{ASL,CSL,DGS\}\\\{\\text\{ASL\},\\text\{CSL\},\\text\{DGS\}\\\}\. Each gold corpus𝒞s′\\mathcal\{C\}\_\{s^\{\\prime\}\}contributes pairs to two directions\. Table[1](https://arxiv.org/html/2605.20588#S2.T1)reports per\-direction pair counts, average synthetic\-source length and gold\-target length\.

Table 1:Statistics of theBT\-constructedS2Straining corpus \(112,324 pairs total\)\.Source signs \(ss\) are synthesized by ourT2Smodel from translated texts; target signs \(s′s^\{\\prime\}\) are gold from the underlying corpus𝒞s′\\mathcal\{C\}\_\{s^\{\\prime\}\}\. Lengths are in motion tokens \(1 token≈4\\approx 4frames\)\.We use this corpus as theS2Straining source in the experiments of §[5](https://arxiv.org/html/2605.20588#S5); evaluation pairs are constructed independently and described in §[4](https://arxiv.org/html/2605.20588#S4), alongside qualitative examples of representative synthetic\-source / gold\-target pairs\.

## 3Model

We build onSOKE\(Zuoet al\.,[2025](https://arxiv.org/html/2605.20588#bib.bib11)\), a recent multilingual text\-to\-sign generator, and adopt its two main components unchanged: a VQ\-VAE\(van den Oordet al\.,[2017](https://arxiv.org/html/2605.20588#bib.bib22)\)that tokenizes sign clips by mapping continuous body and hand motion to discrete motion tokens, and a multilingual encoder–decoder initialized frommBART\-large\-cc25\(Liuet al\.,[2020](https://arxiv.org/html/2605.20588#bib.bib18)\)that translates between spoken\-language texts and sign\-token sequences \(Figure[3](https://arxiv.org/html/2605.20588#S3.F3)\)\. On top of this backbone, we castT2SandS2Sas a single discrete sequence\-to\-sequence problem differing only in the source/target language tags supplied to the encoder input and decoder prefix, and train both directions jointly over the naturalT2Scorpora and theBT\-constructedS2Scorpus from §[2](https://arxiv.org/html/2605.20588#S2)\(§[3\.2](https://arxiv.org/html/2605.20588#S3.SS2)\)\.

![Refer to caption](https://arxiv.org/html/2605.20588v1/figures/model2.png)Figure 3:Model overview\.A singlemBART\-based encoder–decoder handles bothT2SandS2S, with the input/output language signaled by special tokens at the encoder input and decoder prefix\. ForT2Sthe encoder reads a spoken\-language text; forS2Sit reads a sign clip encoded into discrete motion tokens via a VQ\-VAE\. In both cases the decoder emits motion tokens autoregressively, which are decoded back to motion by the VQ\-VAE decoder\.### 3\.1Problem Formulation

Letl∈\{en,zh,de\}l\\in\\\{\\text\{en\},\\text\{zh\},\\text\{de\}\\\}index a spoken language,s∈\{ASL,CSL,DGS\}s\\in\\\{\\mathrm\{ASL\},\\mathrm\{CSL\},\\mathrm\{DGS\}\\\}a sign language, andl​\(s\)l\(s\)the spoken\-language partner ofss\(English forASL, Chinese forCSL, German forDGS\)\. Spoken\-language texts are sequences of subword tokensx∈𝒳lx\\in\\mathcal\{X\}\_\{l\}, and sign clips are sequences of motion tokensz∈𝒵sz\\in\\mathcal\{Z\}\_\{s\}\. The two translation directions of interest are

T2S:p​\(zs∣xl\),S2S:p​\(zs′∣zs\),\\mathrm\{T2S\}:\\ p\(z\_\{s\}\\mid x\_\{l\}\),\\qquad\\mathrm\{S2S\}:\\ p\(z\_\{s^\{\\prime\}\}\\mid z\_\{s\}\),withl=l​\(s\)l=l\(s\)forT2Sands≠s′s\\neq s^\{\\prime\}forS2S\. Both are modelled by a single conditional distributionpθ\(⋅∣⋅\)p\_\{\\theta\}\(\\cdot\\mid\\cdot\)whose input/output language is signaled by special tokens in the encoder input and decoder prefix; the same module handles both directions by varying these tokens\.

### 3\.2Joint Multi\-Task Training

Prior work using this backbone\(Zuoet al\.,[2025](https://arxiv.org/html/2605.20588#bib.bib11)\)optimized onlyT2S\. We instead train a single backbone jointly onT2SandS2Sin one stage, leveraging theBT\-constructedS2Scorpus from §[2](https://arxiv.org/html/2605.20588#S2)\. For each directiond∈\{T2S,S2S\}d\\in\\\{\\mathrm\{T2S\},\\mathrm\{S2S\}\\\}we use the standard token\-level cross\-entropy loss against gold motion\-token labels,ℒd\\mathcal\{L\}\_\{d\}, computed as the mean over the per\-direction subset of the mini\-batch, and minimize their sum,

ℒ=ℒT2S\+ℒS2S\.\\mathcal\{L\}\\;=\\;\\mathcal\{L\}\_\{\\mathrm\{T2S\}\}\+\\mathcal\{L\}\_\{\\mathrm\{S2S\}\}\.T2Ssamples come from the natural \(text, sign\) pairs of the underlying corpora, andS2Ssamples from theBT\-constructed cross\-lingual corpus of §[2](https://arxiv.org/html/2605.20588#S2); per\-direction counts are reported in §[4](https://arxiv.org/html/2605.20588#S4)\.

To consume either modality through a single encoder, we followSOKE’s embedding scheme: a text is embedded as standardmBARTsubword tokens, while a motion\-token tuple at each frame – one token each for body, left hand and right hand – is fused into a single embedding via the same weighted sum thatSOKEapplies to decoder inputs during motion generation under teacher forcing\. The source language tag attached to the encoder input – a spoken\-language tag \(`en\_XX`,`zh\_CN`,`de\_DE`\) forT2S, or a sign\-language tag \(`en\_ASL`,`zh\_CSL`,`de\_DGS`\) forS2S– tells the encoder which embedding path to take, so the same module serves both directions without any new parameters\.

## 4Evaluation Dataset

The three monolingual sign\-language corpora that anchor our training and evaluation –How2Sign\(Duarteet al\.,[2021](https://arxiv.org/html/2605.20588#bib.bib16)\),CSL\-Daily\(Zhouet al\.,[2021](https://arxiv.org/html/2605.20588#bib.bib15)\), andPhoenix\-2014T\(Camgözet al\.,[2018](https://arxiv.org/html/2605.20588#bib.bib12)\)– are introduced in §[2\.2](https://arxiv.org/html/2605.20588#S2.SS2); per\-direction pair counts and sign\-token lengths in the resultingBT\-constructedS2Straining corpus are reported in Table[1](https://arxiv.org/html/2605.20588#S2.T1)\. The remainder of this section therefore focuses on the cross\-lingual*test*pairs ofInanet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib1)\), whose alignment is documented to be noisy by its authors \(§[4\.1](https://arxiv.org/html/2605.20588#S4.SS1)\)\. To enable a more discriminative comparison, we re\-verify this released set into a stricter subset whose source and target texts independently agree on meaning \(§[4\.2](https://arxiv.org/html/2605.20588#S4.SS2)\); statistics before and after re\-verification are reported in §[4\.3](https://arxiv.org/html/2605.20588#S4.SS3)\.

### 4\.1Cross\-Lingual Pairs from Prior Work

The only previously released cross\-lingualsign↔sign\\text\{sign\}\\\!\\leftrightarrow\\\!\\text\{sign\}pairs are those ofInanet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib1)\), who construct three pair sets across\{ASL,CSL,DGS\}\\\{\\text\{ASL\},\\text\{CSL\},\\text\{DGS\}\\\}–ASL↔CSL\\text\{ASL\}\\leftrightarrow\\text\{CSL\},ASL↔DGS\\text\{ASL\}\\leftrightarrow\\text\{DGS\}, andCSL↔DGS\\text\{CSL\}\\leftrightarrow\\text\{DGS\}– by translating the texts of each source corpus into the partner spoken language, encoding original and translated texts with a multilingual paraphrase model, and pairing clips whose texts score highly under that model\. We use these pairs as the test sets in our comparison withInanet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib1)\)\(§[5\.2](https://arxiv.org/html/2605.20588#S5.SS2)\) so that the BLEU numbers in Table[4](https://arxiv.org/html/2605.20588#S5.T4)are directly comparable to their Table 5\.

The authors document that the alignment is imperfect: paraphrase scoring on translated text is noisy, and a non\-trivial fraction of the released pairs do not in fact describe the same content\. Concrete examples of accepted but content\-divergent pairs are shown in the supplement\. We therefore complement the full release with a re\-verified subset\.

### 4\.2Re\-Verifying the Cross\-Lingual Evaluation Set

We do not modify the source corpora but only re\-verify the cross\-lingual pairs ofInanet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib1)\), retaining the subset whose source and target texts independently agree on meaning\. Since the strict subset is used only for evaluation, poolingInanet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib1)\)’s train, dev, and test splits before filtering carries no leakage risk and leaves enough pairs per language pair after the conservative procedure below which stacks two automatic filters and a final manual screening:

1. 1\.LLM\-judge filter\.For each released pair we prompt Qwen3\-8B\(Qwen Team,[2025](https://arxiv.org/html/2605.20588#bib.bib40)\)with the source and target texts and ask for an integer rating from11to55of how likely the two describe the same event/action with the same key entities \(5=5=most likely,1=1=unrelated\), with a one\-sentence justification\. We keep pairs whose rating is strictly greater than44\.
2. 2\.Sentence\-embedding filter\.For each surviving pair, we encode the source and target texts in their*original*languages with the multilingualsentence\-transformers/paraphrase\-multilingual\-MiniLM\-L12\-v2bi\-encoder\(Reimers and Gurevych,[2019](https://arxiv.org/html/2605.20588#bib.bib34)\)and compute the cosine similarity of the two embeddings\. We keep pairs whose cosine similarity exceeds0\.50\.5\.
3. 3\.Conservative intersection\.A pair enters the candidate pool only if it passes both\(1\)and\(2\)\. Either filter alone retains too many borderline pairs – the LLM judge alone is too lenient on semantically related but content\-divergent pairs, and the bi\-encoder keeps near\-duplicate texts with substantively different referents – so we take their intersection\. Precision is favoured over recall by design, since the goal is a discriminative benchmark rather than a large training set\.
4. 4\.Manual screening\.As a final pass, two bilingual authors independently review the candidate pairs that pass\(3\)and discard those whose source and target texts, on visual inspection, do not describe the same event or share the same key entities; a pair is retained only when both annotators agree to keep it\. The pairs surviving all three stages constitute the released strict subset\.

### 4\.3Quality of the Strict Subset

Table[2](https://arxiv.org/html/2605.20588#S4.T2)reports the size of the original release and of the re\-verified subset, together with two automatic agreement signals – mean Qwen3\-8B rating and mean Sentence\-BERT cosine similarity – before and after re\-verification, per language pair\.

Table 2:Original vs\. re\-verified cross\-lingual evaluation pairs\.Per\-pair sizes and text\-level agreement before and after the procedure of §[4\.2](https://arxiv.org/html/2605.20588#S4.SS2)\. Qwen3\-8B rating is on a11–55scale \(5=5=most likely the same event\); S\-BERT cosine is the cosine similarity of the multilingual MiniLM\-L12\-v2 embeddings\. Both report the mean over the \(sub\)set\.The strict subset is consistently smaller and with higher\-agreement: across the three language pairs, roughly8%8\\%of the released pairs survive on average, the mean Qwen3\-8B rating rises from2\.112\.11to4\.544\.54out of55, and the mean Sentence\-BERT cosine increases from0\.630\.63to0\.770\.77\. The two automatic signals do not, however, carry the same weight\. Sentence\-BERT cosine barely moves on two of the three pair sets \(0\.60→0\.650\.60\\to 0\.65onASL↔DGS\\text\{ASL\}\\leftrightarrow\\text\{DGS\},0\.65→0\.670\.65\\to 0\.67onCSL↔DGS\\text\{CSL\}\\leftrightarrow\\text\{DGS\}\), so the bi\-encoder cannot reliably tell content\-aligned from content\-divergent text pairs at the granularity we need\. The Qwen3\-8B rating, in contrast, jumps sharply on every pair set \(e\.g\.1\.97→4\.951\.97\\to 4\.95onASL↔CSL\\text\{ASL\}\\leftrightarrow\\text\{CSL\}\) and is what does the substantive verification; we keep the bi\-encoder filter as a cheap guard against the LLM\-judge\.

We will release the strict splits to support future comparison; in our experiments, we use the “Inan” column of Table[2](https://arxiv.org/html/2605.20588#S4.T2)to evaluate models on the full released set in §[5\.2](https://arxiv.org/html/2605.20588#S5.SS2)and the “Strict” column in §[5\.3](https://arxiv.org/html/2605.20588#S5.SS3)\.

![Refer to caption](https://arxiv.org/html/2605.20588v1/figures/strict.png)Figure 4:CSL\-ASLpairs from the strict subset\.One removed pair also shown for comparison\.

## 5Experiments

We evaluate on threesign↔sign\\text\{sign\}\\\!\\leftrightarrow\\\!\\text\{sign\}test sets\. The*BT\-input*set consists of held\-out \(text, sign\) pairs from each source corpus, passed through the sameback\-translationpipeline used at training time \(§[2](https://arxiv.org/html/2605.20588#S2)\): the held\-out text is translated and fed to ourT2Smodel to produce a synthetic source clip, which is paired with the gold target sign in another sign language\. TheInan\-Fullset is the full release fromInanet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib1)\), whose alignment is noisy by their own report \(§[4](https://arxiv.org/html/2605.20588#S4)\)\. TheInan\-Strictset is our re\-verified subset ofInan\-Full\(§[4](https://arxiv.org/html/2605.20588#S4)\), in which source and target texts are independently checked to confirm fidelity\.

#### Splits\.

T2Straining and the gold\-target side of theBTcorpus \(§[2\.3](https://arxiv.org/html/2605.20588#S2.SS3)\) use only the official*train*splits ofHow2Sign,CSL\-Daily, andPhoenix\-2014T; theBT\-input test set draws its held\-out pairs from the official*test*splits of those same corpora\. By construction, noBT\-input clip appears inT2Straining or as aBTgold target\.Inan\-FullandInan\-Strictare pooled acrossInanet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib1)\)’s train, dev, and test releases \(§[4\.2](https://arxiv.org/html/2605.20588#S4.SS2)\) since the strict subset is used purely for evaluation in our pipeline\. We report sign quality primarily via Procrustes\-aligned dynamic\-time\-warping MPJPE \(DTW\-PA\-MPJPE\)\(Duarteet al\.,[2021](https://arxiv.org/html/2605.20588#bib.bib16)\)\. We use MPJPE as the primary metric because pose\-distance metrics correlate more reliably with human judgement of sign\-production quality thanBLEUcomputed by decoding predicted signs through a downstreamS2Tmodel\(Jianget al\.,[2025](https://arxiv.org/html/2605.20588#bib.bib38)\)\. For comparability withInanet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib1)\), we additionally reportBLEU\-1/4\(Papineniet al\.,[2002](https://arxiv.org/html/2605.20588#bib.bib35); Post,[2018](https://arxiv.org/html/2605.20588#bib.bib37)\): predicted sign clips are passed through a Sign Language Transformer\(Camgözet al\.,[2020](https://arxiv.org/html/2605.20588#bib.bib13)\)trained per target language and the resulting text is scored against the gold target text \(word\-level forASL/DGS, character\-level forCSL\), following the protocol ofSaunderset al\.\([2020](https://arxiv.org/html/2605.20588#bib.bib7)\)\.333This evaluation step uses a separately\-trained sign\-to\-text model purely as an evaluator; it is unrelated to theback\-translationdata\-augmentation procedure of §[2](https://arxiv.org/html/2605.20588#S2)\.

#### Implementation\.

The model is initialized frommBART\-large\-cc25\(Liuet al\.,[2020](https://arxiv.org/html/2605.20588#bib.bib18)\)and trained jointly onT2SandS2Sin a single stage \(§[3\.2](https://arxiv.org/html/2605.20588#S3.SS2)\)\. We use AdamW\(Kingma and Ba,[2015](https://arxiv.org/html/2605.20588#bib.bib41)\)with learning rate2×10−42\{\\times\}10^\{\-4\}cosine\-annealed to1×10−61\{\\times\}10^\{\-6\}, weight decay0, gradient clipping1\.01\.0, for150150epochs at an effective batch size of256256on 2×\\timesNVIDIA H100 GPUs\. The VQ\-VAE sign tokenizer is trained separately followingZuoet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib11)\)and frozen during multi\-task training\.

#### Baselines\.

We compare against two main baselines plus a zero\-shotS2Ssanity check used in the ablation of §[5\.4](https://arxiv.org/html/2605.20588#S5.SS4)\.*Inanet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib1)\)*, the only published directS2Ssystem, trains a pose\-to\-pose model in the style ofSaunderset al\.\([2020](https://arxiv.org/html/2605.20588#bib.bib7)\)with a CTC\(Graveset al\.,[2006](https://arxiv.org/html/2605.20588#bib.bib42)\)gloss auxiliary head; we copy their reportedBLEUnumbers from their Table 5 and score our predictions through a Sign Language Transformer evaluator built from the same codebase\(Camgözet al\.,[2020](https://arxiv.org/html/2605.20588#bib.bib13)\)444OurS2Tevaluator uses the same architecture as theirs but with re\-tuned hyperparameters — the training configuration inInanet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib1)\)is under\-specified — and attains a different upper bound on gold\-sign inputs\. AbsoluteBLEUscores between their reported column and ours are therefore not strictly comparable; per\-direction deltas measured under a single evaluator \(e\.g\.direct vs\. cascade in Table[3](https://arxiv.org/html/2605.20588#S5.T3)\) are\.\. The*cascade*chains a pretrained external Sign Language Transformer\(Camgözet al\.,[2020](https://arxiv.org/html/2605.20588#bib.bib13)\),TranslateGemma4B\(Finkelsteinet al\.,[2026](https://arxiv.org/html/2605.20588#bib.bib21)\)as the off\-the\-shelf MT system, and ourT2Shead\.*Zero\-shotS2S*takes our backbone trained only onT2Sand evaluates it onS2Sat inference time by feeding the source sign\-token sequence to the encoder in place of a text — the model has never seen anyS2Spair during training, and the encoder has never seen sign\-token inputs\. This is the sanity check that theBT\-constructed training corpus from §[2](https://arxiv.org/html/2605.20588#S2)is necessary,i\.e\.that directS2Sdoes not fall out ofT2Straining alone\.

### 5\.1Results on theBT\-Input Test Set

We first compare directS2Sagainst the cascade on theBT\-input test set, where every source clip is constructed by the same procedure as training and held out at training time\. Table[3](https://arxiv.org/html/2605.20588#S5.T3)reports per\-directionDTW\-PA\-MPJPEandBLEU\-4for all six\(s,s′\)\(s,s^\{\\prime\}\)pairs across\{ASL,CSL,DGS\}\\\{\\text\{ASL\},\\text\{CSL\},\\text\{DGS\}\\\}\. DirectS2Sachieves lowerDTW\-PA\-MPJPEthan the cascade on every direction, averaging6\.636\.63vs\.8\.208\.20— a19%19\\%relative reduction\. It is also roughly2\.3×2\.3\\timesfaster in wall\-clock time, averaging7\.237\.23ms per example against the cascade’s16\.3116\.31ms on the same single\-GPU configuration\.

Table 3:Direct vs\. cascadedS2Son theBT\-input test set\.Rows are methods; columns are direction–metric pairs\. PA =DTW\-PA\-MPJPE↓\\\!\{\}\_\{\\downarrow\}\(averaged over all joints\); B4 =BLEU\-4↑\\\!\{\}^\{\\uparrow\}; Lat = wall\-clock latency per example \(ms\) on a single GPU↓\.
### 5\.2Comparison withInanet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib1)\)onInan\-Full

We evaluate on theInan\-Fullset as a benchmark anchor against the only previously reported direct\-S2Sresults\. Table[4](https://arxiv.org/html/2605.20588#S5.T4)places our predictions, scored through our re\-tunedS2Tevaluator, alongside the numbersInanet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib1)\)list in their Table 5; as noted in footnote[4](https://arxiv.org/html/2605.20588#footnote4)the two evaluators are not strictly matched, so absolute scores between their column and ours should be read with that caveat\. The within\-evaluator comparison in Table[3](https://arxiv.org/html/2605.20588#S5.T3)\(directS2Svs\. our cascade, both scored by the same evaluator\) is the comparison we treat as primary\. Even with the evaluator caveat, the gap onInan\-Fullis large: ourBLEU\-4exceeds the reported value on every direction, and the two directions whereInanet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib1)\)report zeroBLEU\-4—ASL→CSL\\text\{ASL\}\\/\\to\\text\{CSL\}andDGS→CSL\\text\{DGS\}\\/\\to\\text\{CSL\}— rise to9\.019\.01and7\.807\.80\.

Table 4:Comparison with the prior directsign↔sign\\text\{sign\}\\\!\\leftrightarrow\\\!\\text\{sign\}system ofInanet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib1)\)on theInan\-Fulltest pairs\.Rows are methods; columns are direction–metric pairs\. B1 =BLEU\-1, B4 =BLEU\-4\. Numbers underInanet al\.\([2025](https://arxiv.org/html/2605.20588#bib.bib1)\)are copied from their Table 5\.
### 5\.3Results onInan\-Strict

We next evaluate onInan\-Strict, the re\-verified subset ofInan\-Fullwhose construction is described in §[4\.2](https://arxiv.org/html/2605.20588#S4.SS2)\(per\-pair sizes in Table[2](https://arxiv.org/html/2605.20588#S4.T2)\)\. Each surviving text\-level match between the two corpora typically corresponds to several sign\-clip pairs, since each text is recorded by multiple signers: a re\-verified text pair\(xl,xl′\)\(x\_\{l\},x\_\{l^\{\\prime\}\}\)in whichxlx\_\{l\}hasaasource\-side sign clips andxl′x\_\{l^\{\\prime\}\}hasbbtarget\-side sign clips yieldsa×ba\{\\times\}bcandidate clip pairs\(zs,zs′\)\(z\_\{s\},z\_\{s^\{\\prime\}\}\)that all share the same meaning\. To prevent any single\(xl,xl′\)\(x\_\{l\},x\_\{l^\{\\prime\}\}\)instance from dominating the score, we evaluate each direction with the source\-side clipzsz\_\{s\}as the*anchor*: every unique source clip is used exactly once as the model’s input, the model produces one output, and that output is scored against*all*target clips that share the input’s text match, with the best of those scores retained\. The anchor side flips between the two directions of a language pair \(e\.g\.ASL→CSL\\text\{ASL\}\\to\\text\{CSL\}anchors onASLclips andCSL→ASL\\text\{CSL\}\\to\\text\{ASL\}onCSLclips\), so the per\-direction anchor counts reported in the top row of Table[5](https://arxiv.org/html/2605.20588#S5.T5)are asymmetric across an unordered pair\. We further note that theASL↔DGS\\text\{ASL\}\\leftrightarrow\\text\{DGS\}subset survives re\-verification with only66text pairs \(Table[2](https://arxiv.org/html/2605.20588#S4.T2)\); its two directions are reported for completeness but are not informative on their own\.

Table 5:Performance onInan\-Strict\.Rows are methods; columns are direction–metric pairs\. The first row reports the number of unique source\-side anchor clips per direction \(§[5\.3](https://arxiv.org/html/2605.20588#S5.SS3)\)\.
### 5\.4Ablations

#### Joint training vs\. single\-task\.

Our model trainsT2SandS2Sjointly in a single stage \(§[3\.2](https://arxiv.org/html/2605.20588#S3.SS2)\)\. To isolate the joint\-training gain, we compare against aT2S\-only baseline trained on the same backbone with the same data and schedule but without anyS2Ssupervision, and evaluate that baseline both on its nativeT2Stask and zero\-shot onS2Sby feeding the source sign\-token sequence to the encoder in place of a text input\. Table[7](https://arxiv.org/html/2605.20588#S5.T7)reportsS2Squality on theBT\-input test set andT2Squality averaged across the three corpora\. As expected, the zero\-shotS2Srow is poor \(DTW\-PA\-MPJPE8\.968\.96vs\. joint6\.636\.63;BLEU\-44\.224\.22vs\. joint10\.7410\.74\), confirming that theBT\-constructed training corpus is the source ofS2Squality\. The reverse direction also benefits, if mildly: addingS2Ssupervision lowersT2SDTW\-PA\-MPJPEfrom5\.195\.19to4\.834\.83and liftsBLEU\-4from12\.2912\.29to12\.9812\.98, so the two directions support one another rather than trading off\.

Table 6:Joint vs\. single\-task training\.Rows are training variants; columns are sign\-quality metrics on each task\.S2Saveraged over the sixBT\-input directions;T2Saveraged across\{How2Sign,CSL\-Daily,Phoenix\-2014T\}\\\{\\text\{How2Sign\},\\text\{CSL\-Daily\},\\text\{Phoenix\-2014T\}\\\}\. TheS2Scolumns of theT2S\-only row report zero\-shotS2Sperformance, obtained by feeding the source sign\-token sequence to the encoder of a checkpoint trained only on text inputs\.
Table 7:Real vs\. synthetic source signs at evaluation time\.Rows are evaluation\-time source variants; columns areS2Ssign\-quality metrics onInan\-Strict\. The real\-source row reproduces the default setup; the synthetic\-source row replaces each test source with aT2S\-generated clip from the gold translated text, matching the train\-time distribution\.

#### Real vs\. synthetic source signs at evaluation time\.

By construction \(§[2\.2](https://arxiv.org/html/2605.20588#S2.SS2)\), everyS2Ssource seen during training is a synthetic clip produced by ourT2Smodel from a translated text, whereas at evaluation onInan\-FullandInan\-Strictthe source is a*real*sign clip tokenised through the frozen VQ\-VAE\. The naive expectation is that matching the train\-time distribution at test time helps: when we additionally evaluate the same checkpoint with each source replaced by aT2S\-generated clip from the gold translated text, however, Table[7](https://arxiv.org/html/2605.20588#S5.T7)shows the opposite — real sources outperform synthetic ones \(DTW\-PA\-MPJPE5\.465\.46vs\.5\.975\.97;BLEU\-410\.2510\.25vs\.8\.338\.33\)\. The explanation is that synthetic clips inherit the residual noise ofT2Sgeneration while real clips do not, so swapping a noisy training\-distribution input for a cleaner real one makes the conditional easier rather than harder\. The model trained on noisy synthetic sources is therefore not bottlenecked at deployment by a train/test source gap; if anything, the gap works in its favour\.

## 6Related Work

#### Sign\-to\-text and text\-to\-sign translation\.

A long line of work translates sign video to spoken\-language text \(S2T\), beginning with neural sign translation\(Camgözet al\.,[2018](https://arxiv.org/html/2605.20588#bib.bib12)\)and continuing through joint recognition–translation transformers\(Camgözet al\.,[2020](https://arxiv.org/html/2605.20588#bib.bib13)\), stronger sequence decoders\(Yin and Read,[2020](https://arxiv.org/html/2605.20588#bib.bib47)\), and gloss\-free end\-to\-end models\(Linet al\.,[2023](https://arxiv.org/html/2605.20588#bib.bib14)\)\. The reverse direction,T2Sproduction, has progressed from GAN\-based pose synthesis\(Stollet al\.,[2018](https://arxiv.org/html/2605.20588#bib.bib9)\)through progressive transformers\(Saunderset al\.,[2020](https://arxiv.org/html/2605.20588#bib.bib7)\), mixtures of motion primitives\(Saunderset al\.,[2021](https://arxiv.org/html/2605.20588#bib.bib8)\), and gloss\-mediated word\-level synthesis\(Zelinka and Kanis,[2020](https://arxiv.org/html/2605.20588#bib.bib10)\), with more recent work moving to discrete sign\-token generation under a multilingual backbone\(Zuoet al\.,[2025](https://arxiv.org/html/2605.20588#bib.bib11)\)\. Spoken\-language MT\(Liuet al\.,[2020](https://arxiv.org/html/2605.20588#bib.bib18); Costa\-jussàet al\.,[2022](https://arxiv.org/html/2605.20588#bib.bib20); Finkelsteinet al\.,[2026](https://arxiv.org/html/2605.20588#bib.bib21)\)provides the bridge that lets these two directions compose into a cascade, which we use as one of our baselines\.SOKE\(Zuoet al\.,[2025](https://arxiv.org/html/2605.20588#bib.bib11)\)provides the representation and backbone substrate for our model – the three\-stream sign tokenizer and themBART\-based encoder–decoder\. Building on this substrate, our contribution is twofold: \(i\) a cross\-lingual signback\-translationcorpus that supplies supervisedS2Straining data without requiring natural parallel pairs \(§[2](https://arxiv.org/html/2605.20588#S2)\); and \(ii\) a directS2Smodel that uses spoken\-language text only as a*training\-time scaffold*and never at inference – in contrast to the cascade above, where text is an obligatory inference\-time bottleneck\. Sign\-sideback\-translationhas previously been explored within a single sign\-language–spoken\-language pair to manufacture additional sign/text data forS2T\(Zhouet al\.,[2021](https://arxiv.org/html/2605.20588#bib.bib15); Moryossefet al\.,[2021](https://arxiv.org/html/2605.20588#bib.bib32)\); we extend that idea to the cross\-lingual sign setting in §[2\.2](https://arxiv.org/html/2605.20588#S2.SS2)\.

#### Multilingual sign processing\.

Recent work treats sign\-language processing as multilingual along the sign↔\\leftrightarrowspoken\-text axis: MLSLT\(Yinet al\.,[2022](https://arxiv.org/html/2605.20588#bib.bib2)\)translates sign video to spoken text across multiple sign languages in one model; JWSign\(Gueuwouet al\.,[2023](https://arxiv.org/html/2605.20588#bib.bib3)\)contributes a typologically diverse multilingual resource; WMT\-SLT\(Mülleret al\.,[2023](https://arxiv.org/html/2605.20588#bib.bib4)\)benchmarks one pair at a time; andJianget al\.\([2023](https://arxiv.org/html/2605.20588#bib.bib33)\)targets the written SignWriting notation\. DirectS2Sis a complementary thread, and position pieces\(Yinet al\.,[2021](https://arxiv.org/html/2605.20588#bib.bib5); De Costeret al\.,[2023](https://arxiv.org/html/2605.20588#bib.bib6)\)argue that sign\-language tools should not require a spoken\-language detour forDHHusers\.

## 7Limitations and Future Work

#### Parallel data and evaluation\.

No in\-the\-wild parallelsign↔sign\\text\{sign\}\\\!\\leftrightarrow\\\!\\text\{sign\}corpus exists: everyS2Straining pair is synthetic on the source side \(§[2](https://arxiv.org/html/2605.20588#S2)\), and our strict evaluation subset \(§[4\.2](https://arxiv.org/html/2605.20588#S4.SS2)\) is re\-verified text alignment rather than directly aligned signs\.BLEU\-4also routes through an externalS2Tevaluator and is an imperfect proxy for sign\-production quality\(Jianget al\.,[2025](https://arxiv.org/html/2605.20588#bib.bib38)\)\.*Future work*: a small human\-aligned cross\-lingual sign corpus across\{ASL,CSL,DGS\}\\\{\\text\{ASL\},\\text\{CSL\},\\text\{DGS\}\\\}would yield both a cleaner training signal and a benchmark independent ofBTandS2Tdecoding\.

#### Deaf\-community engagement\.

This work was carried out without continuousDeaf\-community partnership: qualitative judgements were not reviewed byDHHsigners, and we do not yet report human evaluation of generated signs\.*Future work*: aDHH\-led evaluation loop scoring comprehensibility, naturalness, and fidelity acrossASL,CSL, andDGS\.

#### SimultaneousS2Sfor live conversation\.

Our model is offline and batched: the full source clip must be observed before any target is produced, which is incompatible with conversational latency and ignores the incremental nature of sign discourse\.*Future work*: streamingS2Swith a causal encoder and a wait\-kkdecoding policy\.

#### Broader impacts\.

DirectS2Scould support cross\-lingualDHHcommunication; risks include over\-confident outputs and the historical exclusion ofDeafsigners from sign\-language ML\. The corpora used are governed by their released licenses, and we will release the strict subset with documentation\.

## References

- Neural sign language translation\.InCVPR,pp\. 7784–7793\.Cited by:[§1](https://arxiv.org/html/2605.20588#S1.p2.2),[§2\.2](https://arxiv.org/html/2605.20588#S2.SS2.SSS0.Px1.p1.18),[§4](https://arxiv.org/html/2605.20588#S4.p1.5),[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px1.p1.9)\.
- N\. C\. Camgöz, O\. Koller, S\. Hadfield, and R\. Bowden \(2020\)Sign language transformers: joint end\-to\-end sign language recognition and translation\.InCVPR,pp\. 10023–10033\.Cited by:[§1](https://arxiv.org/html/2605.20588#S1.p2.2),[§1](https://arxiv.org/html/2605.20588#S1.p3.18),[§5](https://arxiv.org/html/2605.20588#S5.SS0.SSS0.Px1.p1.18),[§5](https://arxiv.org/html/2605.20588#S5.SS0.SSS0.Px3.p1.12),[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px1.p1.9)\.
- M\. R\. Costa\-jussà, J\. Cross, O\. Çelebi, M\. Elbayad, K\. Heafield, K\. Heffernan, E\. Kalbassi, J\. Lam, D\. Licht,et al\.\(2022\)No language left behind: scaling human\-centered machine translation\.arXiv:2207\.04672\.Cited by:[§1](https://arxiv.org/html/2605.20588#S1.p2.2),[§2\.1](https://arxiv.org/html/2605.20588#S2.SS1.p1.12),[§2\.2](https://arxiv.org/html/2605.20588#S2.SS2.SSS0.Px2.p3.6),[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px1.p1.9)\.
- M\. De Coster, D\. Shterionov, M\. Van Herreweghe, and J\. Dambre \(2023\)Machine translation from signed to spoken languages: state of the art and challenges\.Universal Access in the Information Society,pp\. 1–27\.Cited by:[§1](https://arxiv.org/html/2605.20588#S1.p1.3),[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px2.p1.3)\.
- A\. Duarte, S\. Palaskar, L\. Ventura, D\. Ghadiyaram, K\. DeHaan, F\. Metze, J\. Torres, and X\. Giró\-i\-Nieto \(2021\)How2Sign: a large\-scale multimodal dataset for continuous american sign language\.InCVPR,Cited by:[§2\.2](https://arxiv.org/html/2605.20588#S2.SS2.SSS0.Px1.p1.18),[§4](https://arxiv.org/html/2605.20588#S4.p1.5),[§5](https://arxiv.org/html/2605.20588#S5.SS0.SSS0.Px1.p1.18)\.
- S\. Edunov, M\. Ott, M\. Auli, and D\. Grangier \(2018\)Understanding back\-translation at scale\.InEMNLP,Cited by:[1st item](https://arxiv.org/html/2605.20588#S1.I1.i1.p1.2),[§1](https://arxiv.org/html/2605.20588#S1.p4.20),[§2\.1](https://arxiv.org/html/2605.20588#S2.SS1.p1.12),[§2](https://arxiv.org/html/2605.20588#S2.p1.6)\.
- M\. Finkelstein, I\. Caswell, T\. Domhan, J\. Peter, J\. Juraska, P\. Riley, D\. Deutsch, C\. Dilanni, C\. Cherry, E\. Briakou, E\. Nielsen, J\. Luo, S\. Agrawal, W\. Xu, E\. Kats, S\. Jaskiewicz, M\. Freitag, and D\. Vilar \(2026\)TranslateGemma technical report\.arXiv:2601\.09012\.Cited by:[item 1](https://arxiv.org/html/2605.20588#S2.I1.i1.p1.5),[§5](https://arxiv.org/html/2605.20588#S5.SS0.SSS0.Px3.p1.12),[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px1.p1.9)\.
- A\. Graves, S\. Fernández, F\. Gomez, and J\. Schmidhuber \(2006\)Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks\.InICML,pp\. 369–376\.Cited by:[§5](https://arxiv.org/html/2605.20588#S5.SS0.SSS0.Px3.p1.12)\.
- S\. Gueuwou, S\. Siake, C\. Leong, and M\. Müller \(2023\)JWSign: a highly multilingual corpus of bible translations for more diversity in sign language processing\.InFindings of EMNLP,pp\. 9907–9927\.Cited by:[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px2.p1.3)\.
- M\. Inan, Y\. Zhong, V\. Ganesh, and M\. Alikhani \(2025\)How to align multiple signed language corpora for better sign\-to\-sign translations?\.InProceedings of NAACL\-HLT \(Long Papers\),pp\. 4003–4016\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.202)Cited by:[2nd item](https://arxiv.org/html/2605.20588#S1.I1.i2.p1.1),[3rd item](https://arxiv.org/html/2605.20588#S1.I1.i3.p1.7),[§1](https://arxiv.org/html/2605.20588#S1.p3.18),[§4\.1](https://arxiv.org/html/2605.20588#S4.SS1.p1.5),[§4\.2](https://arxiv.org/html/2605.20588#S4.SS2.p1.1),[§4](https://arxiv.org/html/2605.20588#S4.p1.5),[§5](https://arxiv.org/html/2605.20588#S5.SS0.SSS0.Px1.p1.18),[§5](https://arxiv.org/html/2605.20588#S5.SS0.SSS0.Px3.p1.12.3),[§5\.2](https://arxiv.org/html/2605.20588#S5.SS2),[§5\.2](https://arxiv.org/html/2605.20588#S5.SS2.p1.11),[Table 4](https://arxiv.org/html/2605.20588#S5.T4),[Table 4](https://arxiv.org/html/2605.20588#S5.T4.2.2),[Table 4](https://arxiv.org/html/2605.20588#S5.T4.39.31.32.1.1),[Table 4](https://arxiv.org/html/2605.20588#S5.T4.6.2),[§5](https://arxiv.org/html/2605.20588#S5.p1.7),[footnote 4](https://arxiv.org/html/2605.20588#footnote4)\.
- Y\. Jia, M\. T\. Ramanovich, T\. Remez, and R\. Pomerantz \(2022\)Translatotron 2: high\-quality direct speech\-to\-speech translation with voice preservation\.ICML\.Cited by:[§1](https://arxiv.org/html/2605.20588#S1.p3.18)\.
- Y\. Jia, R\. J\. Weiss, F\. Biadsy, W\. Macherey, M\. Johnson, Z\. Chen, and Y\. Wu \(2019\)Direct speech\-to\-speech translation with a sequence\-to\-sequence model\.Interspeech\.Cited by:[§1](https://arxiv.org/html/2605.20588#S1.p3.18)\.
- Z\. Jiang, C\. Leong, A\. Moryossef, O\. Cory, M\. Ivashechkin, N\. Tarigopula, B\. Zhang, A\. Göhring, A\. Rios, R\. Sennrich, and S\. Ebling \(2025\)Meaningful pose\-based sign language evaluation\.InProceedings of the Tenth Conference on Machine Translation \(WMT\),pp\. 64–80\.Cited by:[§5](https://arxiv.org/html/2605.20588#S5.SS0.SSS0.Px1.p1.18),[§7](https://arxiv.org/html/2605.20588#S7.SS0.SSS0.Px1.p1.7)\.
- Z\. Jiang, A\. Moryossef, M\. Müller, and S\. Ebling \(2023\)Machine translation between spoken languages and signed languages represented in SignWriting\.InFindings of EACL,pp\. 1706–1724\.Cited by:[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px2.p1.3)\.
- D\. P\. Kingma and J\. Ba \(2015\)Adam: a method for stochastic optimization\.InICLR,Cited by:[§5](https://arxiv.org/html/2605.20588#S5.SS0.SSS0.Px2.p1.10)\.
- K\. Lin, X\. Wang, L\. Zhu, K\. Sun, B\. Zhang, and Y\. Yang \(2023\)Gloss\-free end\-to\-end sign language translation\.InACL,pp\. 12904–12916\.Cited by:[§1](https://arxiv.org/html/2605.20588#S1.p2.2),[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px1.p1.9)\.
- Y\. Liu, J\. Gu, N\. Goyal, X\. Li, S\. Edunov, M\. Ghazvininejad, M\. Lewis, and L\. Zettlemoyer \(2020\)Multilingual denoising pre\-training for neural machine translation\.Transactions of the Association for Computational Linguistics8,pp\. 726–742\.Cited by:[§1](https://arxiv.org/html/2605.20588#S1.p2.2),[§2\.1](https://arxiv.org/html/2605.20588#S2.SS1.p1.12),[§3](https://arxiv.org/html/2605.20588#S3.p1.7),[§5](https://arxiv.org/html/2605.20588#S5.SS0.SSS0.Px2.p1.10),[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px1.p1.9)\.
- A\. Moryossef, K\. Yin, G\. Neubig, and Y\. Goldberg \(2021\)Data augmentation for sign language gloss translation\.InWorkshop on Automatic Translation for Signed and Spoken Languages \(AT4SSL\),pp\. 1–11\.Cited by:[§1](https://arxiv.org/html/2605.20588#S1.p4.20),[§2\.2](https://arxiv.org/html/2605.20588#S2.SS2.SSS0.Px2.p3.6),[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px1.p1.9)\.
- M\. Müller, M\. Alikhani, E\. Avramidis, R\. Bowden, A\. Braffort, N\. C\. Camgöz, S\. Ebling, C\. España\-Bonet, A\. Göhring, R\. Grundkiewicz, M\. Inan, Z\. Jiang, O\. Koller, A\. Moryossef, A\. Rios, D\. Shterionov, S\. Sidler\-Miserez, K\. Tissi, and D\. Van Landuyt \(2023\)Findings of the second WMT shared task on sign language translation \(WMT\-SLT23\)\.InProceedings of the Eighth Conference on Machine Translation \(WMT\),pp\. 68–94\.Cited by:[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px2.p1.3)\.
- K\. Papineni, S\. Roukos, T\. Ward, and W\. Zhu \(2002\)BLEU: a method for automatic evaluation of machine translation\.InACL,Cited by:[§5](https://arxiv.org/html/2605.20588#S5.SS0.SSS0.Px1.p1.18)\.
- M\. Post \(2018\)A call for clarity in reporting BLEU scores\.InConference on Machine Translation,Cited by:[§5](https://arxiv.org/html/2605.20588#S5.SS0.SSS0.Px1.p1.18)\.
- Qwen Team \(2025\)Qwen3 technical report\.arXiv:2505\.09388\.Cited by:[item 1](https://arxiv.org/html/2605.20588#S4.I1.i1.p1.5)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-BERT: sentence embeddings using Siamese BERT\-networks\.InEMNLP,Cited by:[2nd item](https://arxiv.org/html/2605.20588#S1.I1.i2.p1.1),[item 2](https://arxiv.org/html/2605.20588#S4.I1.i2.p1.1)\.
- P\. K\. Rubenstein, C\. Asawaroengchai, D\. D\. Nguyen, A\. Bapna, Z\. Borsos, F\. de Chaumont Quitry, P\. Chen, D\. E\. Badawy, W\. Han, E\. Kharitonov,et al\.\(2023\)AudioPaLM: a large language model that can speak and listen\.arXiv:2306\.12925\.Cited by:[§1](https://arxiv.org/html/2605.20588#S1.p3.18)\.
- B\. Saunders, N\. C\. Camgöz, and R\. Bowden \(2020\)Progressive transformers for end\-to\-end sign language production\.InECCV,pp\. 687–705\.Cited by:[§1](https://arxiv.org/html/2605.20588#S1.p2.2),[§1](https://arxiv.org/html/2605.20588#S1.p3.18),[§5](https://arxiv.org/html/2605.20588#S5.SS0.SSS0.Px1.p1.18),[§5](https://arxiv.org/html/2605.20588#S5.SS0.SSS0.Px3.p1.12),[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px1.p1.9),[footnote 1](https://arxiv.org/html/2605.20588#footnote1)\.
- B\. Saunders, N\. C\. Camgöz, and R\. Bowden \(2021\)Mixed SIGNals: sign language production via a mixture of motion primitives\.InICCV,pp\. 1919–1929\.Cited by:[§1](https://arxiv.org/html/2605.20588#S1.p2.2),[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px1.p1.9)\.
- R\. Sennrich, B\. Haddow, and A\. Birch \(2016\)Improving neural machine translation models with monolingual data\.InACL,Cited by:[1st item](https://arxiv.org/html/2605.20588#S1.I1.i1.p1.2),[§1](https://arxiv.org/html/2605.20588#S1.p4.20),[Figure 2](https://arxiv.org/html/2605.20588#S2.F2),[§2\.1](https://arxiv.org/html/2605.20588#S2.SS1.p1.12),[§2\.2](https://arxiv.org/html/2605.20588#S2.SS2.SSS0.Px2.p3.6),[§2](https://arxiv.org/html/2605.20588#S2.p1.6),[footnote 1](https://arxiv.org/html/2605.20588#footnote1)\.
- S\. Stoll, N\. C\. Camgöz, S\. Hadfield, and R\. Bowden \(2018\)Sign language production using neural machine translation and generative adversarial networks\.InBMVC,Cited by:[§1](https://arxiv.org/html/2605.20588#S1.p2.2),[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px1.p1.9)\.
- A\. van den Oord, O\. Vinyals, and K\. Kavukcuoglu \(2017\)Neural discrete representation learning\.InNeurIPS,Cited by:[§3](https://arxiv.org/html/2605.20588#S3.p1.7)\.
- A\. Yin, Z\. Zhao, W\. Jin, M\. Zhang, X\. Zeng, and X\. He \(2022\)MLSLT: towards multilingual sign language translation\.InCVPR,pp\. 5109–5119\.Cited by:[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px2.p1.3)\.
- K\. Yin, A\. Moryossef, J\. Hochgesang, Y\. Goldberg, and M\. Alikhani \(2021\)Including signed languages in natural language processing\.InACL\-IJCNLP,pp\. 7347–7360\.Cited by:[§1](https://arxiv.org/html/2605.20588#S1.p1.3),[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px2.p1.3)\.
- K\. Yin and J\. Read \(2020\)Better sign language translation with STMC\-transformer\.InCOLING,pp\. 5975–5989\.Cited by:[§1](https://arxiv.org/html/2605.20588#S1.p2.2),[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px1.p1.9)\.
- J\. Zelinka and J\. Kanis \(2020\)Neural sign language synthesis: words are our glosses\.InWACV,pp\. 3384–3392\.Cited by:[§1](https://arxiv.org/html/2605.20588#S1.p2.2),[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px1.p1.9)\.
- H\. Zhou, W\. Zhou, W\. Qi, J\. Pu, and H\. Li \(2021\)Improving sign language translation with monolingual data by sign back\-translation\.InCVPR,pp\. 1316–1325\.Cited by:[§1](https://arxiv.org/html/2605.20588#S1.p4.20),[§2\.2](https://arxiv.org/html/2605.20588#S2.SS2.SSS0.Px1.p1.18),[§2\.2](https://arxiv.org/html/2605.20588#S2.SS2.SSS0.Px2.p3.6),[§4](https://arxiv.org/html/2605.20588#S4.p1.5),[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px1.p1.9)\.
- R\. Zuo, R\. A\. Potamias, E\. Ververas, J\. Deng, and S\. Zafeiriou \(2025\)Signs as tokens: a retrieval\-enhanced multilingual sign language generator\.InICCV,Cited by:[§1](https://arxiv.org/html/2605.20588#S1.p2.2),[§1](https://arxiv.org/html/2605.20588#S1.p4.20),[§3\.2](https://arxiv.org/html/2605.20588#S3.SS2.p1.7),[§3](https://arxiv.org/html/2605.20588#S3.p1.7),[§5](https://arxiv.org/html/2605.20588#S5.SS0.SSS0.Px2.p1.10),[§6](https://arxiv.org/html/2605.20588#S6.SS0.SSS0.Px1.p1.9)\.

Similar Articles

Emotion Recognition in Sign Language Conversation

arXiv cs.CL

This paper introduces the eJSL Dialog dataset for emotion recognition in sign language conversations, addressing the lack of conversational context in existing datasets. Benchmarking shows a domain gap when applying generic multimodal models, highlighting the need for context-aware visual extractors for sign language.

Sentiment Analysis of German Sign Language Fairy Tales

arXiv cs.CL

A research paper presenting a dataset and XGBoost-based model for sentiment analysis of German Sign Language (DGS) fairy tales using facial and body motion features extracted via MediaPipe, achieving 63.1% balanced accuracy and demonstrating the importance of both facial and body movements for sentiment communication in sign language.

DuDi: Dual-Signal Distillation with Cross-Lingual Verbalizer

arXiv cs.CL

DuDi is a dual-signal multilingual distillation framework combining sequence-level and token-level signals with a cross-lingual verbalizer to improve small language models' performance on Southeast Asian languages. Experiments on SEA-HELM show DuDi consistently outperforms competitive distillation baselines across multiple model families and scales.