SignX: Continuous Sign Recognition in Compact Pose-Rich Latent Space
Summary
SignX proposes a novel framework for continuous sign language recognition that unifies heterogeneous pose formats into a compact latent space and achieves state-of-the-art accuracy with 50× computational acceleration over pixel-space baselines.
View Cached Full Text
Cached at: 04/20/26, 08:32 AM
# SignX: Continuous Sign Recognition in Compact Pose-Rich Latent Space
Source: https://arxiv.org/html/2504.16315
11institutetext:1Rutgers University2Nanyang Technological University3Columbia University4Georgia Institute of Technology5University of Wisconsin–Madison6Max Planck Institute for Intelligent Systems7University of Texas at Austin
https://signerx.github.io/SignXYalin Fenghttps://orcid.org/0009-0000-8932-1545Equal ContributionChunyu Suihttps://orcid.org/0009-0008-0497-3463Hongbin Zhonghttps://orcid.org/0009-0003-2564-9674Yanxin Zhanghttps://orcid.org/0009-0001-2307-901XHongwei Yihttps://orcid.org/0000-0001-8669-2009Hezhen Huhttps://orcid.org/0000-0003-0327-1562Dimitris N\. Metaxashttps://orcid.org/0000-0001-7142-7640
###### Abstract
The complexity of Sign Language \(SL\) data processing brings many challenges\. The current approach to recognition of SL signs aims to translate RGB sign language videos through pose information into Word\-based ID Glosses, which serve to uniquely identify signs111Note that there is no shared convention for assigning such glosses to SL signs, so consistent glossing conventions must be used across all datasets\.\. This paper proposesSignX, a novel framework for continuous sign language recognition \(SLR\) in compact pose\-rich latent space\. First, we construct a unified latent representation that encodes heterogeneous pose formats \(SMPLer\-X, DWPose, Mediapipe, PrimeDepth, and Sapiens Segmentation\) into a compact, information\-dense space\. Second, we train a ViT\-based Video\-to\-Pose module to extract this latent representation directly from raw videos\. Finally, we develop a temporal modeling and sequence refinement method that operates entirely in this latent space\. This multi\-stage design achieves end\-to\-end SLR while significantly reducing computational consumption\. Experimental results demonstrate that SignX achieves SOTA accuracy on continuous SLR and Translation task, delivering nearly a 50\-fold acceleration over pixel\-space baselines\.
## 1Introduction
Sign Language Recognition \(SLR\) aims to automatically convert sign language videos into text or glosses \(represented as upper\-case text, which serve as unique identifiers of signs\)\. It has important social value in facilitating barrier\-free communication between deaf and hearing individuals\[camgoz2018neural,camgoz2020sign,stoll2020text2sign,yin\-etal\-2021\-including\-signed\-languages\]\. However, SLR faces technical challenges:\(1\)Signed languages are complex multimodal movements, involving hand and arm movements as well as facial expressions and body postures, with complex temporal dependencies among these modalities\[Bohacek\_2022\_WACV,signor,SIGNUM\]\.\(2\)Existing SL datasets are limited in size and lack a uniform preprocessing format, with complicated processing pipelines, which severely constrains the application of deep learning methods\[duarte2021how2sign,JAsigning,camgoz2020sign\]\.\(3\)Different pose processing formats attend to different visual aspects, leading to distributional shifts that limit effective integrationof innovative advantages across model types\[fang2025signllmsignlanguageproduction,fang2025signdiffdiffusionmodelamerican\]\.
To address these challenges, we develop aVid2Pose moduleby using a novel ViT model\[dosovitskiy2020vit\], which inspired by the five most powerful pose estimation models\. As shown in Fig\.1 (https://arxiv.org/html/2504.16315#S1.F1), we learn the Vid2Pose process in the encoder of the ViT model through a unified pose representation\. Specifically, SMPLer\-X\[cai2023smplerx\]provides 3D body mesh with kinematic and pose parameters for modeling body dynamics;
Refer to captionFigure 1:Multimodal pose estimation methods:SMPLer\-X\[cai2023smplerx\]can provide accurate 3D human body parameters; DWPose\[yang2023effective\]focuses on real\-time 2D keypoint detection; Mediapipe\[MediaPipe\]provides lightweight but efficient 3D pose prediction; PrimeDepth\[zavadski2024primedepth\]can obtain scene depth information; while Sapiens Segmentation\[khirodkar2024sapiens\]provides fine\-grained human body part segmentation results\. These methods each have their own characteristics, providing rich feature representations for sign language recognition\.DWPose\[yang2023effective\]captures 2D keypoints with rich facial landmarks essential for grammatical expressions; MediaPipe\[MediaPipe\]offers 3D joint coordinates that represent overall body motion trends\. Additionally, PrimeDepth\[zavadski2024primedepth\]provides spatial depth information for disambiguating front\-back positioning, while Sapiens Segmentation\[khirodkar2024sapiens\]captures human body shape through fine\-grained part boundaries\. This complementary design captures not only hand movements but also fine\-grained details such asbody dynamics, facial expressions, motion trajectories, spatial depth, and body shape, with the goal of fully capturing a person’s pose\[saunders2020progressive,stoll2020text2sign,pose\_format\_helper\]\. This end\-to\-end processing approach greatly simplifies thepose representation\[Bohacek\_2022\_WACV,cho2021unifying,chiang2019unified\]process that is necessary for SL recognition\.
Then we employ thePose2Gloss methodto identify the pose features in the latent space, but most existing recognition models have several limitations:\(1\)Most are based onraw pose data learning, which is inconvenient in practical applications and requires assistance from other pose information extraction models\[wang2018video,wang2018high\]\.\(2\)Furthermore, models capable of real\-time translation havelow accuracy, while highly accurate models haveslow output speeds\[saunders2021mixed\]\.\(3\)Even purely pixel\-space\-based models are limited byhigh computational costs, and their attention mechanisms cannot support higher performance ceilings for longer and more accurate translations\[cui2017recurrent,koller2020quantitative,gloss\-informal,zelinka2020neural\]\.
To address these limitations, we develop a recognition framework that operates entirely in the compact pose\-rich latent space\. Starting from the 2048\-dimensional pose features extracted by Vid2Pose, we employ a ResNet34 backbone\[7780459\]followed by TemporalConv layers\[10\.1007/978\-3\-319\-49409\-8\_7\]to capture hierarchical temporal patterns\. These are jointly learned with the pose features\. To refine these predictions into coherent sequences, we apply a Transformer\-based encoder\-decoder\[zhang2023sltunet\]that performs sequence\-to\-sequence refinement with beam search and CTC regularization\[camgoz2020sign,vaswani2017attention\]\. This design achieves accurate recognition directly in latent space, eliminating the need for raw pose processing and reducing computational overhead compared to vision\-based approaches\.
Through this multi\-stage training, SignX achieves powerful functionality while maintaining architectural simplicity: It learns from various formats of prior knowledge and can directly use the raw video as input\. Experimental results show that SignX achieves good performance on our ASL and mainstream datasets\[asllrp2025signbank,forster2012rwth,Zhou2021ImprovingSLT\-with\-monolingual\-CSLDaily,WLASL\], surpassing existing methods in both accuracy and robustness\[Chen\-arxiv\-2023\-Robust\]\.
The main contributions of this paper can be summarized in the following three points:
- •We proposeSignX, a novel framework for continuous sign language recognition that operates in acompact pose\-rich latent space, unifying heterogeneous pose representations from five powerful estimation methods into a single information\-dense encoding\.
- •We develop aViT\-based Vid2Pose modulethat extracts unified pose representations \(encompassing facial expressions, body dynamics, motion trajectories, spatial depth, and body shape\) directly from raw videos in an end\-to\-end manner, eliminating the need for explicit pose estimation pipelines and significantly simplifying the workflow\.
- •We design alatent\-space recognition methodthat combines ResNet temporal modeling withTransformer\-based sequence refinement, achieving accurate sign recognition while reducing computational overhead compared to pixel\-space approaches\.
## 2Related Work
Sign Recognition\.While traditional SLR frameworks predominantly rely on feature extraction from high\-dimensional pixel streams or raw skeletal coordinates, these approaches are often hindered by data noise and the difficulty of capturing fine\-grained grammatical nuances\. In contrast, diffusion models have recently redefined sign language generation by operating within compressed, structured latent spaces\[fang2025stablesignerhierarchicalsign,fang2025signllmsignlanguageproduction\]\. Departing from existing SLR paradigms, our work introducesa novel recognition framework that performs inference directly within a sign\-specific latent space\.Inspired by the high\-efficiency latent modeling in diffusion processes\[rombach2021highresolution,fang2025streamflowtheoryalgorithmimplementation\], we leverage a pose\-rich latent domain to bypass the limitations of raw data processing\. This shift allows our model to more effectively capture complex spatiotemporal dependencies and linguistic structures, establishing a new direction for robust and efficient ASL recognition\.
Vision Transformer\.Since the development of ViT\[dosovitskiy2020vit\], the use of Transformers in vision tasks has become increasingly widespread\. Compared to traditional architectures, Transformers can better model long\-range dependencies through self\-attention mechanisms\[saunders2021continuous,huang2021towards\]\. The original ViT first applied pure Transformer structure to image classification, pioneering a new paradigm of vision Transformers\. Subsequent improvements, such as Swin Transformer and other models, further enhanced model performance in vision tasks by introducing local attention mechanisms\[camgoz2020sign,ko2019neural\]\. In the field of video event recognition, researchers have introduced spatiotemporal attention mechanisms, enabling ViT to better process video sequence data\[Bohacek\_2022\_WACV,forster2012rwth\]\. This advance is particularly important for SL recognition, as SL videos contain rich temporal information\. Through well\-designed spatiotemporal attention mechanisms, models can effectively capture temporal dependencies in sequences of signing\[koller15:cslr,saunders2021mixed\]\.
Refer to captionFigure 2:Building Compact Pose\-Rich Latent Space:Overall, we utilize ViT\[dosovitskiy2020vit\]to construct and accommodate a pose latent space\. It has two entry points: a video entry at the top layer and a pose data entry in the middle section\.\(a\) For training stage 1\(Sec\.3\.2 (https://arxiv.org/html/2504.16315#S3.SS2)\), we first train the pose fusion layer to output simple text information, ensuring that the learned pose representations are meaningful\.\(b\) For training stage 2\(Sec\.3\.3 (https://arxiv.org/html/2504.16315#S3.SS3)\), we freeze all other components and only learn how RGB videos can be correctly converted into our pose features\.For inference, only RGB is used as input, so we must ensure that we can encode RGB inputs into our pose features\.
## 3Building Compact Pose\-Rich Latent Space
### 3\.1Data Processing
To address the challenge of integrating heterogeneous pose information for SL recognition, we first construct a comprehensive processing pipeline that standardizes multiple pose formats\. First, we use a high\-quality ASLLRP SignStream® 3 Corpus\[neidle2022asl,asllrp2025signbank\]dataset, which contains over 80 hours of American Sign Language \(ASL\) videos with synchronized front view, side view, and facial close\-up recordings\. For each input videoV∈RT×H×W×3V\\in\\mathbb\{R\}^\{T\\times H\\times W\\times 3\}, whereTTrepresents the number of frames andH,WH,Wrepresent height and width, respectively, we extract sequential pose representations for each pose type:PDWPose∈RT×384P\_\{\\text\{DWPose\}\}\\in\\mathbb\{R\}^\{T\\times 384\},PMediaPipe∈RT×258P\_\{\\text\{MediaPipe\}\}\\in\\mathbb\{R\}^\{T\\times 258\},PSMPLer\-X∈RT×165P\_\{\\text\{SMPLer\-X\}\}\\in\\mathbb\{R\}^\{T\\times 165\},PPrimeDepth∈RT×576P\_\{\\text\{PrimeDepth\}\}\\in\\mathbb\{R\}^\{T\\times 576\},PSapiens∈RT×576P\_\{\\text\{Sapiens\}\}\\in\\mathbb\{R\}^\{T\\times 576\}\. The dimensionality of each pose type reflects its specific information structure\. For instance, DWPose contains 18 body keypoints, 21 keypoints for each hand, and 68 facial keypoints, along with their respective confidence scores, totaling 384 dimensions per frame\.
As shown in Fig\.2 (https://arxiv.org/html/2504.16315#S2.F2), we flatten the information from each frame into a 1\-dimensional list, concatenate five types of poses to form an input of length 1959, which is then passed through projection layers for simple transformation and dimension matching; this is then followed by the fusion layer for multimodal integration\. In this way, we obtain rich pose information for each frame\. This standardized structure not only prevents different types of pose information from interfering with each other, but also makes heterogeneous types of pose information mutually compatible\.
### 3\.2Multimodal Pose Fusion
In order to ensure that the multi\-track pose features are meaningful, we first conduct a simple training to enable the fused posture features to be easily converted into text: We modified the pose entry to output layer of ViT approach that learns the mapping from the unified pose representation to natural language descriptions\. It consists of several key components: a pose encoder for multimodal fusion, some layers for the feature processing, and a text decoder for generating the final output\.
We choose ViT for pose latent space construction because: \(1\) ViT’s self\-attention mechanism can effectively fuse heterogeneous pose representations from different extraction tools \(SMPLer\-X, DWPose, Mediapipe, PrimeDepth, Sapiens\) into a unified latent space\. \(2\) ViT’s deep layered architecture naturally creates a hierarchical latent space that can compress rich, multi\-source pose information into compact representations while preserving essential pose details—making it an ideal container for our pose\-rich latent space\. \(3\) Unlike specialized VAE encoders\[he2022masked\], ViT’s transformer architecture can scale well to high\-dimensional inputs, enabling us to replicate Stable Diffusion’s success of operating in compact latent space for the sign language domain\.
##### Multimodal Pose Fusion Layer\.
The pose encoderEposeE\_\{\\text\{pose\}\}processes the five types of pose information with type\-specific encoders and fuses them through a multi\-head attention mechanism:
fi\\displaystyle f\_\{i\}=Ei\(Pi\),\\displaystyle=E\_\{i\}\(P\_\{i\}\),\(1\)ffused\\displaystyle f\_\{\\text\{fused\}\}=MultiHeadAttention\(f1,f2,...,f5\)\\displaystyle=\\text\{MultiHeadAttention\}\(f\_\{1\},f\_\{2\},\\dots,f\_\{5\}\)whereEiE\_\{i\}is the encoder for pose typeii, andfi∈Rdhf\_\{i\}\\in\\mathbb\{R\}^\{d\_\{h\}\}is the encoded feature\. This fusion approach preserves the unique characteristics of each pose type, while allowing information sharing among them\[ko2019neural,liu2022bevfusion,zhang2020fusionnet\]\.
For eachfif\_\{i\}, we project it to a unified representation space through a dimension matching layer:
zi=PadMatch\(fi\)∈R2048z\_\{i\}=\\text\{PadMatch\}\(f\_\{i\}\)\\in\\mathbb\{R\}^\{2048\}\(2\)whereziz\_\{i\}is the unified representation at positionii;PadMatchrefers to zero\-padding followed by projection\. The ViT layers process this unified representation through multi\-head self\-attention and cross\-attention mechanisms:
zlatent=ViTlayers\(z1,z2,...,zn\)z\_\{\\text\{latent\}\}=\\text\{ViT\}\_\{\\text\{layers\}\}\(z\_\{1\},z\_\{2\},\.\.\.,z\_\{n\}\)\(3\)wherezlatentz\_\{\\text\{latent\}\}represents the finalSimilar Articles
Direct Translation between Sign Languages
This paper introduces a direct sign-to-sign translation model that bypasses intermediate text by using back-translation to create synthetic parallel sign language data, achieving significant improvements in speed and accuracy over cascade methods for ASL, CSL, and DGS.
Emotion Recognition in Sign Language Conversation
This paper introduces the eJSL Dialog dataset for emotion recognition in sign language conversations, addressing the lack of conversational context in existing datasets. Benchmarking shows a domain gap when applying generic multimodal models, highlighting the need for context-aware visual extractors for sign language.
Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns
Proposes the Bag of Dims framework showing that the standard basis of transformer hidden states provides a training-free, architecture-general feature representation where dimensions encode semantic content via sign patterns; validated across language, vision, and audio models, achieving high accuracy with no learned rotations.
Phonological Perception of Sign Language Models
This paper evaluates whether Sign Language Recognition models exhibit phonological sensitivity by probing them with minimal pairs of signs, revealing architectural trade-offs and emergent but limited phonological perception.
Signspell
Signspell is a Python package for real-time American Sign Language alphabet recognition, installable via pip.