Low Resource Multimodal Translation of Nepali Spoken Words into Emotion-Conditioned Sign Language Avatars

arXiv cs.CL Papers

Summary

This paper presents NEST-V1, a proof-of-concept multimodal framework for generating emotion-conditioned Nepali Sign Language avatars from spoken input, achieving 81.1% ASR accuracy and 79.21% emotion recognition accuracy on a dataset of 600 audio samples from 50 speakers.

arXiv:2606.26107v1 Announce Type: new Abstract: Sign language communication systems, that integrate emotional expression remain underexplored, particularly for low-resource languages. This pilot study presents NEST-V1 (Nepali Emotion and Speech Transformer - Version 1), a proof-of-concept multimodal framework that demonstrates the feasibility of generating emotion-conditioned Nepali Sign Language avatars from spoken input. As a preliminary investigation, we focus on four common Nepali words ("thank you", "hello", "house", "me") across three emotional states (happy, neutral, sad) to validate our core technical approach. Our lightweight architecture employs a shared acoustic encoder for simultaneous Automatic Speech Recognition and emotion classification, achieving 81.1% ASR accuracy and 79.21% emotion recognition accuracy on a dataset of 600 labeled audio samples from 50 speakers. The system demonstrates 37% parameter efficiency compared to separate model architectures while maintaining a lightweight footprint with only 22.1M parameters suitable for edge deployment. This pilot work establishes the technical foundation for emotion-aware sign language translation in low-resource settings and provides a scalable framework for future expansion to larger vocabularies and more diverse emotional expressions. Our preliminary results indicate the viability of real-time, emotionally expressive sign language communication systems for the hearing-impaired community, with clear pathways for enhancement in subsequent development phases.
Original Article
View Cached Full Text

Cached at: 06/26/26, 05:13 AM

# Low Resource Multimodal Translation of Nepali Spoken Words into Emotion-Conditioned Sign Language Avatars
Source: [https://arxiv.org/html/2606.26107](https://arxiv.org/html/2606.26107)
11institutetext:Center for Human Mobility and Communications, Prateek Innovations, Kathmandu, Nepal22institutetext:Sunway International Business School, Birmingham City University, Kathmandu, Nepal
22email:jbhusal@prateekinnovations\.com, jatin@sunway\.edu\.np
22email:salma\.tamang@prateekinnovations\.com###### Abstract

Sign language communication systems, that integrate emotional expression remain underexplored, particularly for low\-resource languages\. Thispilot studypresents NEST\-V1 \(Nepali Emotion and Speech Transformer \- Version 1\), a proof\-of\-concept multimodal framework that demonstrates the feasibility of generating emotion\-conditioned Nepali Sign Language avatars from spoken input\. As a preliminary investigation, we focus on fourcommon Nepali words\("thank you", "hello", "house", "me"\)across three emotional states \(happy, neutral, sad\) to validate our core technical approach\. Our lightweight architecture employs a shared acoustic encoder for simultaneous Automatic Speech Recognition and emotion classification, achieving81\.1%ASR accuracy and79\.21%emotion recognition accuracy on a dataset of 600 labeled audio samples from 50 speakers\. The system demonstrates37% parameter efficiencycompared to separate model architectures while maintaining a lightweight footprint with only22\.1M parameterssuitable for edge deployment\. This pilot work establishes the technical foundation for emotion\-aware sign language translation in low\-resource settings and provides a scalable framework for future expansion to larger vocabularies and more diverse emotional expressions\. Our preliminary results indicate the viability of real\-time, emotionally expressive sign language communication systems for the hearing\-impaired community, with clear pathways for enhancement in subsequent development phases\.

## 1Introduction

Spoken\-to\-sign language gesture\-based systems crucial in assistive sign language research\. These systems hold significant potential to bridge the communication gap between verbal speakers and the hearing\-impaired community\. However, most existing systems focus solely on lexical translation, neglecting the emotional context of spoken language, an essential component of natural, human\-centered communication\.

Dynamic, real\-time avatar generation is a critical element in enhancing the naturalness and expressiveness of sign language communication\. Yet, due to the absence of emotionally expressive avatars, many current systems resemble robotic gesture mimicking rather than authentic human interaction\. This gap is even more prominent in low\-resource languages like Nepali and its corresponding sign language, Nepali Sign Language \(NSL\), where research and datasets are scarce\. This pilot study proposes a low\-resource multimodal translation pipeline that combines automatic speech recognition \(ASR\) with emotion recognition to generate dynamic sign language avatar animations\. It encompasses four frequently used Nepali sign words, with corresponding facial expressions representing three emotional states: happy, sad, and neutral\. The key contributions of this research are:

1. 1\.It presents first NSL\-based speech dataset annotated with emotional context\.
2. 2\.This has a modular, real\-time pipeline with independent components for ASR, emotion recognition enabling easy scalability and upgrades\.
3. 3\.The pipeline is lightweight and suitable for low\-resource deployment, supporting edge applications\.

The remainder of this paper is structured as follows: Section 2 reviews Related Works\. Section 3 describes our Methodolog, architecture and model implemen\- tation\. Section 4 presents experimental results and analysis\. Finally, Section 5 concludes the paper and section 6 discusses future directions\.

## 2Related Works

Recent research has explored integrating emotional expressions into sign language avatars to enhance comprehension and naturalness\.As shown in\[[8](https://arxiv.org/html/2606.26107#bib.bib3)\]Smith & Nolan evaluated augmenting avatars with universal emotions in Irish Sign Language, finding little difference in comprehension between baseline and emotionally\-enhanced avatars\. Gonçalves et al\. \(2017\)\[[3](https://arxiv.org/html/2606.26107#bib.bib8)\]proposed a facial expression parametrization method for avatars, identifying relevant facial landmarks and emotions to improve automatic sign synthesis systems\[[6](https://arxiv.org/html/2606.26107#bib.bib9)\]introduced an avatar\-based Sign Language Production system for Korean Sign Language, incorporating named entity transformation and context vector generation to address out\-of\-vocabulary issues\. While these studies demonstrate progress in integrating emotional expressions and improving avatar\-based sign language systems, challenges remain in achieving natural and linguistically accurate representations\.

Previous research by\[[7](https://arxiv.org/html/2606.26107#bib.bib1)\], use a fused data set that combines the SAVEE and RAVDESS data sets \(2,459 samples, 7 emotions\) followed by MFCC preprocessing to encode spectral\-temporal features\. In general, he adopts the CNN\-LSTM hybrid architecture for speech emotion recognition, achieving 61\. 07% in the test set and 75\.31% in the train set\. A 3D avatar\-based sign language learning system by\[[2](https://arxiv.org/html/2606.26107#bib.bib10)\]\. uses three modules: speech\-to\-text via IBM Watson, English\-to\-ISL translation with Lexical Functional Grammar, and Blender\-based 3D avatar animation synchronized via a "motion list" for Indian Sign Language gestures\. Emotionally expressive AI avatars enhance communication for hearing\-impaired users, offering affordable, customizable interpreting services, but raise ethical concerns\[[1](https://arxiv.org/html/2606.26107#bib.bib5)\]\. Designing for emotion and considering users’ unique needs is crucial, with emphasis on incorporating hearing\-impaired individuals in the development process\[[5](https://arxiv.org/html/2606.26107#bib.bib6)\]\.

## 3Methodology

### 3\.1Overview

This research proposes a novel and lightweight multimodal pipeline for translating spoken Nepali into sign language gestures with emotion\-aware rendering\. A shared acoustic encoder performs both automatic speech recognition \(ASR\) and emotion classification from the input audio\. The core of the system is a unified architecture, termedNEST\-V1\(Nepali Emotion and Speech Transformer – Version 1\), which is jointly parameterized for both tasks via a shared encoder\.

![Refer to caption](https://arxiv.org/html/2606.26107v1/images/model-architecture.png)Figure 1:Overview of the NEST\-V1 Architecture
### 3\.2Dataset Creation

The dataset includes four commonly used Nepali words—thank you,home,me, andhello—selected based on the nature of their corresponding sign language gestures\. Specifically, “thank you” and “me” are dynamic gestures involving continuous hand motion, whereas “home” and “hello” are static gestures characterized by a fixed hand pose\. This distinction allows for a balanced evaluation of both motion\-centric and pose\-centric outputs\. To ensure robustness and generalizability, audio samples were collected for all gesture–emotion combinations\. Each speaker provided 12 audio samples, representing the four target words, each spoken with three emotional tones \(happy, sad, and neutral\)\. The raw recordings were captured in \.m4a and \.aac formats\. Files were standardized to \.wav format using FFmpeg and Pydub for compatibility with preprocessing and augmentation pipelines\. Participants ranged in age from 15 to 45 years, ensuring age diversity\. Gender distribution is summarized in Table[2](https://arxiv.org/html/2606.26107#S3.T2)\.

![Refer to caption](https://arxiv.org/html/2606.26107v1/images/dynamic.png)Figure 2:Dynamic gestures: “me” and “thank you” involve multiple hand movements\.![Refer to caption](https://arxiv.org/html/2606.26107v1/images/static.png)Figure 3:Static gestures: “home” and “hello” use a single hand gesture\.
### 3\.3Dataset Augmentation and Audio Characterstics

To enhance diversity and increase the volume of audio samples, this study employed both semitone shifting and Vocal Tract Length Perturbation \(VTLP\) for data augmentation\. The resulting sample counts across the four gesture classes are summarized in Table[4](https://arxiv.org/html/2606.26107#S3.T4)\. Furthermore, the dataset can be broadly categorized into four duration ranges, as detailed in Table[3](https://arxiv.org/html/2606.26107#S3.T3)\. All audio samples were collected in either\.aacor\.m4aformats, with respective sampling rates of 44,100 Hz and 48,000 Hz\. These rates were consistent across the original recordings as well as the augmented samples, including those processed via VTLP and semitone shifting\.

Table 1:Sample distribution across gesture classesTable 2:Emotion\-wise and gender\-wise distribution of samplesTable 3:Distribution of audio sample durations \(in seconds\) for each word class#### 3\.3\.1Augmenting Audio Dataset with Random VTLP

Vocal Tract Length Perturbation \(VTLP\) simulates variations in the vocal tract length by warping the frequency spectrum of an audio signal\. For each audio sample with a sampling rates​r=44,100sr=44\{,\}100Hz or48,00048\{,\}000Hz, we compute the Short\-Time Fourier Transform \(STFT\) using an FFT size ofN=2048N=2048and a hop length ofH=512H=512\. The frequency axis is linearly warped using random warping factorsα∈\{0\.8,0\.9,1\.2,1\.3\}\\alpha\\in\\\{0\.8,0\.9,1\.2,1\.3\\\}\. While standard vocal tract length normalization \(VTLN\) typically restrictsα\\alphato\[0\.8,1\.2\]\[0\.8,1\.2\]\[[4](https://arxiv.org/html/2606.26107#bib.bib11)\], our goal is to introduce greater diversity in the dataset\. Hence, we expand the range to\[0\.8,1\.3\]\[0\.8,1\.3\]\.

![Refer to caption](https://arxiv.org/html/2606.26107v1/images/vtlp_spectrogram.png)Figure 4:Comparison of Mel spectrograms: original audio vs\. VTLP audio withα=0\.8\\alpha=0\.8

### 3\.4Audio Augmentation

1. 1\.Resampling:All audio files were resampled from their original sampling rates \(44\.1kHz/48kHz\) to 16kHz\.
2. 2\.Fixed Duration:Each sample was clipped or zero\-padded to a fixed duration of2 seconds\(32,000 samples at 16kHz\) to handle temporal variability\.
3. 3\.Hop Length:A hop length of 160 samples, equivalent to 10ms, was chosen: Hop Length=Sample Rate×Frame Shift\\displaystyle=\\text\{Sample Rate\}\\times\\text\{Frame Shift\}=16,000×0\.010=160\\displaystyle=16\{,\}000\\times 0\.010=160\(1\)
4. 4\.FFT Parameters:The number of FFT points \(n\_fft\) was set to 320, resulting in a window size of 20ms\.
5. 5\.Target Frame Count:Each spectrogram was configured to contain exactly 200 frames to ensure uniform dimensions\.
6. 6\.Normalization:Each waveform was normalized to the range\[−1,1\]\[\-1,1\]\.

All spectrograms were then converted into 2D image tensors with a fixed input size of128×200128\\times 200, where:

- •128 corresponds to the number of Mel frequency bins \(vertical axis\),
- •200 corresponds to the number of time frames \(horizontal axis\)\.

This standardization allowed consistent batch processing during both training and inference\.

Table 4:Summary of data augmentation techniques and resulting sample counts
### 3\.5Model Architecture

#### 3\.5\.1Overview

NEST\-V1 converts raw audio into visual representations using Mel spectrograms—grayscale images that capture frequency patterns over time\. These spectrograms are split into smaller patches and passed through a lightweight Vision Transformer \(ViT\) backbone\. The model learns complex time\-frequency relationships and performs both keyword recognition \(e\.g\., “Namaste”\) and emotion classification using a single unified architecture\. By treating audio as images and leveraging Transformers, we effectively manage multimodal audio understanding under low\-resource conditions\.

#### 3\.5\.2Technical Implementation

We adapt a ViT\-style transformer for audio spectrograms by treating them as 2D grayscale images\. The following sections detail the preprocessing, model layers, and classification heads\.

#### Input Representation

Each audio clip is transformed into a Mel spectrogram of shape128×200128\\times 200\(Mel bands × time frames\), resized to128×128128\\times 128via bilinear interpolation to match the input requirements of the Vision Transformer\. These spectrograms are treated as single\-channel \(grayscale\) images and normalized\.

#### Patch Embedding Layer

The spectrogram image is divided into non\-overlapping patches of size16×1616\\times 16, resulting in 64 patches\. A convolutional projection layer with kernel size equal to the patch size\(16×16\)\(16\\times 16\), stride 16, and output channels 768 is used to embed each patch:

Conv2D\(1,768,kernel=16,stride=16\)\\text\{Conv2D\}\(1,768,\\text\{kernel\}=16,\\text\{stride\}=16\)The patch embeddings are flattened and linearly projected to a dimensionD=768D=768\.

A learnable\[CLS\]token is prepended for classification, and learnable positional embeddings are added to retain spatial ordering:

𝐙0=\[𝐳cls;𝐳1;…;𝐳64\]\+𝐄pos\\mathbf\{Z\}\_\{0\}=\[\\mathbf\{z\}\_\{\\text\{cls\}\};\\mathbf\{z\}\_\{1\};\\dots;\\mathbf\{z\}\_\{64\}\]\+\\mathbf\{E\}\_\{\\text\{pos\}\}Both embeddings are initialized using a truncated normal distribution with standard deviationstd=0\.02\\text\{std\}=0\.02\.

#### Transformer Encoder

The patch sequence, including the \[CLS\] token, is passed through a stack ofL=3L=3identical Transformer blocks\. Each block consists of:

- •Multi\-head self\-attention \(12 heads, with head dimensiond=64d=64; total embedding dim = 768\)
- •Pre\-norm Layer Normalization \(before both attention and MLP\)
- •MLP with GELU activation and 4×\\timesexpansion \(768→\\rightarrow3072→\\rightarrow768\)
- •Residual connections across both attention and MLP sublayers
- •Dropout withp=0\.1p=0\.1for regularization

These layers capture both local and global dependencies in time\-frequency space, essential for understanding speech and emotion\.

#### Classification Heads

After the Transformer encoder, only the representation of the\[CLS\]token is used for classification\. It is passed through a final LayerNorm and separate linear classifiers for each task:

𝐲^task=softmax​\(𝐖task⋅𝐳cls\+𝐛task\)\\hat\{\\mathbf\{y\}\}\_\{\\text\{task\}\}=\\text\{softmax\}\(\\mathbf\{W\}\_\{\\text\{task\}\}\\cdot\\mathbf\{z\}\_\{\\text\{cls\}\}\+\\mathbf\{b\}\_\{\\text\{task\}\}\)\(2\)
Two parallel heads are used:

- •Emotion classification \(3 classes: Happy, Neutral, Sad\)
- •Keyword classification \(4 classes: “Hello”, “Thank you”, “House”, “Me”\)

This design supports multi\-task learning while sharing a common feature extractor\.

Table 5:Final hyperparameters of the Vision Transformer model used in the audio\-emotion\-to\-avatar pipeline\.
#### Why It Works

This architecture jointly captures:

- •Spectral structurevia 2D patches in Mel\-frequency space
- •Temporal evolutionacross spectrogram frames
- •Global dependenciesthrough Transformer self\-attention

This makes it especially suitable for tasks like emotion and keyword recognition from spoken Nepali audio, even with limited training data\.

Table 6:Comparison of NEST\-V1 with Standard ViT

### 3\.6Experimental Setup

We trained a custom Audio Spectrogram Transformer \(AST\) model for audio classification, adapting the architecture to balance efficiency and performance\. The model takes input spectrograms of size128×128128\\times 128\(single channel\), divided into non\-overlapping patches of size16×1616\\times 16, which are linearly projected into 768\-dimensional embeddings\. The Transformer backbone consists of 3 encoder layers, each with 12 attention heads, followed by task\-specific classification heads\. For optimization, we used the AdamW optimizer with an initial learning rate of 0\.001 and a weight decay of 0\.1\. A cosine annealing scheduler with 10 cycles was employed to stabilize training over 25 epochs\. Cross\-entropy loss was used as the training objective\. Batch sizes were set according to the data loader capacity\. Training and validation accuracy and loss were recorded per epoch, with the best model selected based on validation accuracy\. All experiments were conducted on CUDA\-enabled GPUs using PyTorch, with CPU fallback where necessary\. To ensure reproducibility, random seeds were fixed where applicable, and standard data splits were used for training and validation\. This lightweight architecture achieves a balance between computational efficiency and representational capacity, making it well\-suited for resource\-constrained environments without sacrificing competitive performance\.

### 3\.7Avatar Generation

To represent signed gestures with emotional nuance, we developed a lightweight and expressive avatar animation pipeline\. This pipeline converts the audio input—specifically, the detected keyword and its emotional tone—into corresponding sign language animations using pre\-rendered 2D avatar frames\.

#### Data Preparation

For each of the four selected Nepali Sign Language gestures—Thank you,Hello,House, andMe—we prepared four distinct avatar images:

- •A base pose representing the neutral standing position,
- •Three emotionally expressive variants: Happy, Sad, and Neutral\.

All avatars were designed with a consistent style, depicting the upper body, face, and hands to clearly convey both the sign gesture and its emotional context\.

#### Gesture Animation Pipeline

To create fluid animations from static images, we implemented a frame\-blending pipeline using Python libraries such as OpenCV and PIL\. The animation process involves morphing the base \(neutral\) avatar image into an emotionally expressive variant using linear interpolation across frames\. Specifically, the pipeline executes the following steps:

1. 1\.Load the base and target images \(e\.g\.,ghar\-base\.pngandghar\-sad\.jpeg\) and resize both to a resolution of 512×512 pixels\.
2. 2\.Generate 30 interpolated frames by applying alpha blending using OpenCV’saddWeightedfunction, where blending weights change linearly from the base to the target image\.
3. 3\.Convert each blended frame to RGB and wrap it as aPIL\.Imageobject\.
4. 4\.Reverse the frame sequence to generate a smooth looping animation \(i\.e\., forward \+ reverse = 60 frames total\)\.
5. 5\.Export the sequence as a looping animated GIF using PIL’ssavefunction, with a frame duration of 25 milliseconds\.

This process is repeated for each combination of gesture and emotion, resulting in a total of 12 animated GIFs: four gestures × three emotions\.

#### Emotion\-Conditioned Playback

Once the system identifies the spoken word and corresponding emotional tone using the audio classification and emotion recognition modules, it retrieves the mapped animated GIF and plays it as output\. For example, if the input audio contains the wordThank youspoken with a sad emotional tone, the system displays thethank\-you\-sad\.gifanimation\.

This modular mapping approach ensures real\-time responsiveness and emotional expressiveness in sign language output\. Furthermore, the use of optimized, lightweight GIFs makes this method deployable on resource\-constrained platforms such as web browsers and mobile devices, without requiring high\-end graphics rendering\.

NEST\-V1 is designed for deployment in resource\-constrained environments\. Table[7](https://arxiv.org/html/2606.26107#S3.T7)summarizes the computational characteristics of our model compared to typical alternatives\.

Table 7:Computational Complexity ComparisonModelParamsFLOPsMem\.Time\(M\)\(M\)\(MB\)\(ms\)NEST\-V1 \(Ours\)22\.12\.1894595ASR\+Emotion35\.27\.814178178ViT\-Base86\.617\.534612125CNN\-LSTM\*12\.31\.8493535
#### 3\.7\.1Parameter Efficiency

Our shared encoder architecture achieves significant parameter reduction, calculated as:

Parameter Reduction=Pseparate−PsharedPseparate×100%\\displaystyle=\\frac\{P\_\{\\text\{separate\}\}\-P\_\{\\text\{shared\}\}\}\{P\_\{\\text\{separate\}\}\}\\times 100\\%\(3\)=35\.2​M−22\.1​M35\.2​M×100%=37\.2%\\displaystyle=\\frac\{35\.2M\-22\.1M\}\{35\.2M\}\\times 100\\%=37\.2\\%\(4\)
This reduction is achieved by sharing the transformer encoder between ASR and emotion recognition tasks, eliminating duplicate feature extraction layers\.

#### 3\.7\.2Computational Complexity Analysis

Patch Embedding Layer

- •Input:128×128×1128\\times 128\\times 1spectrogram
- •Patches: 64 patches of size16×1616\\times 16
- •Embedding dimension: 768
- •FLOPs:64×16×16×768=12\.6​M FLOPs64\\times 16\\times 16\\times 768=12\.6\\text\{M FLOPs\}

Transformer Encoder

For each of the 3 transformer layers:

- •Multi\-head attention:𝒪​\(n2​d\)\\mathcal\{O\}\(n^\{2\}d\), wheren=65n=65\(64 patches \+ CLS token\),d=768d=768
- •MLP:𝒪​\(n​d×4​d\)=𝒪​\(4​n​d2\)\\mathcal\{O\}\(nd\\times 4d\)=\\mathcal\{O\}\(4nd^\{2\}\)
- •Per layer FLOPs:652×768\+4×65×7682≈153\.6​M FLOPs65^\{2\}\\times 768\+4\\times 65\\times 768^\{2\}\\approx 153\.6\\text\{M FLOPs\}
- •Total for 3 layers:3×153\.6​M=460\.8​M FLOPs3\\times 153\.6\\text\{M\}=460\.8\\text\{M FLOPs\}

Classification Heads

- •ASR head:768×4=3,072768\\times 4=3,072FLOPs
- •Emotion head:768×3=2,304768\\times 3=2,304FLOPs

Total FLOPs:12\.6​M\+460\.8​M\+0\.005​M≈473\.4​M FLOPs12\.6\\text\{M\}\+460\.8\\text\{M\}\+0\.005\\text\{M\}\\approx 473\.4\\text\{M FLOPs\}

#### 3\.7\.3Memory Efficiency

Model Parameters

- •Patch embedding:16×16×1×768=196,60816\\times 16\\times 1\\times 768=196\{,\}608parameters
- •Positional embeddings:65×768=49,92065\\times 768=49\{,\}920parameters
- •Transformer layers:3×\(7682×4\+768×3072×2\)≈21\.3​M3\\times\(768^\{2\}\\times 4\+768\\times 3072\\times 2\)\\approx 21\.3\\text\{M\}parameters
- •Classification heads:768×7=5,376768\\times 7=5\{,\}376parameters
- •Total:∼22\.1​M\\sim 22\.1\\text\{M\}parameters

Runtime Memory

- •Input tensor:128×128×1×4​bytes=65\.5​KB128\\times 128\\times 1\\times 4\\text\{ bytes\}=65\.5\\text\{ KB\}
- •Intermediate activations:∼85​MB\\sim 85\\text\{ MB\}
- •Model weights:22\.1×106×4​bytes=88\.4​MB22\.1\\times 10^\{6\}\\times 4\\text\{ bytes\}=88\.4\\text\{ MB\}
- •Total inference memory:∼89​MB\\sim 89\\text\{ MB\}

#### 3\.7\.4Scalability Analysis

The computational complexity scales as follows with vocabulary expansion:

- •ASR vocabulary scaling:𝒪​\(V\)\\mathcal\{O\}\(V\)whereVVis vocabulary size
- •Emotion categories scaling:𝒪​\(E\)\\mathcal\{O\}\(E\)whereEEis number of emotions
- •Core transformer complexity: remains constant,𝒪​\(1\)\\mathcal\{O\}\(1\)

For deployment on edge devices, our model maintains:

- •Inference time:<50<50ms on modern mobile GPUs
- •Memory footprint:<100<100MB total
- •Power consumption: estimated 2–3 W during inference

#### 3\.7\.5Deployment Considerations

Hardware Requirements

- •Minimum: 4 GB RAM, ARM Cortex\-A78 or equivalent
- •Recommended: 6 GB RAM, Mobile GPU \(Adreno 640\+, Mali\-G76\+\)
- •Optimal: 8 GB RAM, Dedicated AI accelerator

Software Optimization

- •Model quantization: INT8 quantization can reduce memory by 75%
- •Pruning potential: estimated 30–40% parameters can be pruned
- •Batch processing: supports batch sizes 1–16 for efficiency

## 4Experimental analysis and results

The model was evaluated on two datasets: an ASR dataset comprising 3,107 training samples, 889 validation samples, and 447 testing samples; and an emotion dataset with 2,420 training samples, 753 validation samples, and 321 testing samples\. During 25 training epochs, the model demonstrated steady performance, achieving a best training accuracy of 81\.1% on the ASR dataset and 79\.21% on the emotion dataset\. Validation accuracies reached 79\.6% for ASR and 76\.54% for emotion recognition\. The final loss scores for the ASR dataset were 0\.3121 \(training\) and 0\.4876 \(validation\), while for the emotion dataset, the loss was 0\.476 \(training\) and 0\.684 \(validation\)\.

![Refer to caption](https://arxiv.org/html/2606.26107v1/images/confusion_matrix_words.png)\(a\) For ASR
![Refer to caption](https://arxiv.org/html/2606.26107v1/images/confusion_matrix_emotions.png)\(b\) For Emotion

Figure 5:Confusion matrices for ASR \(4 classes\) and Emotion \(3 classes\)Table 8:Classification report for all four gestures for ASRTable 9:Classification report for emotions
## 5Conclusion

This work presents a novel low\-resource, multimodal translation pipeline that maps spoken Nepali utterances to emotion\-conditioned sign language animations using a lightweight and efficient architecture\. The proposed system integrates three key modules:

- •a Nepali Automatic Speech Recognition \(ASR\) system for transcribing spoken inputs into text,
- •an emotion classification module for extracting affective context from the audio signal,
- •a gesture synthesis pipeline that maps the recognized text and emotion labels to pre\-rendered Nepali Sign Language \(NSL\) avatar animations

The transformer\-based hybrid model \(NEST\-V1\) jointly optimizes ASR and emotion recognition, ensuring low\-latency and resource\-efficient performance suitable for deployment on edge devices\. Evaluation on a constrained vocabulary of four frequent Nepali words yielded consistent classification performance, with F1\-scores between 0\.69 and 0\.78\. The incorporation of emotion conditioning into the avatar rendering phase enables expressivity beyond lexical translation, thereby improving the naturalness and communicative effectiveness of the generated sign gestures\. Overall, this work demonstrates the feasibility of building an end\-to\-end, emotion\-aware speech\-to\-sign language system for low\-resource settings, with potential for scalability across larger vocabularies and real\-world assistive applications\.

## 6Future Directions

Looking ahead, The authors plan to improve the current system across several key dimensions:

- •Expand Vocabulary and Emotions:To include a wider set of Nepali words and extend the emotion categories to cover a broader emotional spectrum\. This will help the model generalize better across real\-world conversations and contexts\.
- •Shift to Dynamic Avatar Generation:Currently, the avatar gestures are rendered using pre\-defined animated GIFs\. In the next phase, the authors plan to implement a real\-time avatar rendering system—possibly using 2D skeletal animation or lightweight 3D rigs—to make the signing experience more fluid and natural\.
- •Collect Larger and More Diverse Dataset:To improve model performance and fairness, the authors plan to collect additional speech samples from speakers across diverse age groups, dialects, and genders\. This will support better generalization in low\-resource scenarios\.
- •Introduce Human Evaluation:Alongside technical metrics, human evaluation framework involving hearing\-impaired users will be introduced\. Their feedback on avatar clarity, emotional accuracy, and overall usability will be critical in shaping the next version of the system\.
- •Optimize for Edge Deployment:Although the system is already lightweight, the authors plan to further optimize it using quantization, pruning, and efficient runtime architectures so it can run smoothly on mobile devices and embedded systems\.

The goal is to maintain the balance between performance and deployability, keeping the system modular, real\-time, and aligned with real\-world assistive needs\.

## 7Acknowledgement

The authors acknowledge Dr\. Manish Sakhakarmy, for his guidance and Sunway College for assiting with data collection\.

## References

- \[1\]S\. Chen, H\. Cheng, S\. Su, S\. Patterson, R\. Kushalnagar, Y\. Huang, and Q\. Wang\(2025\)Customizing generated signs and voices of ai avatars: deaf\-centric mixed\-reality design for deaf\-hearing communication\.Proceedings of the ACM on Human\-Computer Interaction9\(2\),pp\. 1–31\.Cited by:[§2](https://arxiv.org/html/2606.26107#S2.p2.1)\.
- \[2\]D\. Das Chakladar, P\. Kumar, S\. Mandal, P\. P\. Roy, M\. Iwamura, and B\. Kim\(2021\)3d avatar approach for continuous sign movement using speech/text\.Applied Sciences11\(8\),pp\. 3439\.Cited by:[§2](https://arxiv.org/html/2606.26107#S2.p2.1)\.
- \[3\]D\. A\. Gonçalves, E\. Todt, and D\. P\. Cláudio\(2017\)Landmark\-based facial expression parametrization for sign languages avatar animation\.InProceedings of the XVI Brazilian Symposium on Human Factors in Computing Systems,pp\. 1–6\.Cited by:[§2](https://arxiv.org/html/2606.26107#S2.p1.1)\.
- \[4\]N\. Jaitly and G\. E\. Hinton\(2013\)Vocal tract length perturbation \(vtlp\) improves speech recognition\.InProc\. ICML Workshop on Deep Learning for Audio, Speech and Language,Vol\.117\.Cited by:[§3\.3\.1](https://arxiv.org/html/2606.26107#S3.SS3.SSS1.p1.8)\.
- \[5\]H\. Kim, H\. Hwang, S\. Gwak, J\. Yoon, and K\. Park\(2024\)Improving communication and promoting social inclusion for hearing\-impaired users: usability evaluation and design recommendations for assistive mobile applications\.PloS one19\(7\),pp\. e0305726\.Cited by:[§2](https://arxiv.org/html/2606.26107#S2.p2.1)\.
- \[6\]J\. Kim, E\. J\. Hwang, S\. Cho, D\. H\. Lee, and J\. C\. Park\(2022\)Sign language production with avatar layering: a critical use case over rare words\.InProceedings of the Thirteenth Language Resources and Evaluation Conference,pp\. 1519–1528\.Cited by:[§2](https://arxiv.org/html/2606.26107#S2.p1.1)\.
- \[7\]Q\. Ouyang\(2025\)Speech emotion detection based on mfcc and cnn\-lstm architecture\.arXiv preprint arXiv:2501\.10666\.Cited by:[§2](https://arxiv.org/html/2606.26107#S2.p2.1)\.
- \[8\]R\. G\. Smith and B\. Nolan\(2016\)Emotional facial expressions in synthesised sign language avatars: a manual evaluation\.Universal Access in the Information Society15\(4\),pp\. 567–576\.Cited by:[§2](https://arxiv.org/html/2606.26107#S2.p1.1)\.

Similar Articles

Emotion Recognition in Sign Language Conversation

arXiv cs.CL

This paper introduces the eJSL Dialog dataset for emotion recognition in sign language conversations, addressing the lack of conversational context in existing datasets. Benchmarking shows a domain gap when applying generic multimodal models, highlighting the need for context-aware visual extractors for sign language.

Evaluating multimodal emotion recognition in proactive conversational agents: A user study

arXiv cs.AI

This paper presents a multimodal emotion recognition module for proactive conversational agents, using facial recognition and linguistic analysis. A user study with 20 participants reveals a 'poker face' effect where visual cues are unreliable, while linguistic analysis proves more accurate; the study also shows agents can elicit emotions through conversational adaptation.