AttnGen: Attention-Guided Saliency Learning for Interpretable Genomic Sequence Classification
Summary
AttnGen is an attention-guided training framework that embeds interpretability into the optimization of deep neural networks for genomic sequence classification, achieving improved accuracy and encouraging models to focus on informative nucleotide positions.
View Cached Full Text
Cached at: 05/15/26, 06:26 AM
# AttnGen: Attention-Guided Saliency Learning for Interpretable Genomic Sequence Classification
Source: [https://arxiv.org/html/2605.14073](https://arxiv.org/html/2605.14073)
###### Abstract
Deep neural networks have achieved great performance in genomic sequences; however, still is a mystery to relate their predictions to biologically meaningful patterns\. In this work, we presentAttnGen, an attention\-guided training framework that embeds interpretability directly into the optimization process\. AttnGen computes nucleotide\-level importance through a specific attention mechanism\. It then uses these scores to progressively suppress low\-contribution positions during training\. This encourages the model to concentrate its predictions on a compact set of informative regions and prevents distributing importance across noisy sequence elements\. We evaluate AttnGen on the standardizeddemo\_human\_or\_wormbenchmark, consisting of binary classification over 200\-nucleotide sequences\. With moderate masking, AttnGen achieves a validation accuracy of 96\.73%, outperforming a conventional CNN baseline \(95\.83%\) while also exhibiting faster convergence and improved training stability\. To examine whether the learned importance scores reflect functionally relevant signal, we perform perturbation\-based analysis by removing high\-saliency nucleotides\. This results in a substantial accuracy drop—from 96\.9% to near chance level on a 3,000\-sequence evaluation set—indicating that predictions depend on a relatively small subset of positions\.
Our conducted analysis shows that masking 10–20% of positions yields the most favorable trade\-off between predictive performance and interpretability\. These results indicate that attention\-guided masking not only improves classification performance but also reshapes how models distribute importance across sequence positions\. While this study targets the short genomic sequences, the proposed approach could be extend interpretable training strategies to more complex sequence modeling settings\.
## IIntroduction
In genomic sequence modeling, predictive accuracy alone is often insufficient\. Model outputs are frequently used to guide downstream biological interpretation, such as identifying regulatory motifs or prioritizing candidate regions for experimental validation\. Models that perform well but offer no insight into their decision process are difficult to trust\. Early approaches such as position weight matrices \(PWMs\) andkk\-mer–based models provided a structured way to represent local sequence preferences\[[1](https://arxiv.org/html/2605.14073#bib.bib1)\]\. However, these methods assume limited interaction structure and struggle when regulatory signals depend on broader context or long\-range dependencies\. Deep learning models address part of this limitation by learning representations directly from raw sequence data\. Convolutional neural networks \(CNNs\), in particular, have shown strong performance across tasks such as transcription\-factor binding prediction and enhancer detection\[[2](https://arxiv.org/html/2605.14073#bib.bib2),[3](https://arxiv.org/html/2605.14073#bib.bib3)\]\. In several cases, these models rediscover known biological motifs without explicit supervision, indicating that they capture meaningful structure in the data\. Recent work in self\-supervised learning has explored alignment\-based objectives to improve representation quality without requiring large labeled datasets\. In particular, alignment learning has been applied in medical image segmentation to enforce consistency across different view and learn stable feature correspondences\[[4](https://arxiv.org/html/2605.14073#bib.bib4)\]\. This perspective is relevant to genomic sequence modeling, where only a subset of positions contributes meaningfully to prediction\. Post hoc explanation methods attempt to bridge this gap\. Saliency maps\[[5](https://arxiv.org/html/2605.14073#bib.bib5)\], integrated gradients\[[6](https://arxiv.org/html/2605.14073#bib.bib6)\], DeepLIFT\[[7](https://arxiv.org/html/2605.14073#bib.bib7)\], and attention\-based approaches\[[8](https://arxiv.org/html/2605.14073#bib.bib8)\]are commonly used to assign importance scores to individual nucleotides\. However, these methods have limitations\. For example, prior work has shown that certain saliency maps remain largely unchanged even when model parameters are randomized\[[9](https://arxiv.org/html/2605.14073#bib.bib9)\], raising concerns about whether they reflect true model behavior\. If interpretability is introduced only after training, it does not affect how model forms its decisions\. An alternative is to incorporate interpretability during training\. Prior work has explored this idea through saliency\-guided masking and consistency constraints\[[10](https://arxiv.org/html/2605.14073#bib.bib10),[11](https://arxiv.org/html/2605.14073#bib.bib11),[12](https://arxiv.org/html/2605.14073#bib.bib12),[16](https://arxiv.org/html/2605.14073#bib.bib16)\]\. In a related direction, Unified Gravity Loss\[[14](https://arxiv.org/html/2605.14073#bib.bib14)\]improves robustness by shaping the feature space during training\. Despite its relevance, this direction remains relatively underexplored in genomic sequence modeling\. AttnGen follows this perspective by integrating a lightweight attention mechanism that estimates nucleotide importance during the forward pass\. Positions identified as less informative are progressively masked, encouraging the model to focus on a smaller set of discriminative regions while preserving necessary context\.
We evaluate this approach on the standardizeddemo\_human\_or\_wormdataset from theGenomic Benchmarkscollection\[[13](https://arxiv.org/html/2605.14073#bib.bib13)\]\. Our goal is to study whether masking low\-importance positions preserves predictive performance and whether the learned importance scores are aligned with the models predictions\. To study this, we perform gradient based ablation by removing high and low saliency nucleotides and measuring the resulting change in classification accuracy\.
## IIRelated Work
### II\-ADeep Learning for Genomic Sequence Modeling
Unlike images or natural language, genomic sequences do not exhibit clear spatial or semantic segmentation, making hand\-crafted feature design both domain intensive and brittle\. Deep learning addresses this challenge by enabling models to learn representations directly from raw DNA sequences without relying on predefined features\. Early work demonstrated that convolutional neural networks \(CNNs\) can capture regulatory patterns from sequence data alone\. Alipanahi et al\.\[[15](https://arxiv.org/html/2605.14073#bib.bib15)\]showed that CNNs could infer DNA\- and RNA binding protein specificities without predefined motif templates, providing one of the first clear demonstrations of end\-to\-end learning in this domain\. Around the same time, DeepSEA\[[17](https://arxiv.org/html/2605.14073#bib.bib17)\]introduced a multi task convolutional framework capable of predicting chromatin effects at single nucleotide resolution, highlighting how small sequence variations can lead to measurable regulatory changes\. As architectures became more expressive, attention shifted toward modeling interactions beyond local receptive fields\. Regulatory elements often involve motifs separated by tens or even hundreds of nucleotides, requiring models to capture long\-range dependencies\. DanQ\[[18](https://arxiv.org/html/2605.14073#bib.bib18)\]addressed this limitation by combining convolutional layers with bidirectional LSTMs, enabling motif level representations to interact across longer sequence spans\. More recent work has incorporated attention mechanisms to model distal enhancer promoter interactions that may span kilobases\[[8](https://arxiv.org/html/2605.14073#bib.bib8)\]\. Collectively, these approaches reflect a transition from local motif detection toward modeling distributed and context dependent regulatory structure\. Despite these advances, evaluating progress in genomic modeling remains challenging\. Reported performance improvements often depend on preprocessing pipelines, filtering strategies, or dataset splits\. Even minor implementation choices can lead to nontrivial differences in results, making it difficult to attribute gains to model design alone\. The introduction ofGenomic Benchmarks\[[13](https://arxiv.org/html/2605.14073#bib.bib13)\]addressed part of this issue by providing a curated collection of datasets with standardized preprocessing and baseline implementations\. However, while such benchmarks improve evaluation consistency, they do not fully resolve how different training strategies—particularly those targeting interpretability behave under controlled conditions\.
In this work, we adopt thedemo\_human\_or\_wormdataset from the Genomic Benchmarks collection\. The dataset contains 100,000 DNA sequences of length 200 and defines a balanced binary classification task\. Its controlled setup allows us to study the effects of saliency\-guided training without confounding variability introduced by custom data processing pipelines\.
### II\-BInterpretability and Saliency\-Guided Training
Most interpretability methods operate after model training\. Gradient\-based saliency maps\[[5](https://arxiv.org/html/2605.14073#bib.bib5)\]estimate input importance through local sensitivity, while Integrated Gradients\[[6](https://arxiv.org/html/2605.14073#bib.bib6)\]and Grad\-CAM\[[19](https://arxiv.org/html/2605.14073#bib.bib19)\]provide refinements intended to improve attribution quality\. However, these approaches have known limitations: gradient\-based explanations can be noisy and sensitive to small perturbations, while perturbation\-based methods are often computationally expensive\. In genomic applications, such instability is particularly problematic, as small changes in importance scores can directly affect biological interpretation\.
More fundamentally, prior work has shown that certain saliency methods may produce visually plausible explanation even when model parameters are randomized\[[9](https://arxiv.org/html/2605.14073#bib.bib9)\], raising concerns about whether these explanations reflect true model behavior or merely plausible artifacts\. At the same time, integrating gradient\-based saliency into the training loop is not straight forward\. Computing saliency at each iteration requires additional backward passes and can introduce instability during optimization\[[20](https://arxiv.org/html/2605.14073#bib.bib20)\]\. This makes it difficult to directly incorporate attribution signals into the learning process\. An alternative is to estimate importance scores in the forward pass\. An attention mechanism provides such a pathway, it produces differentiable importance weights without requiring repeated gradient computations, making it more suitable for integration into the training objective\. Saliency\-Guided Training \(SGT\)\[[11](https://arxiv.org/html/2605.14073#bib.bib11)\]builds on this idea by masking low importance features and enforcing consistency between original and masked predictions using a KL divergence regularizer\. This approach encourages models to rely less on noisy or incidental features and more on stable, predictive structure\.
We adapt this principle to genomic sequence classification in AttnGen\. Instead of relying on gradient\-based saliency during training, we use a lightweight attention mechanism to estimate per nucleotide importance in the forward pass\. Progressive masking and KL\-based consistency are retained, but reformulated for sequence data, where masking individual nucleotides introduces different structural constraint compared to masking pixels in images\.
### II\-CResearch Gaps and Our Contributions
Although genomic deep learning architectures have become increasingly sophisticated, relatively little work has examined how interpretability constraints influence sequence based optimization\. Existing saliency\-guided approaches have largely been developed in vision settings, where masking operates over continuous pixel intensities\. In contrast, genomic sequences are discrete and symbolic, and masking individual nucleotides can alter both local context and downstream representation in nontrivial ways\.
As a result, it remains unclear whether saliency guided training strategies transfer effectively to genomic sequence modeling, particularly under standardized evaluation setting\. In this work, we investigate this question through AttnGen, an attention\-guided saliency learning framework\. By combining forward\-pass importance estimation with structured masking and consistency constraints, we examine whether interpretability can be incorporated into the training process in a way that directly shapes nucleotide\-level importance and the resulting biological interpretations\.
## IIIProblem Statement
### III\-ATask Definition
Let𝒟=\{\(𝐱i,yi\)\}i=1N\\mathcal\{D\}=\\\{\(\\mathbf\{x\}\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}denote a genomic dataset in which each𝐱i∈ΣL\\mathbf\{x\}\_\{i\}\\in\\Sigma^\{L\}is a DNA sequence of lengthLLover the nucleotide alphabetΣ=\{A,T,G,C\}\\Sigma=\\\{A,T,G,C\\\}, andyi∈\{0,1\}y\_\{i\}\\in\\\{0,1\\\}indicates the class label \( human versusC\. elegans\)\. Our goal is to learn a classifierfθ:ΣL→ℝ2f\_\{\\theta\}:\\Sigma^\{L\}\\rightarrow\\mathbb\{R\}^\{2\}that performs well on unseen sequences while also revealing which positions in the sequence contribute most strongly to its decisions\. In addition to predictive accuracy, we examine whether the model’s outputs are supported by biologically meaningful patterns rather than superficial correlations\. Many existing approaches prioritize predictive performance as the primary optimization target\. Although some incorporate domain knowledge or motif constraints, standard end to end neural training typically does not distinguish between highly informative positions and those that contribute little signal\. In practice, genomic sequences are not uniformly informative\. Certain regions contain regulatory motif or conserved subsequences, whereas others introduce redundancy or noise\. A central question therefore emerges: can the training process itself encourage the model to concentrate on discriminative positions, instead of relying on diffuse or dataset specific cues?
### III\-BChallenges and Research Questions
Convolutional sequence models provide strong classification accuracy, yet they offer limited transparency regarding which nucleotides drive predictions\. Attribution techniques can estimate importance score, but these are usually computed after training and do not alter how representations are formed\. This separation between learning and explanation creates a mismatch: the model may rely on feature that appear weak or unstable under post\-hoc analysis, making biological interpretation uncertain\. Another difficulty arises from positional and compositional biases present in many genomic datasets\. Models may exploit such biases to achieve high training accuracy without learning relationships that generalize beyond the dataset at hand\. Additionally, incorporating gradient\-based saliency directly into optimization introduces computational overhead and can amplify instability during backpropagation\[[20](https://arxiv.org/html/2605.14073#bib.bib20)\]\. These considerations lead us to three guiding questions\. We first examine whether attention mechanisms can approximate gradient derived saliency in a way that remains stable during training\. We then study how masking sequence positions based on learned importance influences model focus and predictive behavior\. Finally, we investigate how different masking intensities affect the balance between accuracy, interpretability, and robustness in nucleotide\-level classification\.
### III\-CApproach Overview
We proposeAttnGen, an attention\-guided saliency learning framework that incorporates interpretability constraints into the optimization process\. Instead of computing gradient based attribution at every iteration—which typically requires additional backward passes—AttnGen introduces a lightweight attention module that produces per\-nucleotide importance scores during the forward computation\. This design avoids repeated gradient evaluations and reduces the computational cost relative to gradient\-based saliency integration\.
The model then applies progressive masking to positions assignd low importance\. Rather than enforcing hard sparsity, this masking encourages the network to rely more consistently on discriminative regions of the sequence\. To maintain stable predictions, a Kullback–Leibler divergence term aligns the output distributions of original and masked input\. This regularization penalizes large predictive shifts when low\-importance positions are removed, thereby reinforcing the relevance of retained nucleotides\. Our focus is to analyze how this attention\-driven masking strategy shapes model behavior under controlled experimental conditions and whether it improves the alignment between predictive performance and identifiable sequence regions\.
## IVMethodology
### IV\-AModel Architecture
The base classifierfθf\_\{\\theta\}in AttnGen is a 1D convolutional network designed for genomic sequence inputs\. An embedding layer first maps discrete nucleotide tokens into 128\-dimensional continuous vectors\. Three convolutional blocks are then applied, each consisting of a 1D convolution \(kernel size 8\), batch normalization, ReLU activation, and max pooling \(stride 2\)\.
The kernel size of 8 was selected to capture short sequence motifs spanning approximately 6–8 nucleotides, which aligns with the typical length of many regulatory elements reported in prior genomic studies\. The use of three convolutional stages allows the model to progressively expand its receptive field while maintaining computational efficiency\. Channel dimensionality is reduced \(128→\\rightarrow32→\\rightarrow16→\\rightarrow4\) to limit model capacity and reduce overfitting on short sequences\.
After flattening, two fully connected layers with dropout \(p=0\.3p\{=\}0\.3\) produce binary class logits\.
### IV\-BAttention\-Based Saliency Computation
Standard saliency estimation relies on gradient computation∇𝐱fθ\(𝐱\)\\nabla\_\{\\mathbf\{x\}\}f\_\{\\theta\}\(\\mathbf\{x\}\)to measure feature importance\[[5](https://arxiv.org/html/2605.14073#bib.bib5)\]\. While effective for post\-hoc analysis, computing gradients for saliency during training requires additional backward passes and can introduce instability into the optimization process\.
To avoid this overhead, AttnGen estimates importance directly in the forward pass using a lightweight attention mechanism\. Given the embedding tensor𝐄\(𝐱\)∈ℝB×L×d\\mathbf\{E\}\(\\mathbf\{x\}\)\\in\\mathbb\{R\}^\{B\\times L\\times d\}for a batch of sequences \(BB: batch size,L=200L=200,d=128d=128\), we compute position\-wise scores by averaging across the feature dimension:
𝐬b,i\(𝐱\)=1d∑j=1d𝐄b,i,j\(𝐱\),𝐬\(𝐱\)∈ℝB×L\.\\mathbf\{s\}\_\{b,i\}\(\\mathbf\{x\}\)=\\frac\{1\}\{d\}\\sum\_\{j=1\}^\{d\}\\mathbf\{E\}\_\{b,i,j\}\(\\mathbf\{x\}\),\\quad\\mathbf\{s\}\(\\mathbf\{x\}\)\\in\\mathbb\{R\}^\{B\\times L\}\.\(1\)
The normalized importance weights are then obtained via a softmax operation along the sequence dimension:
𝐀b,i\(𝐱\)=exp\(𝐬b,i\(𝐱\)\)∑i′=1Lexp\(𝐬b,i′\(𝐱\)\),𝐀\(𝐱\)∈ℝB×L\.\\mathbf\{A\}\_\{b,i\}\(\\mathbf\{x\}\)=\\frac\{\\exp\\big\(\\mathbf\{s\}\_\{b,i\}\(\\mathbf\{x\}\)\\big\)\}\{\\sum\_\{i^\{\\prime\}=1\}^\{L\}\\exp\\big\(\\mathbf\{s\}\_\{b,i^\{\\prime\}\}\(\\mathbf\{x\}\)\\big\)\},\\quad\\mathbf\{A\}\(\\mathbf\{x\}\)\\in\\mathbb\{R\}^\{B\\times L\}\.\(2\)
We use mean aggregation rather than max pooling to reduce sensitivity to individual embedding dimensions, which may be noisy or poorly calibrated during early training\. This yields a more stable estimate of position\-wise importance\.
The resulting weights𝐀b,i\(𝐱\)\\mathbf\{A\}\_\{b,i\}\(\\mathbf\{x\}\)are used to rank nucleotide positions for subsequent masking\.
### IV\-CProgressive Masking Strategy
During training, AttnGen uses the attention weights to suppress low\-importance positions\. For a masking ratioα∈\[0,1\]\\alpha\\in\[0,1\], the number of masked positions is:
k=⌊αL⌋\.k=\\left\\lfloor\\alpha L\\right\\rfloor\.\(3\)
The indices corresponding to thekkleast salient positions are defined as:
ℐmask\(𝐱\)=argminℐ⊂\{1,…,L\}\|ℐ\|=k∑i∈ℐ𝐀b,i\(𝐱\),\\mathcal\{I\}\_\{\\text\{mask\}\}\(\\mathbf\{x\}\)=\\operatorname\{arg\\,min\}\_\{\\begin\{subarray\}\{c\}\\mathcal\{I\}\\subset\\\{1,\\dots,L\\\}\\\\ \|\\mathcal\{I\}\|=k\\end\{subarray\}\}\\sum\_\{i\\in\\mathcal\{I\}\}\\mathbf\{A\}\_\{b,i\}\(\\mathbf\{x\}\),\(4\)
i\.e\., the set ofkkpositions with the smallest importance scores\. The masked input𝐱~\\tilde\{\\mathbf\{x\}\}is constructed by replacing positions inℐmask\(𝐱\)\\mathcal\{I\}\_\{\\text\{mask\}\}\(\\mathbf\{x\}\)with a padding token \(index 0\)\. This operation is applied independently to each sequence in the batch\. This masking strategy introduces a controlled perturbation that forces the model to redistribute attention toward more informative regions\. We consider four masking regimes: baseline \(0%\), moderate \(10–25%\), high \(50%\), and extreme \(75%\)\. These regimes span the range from minimal perturbation to near\-complete information removal, with finer resolution in the moderate range where the accuracy–interpretability trade\-off is most sensitive\.
### IV\-DSaliency\-Guided Loss Function
The AttnGen objective combines standard classification loss with a consistency constraint:
ℒtotal=ℒCE\(fθ\(𝐱\),y\)\+λ𝒟KL\(fθ\(𝐱\)∥fθ\(𝐱~\)\),\\mathcal\{L\}\_\{\\text\{total\}\}=\\mathcal\{L\}\_\{\\text\{CE\}\}\\big\(f\_\{\\theta\}\(\\mathbf\{x\}\),y\\big\)\+\\lambda\\,\\mathcal\{D\}\_\{\\text\{KL\}\}\\big\(f\_\{\\theta\}\(\\mathbf\{x\}\)\\parallel f\_\{\\theta\}\(\\tilde\{\\mathbf\{x\}\}\)\\big\),whereℒCE\\mathcal\{L\}\_\{\\text\{CE\}\}is the cross\-entropy loss and𝒟KL\\mathcal\{D\}\_\{\\text\{KL\}\}measures divergence between prediction distributions:
𝒟KL\(fθ\(𝐱\)∥fθ\(𝐱~\)\)=∑c=1CP\(c\|𝐱\)logP\(c\|𝐱\)P\(c\|𝐱~\)\.\\mathcal\{D\}\_\{\\text\{KL\}\}\\big\(f\_\{\\theta\}\(\\mathbf\{x\}\)\\parallel f\_\{\\theta\}\(\\tilde\{\\mathbf\{x\}\}\)\\big\)=\\sum\_\{c=1\}^\{C\}P\(c\|\\mathbf\{x\}\)\\log\\frac\{P\(c\|\\mathbf\{x\}\)\}\{P\(c\|\\tilde\{\\mathbf\{x\}\}\)\}\.
The KL term penalizes large prediction shifts when low\-importance positions are removed, encouraging the model to rely on features that remain stable under masking\. The regularization weightλ=0\.1\\lambda\{=\}0\.1controls the trade\-off between predictive accuracy and consistency, and was selected based on a small grid search over\{0\.01,0\.1,0\.5\}\\\{0\.01,0\.1,0\.5\\\}on the validation set\. For the baseline setting \(α=0\\alpha\{=\}0\),λ=0\\lambda\{=\}0reduces the objective to standard cross\-entropy training\.
Algorithm[1](https://arxiv.org/html/2605.14073#alg1)summarizes the overall training procedure, replacing gradient\-based saliency computation in\[[11](https://arxiv.org/html/2605.14073#bib.bib11)\]with forward\-pass attention estimation\.
Input:Training samples𝐗\\mathbf\{X\}, labels𝐲\\mathbf\{y\}, masking ratioα\\alpha, learning rateη\\eta, KL weightλ\\lambda, epochsNN
for*epoch=1epoch=1toNN*do
for*each mini\-batch\(𝐱,y\)\(\\mathbf\{x\},y\)*do
\# Compute position\-wise importance
𝐬b,i=1d∑j𝐄b,i,j\\mathbf\{s\}\_\{b,i\}=\\frac\{1\}\{d\}\\sum\_\{j\}\\mathbf\{E\}\_\{b,i,j\},;
𝐀b,i=exp\(𝐬b,i\)∑i′exp\(𝐬b,i′\)\\mathbf\{A\}\_\{b,i\}=\\frac\{\\exp\(\\mathbf\{s\}\_\{b,i\}\)\}\{\\sum\_\{i^\{\\prime\}\}\\exp\(\\mathbf\{s\}\_\{b,i^\{\\prime\}\}\)\};
\# Select low\-importance positions
k=⌊αL⌋k=\\lfloor\\alpha L\\rfloor;
ℐmask=argmin\|ℐ\|=k∑i∈ℐ𝐀b,i\\mathcal\{I\}\_\{\\text\{mask\}\}=\\arg\\min\_\{\|\\mathcal\{I\}\|=k\}\\sum\_\{i\\in\\mathcal\{I\}\}\\mathbf\{A\}\_\{b,i\};
Construct masked input
𝐱~=Mask\(𝐱,ℐmask\)\\widetilde\{\\mathbf\{x\}\}=\\mathrm\{Mask\}\(\\mathbf\{x\},\\mathcal\{I\}\_\{\\text\{mask\}\}\);
\# Compute training objective
ℒ=ℒCE\+λ𝒟KL\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{CE\}\}\+\\lambda\\,\\mathcal\{D\}\_\{\\text\{KL\}\};
\# update
θ←θ−η∇θℒ\\theta\\leftarrow\\theta\-\\eta\\nabla\_\{\\theta\}\\mathcal\{L\};
end for
end for
Algorithm 1AttnGen
### IV\-ETraining Configuration
All models are trained using the Adam optimizer \(learning rateη=0\.001\\eta\{=\}0\.001\), batch size 64, and weight decay10−410^\{\-4\}\. These hyperparameters follow standard configurations for short\-sequence classification tasks and were found to provide stable convergence in preliminary experiments\.
Early stopping with patience 10 \(based on validation accuracy\) is applied to balance training time and overfitting risk\. The KL weightλ=0\.1\\lambda\{=\}0\.1is used for all masking levels except the baseline \(α=0\\alpha\{=\}0\)\. We use a fixed random seed \(42\), mini\-batch shuffling, and gradient clipping \(max\-norm 1\.0\) to stabilize optimization\.
### IV\-FGradient\-Based Importance Validation
To evaluate whether attention based importance aligns with gradient sensitivity, we perform post\-hoc gradient analysis\. For a trained modelfθf\_\{\\theta\}, we sampleN=3000N\{=\}3000sequences and compute importance scores:
Ii=‖∂fθ\(𝐱\)∂𝐄\(𝐱\)i‖2,I\_\{i\}=\\left\\\|\\frac\{\\partial f\_\{\\theta\}\(\\mathbf\{x\}\)\}\{\\partial\\mathbf\{E\}\(\\mathbf\{x\}\)\_\{i\}\}\\right\\\|\_\{2\},where𝐄\(𝐱\)i\\mathbf\{E\}\(\\mathbf\{x\}\)\_\{i\}denotes the embedding at positionii\.
We then progressively mask positions in decreasing order ofIiI\_\{i\}and measure classification accuracy as a function of the number of masked nucleotidesmm\. A steeper accuracy drop under high\-importance masking—compared to random or low\-importance baselines—indicates that the model relies on a structured rather than diffuse set of positio
## VResults
We evaluateAttnGenon thedemo\_human\_or\_wormbenchmark\[[13](https://arxiv.org/html/2605.14073#bib.bib13)\]under multiple masking configurations \(10%, 20%, 50%, 75%\)\. All experiments use identical architectures, hyperparameters, and random seeds\.
### V\-AClassification Performance
Table[I](https://arxiv.org/html/2605.14073#S5.T1)reports classification accuracy across masking levels\.AttnGen\(10%\)achieves 96\.73% accuracy, compared to 95\.83% for the baseline CNN \(\+0\.90 pp\)\.AttnGen\(20%\)reaches 96\.10%\.
At higher masking levels \(50% and 75%\), accuracy decreases to 95\.55% and 79\.81%, respectively\. This drop suggests that removing too many positions eliminates information required for classification\.
Table I:Classification accuracy on thedemo\_human\_or\_wormbenchmark\.
### V\-BAttention\-Based Masking Visualization
Figure[1](https://arxiv.org/html/2605.14073#S5.F1)shows masking patterns at different levels\. At 20% masking, removed positions are distributed across sequence, while retained positions form compact regions that differ across samples\. This variation indicates that masking is driven by sequence specific importance estimates rather than fixed positional patterns\. At higher masking levels \(50%\), retained regions become smaller and less connected, and classification becomes more sensitive to the remaining context\.
Figure 1:Masking patterns for human and worm sequences at different masking levels\. Blue: retained positions; red: masked positions\.
### V\-CGradient\-Based Importance Analysis
We compare attention based importance with gradient based rankings by measuring how accuracy changes when high gradient positions are removed\. For each sequence, position are ranked by gradient magnitude and progressively masked\.
Table II:Accuracy under progressive masking of high\-gradient positions\.Accuracy decreases as more high\-gradient positions are removed\. Masking 10 positions reduces accuracy by approximately 14\.5 pp, while masking all positions reduces performance to chance level\. The standard deviation increases with the number of masked positions\. This indicates variability across sequences: some retain partial predictive signal after masking, while others do not\. This variation is consistent with differences in how informative regions are distributed across inputs\.
Figure 2:Accuracy under progressive masking of high\-gradient positions\. Shaded region:±\\pm1 std\.
### V\-DAblation Study
We study the 10% masking configuration to isolate the effects ofattention\-based maskingandKL\-consistency regularization\. Table[III](https://arxiv.org/html/2605.14073#S5.T3)reports classification accuracy for each variant\.
Table III:Ablation on AttnGen\(10%\)\.Random masking with KL regularization \(95\.88%\) performs similarly to the baseline \(95\.83%\), indicating that KL consistency alone does not improve performance when masking is not guided by importance scores\. In contrast, attention\-based masking without KL \(95\.98%\) yields a larger improvement, suggesting that identifying and removing low\-importance positions contributes more directly to performance gains\. Combining attention\-based masking with KL regularization produces the highest accuracy \(96\.73%\), indicating that attention determines which positions are removed, while KL regularization stabilizes the output distribution when they are\.
### V\-ETraining Stability
Moderate masking ratios \(10–20%\) produce stable optimization, with validation loss decreasing smoothly and remaining close to training loss throughout training\. The 50% configuration remains stable but shows a wider generalization gap, consistent with the accuracy drop reported in Table[I](https://arxiv.org/html/2605.14073#S5.T1)\.
In contrast, 75% masking leads to a rapid increase in validation loss after early epochs, indicating that the model fails to maintain generalization when too much input information is removed\.
## VIConclusion
We presentedAttnGen, a training framework that integrates importance estimation into the optimization process for genomic sequence classification\. By combining attention\-based saliency with progressive masking and KL based consistency, the model focuses on a compact set of informative positions while maintaining predictive performance\. On thedemo\_human\_or\_wormbenchmark, moderate masking \(10–20%\) provides the best trade off, reaching up to 96\.73% accuracy\. Ablatin results show that attention\-guided masking is the primary source of improvement\. Additionally, removing high\-importance positions leads to a larger drop in accuracy compared to removing low\-importance ones, indicating that the learned importance scores align with the model’s predictions\. Future work includes extending the method to longer and more complex genomic sequences\.
## References
- \[1\]Stormo, G\. DNA binding sites: representation and discovery\.*Bioinformatics*16, 16–23 \(2000\)\.
- \[2\]LeCun, Y\., Bengio, Y\. & Hinton, G\. Deep learning\.*Nature*521, 436–444 \(2015\)\.
- \[3\]Karkehabadi, A\. & Sadeghmalakabadi, S\. Evaluating deep learning models for architectural image classification: A case study on the UC Davis campus\.*2024 IEEE 8th International Conference on Information and Communication Technology \(CICT\)*, pp\. 1–6 \(2024\)\.
- \[4\]Hassanpour, J\., Srivastav, V\., Mutter, D\. & Padoy, N\. Overcoming Dimensional Collapse in Self\-Supervised Contrastive Learning for Medical Image Segmentation\. \(2024\)\.
- \[5\]Simonyan, K\., Vedaldi, A\. & Zisserman, A\. Deep inside convolutional networks: Visualising image classification models and saliency maps\.*arXiv:1312\.6034*\(2013\)\.
- \[6\]Sundararajan, M\., Taly, A\. & Yan, Q\. Axiomatic attribution for deep networks\. In*ICML*, 3319–3328 \(2017\)\.
- \[7\]Shrikumar, A\., Greenside, P\. & Kundaje, A\. Learning important features through propagating activation differences\. In*ICML*, 3145–3153 \(2017\)\.
- \[8\]Avsec, Ž\.*et al\.*Effective gene expression prediction from sequence by integrating long\-range interactions\.*Nature Methods*18, 1196–1203 \(2021\)\.
- \[9\]Adebayo, J\.*et al\.*Sanity checks for saliency maps\. In*NeurIPS*31, 9505–9515 \(2018\)\.
- \[10\]Ross, A\., Hughes, M\. & Doshi\-Velez, F\. Right for the right reasons: Training differentiable models by constraining their explanations\. In*IJCAI*, 2662–2670 \(2017\)\.
- \[11\]Ismail, A\., Corrada Bravo, H\. & Feizi, S\. Improving deep learning interpretability by saliency guided training\. In*NeurIPS*34, 26726–26739 \(2021\)\.
- \[12\]Karkehabadi, A\., Homayoun, H\. & Sasan, A\. SMOOT: Saliency guided mask optimized online training\. In*2024 IEEE 17th Dallas Circuits and Systems Conference \(DCAS\)*, pp\. 1–6 \(2024\)\.
- \[13\]Grešová, K\., Martinek, V\., Čechák, D\., Šimeček, P\. & Alexiou, P\. Genomic benchmarks: a collection of datasets for genomic sequence classification\.*BMC Genomic Data*24, 25 \(2023\)\.
- \[14\]Karkehabadi, A\., Homayoun, H\. & Sasan, A\. Unified Gravity Loss for Robust Neural Networks Through Feature Space Optimization\.*Proceedings of the Great Lakes Symposium on VLSI 2025*, pp\. 947–953 \(2025\)\.
- \[15\]Alipanahi, B\., Delong, A\., Weirauch, M\. & Frey, B\. Predicting the sequence specificities of DNA\- and RNA\-binding proteins by deep learning\.*Nature Biotechnology*33, 831–838 \(2015\)\.
- \[16\]Karkehabadi, A\., Latibari, B\., Homayoun, H\. & Sasan, A\. HLGM: A novel methodology for improving model accuracy using saliency\-guided high and low gradient masking\. In*2024 14th International Conference on Information Science and Technology \(ICIST\)*, pp\. 909–917 \(2024\)\.
- \[17\]Zhou, J\. & Troyanskaya, O\. Predicting effects of noncoding variants with deep learning\-based sequence model\.*Nature Methods*12, 931–934 \(2015\)\.
- \[18\]Quang, D\. & Xie, X\. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences\.*Nucleic Acids Research*44, e107 \(2016\)\.
- \[19\]Selvaraju, R\.*et al\.*Grad\-CAM: Visual explanations from deep networks via gradient\-based localization\. In*ICCV*, 618–626 \(2017\)\.
- \[20\]Kapishnikov, A\., Bolukbasi, T\., Viégas, F\. & Terry, M\. Guided integrated gradients: An adaptive path method for removing noise\. In*CVPR*, 5050–5058 \(2021\)\.Similar Articles
AlphaGenome: AI for better understanding the genome
DeepMind introduces AlphaGenome, an AI model that predicts how DNA sequence variants impact gene regulation and biological processes across diverse cell types and tissues. The model processes up to 1 million base pairs and is available via API for non-commercial research, with the full paper published in Nature.
LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling
LDARNet is a 120M-parameter hierarchical genomic foundation model that introduces learnable adaptive tokenization (inspired by H-Net's dynamic chunking) for masked language modeling on DNA sequences. It achieves state-of-the-art results on 5 histone modification tasks and outperforms models up to 20× larger on several genomic benchmarks, with learned token boundaries aligning with biological features like promoter motifs and splice junctions.
A Temporally Augmented Graph Attention Network for Affordance Classification
EEG-tGAT is a temporally augmented Graph Attention Network that improves affordance classification from interaction sequences by incorporating temporal attention and dropout mechanisms. The model enhances GATv2 for sequential data where temporal dimensions are semantically non-uniform.
SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers
SEGA is a training-free method that improves high-resolution text-to-image generation by adaptively scaling attention across RoPE components based on spatial-frequency structure during denoising steps.
Interpretable machine learning through teaching
OpenAI presents a machine teaching approach where a teacher neural network learns to select the most illustrative examples to teach a student network to recognize concepts, producing interpretable results by grounding examples in human-understandable properties rather than arbitrary feature encodings.