TaxDistill: Improving Metagenomic Taxonomic Annotation via Distilled Genomic Foundation Models

arXiv cs.LG 05/29/26, 04:00 AM Papers
Summary
TaxDistill proposes a knowledge distillation framework using a 500M parameter genomic foundation model (GenomeOcean) as a teacher to improve metagenomic taxonomic annotation by reducing label noise from similarity search tools, achieving significant F1 improvements on CAMI2 datasets.
arXiv:2605.28868v1 Announce Type: new Abstract: Metagenomic taxonomic annotation aims to identify the microbial origins of DNA fragments in environmental samples. Traditional methods that rely on sequence similarity are often constrained by the high microbial diversity and the incompleteness of reference databases, which has motivated the development of learning approaches such as Taxometer that perform post hoc correction to learn more informative metagenomic sequence representations. However, these methods typically rely on labels derived from similarity search tools during training, which inevitably introduces noise that can impair representation learning and degrade classification performance. To address this issue, we propose TaxDistill, a knowledge distillation framework for metagenomic classification. We introduce GenomeOcean, a 500M parameter genomic foundation model, as the teacher network to extract deep semantic features and generate soft labels based on confidence. By distilling this soft label information into a lightweight student network, TaxDistill effectively reduces the label noise introduced by initial retrieval tools. Comprehensive experiments on seven diverse CAMI2 datasets demonstrate that TaxDistill outperforms existing baselines in most scenarios. For instance, on the Gastrointestinal dataset, it improves the F1 score of MMseqs2 from 0.763 to 0.941, outperforming the Taxometer baseline. Overall, TaxDistill provides a reliable method for label correction in complex metagenomic analysis.
Original Article
View Cached Full Text
Cached at: 05/29/26, 09:12 AM
# TaxDistill: Improving Metagenomic Taxonomic Annotation via Distilled Genomic Foundation Models
Source: [https://arxiv.org/html/2605.28868](https://arxiv.org/html/2605.28868)
Rongye Ye1,3,4,†\\dagger,Lun Li1,2,3,†\\dagger,Zheng Luo1,3,4,Yiran Zhan1,3,4,Shuhui Song1,2,3,4 1National Genomics Data Center, China National Center for Bioinformation, Beijing 100101, China 2Beijing Key Laboratory of Intelligent Governance and Application of Biological Big Data, China National Center for Bioinformation, Beijing 100049, China 3Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China 4University of Chinese Academy of Sciences, Beijing 100049, China †\\daggerThese authors contributed equally to this work

###### Abstract

Metagenomic taxonomic annotation aims to identify the microbial origins of DNA fragments in environmental samples\. Traditional methods that rely on sequence similarity are often constrained by the high microbial diversity and the incompleteness of reference databases, which has motivated the development of learning approaches such as Taxometer that perform post hoc correction to learn more informative metagenomic sequence representations\. However, these methods typically rely on labels derived from similarity search tools during training, which inevitably introduces noise that can impair representation learning and degrade classification performance\. To address this issue, we propose TaxDistill, a knowledge distillation framework for metagenomic classification\. We introduce GenomeOcean, a 500M parameter genomic foundation model, as the teacher network to extract deep semantic features and generate soft labels based on confidence\. By distilling this soft label information into a lightweight student network, TaxDistill effectively reduces the label noise introduced by initial retrieval tools\. Comprehensive experiments on seven diverse CAMI2 datasets demonstrate that TaxDistill outperforms existing baselines in most scenarios\. For instance, on the Gastrointestinal dataset, it improves the F1 score of MMseqs2 from 0\.763 to 0\.941, outperforming the Taxometer baseline\. Overall, TaxDistill provides a reliable method for label correction in complex metagenomic analysis\.

TaxDistill: Improving Metagenomic Taxonomic Annotation via Distilled Genomic Foundation Models

Rongye Ye1,3,4,†\\dagger, Lun Li1,2,3,†\\dagger, Zheng Luo1,3,4, Yiran Zhan1,3,4, Shuhui Song1,2,3,4††thanks:Corresponding author\. Email:[songshh@big\.ac\.cn](https://arxiv.org/html/2605.28868v1/mailto:[email protected])1National Genomics Data Center, China National Center for Bioinformation, Beijing 100101, China2Beijing Key Laboratory of Intelligent Governance and Application of Biological Big Data,China National Center for Bioinformation, Beijing 100049, China3Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China4University of Chinese Academy of Sciences, Beijing 100049, China†\\daggerThese authors contributed equally to this work\.

![Refer to caption](https://arxiv.org/html/2605.28868v1/Fig/Figure1.png)Figure 1:Metagenomic Analysis Pipeline and Research Motivation## 1Introduction

Metagenomic sequencing has emerged as a crucial technology for profiling complex microbial communities, essentially deciphering the "language of life" hidden within environmental samples\(Handelsman,[2004](https://arxiv.org/html/2605.28868#bib.bib6); Prabakaran and Bromberg,[2025](https://arxiv.org/html/2605.28868#bib.bib20); Levy Karin and Steinegger,[2025](https://arxiv.org/html/2605.28868#bib.bib12)\)\. In clinical pathogen detection and disease microbiome characterization, the taxonomic annotation of sequences is a vital step, whose objective is to precisely map sequencing reads or assembled contigs to specific taxonomic nodes\(Simon et al\.,[2019](https://arxiv.org/html/2605.28868#bib.bib21); Chiu and Miller,[2019](https://arxiv.org/html/2605.28868#bib.bib4)\)\. Currently, mainstream methods primarily rely on sequence similarity search algorithms\.\(Wood et al\.,[2019](https://arxiv.org/html/2605.28868#bib.bib27); Kim et al\.,[2016](https://arxiv.org/html/2605.28868#bib.bib9); Kallenborn et al\.,[2025](https://arxiv.org/html/2605.28868#bib.bib8); Kim and Steinegger,[2024](https://arxiv.org/html/2605.28868#bib.bib10)\)\. These methods demonstrate strong performance on well characterized microbes, but their performance often decreases substantially on rare or novel species that are underrepresented in reference databases\(Meyer et al\.,[2022](https://arxiv.org/html/2605.28868#bib.bib17)\)\.

In recent years, with breakthroughs in deep learning for sequence modeling\(Vaswani et al\.,[2017](https://arxiv.org/html/2605.28868#bib.bib24); Ye et al\.,[2025](https://arxiv.org/html/2605.28868#bib.bib28)\), researchers have begun exploring neural network approaches for classification representation\. For instance, Taxometer\(Kutuzova et al\.,[2024](https://arxiv.org/html/2605.28868#bib.bib11)\)is a feature aggregation method for metagenomic sequence classification, which combines tetranucleotide frequencies \(TNFs\) and abundance information\. This method introduces a deep hierarchical loss\(Valmadre,[2022](https://arxiv.org/html/2605.28868#bib.bib23)\)to align with the taxonomic tree, thereby smoothly propagating classification signals from labeled sequences to unlabeled ones\. However, the effectiveness of this model encounters a critical bottleneck: its training process is highly dependent on initial pseudo\-labels generated by traditional sequence similarity tools\. When dealing with highly complex microbial scenarios, such retrieval tools often produce massive misclassifications and unassigned nodes\(Meyer et al\.,[2022](https://arxiv.org/html/2605.28868#bib.bib17)\), inevitably introducing severe label noise into subsequent training\. Because Taxometer solely employs a lightweight Multilayer Perceptron \(MLP\) as its feature encoder, when confronted with these highly noisy hard labels, constrained by its limited capacity and sequence modeling capabilities, the model is prone to overfitting erroneous signals and falling into representation collapse\(Zhang et al\.,[2016](https://arxiv.org/html/2605.28868#bib.bib31); Liu et al\.,[2020](https://arxiv.org/html/2605.28868#bib.bib16); Vishwakarma et al\.,[2025](https://arxiv.org/html/2605.28868#bib.bib25)\), which weakens its capacity to maintain robustness against noisy pseudo\-labels and to carry out self correction\. Genomic language models have demonstrated broad potential for applications in the life sciences\(Lin et al\.,[2023](https://arxiv.org/html/2605.28868#bib.bib15); Brixi et al\.,[2026](https://arxiv.org/html/2605.28868#bib.bib2); Cheng et al\.,[2024](https://arxiv.org/html/2605.28868#bib.bib3); Ye et al\.,[2026](https://arxiv.org/html/2605.28868#bib.bib29)\)\. Among them, the recently proposed GenomeOcean is a 4B\-parameter generative language model pre\-trained on over 600 Gbp of large\-scale, complex metagenomic assembled sequences\. Similar to large language models in NLP, GenomeOcean\(Zhou et al\.,[2025](https://arxiv.org/html/2605.28868#bib.bib32)\)adopts an efficient Byte Pair Encoding \(BPE\) tokenization strategy to construct its genomic vocabulary; It can not only capture the implicit grammatical constraints in DNA sequences but also model complex long range dependencies\.

Driven by the research objectives illustrated in Figure[1](https://arxiv.org/html/2605.28868#S0.F1), we introduce TaxDistill, a novel metagenomic classification framework that focuses specifically on taxonomic annotation at the contig level\. Similar to Taxometer, the core positioning of TaxDistill is a post\-hoc label denoising module, aiming to correct the results of initial retrieval based classifiers\. While retaining the highly efficient, lightweight architecture of Taxometer as the student network, we introduce the powerfully expressive GenomeOcean as the teacher network\. By leveraging the high dimensional continuous features and the confidence score of soft labels, which are enriched with dark knowledge distilled from GenomeOcean, we effectively neutralize the hard label noise introduced by traditional sequence retrieval tools\. Experiments demonstrate that this knowledge distillation framework endows the lightweight network with deep semantic understanding capabilities, and its classification performance consistently surpasses the Taxometer baseline across a series of benchmarks in complex microbial environments\. In summary, our main contributions are as follows:

1. 1\.We propose TaxDistill, a knowledge distillation framework for metagenomic taxonomic annotation\. TaxDistill employs a plug\-and\-play design, allowing direct integration with any sequence alignment algorithm\. To the best of our knowledge, this study is the first to introduce a metagenomic language model as a teacher within a knowledge distillation framework, effectively mitigating the problem of the student network overfitting to noisy labels\.
2. 2\.Our experiments show that soft label distillation effectively endows the student network with uncertainty awareness in metagenomic taxonomic annotation\. By selectively converting high risk predictions into unclassified labels at ambiguous boundaries, TaxDistill achieves strict false positive control, ensuring high reliability for complex real world applications\.
3. 3\.We conducted comprehensive benchmarking on seven diverse microbial environment datasets from CAMI2, evaluating multiple mainstream sequence classifiers \(MMseqs2\(Kallenborn et al\.,[2025](https://arxiv.org/html/2605.28868#bib.bib8)\), Metabuli\(Kim and Steinegger,[2024](https://arxiv.org/html/2605.28868#bib.bib10)\), Kraken2\(Wood et al\.,[2019](https://arxiv.org/html/2605.28868#bib.bib27)\)\) as well as the existing calibration model Taxometer\(Kutuzova et al\.,[2024](https://arxiv.org/html/2605.28868#bib.bib11)\)\. Experimental results demonstrate that TaxDistill outperforms the baseline models in the majority of scenarios\.

## 2Related Works

### 2\.1Metagenomic Sequence Classification

Traditional metagenomic taxonomic annotation methods are primarily based on sequence similarity and heuristic matching, with widely used tools including Kraken2\(Wood et al\.,[2019](https://arxiv.org/html/2605.28868#bib.bib27)\), Centrifuge\(Kim et al\.,[2016](https://arxiv.org/html/2605.28868#bib.bib9)\), MMseqs2\(Kallenborn et al\.,[2025](https://arxiv.org/html/2605.28868#bib.bib8)\), and the recently proposed Metabuli\(Kim and Steinegger,[2024](https://arxiv.org/html/2605.28868#bib.bib10)\)\. Although these dictionary style retrieval methods are highly computationally efficient, they tend to produce a large number of erroneous labels or ambiguous taxonomic predictions when confronted with highly complex environmental metagenomic samples or novel microorganisms absent from reference databases\.

To overcome the inherent limitations of sequence alignment methods, researchers in recent years have begun to introduce deep learning architectures to extract continuous spatial patterns from genetic sequences\. Notable examples include DeepMicrobes\(Liang et al\.,[2020](https://arxiv.org/html/2605.28868#bib.bib14)\), which is based on bidirectional long short\-term memory \(Bi\-LSTM\) units, and MetaTransformer\(Wichmann et al\.,[2023](https://arxiv.org/html/2605.28868#bib.bib26)\), which leverages self\-attention mechanisms\. Although these end\-to\-end sequence classification models demonstrate excellent performance on standard benchmarks, their training heavily relies on artificially simulated short reads and real reference labels\. This training paradigm, rooted in a fixed label set, is fundamentally a form of closed\-set learning\. However, real world metagenomic environments are extraordinarily complex and replete with uncharted microbial communities\(Nayfach et al\.,[2021](https://arxiv.org/html/2605.28868#bib.bib19); Thompson et al\.,[2017](https://arxiv.org/html/2605.28868#bib.bib22)\)\. The discrepancy between the idealized fixed label set and the microbial diversity in real environments often causes these models to experience severe domain shift when applied to real environmental data, resulting in a significant decline in generalization performance\.

To address the limitations of the traditional learning paradigm with fixed labels, the recently proposed Taxometer establishes a novel post\-hoc label correction method\. This model dynamically constructs a local label tree tailored to the current dataset based on the retrieval results of an initial classifier \(e\.g\., MMseqs2\)\. However, its lightweight architecture remains highly susceptible to overfitting the label noise introduced by initial retrieval tools\.

### 2\.2Knowledge Distillation

Knowledge Distillation \(KD\) is designed to transfer the rich representation capabilities of a complex teacher model to a student model with a more compact architecture\(Hinton et al\.,[2015](https://arxiv.org/html/2605.28868#bib.bib7); Gou et al\.,[2021](https://arxiv.org/html/2605.28868#bib.bib5)\)\. In recent years, an extensive body of research has demonstrated that KD offers significant advantages in mitigating the challenges associated with learning from noisy labels\(Li et al\.,[2017](https://arxiv.org/html/2605.28868#bib.bib13); Müller et al\.,[2019](https://arxiv.org/html/2605.28868#bib.bib18)\)\. When confronted with pseudo\-labels containing substantial errors, traditional hard labels frequently cause lightweight networks to overfit\. In contrast, the soft labels generated by a teacher network capture information that reveals the underlying similarities between different classes\(Yuan et al\.,[2020](https://arxiv.org/html/2605.28868#bib.bib30)\)\. This continuous probability distribution serves as a natural regularizer\(Yuan et al\.,[2020](https://arxiv.org/html/2605.28868#bib.bib30); Ben\-Baruch et al\.,[2024](https://arxiv.org/html/2605.28868#bib.bib1)\), effectively preventing the student network from blindly overfitting to erroneous annotations\.

Despite the recent emergence of parameterized genomic foundation models, efficiently distilling their deep semantic knowledge into lightweight networks for metagenomic sequence classification remains unexplored\. The proposed TaxDistill framework is specifically designed to bridge this gap\.

## 3Methodology

![Refer to caption](https://arxiv.org/html/2605.28868v1/x1.png)Figure 2:The overall architecture of the proposed TaxDistill framework\. It consists of three core modules: multimodal data input formulation, the Teacher Model branch, and the Student Model branch\.In this section, we formally introduce the proposed TaxDistill framework\. As illustrated in Figure[2](https://arxiv.org/html/2605.28868#S3.F2), the framework is designed for reliable metagenomic classification via distillation with soft labels\. This study innovatively proposes a knowledge distillation framework based on a metagenomic language model and applies it to metagenomic taxonomic annotation tasks\.

The Teacher branch employs the pre\-trained GenomeOcean foundation model with a frozen backbone\. It extracts deep semantic features from raw sequences and projects them through a learnable classification head to output a categorical probability distribution\. This branch is optimized independently via a deep hierarchical loss\.

Conversely, the Student branch maintains a lightweight MLP architecture to ensure inference with low latency\. It processes a\(103\+K\+1\)\(103\+K\+1\)dimensional feature vector consisting of hand crafted TNFs features, abundances across K environments, and total abundance\.

During the joint optimization phase, the KD loss is introduced to quantify the divergence between the teacher and student distributions\. The student model’s parameters are jointly updated by its own hierarchical classification loss and the KD loss\. Meanwhile, the teacher model continues to be updated solely by its classification loss\. Detailed mathematical formulations are elaborated in Section[3\.1](https://arxiv.org/html/2605.28868#S3.SS1)\.

### 3\.1Problem Formulation and Notation

We first define a directed hierarchical taxonomic tree𝒯=\(𝒱,ℰ\)\\mathcal\{T\}=\(\\mathcal\{V\},\\mathcal\{E\}\)for each target dataset, where𝒱\\mathcal\{V\}is the set of all taxonomic nodes, andℰ\\mathcal\{E\}is the set of directed edges representing the parent\-child hierarchical relationships \(i\.e\.,\(u,v\)∈ℰ\(u,v\)\\in\\mathcal\{E\}indicates thatuuis the direct parent ofvv\)\. Let𝒩⊂𝒱\\mathcal\{N\}\\subset\\mathcal\{V\}denote the set of leaf nodes representing fine\-grained taxonomic labels\. For any nodeu∈𝒱u\\in\\mathcal\{V\}, we defineleaves\(u\)leaves\(u\)as the set of all leaf nodes descending fromuu\.

Given theii\-th instance in the dataset, we use the same symbolxix\_\{i\}to denote the inputs to both the teacher and student models for notational convenience\. The classifier produces a taxonomic hierarchical assignment, and we denote byyi∈𝒱y\_\{i\}\\in\\mathcal\{V\}the finest\-grained label in the hierarchical structure defined by the classifier output\. The model, parameterized byθ\\theta, receivesxix\_\{i\}and outputs a logit vector for all leaf nodes, denoted as𝐳i∈ℝ\|𝒩\|\\mathbf\{z\}\_\{i\}\\in\\mathbb\{R\}^\{\|\\mathcal\{N\}\|\}, wherezi,lz\_\{i,l\}represents thell\-th element of the vector\. To mitigate the risk of false positive classifications, the classification path extends exclusively to child nodes with a probability exceeding 0\.5, and is ultimately truncated using a strict threshold of 0\.80\.

### 3\.2TaxDistill Framework

The base probability of a leaf nodel∈𝒩l\\in\\mathcal\{N\}is computed via the standard softmax function:

P\(l\|xi;θ\)=exp⁡\(zi,l\)∑k∈𝒩exp⁡\(zi,k\)\.P\(l\|x\_\{i\};\\theta\)=\\frac\{\\exp\(z\_\{i,l\}\)\}\{\\sum\_\{k\\in\\mathcal\{N\}\}\\exp\(z\_\{i,k\}\)\}\.\(1\)
Building upon this, the probability of any non\-leaf nodeuuis defined as the marginalized sum of the probabilities of all its leaf descendants:

P\(u\|xi;θ\)=∑l∈leaves\(u\)P\(l\|xi;θ\)\.P\(u\|x\_\{i\};\\theta\)=\\sum\_\{l\\in leaves\(u\)\}P\(l\|x\_\{i\};\\theta\)\.\(2\)
For theii\-th sample, let𝒫\(yi\)\\mathcal\{P\}\(y\_\{i\}\)be its initial hierarchical label path \(i\.e\., the set of all nodes from the root to the assigned nodeyiy\_\{i\}\)\. The deep hierarchical loss maximizes the joint log\-likelihood of all nodes along this path over a batch of sizeNN:

ℒhier\(θ\)=−1N∑i=1N∑u∈𝒫\(yi\)log⁡P\(u\|xi;θ\)\.\\mathcal\{L\}\_\{\\text\{hier\}\}\(\\theta\)=\-\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{u\\in\\mathcal\{P\}\(y\_\{i\}\)\}\\log P\(u\|x\_\{i\};\\theta\)\.\(3\)
The temperature parameter controls the smoothness of the output distribution; as the temperature increases, the distribution becomes more uniform, thereby emphasizing information from non\-dominant classes\. We introduce a temperature scaling parameterτ\>1\\tau\>1to soften the probability distributions\. LetθT\\theta\_\{T\}andθS\\theta\_\{S\}parameterize the teacher and student models, respectively, with their output logits denoted as𝐳iT\\mathbf\{z\}\_\{i\}^\{T\}and𝐳iS\\mathbf\{z\}\_\{i\}^\{S\}\. The softened distribution is defined as:

qM\(l\|𝐳iM;τ\)=exp⁡\(zi,lM/τ\)∑k∈𝒩exp⁡\(zi,kM/τ\),M∈\{S,T\}\.q^\{M\}\(l\|\\mathbf\{z\}\_\{i\}^\{M\};\\tau\)=\\frac\{\\exp\(z\_\{i,l\}^\{M\}/\\tau\)\}\{\\sum\_\{k\\in\\mathcal\{N\}\}\\exp\(z\_\{i,k\}^\{M\}/\\tau\)\},\\quad M\\in\\\{S,T\\\}\.\(4\)
During joint training, to prevent the randomly initialized student model from deviating the teacher’s probability distribution, we apply a stop\-gradient \(sg\) operation to the teacher’s logits\. This ensures that the teacher is updated exclusively by the hierarchical loss derived from the initial labels\. We then minimize the Kullback\-Leibler \(KL\) divergence between the student’s prediction and the teacher’s stop\-gradient soft distribution:

ℒKD\(θS\)=τ2N∑i=1N∑l∈𝒩qT\(l\|sg\(𝐳iT\);τ\)×log⁡\(qT\(l\|sg\(𝐳iT\);τ\)qS\(l\|𝐳iS;τ\)\)\.\\begin\{split\}\\mathcal\{L\}\_\{\\text\{KD\}\}\(\\theta\_\{S\}\)&=\\frac\{\\tau^\{2\}\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{l\\in\\mathcal\{N\}\}q^\{T\}\(l\|\\text\{sg\}\(\\mathbf\{z\}\_\{i\}^\{T\}\);\\tau\)\\\\ &\\quad\\times\\log\\left\(\\frac\{q^\{T\}\(l\|\\text\{sg\}\(\\mathbf\{z\}\_\{i\}^\{T\}\);\\tau\)\}\{q^\{S\}\(l\|\\mathbf\{z\}\_\{i\}^\{S\};\\tau\)\}\\right\)\.\\end\{split\}\(5\)
Ultimately, the optimization objective for the teacher modelθT\\theta\_\{T\}is strictly to minimize its own hierarchical supervision loss:

ℒTeacher\(θT\)=ℒhier\(θT\)\.\\mathcal\{L\}\_\{\\text\{Teacher\}\}\(\\theta\_\{T\}\)=\\mathcal\{L\}\_\{\\text\{hier\}\}\(\\theta\_\{T\}\)\.\(6\)
Conversely, the optimization objective for the student modelθS\\theta\_\{S\}is a weighted combination of the hierarchical hard\-label loss and the soft\-label distillation loss, balanced by a hyperparameterα\\alpha:

ℒStudent\(θS\)=αℒhier\(θS\)\+\(1−α\)ℒKD\(θS\)\.\\mathcal\{L\}\_\{\\text\{Student\}\}\(\\theta\_\{S\}\)=\\alpha\\mathcal\{L\}\_\{\\text\{hier\}\}\(\\theta\_\{S\}\)\+\(1\-\\alpha\)\\mathcal\{L\}\_\{\\text\{KD\}\}\(\\theta\_\{S\}\)\.\(7\)
The total loss for the end\-to\-end framework is the sum of both components:

ℒtotal=ℒTeacher\(θT\)\+ℒStudent\(θS\)\.\\mathcal\{L\}\_\{\\text\{total\}\}=\\mathcal\{L\}\_\{\\text\{Teacher\}\}\(\\theta\_\{T\}\)\+\\mathcal\{L\}\_\{\\text\{Student\}\}\(\\theta\_\{S\}\)\.\(8\)

## 4Experiments

![Refer to caption](https://arxiv.org/html/2605.28868v1/Fig/Figure3.png)Figure 3:CAMI2 Human Microbiome Dataset Experimental Results\. Each row demonstrates the optimization results initialized by a specific tool: Metabuli \(Row 1\), MMseq2 \(Row 2\), and Kraken2 \(Row 3\)\. TaxDistill \(100M\) and TaxDistill \(500M\) represent the results obtained using GenomeOcean teacher models with 100M and 500M parameters, respectively\.In this section, we systematically evaluate the proposed TaxDistill framework on the widely adopted CAMI2 benchmark dataset\(Meyer et al\.,[2022](https://arxiv.org/html/2605.28868#bib.bib17)\)and comprehensively compare its performance against mainstream metagenomic classification baseline models\. The detailed dataset processing procedures and model evaluation metrics are provided in the Appendix[A](https://arxiv.org/html/2605.28868#A1)\.

Due to the extreme heterogeneity of real\-world metagenomic data, traditional transfer learning yields catastrophic performance degradation\. To overcome conventional train\-test split limitations, both the baseline Taxometer and our TaxDistill adopt an on\-the\-fly, transductive training mechanism\. Specifically, the models are optimized directly on the target dataset , executing self\-correction solely via noisy pseudo\-labels without any access to ground truth annotations\.

![Refer to caption](https://arxiv.org/html/2605.28868v1/x2.png)Figure 4:Experimental Results on CAMI2 Plant Rhizosphere and Marine Datasets\. Each column demonstrates the optimization results initialized by a specific tool: Metabuli \(Column 1\), MMseq2 \(Column 2\), and Kraken2 \(Column 3\)\.![Refer to caption](https://arxiv.org/html/2605.28868v1/x3.png)Figure 5:Sankey diagram analysis of label transition and recalibration dynamics on the Marine dataset\. The first row compares initial labeling tools with their TaxDistill optimized versions, while the second row contrasts optimized Taxometer with TaxDistill\.### 4\.1Baselines Setting

TaxDistill is designed as a flexible post\-hoc label correction framework\. To validate its performance, we evaluate it against a hierarchical set of baselines, divided into initial retrieval tools and post\-hoc correction models\.

Initial Retrieval Baselines: We utilize three mainstream retrieval tools to generate base predictions: 1\)MMseqs2\(Kallenborn et al\.,[2025](https://arxiv.org/html/2605.28868#bib.bib8)\)forkk\-mer pre\-screening homology search; 2\)Metabuli\(Kim and Steinegger,[2024](https://arxiv.org/html/2605.28868#bib.bib10)\)for bi\-modal LCA alignment; and 3\)Kraken2\(Wood et al\.,[2019](https://arxiv.org/html/2605.28868#bib.bib27)\)for exactkk\-mer matching\.

Post\-hoc Correction Baselines: We subsequently apply denoising frameworks to refine the initial predictions\. We compare TaxDistill against three experimental setups: 1\) the raw uncorrected outputs of the initial tools; 2\)Taxometer\(Kutuzova et al\.,[2024](https://arxiv.org/html/2605.28868#bib.bib11)\), which recalibrates using TNFs and sample abundances; and 3\)Taxometer\_Gis our modified version of Taxometer, in which the original TNFs are replaced with embeddings derived solely from GenomeOcean, in order to evaluate the impact of foundation model features\.

### 4\.2Results on Human Microbiome Datasets

We systematically evaluated the proposed TaxDistill framework and baseline models on five highly diverse CAMI2 human microbiome datasets\. As shown in Figure[3](https://arxiv.org/html/2605.28868#S4.F3), we report the F1 scores alongside the number of correctly classified \(Correct\), misclassified \(Wrong\), and unassigned \(No label\) sequences, with full details provided in Table[2](https://arxiv.org/html/2605.28868#A2.T2)\.

Specifically, we evaluated TaxDistill across 15 scenarios from five datasets and three classification tools\. It outperformed Taxometer in 13 cases\. In addition, TaxDistill showed strong uncertainty\-awareness\. It could reclassify Taxometer’s errors as unknown or correct categories\. On the Gastrointestinal dataset using MMseqs2, Taxometer improves the initial F1 score from 0\.763 to 0\.924, while TaxDistill further increases it to 0\.941\. Similarly, on the Oral and Skin datasets using Metabuli, TaxDistill consistently outperforms Taxometer by about 1\.0% to 1\.2% in F1 score\. This indicates that, compared to the hand crafted shallow features used by Taxometer, the deep semantic knowledge from genomic foundation models enables the network to perform more accurate metagenomic taxonomic annotation\.

Experimental results indicate that MMseqs2 and Metabuli exhibit relatively conservative predictions\. When employed to generate initial pseudo\-labels, these base classifiers yield a substantial number of unassigned contigs\. While Taxometer can partially recover these sequences, TaxDistill shows a stronger rescue capability\. For instance, on the Airways dataset using MMseqs2, Taxometer achieves 164,925 correct taxonomic assignments\. TaxDistill further improves upon this result, increasing the correct classifications to 166,804\. Overall, TaxDistill reduces the total number of misclassified sequences by nearly 4,000, of which approximately 2,000 sequences are correctly classified from previous incorrect assignments\.

In contrast, Kraken2 uses an aggressive exactkk\-mer matching strategy that tends to force classifications\. This results in fewer unassigned sequences but introduces a large number of false positive errors\. Under this heavy label noise, Taxometer shows a relatively conservative calibration, predicting more unknown classes, whereas TaxDistill applies an even stricter calibration strategy\. Taking the Airways dataset as an example, the base Kraken2 outputs 13,670 unassigned labels and 15,499 erroneous predictions\. Taxometer increases the unassigned labels to 20,300 and reduces the errors to 14,377\. TaxDistill applies stronger distillation regularization, significantly increasing the unassigned labels to 25,528 while further reducing the errors to 13,183\. Although this conservative strategy of converting high confidence errors into unassigned labels may slightly lower the F1 score on a few datasets compared to Taxometer, avoiding false positives is often much more critical than forcing a prediction in clinical metagenomic applications\.

Furthermore, across 15 test instances involving three taxonomic tools evaluated on five diverse datasets, TaxDistill consistently outperformed Taxometer\_G in 13 cases\. This suggests that simply incorporating foundation model knowledge is insufficient to substantially enhance label correction performance\. Additionally, the 100M teacher model exhibits slightly lower overall label correction accuracy compared to the 500M model\.

### 4\.3Rhizosphere and Marine Datasets

To further validate the label correction capability of TaxDistill in complex real world scenarios, we conducted an in\-depth analysis on two highly challenging environmental datasets: Rhizosphere and Marine\. Among the five human microbiome datasets, the maximum number of contigs is approximately 200,000, whereas the Rhizosphere dataset contains over 300,000 contigs, and the Marine dataset exceeds 430,000 contigs\. Unlike microbiomes associated with humans, the marine dataset contains a vast amount of uncultured microbial genomic fragments and plasmids\. Meanwhile, the Rhizosphere, representing one of the most diverse and structurally complex ecosystems on Earth\.

As shown in Figure[4](https://arxiv.org/html/2605.28868#S4.F4), when processing the marine dataset rich in uncultured fragments, TaxDistill demonstrates superior classification performance over Taxometer\. The performance improvements are particularly notable when using MMseqs2 and Metabuli as baselines\. Taking MMseqs2 as an example, Taxometer improves the base F1 score from 0\.743 to 0\.847, whereas TaxDistill further elevates it to 0\.864\. Similarly, TaxDistill yields a steady improvement of approximately 1\.5% on both Metabuli and Kraken2\. This demonstrates that Taxometer, which relies on hand crafted features, is prone to encountering representational bottlenecks\.

In the highly complex Rhizosphere dataset, observing the results under the MMseqs2 baseline in Figure[4](https://arxiv.org/html/2605.28868#S4.F4), Taxometer attempts to forcefully correct a large number of unassigned sequences\. Although this pushes its F1 score to 0\.772, the number of misclassifications also expands significantly\. In contrast, TaxDistill exhibits notable uncertainty awareness\. When faced with highly ambiguous subspecies boundaries, it is more capable than Taxometer of assigning erroneous samples to the unclassified category, thereby reducing the absolute number of misclassifications\. This trend is equally evident with the Metabuli baseline\. Notably, for the Kraken2 classifier, TaxDistill leverages this uncertainty awareness to its fullest extent\. It successfully reassigns a large number of erroneous predictions to the unknown label category\. Compared to Taxometer, TaxDistill achieves an approximate 2\.5% improvement in the F1 score and a striking 8% surge in Precision, while maintaining a roughly equivalent Recall\. This demonstrates that in complex environments, TaxDistill can strictly control false positives for aggressive classifiers and apply conservative recalibration\.

Furthermore, across six test scenarios covering two datasets and three classification tools, TaxDistill consistently outperformed Taxometer\_G, a trend consistent with the results observed on the human microbiome datasets\. In addition, the overall label correction performance of the 100M teacher model is slightly lower than that of the 500M teacher model\.

### 4\.4Sankey Diagrams Analysis

To further analyze the recalibration mechanism, we visualize the label transition process on the Marine dataset using Sankey diagrams in Figure[5](https://arxiv.org/html/2605.28868#S4.F5)\. First, compared to the initial heuristic classifiers, TaxDistill achieves a substantial performance gain\. Using MMseqs2 as an example, the number of correct classifications surges from 141,296 to 183,023\. The sankey flow illustrates that TaxDistill not only precisely redirects a massive volume of No label to the correct category, but also effectively mitigates initial misclassifications: a portion is directly corrected, while the remainder is conservatively assigned to the unknown label category\. For baselines like Kraken2 and Metabuli, TaxDistill maintains recalibration gains while showing a propensity to convert initial errors into unknown labels\.

Furthermore, in comparison with the advanced baseline Taxometer, TaxDistill demonstrates a highly favorable recalibration trade\-off\. Although it conservatively reassigns a negligible fraction of Taxometer’s correct predictions to the unknown state, it successfully converts a larger number of erroneous and unassigned predictions into correct labels, and is also able to assign erroneous predictions to the unknown category\. Notably, on the Metabuli baseline, TaxDistill not only yields more correct predictions than Taxometer but also significantly reduces the number of misclassifications by nearly 9,000\. This further confirms the effective recalibration capability of our model\.

## 5Conclusion

In this paper, we propose TaxDistill, an innovative post\-hoc recalibration framework based on knowledge distillation\. As a post\-hoc recalibration framework, TaxDistill features a plug\-and\-play design that can be integrated into any sequence alignment algorithm, thereby demonstrating excellent scalability\. It is designed to address the label noise challenges introduced by heuristic tools in metagenomic taxonomic classification, thereby providing a highly accurate post\-hoc correction solution\. TaxDistill adopts an on\-the\-fly, transductive learning paradigm\. By distilling deep semantic knowledge from genomic foundation models, this framework relies solely on the noisy pseudo\-labels generated by heuristic tools to perform denoising and self correction directly on the target dataset\.

Extensive evaluations across seven diverse datasets demonstrate that TaxDistill consistently outperforms Taxometer and existing baselines based on sequence retrieval in the vast majority of scenarios\. Notably, our framework exhibits strong uncertainty awareness\. When confronted with highly ambiguous classification boundaries, it effectively converts high risk predictions into unclassified labels\. This safely suppresses the surge of false positives in complex environments, achieving a highly favorable trade\-off between model recall and strict false positive control\. Ultimately, TaxDistill provides an efficient and highly reliable solution for downstream ecological monitoring and clinical metagenomic diagnostics\.

## 6Limitations

Although TaxDistill provides a reliable solution for metagenomic label correction, several limitations remain\. First, the efficacy of the knowledge distillation is inherently constrained by the representational boundaries of the teacher model\. Because our current framework solely fine\-tunes the classification head, it is susceptible to semantic biases when processing highly complex samples\. These biases can inadvertently propagate to the student network, underscoring the need to explore more advanced distillation architectures\. Second, TaxDistill achieves strict false positive control by conservatively reassigning high risk predictions to the unassigned category\. While this defensive strategy is of immense practical value in clinical diagnostics and rigorous ecological monitoring, it operates fundamentally as a confidence\-based rejection mechanism, lacking the capacity for de novo discovery of unknown species\. In future work, we plan to overcome this bottleneck\.

## 7Code and Dataset Availability

## 8Acknowledgments

This work was supported by the National Key Research & Development Program of China \(2024YFC2311303, 2025YFF1207901, 2023YFC2604400\)\.

## References

- Ben\-Baruch et al\. \(2024\)Emanuel Ben\-Baruch, Adam Botach, Igor Kviatkovsky, Manoj Aggarwal, and Gérard Medioni\. 2024\.Distilling the knowledge in data pruning\.*arXiv preprint arXiv:2403\.07854*\.
- Brixi et al\. \(2026\)Garyk Brixi, Matthew G Durrant, Jerome Ku, Mohsen Naghipourfar, Michael Poli, Gwanggyu Sun, Greg Brockman, Daniel Chang, Alison Fanton, Gabriel A Gonzalez, and 1 others\. 2026\.Genome modelling and design across all domains of life with evo 2\.*Nature*, pages 1–13\.
- Cheng et al\. \(2024\)Xingyi Cheng, Bo Chen, Pan Li, Jing Gong, Jie Tang, and Le Song\. 2024\.Training compute\-optimal protein language models\.*Advances in Neural Information Processing Systems*, 37:69386–69418\.
- Chiu and Miller \(2019\)Charles Y Chiu and Steven A Miller\. 2019\.Clinical metagenomics\.*Nature Reviews Genetics*, 20\(6\):341–355\.
- Gou et al\. \(2021\)Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao\. 2021\.Knowledge distillation: A survey\.*International journal of computer vision*, 129\(6\):1789–1819\.
- Handelsman \(2004\)Jo Handelsman\. 2004\.Metagenomics: application of genomics to uncultured microorganisms\.*Microbiology and molecular biology reviews*, 68\(4\):669–685\.
- Hinton et al\. \(2015\)Geoffrey Hinton, Oriol Vinyals, and Jeff Dean\. 2015\.Distilling the knowledge in a neural network\.*arXiv preprint arXiv:1503\.02531*\.
- Kallenborn et al\. \(2025\)Felix Kallenborn, Alejandro Chacon, Christian Hundt, Hassan Sirelkhatim, Kieran Didi, Sooyoung Cha, Christian Dallago, Milot Mirdita, Bertil Schmidt, and Martin Steinegger\. 2025\.Gpu\-accelerated homology search with mmseqs2\.*Nature Methods*, 22\(10\):2024–2027\.
- Kim et al\. \(2016\)Daehwan Kim, Li Song, Florian P Breitwieser, and Steven L Salzberg\. 2016\.Centrifuge: rapid and sensitive classification of metagenomic sequences\.*Genome research*, 26\(12\):1721–1729\.
- Kim and Steinegger \(2024\)Jaebeom Kim and Martin Steinegger\. 2024\.Metabuli: sensitive and specific metagenomic classification via joint analysis of amino acid and dna\.*Nature methods*, 21\(6\):971–973\.
- Kutuzova et al\. \(2024\)Svetlana Kutuzova, Mads Nielsen, Pau Piera, Jakob Nybo Nissen, and Simon Rasmussen\. 2024\.Taxometer: Improving taxonomic classification of metagenomics contigs\.*Nature Communications*, 15\(1\):8357\.
- Levy Karin and Steinegger \(2025\)Eli Levy Karin and Martin Steinegger\. 2025\.Cutting\-edge deep\-learning based tools for metagenomic research\.*National Science Review*, 12\(6\):nwaf056\.
- Li et al\. \(2017\)Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li\-Jia Li\. 2017\.Learning from noisy labels with distillation\.In*Proceedings of the IEEE international conference on computer vision*, pages 1910–1918\.
- Liang et al\. \(2020\)Qiaoxing Liang, Paul W Bible, Yu Liu, Bin Zou, and Lai Wei\. 2020\.Deepmicrobes: taxonomic classification for metagenomics with deep learning\.*NAR Genomics and Bioinformatics*, 2\(1\):lqaa009\.
- Lin et al\. \(2023\)Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, and 1 others\. 2023\.Evolutionary\-scale prediction of atomic\-level protein structure with a language model\.*Science*, 379\(6637\):1123–1130\.
- Liu et al\. \(2020\)Sheng Liu, Jonathan Niles\-Weed, Narges Razavian, and Carlos Fernandez\-Granda\. 2020\.Early\-learning regularization prevents memorization of noisy labels\.*Advances in neural information processing systems*, 33:20331–20342\.
- Meyer et al\. \(2022\)Fernando Meyer, Adrian Fritz, Zhi\-Luo Deng, David Koslicki, Till Robin Lesker, Alexey Gurevich, Gary Robertson, Mohammed Alser, Dmitry Antipov, Francesco Beghini, and 1 others\. 2022\.Critical assessment of metagenome interpretation: the second round of challenges\.*Nature methods*, 19\(4\):429–440\.
- Müller et al\. \(2019\)Rafael Müller, Simon Kornblith, and Geoffrey E Hinton\. 2019\.When does label smoothing help?*Advances in neural information processing systems*, 32\.
- Nayfach et al\. \(2021\)Stephen Nayfach, Simon Roux, Rekha Seshadri, Daniel Udwary, Neha Varghese, Frederik Schulz, Dongying Wu, David Paez\-Espino, I\-Min Chen, Marcel Huntemann, and 1 others\. 2021\.A genomic catalog of earth’s microbiomes\.*Nature biotechnology*, 39\(4\):499–509\.
- Prabakaran and Bromberg \(2025\)R Prabakaran and Yana Bromberg\. 2025\.Deciphering enzymatic potential in metagenomic reads through dna language models\.*Nucleic Acids Research*, 53\(16\):gkaf836\.
- Simon et al\. \(2019\)H Ye Simon, Katherine J Siddle, Daniel J Park, and Pardis C Sabeti\. 2019\.Benchmarking metagenomics tools for taxonomic classification\.*Cell*, 178\(4\):779–794\.
- Thompson et al\. \(2017\)Luke R Thompson, Jon G Sanders, Daniel McDonald, Amnon Amir, Joshua Ladau, Kenneth J Locey, Robert J Prill, Anupriya Tripathi, Sean M Gibbons, Gail Ackermann, and 1 others\. 2017\.A communal catalogue reveals earth’s multiscale microbial diversity\.*Nature*, 551\(7681\):457–463\.
- Valmadre \(2022\)Jack Valmadre\. 2022\.Hierarchical classification at multiple operating points\.*Advances in Neural Information Processing Systems*, 35:18034–18045\.
- Vaswani et al\. \(2017\)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin\. 2017\.Attention is all you need\.*Advances in neural information processing systems*, 30\.
- Vishwakarma et al\. \(2025\)Harit Vishwakarma, Yi Chen, Satya Sai Srinath Namburi Gnvv, Sui Jiet Tay, Ramya Korlakai Vinayak, and Frederic Sala\. 2025\.Rethinking confidence scores and thresholds in pseudolabeling\-based ssl\.In*Forty\-second International Conference on Machine Learning*\.
- Wichmann et al\. \(2023\)Alexander Wichmann, Etienne Buschong, André Müller, Daniel Jünger, Andreas Hildebrandt, Thomas Hankeln, and Bertil Schmidt\. 2023\.Metatransformer: deep metagenomic sequencing read classification using self\-attention models\.*NAR Genomics and Bioinformatics*, 5\(3\):lqad082\.
- Wood et al\. \(2019\)Derrick E Wood, Jennifer Lu, and Ben Langmead\. 2019\.Improved metagenomic analysis with kraken 2\.*Genome biology*, 20\(1\):257\.
- Ye et al\. \(2025\)Rongguang Ye, Ming Tang, and Edith CH Ngai\. 2025\.On\-the\-fly adaptation to quantization: Configuration\-aware lora for efficient fine\-tuning of quantized llms\.*arXiv preprint arXiv:2509\.25214*\.
- Ye et al\. \(2026\)Rongye Ye, Lun Li, Ana Tereza Ribeiro de Vasconcelos, and Shuhui Song\. 2026\.Influ\-bert: a domain\-adaptive genomic language model for advancing influenza a virus research\.*Briefings in Bioinformatics*, 27\(2\):bbag171\.
- Yuan et al\. \(2020\)Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, and Jiashi Feng\. 2020\.Revisiting knowledge distillation via label smoothing regularization\.In*Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pages 3903–3911\.
- Zhang et al\. \(2016\)Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals\. 2016\.Understanding deep learning requires rethinking generalization\.*arXiv preprint arXiv:1611\.03530*\.
- Zhou et al\. \(2025\)Zhihan Zhou, Robert Riley, Satria Kautsar, Weimin Wu, Rob Egan, Steven Hofmeyr, Shira Goldhaber\-Gordon, Mutian Yu, Harrison Ho, Fengchen Liu, and 1 others\. 2025\.Genomeocean: an efficient genome foundation model trained on large\-scale metagenomic assemblies\.*bioRxiv*\.

Table 1:Ablation study on the Marine Dataset\. The table reports the performance variations \(F1\-score, Recall, and Precision\) across three different models under differentα\\alpharatios \(top block\) and temperature settings \(bottom block\)\.![Refer to caption](https://arxiv.org/html/2605.28868v1/x4.png)Figure 6:Ablation study on the effect of contig sequence length on model inference performance\.![Refer to caption](https://arxiv.org/html/2605.28868v1/x5.png)Figure 7:Dataset contig volume statistics and time overhead analysis for end\-to\-end inference\.## Appendix AExperimental Details

Metric Computation\.The evaluation metrics used in this study are defined as follows:

Recall=𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝐶𝑜𝑟𝑟𝑒𝑐𝑡\+𝑊𝑟𝑜𝑛𝑔\+𝑁𝑜𝑙𝑎𝑏𝑒𝑙,\\text\{Recall\}=\\frac\{\\mathit\{Correct\}\}\{\\mathit\{Correct\}\+\\mathit\{Wrong\}\+\\mathit\{No\\ label\}\},\(9\)
Precision=𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝐶𝑜𝑟𝑟𝑒𝑐𝑡\+𝑊𝑟𝑜𝑛𝑔\.\\text\{Precision\}=\\frac\{\\mathit\{Correct\}\}\{\\mathit\{Correct\}\+\\mathit\{Wrong\}\}\.\(10\)
Dataset Setting\.To ensure a fair comparison, we adopt the identical benchmark datasets and data processing pipelines as utilized by Taxometer\. Specifically, to systematically validate the performance of the TaxDistill framework against noisy pseudo\-labels and its self\-correction capabilities, we construct a comprehensive benchmark suite primarily sourced from the CAMI2 datasets\. This benchmark encompasses highly diverse microbial environments, including five human microbiome datasets and two complex environmental microbiome datasets\. All performance evaluations are conducted at the species\-level\.

Sequence Assembly and Preprocessing\.Consistent with the analysis strategy of Taxometer, we conduct our evaluation using assembled sequences \(contigs\) rather than raw short reads\. This strategy ensures that deep representation learning can capture sufficient contextual semantic information\. We acquired the official contig and read files provided by the CAMI2 platform, which preserve the complete environmental characteristics of the metagenomic sequences\. In the main experiments, we applied a strict length filtering mechanism across all datasets, retaining exclusively contigs with a length greater than or equal to2,0002,000base pairs \(bp\)\. Additionally, we evaluated the impact of varying assembly lengths on model performance in our ablation studies\.

Ground Truth Establishment and Label Alignment\.We followed Taxometer’s label processing logic: for the five human microbiome datasets, we directly extracted species\-level ground truth annotations using the GTDB\-Tk tool, strictly adhering to the Genome Taxonomy Database \(GTDB\) nomenclature\. However, the Marine and Rhizosphere datasets originally employed an NCBI\-based taxonomic system\. To homogenize the label space across all datasets, we uniformly mapped the labels of these environmental datasets from the NCBI system to the GTDB \(v226\) system, and systematically discarded ambiguous sequences that lacked a strict one\-to\-one species\-level mapping\.

Hyperparameters\.For fair comparison, we adopt the same training configuration as Taxometer, setting the number of training epochs to 100 and the weight decay to1×10−41\\times 10^\{\-4\}\. For TaxDistill, the batch size is set to 64\. The learning rate of the student model is kept consistent with Taxometer and set to1×10−31\\times 10^\{\-3\}, while the teacher model uses a learning rate of1×10−41\\times 10^\{\-4\}\. For knowledge distillation, the distillation weight is set to 0\.3 and the temperature is set to 4\. Finally, we conduct ablation studies to systematically analyze the impact of these key hyperparameters\.

Hardware Details\.All experiments were conducted on a server equipped with two Intel\(R\) Xeon\(R\) Platinum 8558 CPUs \(96 cores in total\) and four NVIDIA GeForce RTX 5090 GPUs, each with 32 GB of memory\. All implementations were based on the PyTorch framework\.

## Appendix BAdditional Experiments

### B\.1Parameter Ablation Analysis

To investigate the specific effects of knowledge distillation hyperparameters on model performance, we conducted a systematic ablation study on the highly heterogeneous Marine dataset\.

#### Regularization Effect of Distillation Weight \(α\\alpha\)\.

First, the distillation weightα\\alphadetermines the balance between the student model’s reliance on the initial hard pseudo\-labels and the teacher model’s soft labels during joint optimization\. Keeping other parameters constant, we evaluated the performance variations acrossα∈\[0\.3,0\.8\]\\alpha\\in\[0\.3,0\.8\]\. As shown in Table[1](https://arxiv.org/html/2605.28868#S8.T1), settingα=0\.3\\alpha=0\.3yields the optimal F1\-score and Precision across all three baseline classifiers\. Conceptually, a lowerα\\alphacompels the student model to heavily leverage the probability distribution output by the teacher network as a strong regularizer\. This mechanism effectively mitigates the student model’s tendency to overfit the erroneous hard labels introduced by frontend heuristic tools, thereby enhancing overall performance\.

#### Performance Trade\-off with Distillation Temperature \(TT\)\.

Furthermore, withα\\alphafixed at 0\.3, we analyzed the impact of the distillation temperatureTT\. IncreasingTTfurther softens the teacher model’s logits, effectively smoothing the decision boundaries between ambiguous taxonomic classes\. Experimental results indicate that while higher temperatures \(e\.g\.,T∈\{5,6\}T\\in\\\{5,6\\\}\) marginally improve the overall F1\-score by boosting Recall, they inevitably lead to a degradation in Precision\. Given that strictly controlling the risk of false positives is typically more critical than merely maximizing recall in the analysis of complex real\-world environmental samples, a careful balance is required\. By striking an optimal trade\-off, settingT=4T=4sustains a highly competitive F1\-score while effectively preventing the deterioration of Precision\.

### B\.2Contigs Length Analysis

In metagenomic analysis, contig sequences serve as the primary carriers of microbial genetic information; theoretically, longer sequences encapsulate richer contextual semantics\. To investigate whether the model maintains robust label correction capabilities under information\-constrained conditions, we conducted experimental analyses across two length intervals: 1000\-1500 bp and 1500\-2000 bp\. As illustrated in Figure[6](https://arxiv.org/html/2605.28868#S8.F6), on the MMseqs2 and Metabuli baselines, TaxDistill demonstrates consistently superior recalibration performance compared to Taxometer across both length intervals\. This indicates that, benefiting from the effective transfer of global deep semantics facilitated by the knowledge distillation mechanism, TaxDistill can overcome the local feature deficits caused by low information density, thereby maintaining its recalibration advantage in short\-sequence scenarios\. On the Kraken2 classifier, although TaxDistill and Taxometer exhibit comparable overall F1\-scores, TaxDistill achieves superior Precision\. When short sequences result in highly ambiguous classification boundaries, TaxDistill effectively leverages its distribution uncertainty awareness to safely relegate unreliable predictions to the unassigned category\.

### B\.3Time Analysis

Finally, we systematically evaluated the computational efficiency and time overhead of the TaxDistill framework\. As illustrated in the left panel of Figure[7](https://arxiv.org/html/2605.28868#S8.F7), we report both the total number of contigs satisfying the initial length filtering criterion and the subset of contigs with ground\-truth labels used for final performance evaluation across various datasets\. The right panel of Figure[7](https://arxiv.org/html/2605.28868#S8.F7)provides a detailed comparison of the execution times between the Taxometer baseline and TaxDistill models of two different parameter scales across five datasets\.

Analysis of Figure[7](https://arxiv.org/html/2605.28868#S8.F7)reveals that while the introduction of a genomic foundation model for deep feature extraction increases time complexity, the overall computational cost remains within a tractable range for practical applications\. This efficiency is primarily attributed to the design of TaxDistill, which only requires fine\-tuning a lightweight classification head\. On the highly complex Marine dataset, the end\-to\-end runtime of TaxDistill \(500M\) is only approximately 110 minutes longer than that of the Taxometer baseline\. Notably, the more lightweight TaxDistill \(100M\) variant introduces an additional overhead of only approximately 36 minutes on the Marine dataset\.

Table 2:Comprehensive evaluation of post\-correction models across diverse environmental datasets\. The table reports the absolute number of Correct, Wrong, and Unlabeled \(Unlbl\.\) contigs assigned by three base classifiers before and after applying various post\-correction strategies\. The total number of contigs \(NN\) for each dataset is indicated in the first column\.Table 3:Comprehensive evaluation of post\-correction models across diverse environmental datasets\. The table reports the F1 Score, Recall \(Rec\.\), and Precision \(Prec\.\) achieved by three base classifiers before and after applying various post\-correction strategies\. The total number of contigs \(NN\) for each dataset is indicated in the first column\.
TaxDistill: Improving Metagenomic Taxonomic Annotation via Distilled Genomic Foundation Models

Similar Articles

Distill

@zhaisf: These were some magical results from distillation by @geoffreyhinton that really shocked me when I first saw them, and …

Position: Genomic Model Research Must Move Beyond Anecdotal Evaluation of Interpretability Methods

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction

Submit Feedback

Similar Articles

@zhaisf: These were some magical results from distillation by @geoffreyhinton that really shocked me when I first saw them, and …
Position: Genomic Model Research Must Move Beyond Anecdotal Evaluation of Interpretability Methods
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction