Tag
LDARNet is a 120M-parameter hierarchical genomic foundation model that introduces learnable adaptive tokenization (inspired by H-Net's dynamic chunking) for masked language modeling on DNA sequences. It achieves state-of-the-art results on 5 histone modification tasks and outperforms models up to 20× larger on several genomic benchmarks, with learned token boundaries aligning with biological features like promoter motifs and splice junctions.