Carbon: Decoding the Language of Life

Reddit r/LocalLLaMA 05/19/26, 04:54 PM Models

dna-foundation-model open-source bioinformatics genomic-model hugging-face efficiency

Summary

Hugging Face released Carbon, a family of open DNA foundation models that matches state-of-the-art performance of Evo2-7B while being 275x faster, using 6-mer tokenization, factorized loss, and curated genomic data.

https://preview.redd.it/rajj11v7j42h1.png?width=1744&format=png&auto=webp&s=72381de22a9bac4b30a59498d549bb09df075df3 Hey, it's loubna from Hugging Face. Very happy to share our latest release: Carbon 🧬, a family of open DNA foundation models. Carbon-3B matches the current SOTA (Evo2-7B) while being 275x faster. We borrowed a lot from how modern LLMs are trained and from our SmolLM work, but DNA isn't language. Genomes are noisy, redundant, and shaped by evolution rather than communication. So we adjusted the recipe: **Tokenizer.** Most genomic models tokenize at the nucleotide level, which blows up sequence length. BPE is the obvious LLM-style fix, but it doesn't behave well on DNA. We use deterministic 6-mer tokens (one token = 6 nucleotides): 6× shorter sequences and cheaper attention. **Training loss.** With 6-mer tokens, cross-entropy scores a prediction that gets 5 of 6 nucleotides right the same as one that's completely wrong. This gets brittle late in training and produces loss spikes. We switch mid-training to a more flexible factorized loss (FNS). **Data.** Genomes are mostly sparse, repetitive background. We curate down to a staged functional DNA + mRNA mixture, with every ratio chosen by ablation. Like mixing a web corpus, but for biology. \- Technical report: [https://github.com/huggingface/carbon/blob/main/tech-report.pdf](https://github.com/huggingface/carbon/blob/main/tech-report.pdf) \- Demo (with a biology primer for our ML friends): [https://huggingface.co/spaces/HuggingFaceBio/carbon-demo](https://huggingface.co/spaces/HuggingFaceBio/carbon-demo) Happy to answer questions in the comments 🤗

Original Article

Carbon: Decoding the Language of Life

Similar Articles

@lvwerra: We are releasing Carbon: a crazy fast DNA model Carbon is 275x faster than the next best model. So fast you can process…

@ClementDelangue: The future of biology shouldn’t stay behind black-box APIs. Especially when it touches personal health. Whether you’re …

@adithya_s_k: Wake up ppl Huggingface just open sourced Genomic Foundational Models

LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

Decoding genetics with OpenAI o1

Submit Feedback

Similar Articles

@lvwerra: We are releasing Carbon: a crazy fast DNA model Carbon is 275x faster than the next best model. So fast you can process…

@ClementDelangue: The future of biology shouldn’t stay behind black-box APIs. Especially when it touches personal health. Whether you’re …

@adithya_s_k: Wake up ppl Huggingface just open sourced Genomic Foundational Models

LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling

Decoding genetics with OpenAI o1