@lvwerra: We are releasing Carbon: a crazy fast DNA model Carbon is 275x faster than the next best model. So fast you can process…

X AI KOLs Following 05/19/26, 04:31 PM Models

dna-model bioinformatics tokenizer huggingface ai-model open-source genome

Summary

HuggingFace releases Carbon, a DNA model that is 275x faster than the previous state-of-the-art (Evo2), enabling processing of the entire human genome on a single GPU in under two days. The model uses a unique tokenizer that splits sequences into 6-base chunks while maintaining single-base resolution, and comes with an interactive demo.

We are releasing Carbon: a crazy fast DNA model Carbon is 275x faster than the next best model. So fast you can process the whole human genome on a single GPU in <2 days. Here are the tricks we used: When modelling DNA sequences a lot of the performance comes down to tokenizing the sequences in a smart way. BPE tokenizer struggle because there are no whitespaces and character (called base in DNA) level tokenizers waste a lot of compute on too many tokens. Carbon is built with a unique tokenizer: we split sequences in chunks of 6 bases, but during both training and inference we can work with single base resolution. That's similar to having word tokens but resolving them at the character level. All possible thanks to the DNA tokens unique structure. The architecture combined with the tokenizer makes the model 275x faster than the previous SoTA (Evo2) at this size. We built an interactive demo so you can explore how the model can generate DNA sequences, investigate the structure of genes, predict the effect of mutations, generate and fold proteins and even reconstruct parts of the tree of life. https://huggingface.co/spaces/HuggingFaceBio/carbon-demo…

Original Article

View Cached Full Text

Cached at: 05/20/26, 02:25 AM

We are releasing Carbon: a crazy fast DNA model

Carbon is 275x faster than the next best model. So fast you can process the whole human genome on a single GPU in <2 days.

Here are the tricks we used:

When modelling DNA sequences a lot of the performance comes down to tokenizing the sequences in a smart way. BPE tokenizer struggle because there are no whitespaces and character (called base in DNA) level tokenizers waste a lot of compute on too many tokens.

Carbon is built with a unique tokenizer: we split sequences in chunks of 6 bases, but during both training and inference we can work with single base resolution. That’s similar to having word tokens but resolving them at the character level. All possible thanks to the DNA tokens unique structure.

The architecture combined with the tokenizer makes the model 275x faster than the previous SoTA (Evo2) at this size.

We built an interactive demo so you can explore how the model can generate DNA sequences, investigate the structure of genes, predict the effect of mutations, generate and fold proteins and even reconstruct parts of the tree of life.

https://huggingface.co/spaces/HuggingFaceBio/carbon-demo…

Carbon - a Hugging Face Space by HuggingFaceBio

Source: https://huggingface.co/spaces/HuggingFaceBio/carbon-demo Fetching metadata from the HF Docker repository...

@lvwerra: We are releasing Carbon: a crazy fast DNA model Carbon is 275x faster than the next best model. So fast you can process…

Carbon - a Hugging Face Space by HuggingFaceBio

Similar Articles

Carbon: Decoding the Language of Life

@adithya_s_k: Wake up ppl Huggingface just open sourced Genomic Foundational Models

@ClementDelangue: The future of biology shouldn’t stay behind black-box APIs. Especially when it touches personal health. Whether you’re …

@draecomino: Cerebras sets a new record: a one trillion parameter model @ 1,000 tokens/s

@TeksEdge: Wow! New open source Computer Use model shows strong local performance on LLM Leaderboard using a single DGX Spark! Thi…

Submit Feedback

Similar Articles

Carbon: Decoding the Language of Life

@adithya_s_k: Wake up ppl Huggingface just open sourced Genomic Foundational Models

@ClementDelangue: The future of biology shouldn’t stay behind black-box APIs. Especially when it touches personal health. Whether you’re …

@draecomino: Cerebras sets a new record: a one trillion parameter model @ 1,000 tokens/s

@TeksEdge: Wow! New open source Computer Use model shows strong local performance on LLM Leaderboard using a single DGX Spark! Thi…