Tag
MiniCPM5-1B is a new small language model from OpenBMB, apparently built from scratch with its own tokenizer and distinct behavior, generating excitement as a capable 1B model.
This pull request adds tokenizer support for MiniCPM5 to llama.cpp, extending the tool's compatibility with the MiniCPM family of models.
SupraLabs released Supra-50M, a compact 50M-parameter causal language model with base and instruct versions, trained on 20B tokens from fineweb-edu, achieving competitive benchmarks against larger models like GPT-2 and SmolLM.
ztok 是一个用 Zig 编写的高性能多线程分词器库,支持多种格式(tiktoken、HF、SentencePiece 等),速度比现有方案快 2–5 倍,适用于 RAG 分块和数据集分词。
HuggingFace releases Carbon, a DNA model that is 275x faster than the previous state-of-the-art (Evo2), enabling processing of the entire human genome on a single GPU in under two days. The model uses a unique tokenizer that splits sequences into 6-base chunks while maintaining single-base resolution, and comes with an interactive demo.
A technique to make embedding models aware of number ordering by overriding tokenizer and MLM fine-tuning, achieving 59% accuracy on number sorting benchmarks.
MAML is a novel multi-modal AI model that unifies understanding of chemistry, genetics, and proteins, outperforming specialized models on 11 drug discovery benchmarks, promising to accelerate pharmaceutical research and improve success rates.
Andrew Ng's newsletter covers recent AI developments including attacks on data centers, the release of Qwen3.5 in various sizes, DeepSeek's collaboration with Huawei, and Apple's multimodal tokenizer, alongside reflections on AI-driven job uncertainty and geopolitical risks.
Kronos is an open-source foundation model for financial K-line sequences, trained on data from over 45 global exchanges. It uses a specialized tokenizer and a decoder-only Transformer, and has been accepted at AAAI 2026.