tokenizer

Tag

Cards List
#tokenizer

MiniCPM5 1B - what is it?

Reddit r/LocalLLaMA · 2026-06-01

MiniCPM5-1B is a new small language model from OpenBMB, apparently built from scratch with its own tokenizer and distinct behavior, generating excitement as a capable 1B model.

0 favorites 0 likes
#tokenizer

Add MiniCPM5 tokenizer support by zhangtao2-1 · Pull Request #23384 · ggml-org/llama.cpp

Reddit r/LocalLLaMA · 2026-05-27 Cached

This pull request adds tokenizer support for MiniCPM5 to llama.cpp, extending the tool's compatibility with the MiniCPM family of models.

0 favorites 0 likes
#tokenizer

[NEW] Supra-50M Released!

Reddit r/LocalLLaMA · 2026-05-22

SupraLabs released Supra-50M, a compact 50M-parameter causal language model with base and instruct versions, trained on 20B tokens from fineweb-edu, achieving competitive benchmarks against larger models like GPT-2 and SmolLM.

0 favorites 0 likes
#tokenizer

ztok — a fast multithreaded tokenizer in Zig that loads tiktoken / HF / SentencePiece and is 2–5× faster

Reddit r/LocalLLaMA · 2026-05-22

ztok 是一个用 Zig 编写的高性能多线程分词器库,支持多种格式(tiktoken、HF、SentencePiece 等),速度比现有方案快 2–5 倍,适用于 RAG 分块和数据集分词。

0 favorites 0 likes
#tokenizer

@lvwerra: We are releasing Carbon: a crazy fast DNA model Carbon is 275x faster than the next best model. So fast you can process…

X AI KOLs Following · 2026-05-19 Cached

HuggingFace releases Carbon, a DNA model that is 275x faster than the previous state-of-the-art (Evo2), enabling processing of the entire human genome on a single GPU in under two days. The model uses a unique tokenizer that splits sequences into 6-base chunks while maintaining single-base resolution, and comes with an interactive demo.

0 favorites 0 likes
#tokenizer

Number-aware embeddings

Reddit r/LocalLLaMA · 2026-05-19

A technique to make embedding models aware of number ordering by overriding tokenizer and MLM fine-tuning, achieving 59% accuracy on number sorting benchmarks.

0 favorites 0 likes
#tokenizer

The biggest AI breakthrough in medicine & drug discovery

Reddit r/singularity · 2026-05-14 Cached

MAML is a novel multi-modal AI model that unifies understanding of chemistry, genetics, and proteins, outperforming specialized models on 11 drug discovery benchmarks, promising to accelerate pharmaceutical research and improve success rates.

0 favorites 0 likes
#tokenizer

Attacks On Data Centers, Qwen3.5 In All Sizes, DeepSeek’s Huawei Play, Apple’s Multimodal Tokenizer

The Batch · 2026-03-20 Cached

Andrew Ng's newsletter covers recent AI developments including attacks on data centers, the release of Qwen3.5 in various sizes, DeepSeek's collaboration with Huawei, and Apple's multimodal tokenizer, alongside reflections on AI-driven job uncertainty and geopolitical risks.

0 favorites 0 likes
#tokenizer

shiyu-coder/Kronos

GitHub Trending (daily) · 2026-05-14 Cached

Kronos is an open-source foundation model for financial K-line sequences, trained on data from over 45 global exchanges. It uses a specialized tokenizer and a decoder-only Transformer, and has been accepted at AAAI 2026.

0 favorites 0 likes
← Back to home

Submit Feedback