@akseljoonas: 3 weeks since ml-intern launched and we just hit 1M messages exchanged. that's 3.3 agent-years of ML research in 21 day…
Summary
ml-intern has processed over 1M messages in 3 weeks, enabling accelerated ML research with user projects including model training, architecture replication, and automation tasks.
View Cached Full Text
Cached at: 05/11/26, 06:44 PM
3 weeks since ml-intern launched and we just hit 1M messages exchanged. that’s 3.3 agent-years of ML research in 21 days. 2 months worth of research every day. 17,383 training jobs total. talk about AI acceleration. here’s some of what people built: @cmpatino_ replicated the full DeepSeek v4 architecture and pre+post trained a 100M MoE from scratch. → https://huggingface.co/cmpatino/nanowhale-100m… it landed a third place submission on @kellerjordan0 optimizer competition. autoresearch on SOTA territory. https://github.com/KellerJordan/modded-nanogpt/pull/286… @_lewtun Got the intern to convert @AlecRad’s cool new talkie-lm 1930 model to work with transformers. tokenizer, chat template, model conversion etc all one-shotted by ml-intern. https://huggingface.co/lewtun/talkie-1930-13b-it-hf… someone created entire PhD dissertation chapter on context-aware agentic cyber defense drafted with 16 research subagents. and someone used it to crack an @Anthropic kernel optimization take-home. (we don’t know how to feel about this one ) just getting started → https://huggingface.co/spaces/smolagents/ml-intern…
cmpatino/nanowhale-100m · Hugging Face
Source: https://huggingface.co/cmpatino/nanowhale-100m
https://huggingface.co/cmpatino/nanowhale-100m#nanowhale-100m-%F0%9F%90%B3nanowhale-100m 🐳
A small ~110M parameter language model implementing theDeepSeek-V4 architecture, fine-tuned for chat/instruction following. Trained from scratch — no weights from DeepSeek-V4 were used.
- Pretrained base model:cmpatino/nanowhale-100m-base
- This model: SFT onHuggingFaceTB/smol-smoltalk
- Training code:github.com/huggingface/nanowhale
https://huggingface.co/cmpatino/nanowhale-100m#architectureArchitecture
This model implements key DeepSeek-V4 innovations at a miniature scale:
ComponentDetailsParameters~110M total (41M embeddings, 69M non-embedding)Hidden size320Layers8Attention heads8 (1 KV head — MQA-style)MLAMulti-head Latent Attention with q_lora_rank=160MoE4 routed experts + 1 shared, top-2 routingHyper-Connectionshc_mult=4, Sinkhorn routing (replacing residual connections)MTP1 next-token prediction layerVocab129,280 (DeepSeek-V4 tokenizer)Context2,048 tokens
https://huggingface.co/cmpatino/nanowhale-100m#trainingTraining
https://huggingface.co/cmpatino/nanowhale-100m#stage-1-pretrainingStage 1: Pretraining
- Dataset:HuggingFaceFW/fineweb-edu
- Steps: 5,000 |Tokens: ~2.6B
- Batch: 32 effective (8 × 4 GA) |Seq length: 2,048
- LR: 6e-4, cosine, 3% warmup
- Precision: bf16 mixed
https://huggingface.co/cmpatino/nanowhale-100m#stage-2-sft-this-modelStage 2: SFT (this model)
- Dataset:HuggingFaceTB/smol-smoltalk(460K conversations)
- Steps: 3,000 |Tokens: ~72.7M
- Batch: 32 effective (8 × 4 GA) |Seq length: 2,048
- LR: 2e-5, cosine, 5% warmup
- Precision: fp32
https://huggingface.co/cmpatino/nanowhale-100m#metricsMetrics
MetricPretrainedSFTEval loss—2.607Perplexity(held-out)13.6212.90Token accuracy33.8%48.5%
https://huggingface.co/cmpatino/nanowhale-100m#usageUsage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"cmpatino/nanowhale-100m", trust_remote_code=True, dtype=torch.float32
).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained("cmpatino/nanowhale-100m")
messages = [{"role": "user", "content": "What are 3 benefits of exercise?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer.encode(prompt, return_tensors="pt").cuda()
output = model.generate(input_ids, max_new_tokens=200, temperature=0.7, top_p=0.9,
pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True))
https://huggingface.co/cmpatino/nanowhale-100m#limitationsLimitations
- Tiny model: 110M params with 129K vocabulary — most capacity goes to embeddings. Generations are often incoherent or factually wrong.
- Undertrained: Only 5K pretrain + 3K SFT steps. Production models train for 100K+ steps on trillions of tokens.
- Educational purpose: This model demonstrates the DeepSeek-V4 architecture at small scale. It isnotsuitable for any production use.
- fp32 recommended: The Hyper-Connections architecture can produce values that overflow bf16 range at this scale. Use
dtype=torch\.float32. - Custom code: Requires
trust\_remote\_code=True.
https://huggingface.co/cmpatino/nanowhale-100m#hardwareHardware
Trained on 1× NVIDIA H100 80GB.
https://huggingface.co/cmpatino/nanowhale-100m#licenseLicense
Apache-2.0
Similar Articles
ml-intern
Hugging Face launches ML-Intern, an AI agent that automates post-training tasks for machine-learning workflows.
@cmpatino_: I’ve been using ml-intern for a while, and it genuinely changed my workflow. It's super good at: - Model/Dataset discov…
Developer praises ml-intern tool for streamlining model/dataset discovery, post-training iteration and data workflows.
@AnandButani: ml-intern by @huggingface is wild You drop a high-level prompt (“build the best scientific reasoning model” or “crush h…
Hugging Face’s open-source "ml-intern" agent automates the full post-training pipeline—from literature review and data cleaning to model tuning—given only a high-level prompt.
@socialwithaayan: HUGGING FACE JUST OPEN-SOURCED THE ML INTERN EVERY RESEARCHER HAS DREAMED OF No more spending days reading papers and w…
Hugging Face open-sourced ml-intern, an autonomous agent that reads ML papers, discovers datasets, trains models, debugs failures, and ships production-ready models to the Hub, automating the entire post-training workflow.
@DataChaz: Are we witnessing the automation of AI research? @HuggingFace just unveiled "ML-Intern" and my mind is BLOWN It’s an op…
HuggingFace released ML-Intern, an open-source pipeline that automates the daily workflow of machine-learning researchers from a single prompt.