@akseljoonas: 3 weeks since ml-intern launched and we just hit 1M messages exchanged. that's 3.3 agent-years of ML research in 21 day…

X AI KOLs Following Tools

Summary

ml-intern has processed over 1M messages in 3 weeks, enabling accelerated ML research with user projects including model training, architecture replication, and automation tasks.

3 weeks since ml-intern launched and we just hit 1M messages exchanged. that's 3.3 agent-years of ML research in 21 days. 2 months worth of research every day. 17,383 training jobs total. talk about AI acceleration. here's some of what people built: @cmpatino_ replicated the full DeepSeek v4 architecture and pre+post trained a 100M MoE from scratch. → https://huggingface.co/cmpatino/nanowhale-100m… it landed a third place submission on @kellerjordan0 optimizer competition. autoresearch on SOTA territory. https://github.com/KellerJordan/modded-nanogpt/pull/286… @_lewtun Got the intern to convert @AlecRad's cool new talkie-lm 1930 model to work with transformers. tokenizer, chat template, model conversion etc all one-shotted by ml-intern. https://huggingface.co/lewtun/talkie-1930-13b-it-hf… someone created entire PhD dissertation chapter on context-aware agentic cyber defense drafted with 16 research subagents. and someone used it to crack an @Anthropic kernel optimization take-home. (we don't know how to feel about this one ) just getting started → https://huggingface.co/spaces/smolagents/ml-intern…
Original Article
View Cached Full Text

Cached at: 05/11/26, 06:44 PM

3 weeks since ml-intern launched and we just hit 1M messages exchanged. that’s 3.3 agent-years of ML research in 21 days. 2 months worth of research every day. 17,383 training jobs total. talk about AI acceleration. here’s some of what people built: @cmpatino_ replicated the full DeepSeek v4 architecture and pre+post trained a 100M MoE from scratch. → https://huggingface.co/cmpatino/nanowhale-100m… it landed a third place submission on @kellerjordan0 optimizer competition. autoresearch on SOTA territory. https://github.com/KellerJordan/modded-nanogpt/pull/286… @_lewtun Got the intern to convert @AlecRad’s cool new talkie-lm 1930 model to work with transformers. tokenizer, chat template, model conversion etc all one-shotted by ml-intern. https://huggingface.co/lewtun/talkie-1930-13b-it-hf… someone created entire PhD dissertation chapter on context-aware agentic cyber defense drafted with 16 research subagents. and someone used it to crack an @Anthropic kernel optimization take-home. (we don’t know how to feel about this one ) just getting started → https://huggingface.co/spaces/smolagents/ml-intern…


cmpatino/nanowhale-100m · Hugging Face

Source: https://huggingface.co/cmpatino/nanowhale-100m

https://huggingface.co/cmpatino/nanowhale-100m#nanowhale-100m-%F0%9F%90%B3nanowhale-100m 🐳

A small ~110M parameter language model implementing theDeepSeek-V4 architecture, fine-tuned for chat/instruction following. Trained from scratch — no weights from DeepSeek-V4 were used.

https://huggingface.co/cmpatino/nanowhale-100m#architectureArchitecture

This model implements key DeepSeek-V4 innovations at a miniature scale:

ComponentDetailsParameters~110M total (41M embeddings, 69M non-embedding)Hidden size320Layers8Attention heads8 (1 KV head — MQA-style)MLAMulti-head Latent Attention with q_lora_rank=160MoE4 routed experts + 1 shared, top-2 routingHyper-Connectionshc_mult=4, Sinkhorn routing (replacing residual connections)MTP1 next-token prediction layerVocab129,280 (DeepSeek-V4 tokenizer)Context2,048 tokens

https://huggingface.co/cmpatino/nanowhale-100m#trainingTraining

https://huggingface.co/cmpatino/nanowhale-100m#stage-1-pretrainingStage 1: Pretraining

  • Dataset:HuggingFaceFW/fineweb-edu
  • Steps: 5,000 |Tokens: ~2.6B
  • Batch: 32 effective (8 × 4 GA) |Seq length: 2,048
  • LR: 6e-4, cosine, 3% warmup
  • Precision: bf16 mixed

https://huggingface.co/cmpatino/nanowhale-100m#stage-2-sft-this-modelStage 2: SFT (this model)

  • Dataset:HuggingFaceTB/smol-smoltalk(460K conversations)
  • Steps: 3,000 |Tokens: ~72.7M
  • Batch: 32 effective (8 × 4 GA) |Seq length: 2,048
  • LR: 2e-5, cosine, 5% warmup
  • Precision: fp32

https://huggingface.co/cmpatino/nanowhale-100m#metricsMetrics

MetricPretrainedSFTEval loss—2.607Perplexity(held-out)13.6212.90Token accuracy33.8%48.5%

https://huggingface.co/cmpatino/nanowhale-100m#usageUsage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "cmpatino/nanowhale-100m", trust_remote_code=True, dtype=torch.float32
).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained("cmpatino/nanowhale-100m")

messages = [{"role": "user", "content": "What are 3 benefits of exercise?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer.encode(prompt, return_tensors="pt").cuda()
output = model.generate(input_ids, max_new_tokens=200, temperature=0.7, top_p=0.9,
                        pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True))

https://huggingface.co/cmpatino/nanowhale-100m#limitationsLimitations

  • Tiny model: 110M params with 129K vocabulary — most capacity goes to embeddings. Generations are often incoherent or factually wrong.
  • Undertrained: Only 5K pretrain + 3K SFT steps. Production models train for 100K+ steps on trillions of tokens.
  • Educational purpose: This model demonstrates the DeepSeek-V4 architecture at small scale. It isnotsuitable for any production use.
  • fp32 recommended: The Hyper-Connections architecture can produce values that overflow bf16 range at this scale. Usedtype=torch\.float32.
  • Custom code: Requirestrust\_remote\_code=True.

https://huggingface.co/cmpatino/nanowhale-100m#hardwareHardware

Trained on 1× NVIDIA H100 80GB.

https://huggingface.co/cmpatino/nanowhale-100m#licenseLicense

Apache-2.0

Similar Articles

ml-intern

Product Hunt

Hugging Face launches ML-Intern, an AI agent that automates post-training tasks for machine-learning workflows.