I trained a 75M parameter LLM from scratch on 18B tokens and it beats a model almost double its size

Reddit r/LocalLLaMA 06/02/26, 05:41 PM Models

Summary

Trained a 75M parameter LLM called KeyLM from scratch on 18B tokens, achieving competitive instruction-following scores against larger models while using fewer parameters and less data.

I trained a small language model from scratch called KeyLM. It is 75M params, decoder-only, and there is a pretrained base, an instruction-tuned version, and a GGUF. On IFEval (instruction following) the 75M instruct model scores slightly higher than the original SmolLM-135M-Instruct at about half the parameters and a fraction of the training data. (SmolLM was pre trained on 600B tokens and SmolLM2 on 2T tokens, but KeyLM is only pretrained on 18B tokens.) Model Params IFEval --------------------- ------ ------ KeyLM-75M-Instruct 75M 17.85 SmolLM-135M-Instruct 135M 17.15 SmolLM2-135M-Instruct 135M 26.98 Although the rest of the benchmarks are about what you would expect from a model at this size. Bench Score ------------------------ ----- MMLU (acc) 24.0% ARC-Easy (acc) 40.3% ARC-Challenge (acc_norm) 22.6% HellaSwag (acc_norm) 31.0% PIQA (acc) 61.3% WinoGrande (acc) 48.3% Its Architecture is what you would normally see in a model this size standard. GQA (8 query / 2 KV heads), RoPE, SwiGLU, per-head QK-Norm, 24 layers, hidden 512, 2048 context, 12,020 ByteLevel BPE vocab, bf16. Trained on \~18B tokens of public data (FineWeb-Edu, Wikipedia, Reddit, StackExchange, WildChat, UltraChat, LMSYS, OASST2, Cosmopedia), then SFT on smol-smoltalk and smoltalk2. Here are some Sample outputs (KeyLM-75M-Instruct, greedy decoding): **Explain photosynthesis.** ▎ Photosynthesis is the process plants use to turn sunlight, water, and carbon dioxide into energy. It also releases oxygen, which people and animals need to breathe. **What is the difference between weather and climate?** ▎ Weather is what the atmosphere is doing over a short time, such as today's rain, wind, or temperature. Climate is the long-term pattern of weather in a region. **What is gravity?** ▎ Gravity is the force that pulls objects toward each other. It keeps the Moon moving around Earth and Earth moving around the Sun. **Explain the water cycle.** ▎ The sun warms water in oceans, lakes, rivers, and soil, causing some of it to evaporate into the air. The water vapor cools and forms clouds through condensation. When clouds become heavy, water falls as rain, snow, sleet, or hail. **What is inflation in economics?** ▎ Inflation means prices are rising over time. If inflation is high, the same amount of money buys less than it did before. **Explain the concept of religion.** ▎ Religion is a belief system that affects how people live, work, and interact with the world around them. It can shape culture, politics, and values. **Write a short thank-you note to a teacher.** ▎ Thank you for your patience and support this year. I appreciate the time you put into helping me learn. The model is English only and near random on knowledge benchmarks so it will hallucinate a basically all the time. All three versions (Base, Instruct, GGUF) are published on huggingface so you can try it out for yourself: * Base: [https://huggingface.co/Eclipse-Senpai/KeyLM-75M](https://huggingface.co/Eclipse-Senpai/KeyLM-75M) * Instruct: [https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct) * GGUF: [https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct-GGUF](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct-GGUF)

Original Article

I trained a 75M parameter LLM from scratch on 18B tokens and it beats a model almost double its size

Similar Articles

Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models

@cjzafir: VLMs (Vertical Language Models) are beating top LLMs. These small 7B to 15B niche-focused models are beating SoTA model…

Me train LLM on 8GB from Scratch. Me happy

Developing open source LLM from ground up from pretrain - rlhf(PPO/GRPO)

@j_golebiowski: A 1.7B parameter model beats GLM-5 (744B) on Schema Guided Dialogue — even when the training data is corrupted. That's …

Submit Feedback

Similar Articles

Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models

@cjzafir: VLMs (Vertical Language Models) are beating top LLMs. These small 7B to 15B niche-focused models are beating SoTA model…

Me train LLM on 8GB from Scratch. Me happy

Developing open source LLM from ground up from pretrain - rlhf(PPO/GRPO)

@j_golebiowski: A 1.7B parameter model beats GLM-5 (744B) on Schema Guided Dialogue — even when the training data is corrupted. That's …
A 1.7B parameter model surpasses 744B GLM-5 on Schema Guided Dialogue despite corrupted training data, showing 437× size efficiency.