Trained a 75M parameter LLM called KeyLM from scratch on 18B tokens, achieving competitive instruction-following scores against larger models while using fewer parameters and less data.
I trained a small language model from scratch called KeyLM. It is 75M params, decoder-only, and there is a pretrained base, an instruction-tuned version, and a GGUF. On IFEval (instruction following) the 75M instruct model scores slightly higher than the original SmolLM-135M-Instruct at about half the parameters and a fraction of the training data. (SmolLM was pre trained on 600B tokens and SmolLM2 on 2T tokens, but KeyLM is only pretrained on 18B tokens.) Model Params IFEval --------------------- ------ ------ KeyLM-75M-Instruct 75M 17.85 SmolLM-135M-Instruct 135M 17.15 SmolLM2-135M-Instruct 135M 26.98 Although the rest of the benchmarks are about what you would expect from a model at this size. Bench Score ------------------------ ----- MMLU (acc) 24.0% ARC-Easy (acc) 40.3% ARC-Challenge (acc_norm) 22.6% HellaSwag (acc_norm) 31.0% PIQA (acc) 61.3% WinoGrande (acc) 48.3% Its Architecture is what you would normally see in a model this size standard. GQA (8 query / 2 KV heads), RoPE, SwiGLU, per-head QK-Norm, 24 layers, hidden 512, 2048 context, 12,020 ByteLevel BPE vocab, bf16. Trained on \~18B tokens of public data (FineWeb-Edu, Wikipedia, Reddit, StackExchange, WildChat, UltraChat, LMSYS, OASST2, Cosmopedia), then SFT on smol-smoltalk and smoltalk2. Here are some Sample outputs (KeyLM-75M-Instruct, greedy decoding): **Explain photosynthesis.** ▎ Photosynthesis is the process plants use to turn sunlight, water, and carbon dioxide into energy. It also releases oxygen, which people and animals need to breathe. **What is the difference between weather and climate?** ▎ Weather is what the atmosphere is doing over a short time, such as today's rain, wind, or temperature. Climate is the long-term pattern of weather in a region. **What is gravity?** ▎ Gravity is the force that pulls objects toward each other. It keeps the Moon moving around Earth and Earth moving around the Sun. **Explain the water cycle.** ▎ The sun warms water in oceans, lakes, rivers, and soil, causing some of it to evaporate into the air. The water vapor cools and forms clouds through condensation. When clouds become heavy, water falls as rain, snow, sleet, or hail. **What is inflation in economics?** ▎ Inflation means prices are rising over time. If inflation is high, the same amount of money buys less than it did before. **Explain the concept of religion.** ▎ Religion is a belief system that affects how people live, work, and interact with the world around them. It can shape culture, politics, and values. **Write a short thank-you note to a teacher.** ▎ Thank you for your patience and support this year. I appreciate the time you put into helping me learn. The model is English only and near random on knowledge benchmarks so it will hallucinate a basically all the time. All three versions (Base, Instruct, GGUF) are published on huggingface so you can try it out for yourself: * Base: [https://huggingface.co/Eclipse-Senpai/KeyLM-75M](https://huggingface.co/Eclipse-Senpai/KeyLM-75M) * Instruct: [https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct) * GGUF: [https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct-GGUF](https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct-GGUF)
Nous Research releases Token Superposition Training (TST), a method that speeds up LLM pre-training by up to 2.5x across models from 270M to 10B parameters, reducing wall-clock time without altering architecture or data.
The author demonstrates that small vertical language models (6B-15B) can outperform top LLMs on niche benchmarks through cost-effective fine-tuning using open-source models and Codex orchestration, achieving results with a $300 dataset.
Built a repository to train a tiny language model (25M parameters) from scratch on 8GB VRAM, with support for MTP but noting limitations of mHC and BitNet.
A developer shares progress on training a 7B parameter open source LLM from scratch using a DeepSeek architecture optimized for low VRAM, with the goal of democratizing AI development and eventually surpassing large proprietary models.