Want to build a custom model

Reddit r/LocalLLaMA News

Summary

A user discusses building a small autocomplete model (25M parameters) as a learning project, mentions hardware constraints (32GB VRAM), data requirements (~100M tokens), and seeks advice on datasets and data formatting for autocomplete-style training.

I've been toying with the idea of building my own model. At this point, the architecture and training pipeline seem fairly well established, and I'm feeling reasonably confident that I could put together a small model from scratch. Hardware is obviously the limiting factor. I've only got 32 GB of VRAM, so this clearly isn't going to be some flagship foundation model. It may not even end up particularly useful for general tasks, but it sounds like a fun project and a good learning experience. My current thought is to avoid full chat responses entirely and instead build a small autocomplete model, probably somewhere around 25M parameters. The goal would simply be: given context, predict the next token, sentence, or paragraph. The biggest challenge seems to be data. My understanding is that a rough rule of thumb is training on several times the parameter count in tokens, so even a 25M parameter model would ideally want on the order of 100M+ tokens for experimentation. For a first run, I was considering something more specialized or entertaining. One idea was a comedy model trained on cleaned transcripts fron YouTube to learn setup-to-punchline continuation patterns. Another more boring possibility would be a technical model focused on Python, Linux, or cybersecurity. For those of you who've trained small models before: where are you finding high-quality datasets? beyond the obvious choices like Wikipedia, Common Crawl derivatives, or synthetic data generated by frontier models? Also curious how people are formatting data for autocomplete-style training versus chat or Q&A datasets.
Original Article

Similar Articles

Are small local models for automation a thing?

Reddit r/LocalLLaMA

A Reddit user discusses the potential of small local language models (1B-4B parameters) for automation and scripting, and asks for resources focused on this use case.

What if i really wanna train an AI from scratch?

Reddit r/artificial

A personal reflection on the challenges and allure of training an AI model from scratch, highlighting the difficulties with data, hardware, and scaling, while noting that surprisingly good small models can be trained on modest hardware.

Me train LLM on 8GB from Scratch. Me happy

Reddit r/LocalLLaMA

Built a repository to train a tiny language model (25M parameters) from scratch on 8GB VRAM, with support for MTP but noting limitations of mHC and BitNet.

@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587

X AI KOLs Timeline

The author shares learnings from training a 160M parameter LLM from scratch, experimenting with architectures like multi-token prediction and hierarchical reasoning models. They emphasize the importance of fast iteration, simplifying ideas, and understanding why architectures work.