Want to build a custom model

Reddit r/LocalLLaMA 06/14/26, 05:32 AM News

small-model custom-model training dataset autocomplete hardware vram

Summary

A user discusses building a small autocomplete model (25M parameters) as a learning project, mentions hardware constraints (32GB VRAM), data requirements (~100M tokens), and seeks advice on datasets and data formatting for autocomplete-style training.

I've been toying with the idea of building my own model. At this point, the architecture and training pipeline seem fairly well established, and I'm feeling reasonably confident that I could put together a small model from scratch. Hardware is obviously the limiting factor. I've only got 32 GB of VRAM, so this clearly isn't going to be some flagship foundation model. It may not even end up particularly useful for general tasks, but it sounds like a fun project and a good learning experience. My current thought is to avoid full chat responses entirely and instead build a small autocomplete model, probably somewhere around 25M parameters. The goal would simply be: given context, predict the next token, sentence, or paragraph. The biggest challenge seems to be data. My understanding is that a rough rule of thumb is training on several times the parameter count in tokens, so even a 25M parameter model would ideally want on the order of 100M+ tokens for experimentation. For a first run, I was considering something more specialized or entertaining. One idea was a comedy model trained on cleaned transcripts fron YouTube to learn setup-to-punchline continuation patterns. Another more boring possibility would be a technical model focused on Python, Linux, or cybersecurity. For those of you who've trained small models before: where are you finding high-quality datasets? beyond the obvious choices like Wikipedia, Common Crawl derivatives, or synthetic data generated by frontier models? Also curious how people are formatting data for autocomplete-style training versus chat or Q&A datasets.

Original Article

Want to build a custom model

Similar Articles

Are small local models for automation a thing?

What if i really wanna train an AI from scratch?

Me train LLM on 8GB from Scratch. Me happy

@paulabartabajo_: Advice for AI engineers A small Visual Language Model fine-tuned on your custom dataset is as accurate as GPT-5... ... …

@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587

Submit Feedback

Similar Articles

Are small local models for automation a thing?

What if i really wanna train an AI from scratch?

Me train LLM on 8GB from Scratch. Me happy

@paulabartabajo_: Advice for AI engineers A small Visual Language Model fine-tuned on your custom dataset is as accurate as GPT-5... ... …

@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587