Get in here: Community model build thread

Reddit r/LocalLLaMA Models

Summary

A thread proposing a method for creating a community AI model using crowdsourced compute via Branch-Train-Stitch to build a Mixture-of-Experts model from independently trained submodels, with discussion of hardware requirements, participant involvement, and technical challenges.

**You absolutely can create a community model through crowdsourced compute,** and there are at least 2 ways to do it. This thread is a attempt to refocus this thread (which is devolving into pseudo-experts explaining why a pooled approach isn’t possible): https://old.reddit.com/r/LocalLLaMA/comments/1u77xo3/joing_all_gpus_to_train_a_community_model/ It is true that you can’t reasonably try to create a compute cluster by networking everyone’s rigs together, but that is a straw man of sorts as you literally don’t have to. Generally, the main strategy involves making a MoE through a variant of ‘Branch-Train-Stitch’. In short, you distribute a ‘prototype’ dense model (with specific shape and architecture, later) to everyone who wants to participate, people train this prototype model on their own hardware independently, and then resubmit the narrow-domain trained submodels back to the organizers who stitch the submodels into a large MoE. There are a number of catches, and decisions that everyone participating would have to decide on. ## Target size of the prototype (or, who gets to participate) This decision affects how many people could possibly contribute, and should probably be handled by a poll as it is fundamentally an engagement question. I went through the old subforum hardware poll (https://old.reddit.com/r/LocalLLaMA/comments/1op0j6j/recent_vram_poll_results/) - Given we have literally thousands amongst us with more than 12GB vram, we could easily decide on a prototype size around 2B and end up with far more participants than we could reasonably include in the final model. If we bump it a notch (to 32GB vram), we could distribute a 7B prototype, which has other considerations (e.g. the final MoE is likely to end up being in the 500B-1T size class, which would literally be unrunnable for the overwhelming majority of the forum, training for the router and heal post merge would be ridiculously expensive, the time window for the members to train and submit the submodels would be need to be extended to ~8 weeks or so vs maybe 2 for a 1-2B prototype). So I guess that is the biggest first question: who is this for? Is this for the forum members to use themselves? Or is it intended to be a world-class frontier model? ## Some considerations, ideas to make this suck less, and misc * The finished donors will need to have narrow, well-defined scopes. The organizers will likely need to vibe-code a registration portal, where anyone participating needs to declare their intended sub scope (e.g. graduate signal processing / books dated <1900s / etc), and the portal internally checks overlap with other declared scopes (basically, enforce orthogonality) and we will need everyone to pinky-promise to stay on task to avoid causing issues training the router downstream. * The process of training the prototypes into donors needs to be well structured, and it may be worth vibing up a script that we distribute to ensure structure, numerical data types, tokenizer, and chat template all match, while enforcing a minimum length of data (which also needs to be agreed on beforehand). * Actually we absolutely will need to provide a ‘insert your data here’ script. Many of the people in this sub may have experience with inference, but just reading through threads here shows so few have any experience with training. Making this script autodetect users’ hardware and ‘just work’ is likely to be its own headache. Maybe limit to just cuda and Vulkan backends? I have some donor scripts we could adapt for this that autodetect batch size (basically, fill ever-larger batch->run forward->backward->step until OOM, then back off), pre allocate / strictly manage memory, etc. * Once the organizers collect all of the submissions, the first and last few layers get stripped, the donors get assembled into the meta-model, and traditionally the donor attention weights get averaged, but there may be better techniques available, especially if we collectively decide to play with gated delta nets or similar. I can do a bit of research if nobody else feels like it. * Next, the router layers get trained with the donors all frozen (and attention/embeds initially frozen too). Because we can’t be sure of the quality or provenance of the donor models, we actually need to not strictly enforce uniform utilization, as it is trivial to modify the donor training script to intentionally submit an undertrained or malicious donor. Fortunately, there are at least 4 unique solutions for this in the literature, two of which I have experience with and could speak to if someone more qualified doesn’t come along. * This final healing->RL training will require the entire model be held in vram. While we do have a few members amongst us with full H200 rigs, if they don’t opt to participate, these will need to be rented. EDIT: actually the two members here that I know of with H200s both have their rigs on Vast; we might be able to pool a fund, pay our own guys, and keep the entire project in the LocalLLama family lol. I’m out of time and sure I forgot things. May edit this later. In short, we can absolutely pull this off if we collectively decide we wanted to. ——— EDIT: Actually Branch-Train-Stitch is the only way to get this done - the other approach I was going to propose at the start of this essay was ‘upcycling’, like what NVIDIA outlines here: https://arxiv.org/abs/2410.07524 But we can’t do that as they generate the experts online vs offline, which is a requirement for distributed compute here. :P
Original Article

Similar Articles

Joing all GPUs to train a community model

Reddit r/LocalLLaMA

A discussion about pooling GPUs from a community to train a massive AI model, questioning the feasibility and existing projects despite known bottlenecks like latency and weight poisoning.

LocalLLaMA crowdsourced coding dataset

Reddit r/LocalLLaMA

A community member proposes creating a crowdsourced coding dataset for local LLMs to enable collaborative model training and fine-tuning, addressing concerns about future availability of open-weight models.

An A.I. Aggregator?

Reddit r/AI_Agents

A user shares their experience using ChatGPT for complex medical caregiving and proposes the idea of aggregating multiple AI models to improve reliability by seeking consensus among different LLMs.