Cached at:
05/13/26, 02:09 AM
# Fine-Tuning TranslateGemma-4B for Welsh on H200: Data Strategy, LoRA Training, and GGUF Deployment
Source: [https://metalglot.com/blog/welsh-translategemma-finetuning-guide/](https://metalglot.com/blog/welsh-translategemma-finetuning-guide/)
Welsh is exactly the kind of language that exposes the limits of generic translation tuning\. Legal phrasing, Senedd debates, government terminology, and dictionary definitions all behave differently, so “just add more bilingual rows” is not a serious strategy\. If the goal is robust English\-Welsh and Welsh\-English translation that still feels like a capable instruction\-following model, the training recipe matters at least as much as the hardware\.
This guide is the practical version of that experiment\. It explains why TranslateGemma\-4B was the right base model, why an NVIDIA H200 was the most pragmatic single\-GPU choice, how the dataset was deliberately rebalanced, why some corpora were downsampled or excluded, what the actual script outputs looked like, and how the final model was merged and converted for local inference\.
If you only want the headline result, it is this: a 5% pilot run on one H200 completed in about 40 minutes, the dataset was kept near a deliberate 70:30 translation\-to\-instruction mix, and the whole point was to improve Welsh translation without turning the model into a narrow “always translate” system\. For more background on why specialized translation models can outperform larger general baselines, see[Inside TranslateGemma](https://metalglot.com/blog/gemma-translate/)\.
One note before the raw logs: this post combines several working drafts and real artifacts from the same project\. I am keeping the commands, console outputs, and JSON blocks intact on purpose, even where they reflect adjacent iterations of the recipe rather than one frozen snapshot\. That is more useful than a polished reconstruction because it shows what the pipeline actually looked like in practice\.
If you are evaluating a similar fine\-tune for another underrepresented language, these are the real decisions this post should help you make:
- Is TranslateGemma the right base model for a language that was not part of the official SFT set?
- Is a single H200 enough for a credible pilot run?
- Does your dataset need more volume, or better balance?
## Why Welsh Was Worth Fine\-Tuning at All
Welsh is not short on bilingual text, but it is unevenly distributed across domains\. Legal text, parliamentary proceedings, government phrasing, and terminology databases all have different value\. That makes Welsh a good example of why low\-resource or mid\-resource translation work is not solved by row count alone\.
Standard TranslateGemma\-4B provides high\-quality fine\-tuned translation for 55 languages, but Welsh was not part of the official fine\-tuned instruction set\. It sits in the broader group of “Tier 2” languages that exist in base Gemma\-3 pretraining but were not given the same direct translation specialization\. For more on that production\-readiness framing, see[TranslateGemma Language Quality Tiers: When to Translate Directly vs\. Pivot Through English](https://metalglot.com/blog/gemma-translate-insights/)\.
That gap is exactly why Welsh is interesting here\. Base Gemma\-3 already shows some zero\-shot Welsh ability, but that is not the same thing as reliable, production\-oriented translation behavior\. This project is a proof\-of\-concept for closing some of that gap with a transparent, reproducible fine\-tuning pipeline and an open repository:[finetuned\-gemmatranslate\-cy](https://github.com/grctest/finetuned-gemmatranslate-cy)\. The scripts in that repository are MIT licensed, so you are free to follow and adapt the workflow, but you should still review the licenses attached to the base model, every dataset you use, and any acceleration packages you depend on\.
## Why TranslateGemma Was the Right Base Model
TranslateGemma\-4B was a good fit for Welsh for a simple reason: the model and the repository already agree on task framing\. Translation rows are rendered through TranslateGemma’s chat template with explicit source and target language codes, while instruction rows are rendered as standard user\-assistant turns\. That matters because the fine\-tune is not trying to teach one generic prompt format to do everything badly\. It is teaching the model in the structure it already expects\.
The training loop also uses completion\-only loss\. In practice, that keeps the optimization pressure on the model’s answer instead of penalizing it for prompt tokens\. Combined with the mixed translation\-and\-instruction recipe, the result is a better chance of improving Welsh translation while preserving the behavior that makes the model usable outside a single narrow prompt shape\.
This distinction matters more than it sounds\. A translation\-only fine\-tune can absolutely improve bilingual mapping, but it can also push the model toward reflexive translation behavior and weaken general prompt following\. For this project, the target was not “a Welsh translation engine and nothing else\.” The target was a translation\-capable LLM that remained useful as an assistant\.
## Why H200 Was the Practical Single\-GPU Choice
The practical training target in this repository is the`H200F`profile\. It is the most aggressive single\-GPU profile in the codebase for LoRA training, and it exists to answer a very specific question: how far can you push a single H200 before you need to redesign the entire run?
The key settings were:
- `per\_device\_train\_batch\_size=12`
- `gradient\_accumulation\_steps=8`
- effective batch size`= 96`
- `max\_seq\_length=2048`
- `bf16=True`
- `gradient\_checkpointing=True`
- `packing=True`
- `optimizer=adamw\_torch\_fused`
- `dataset\_fraction=0\.05`for the logged pilot run shown later in this post
That configuration is a good fit for NVIDIA H200 because it lets the run keep a reasonably large context window, lean on sequence packing, and use the fastest stable attention path available for Hopper\-class hardware\. In the training script, backend selection is explicit: if flash attention is enabled and the machine looks like Hopper, the code first attempts Flash Attention 3 otherwise it falls back to SDPA\.
In other words, the H200 decision was not just about having more VRAM in the abstract\. It was about getting a stable, fast path for the exact software stack used in this repo\. The`H200F`profile is tuned around the H200’s 141 GB of HBM3e memory, which gives enough headroom for larger batch sizes and packed 2048\-token sequences without instantly forcing compromises\. Keeping`2048`here was also deliberate because it stays aligned with the sequence\-length regime Google describes in the[TranslateGemma technical report](https://arxiv.org/abs/2601.09012), rather than inventing a very different fine\-tuning setup for the Welsh run\.
That trade\-off showed up in the pilot run\. Using this profile, a 5% slice of the training set completed in about 40 minutes on a single H200\.
## Scaling the Run: Time, Cost, and Multi\-GPU Trade\-Offs
The 40\-minute pilot run has proven the recipe is viable for a full 100% training run\. A complete pass across the 1\.35M\-row\-scale dataset is a far more substantial amount of data to process, a difference of days vs an hour\. We ran just 5% to prove it all works, and in the future we \(or you\) could run the 100% run with more compute resources\.
Based on the logged benchmarks:
- **Single GPU baseline**: a full run on 1x H200 would take approximately**33 hours**\.
- **Multi\-GPU path**: an**8x H200**cluster using`accelerate`and`deepspeed`could compress that to about**4\.5 hours**\.
That is the real decision point\. If you are validating the data recipe, one H200 is enough\. If you are trying to get to a same\-day full run, multi\-GPU starts to make sense quickly\. The framework is already shaped for that next step, and the same logic should scale further to 12B or 27B TranslateGemma variants once the infrastructure is worth the cost\.
Larger models may require different profiles, and if you run a multi\-gpu setup then some refactoring may be warranted to take best advantage of accelerate and deepspeed\.
We use the hopper range of GPUs for their extensive VRAM, their flash attention v3 support, and bfloat16 support\. We attempted the script with flash attention v4 on blackwell GPUs, but found that the beta release is not yet compatible with our gemma 3 finetuning tech stack; in the future when v4 is fully developed we could evaluate the speed improvements available from upgrading from H200s to B200s for finetuning tasks\.
### What the Run Actually Cost
Financial transparency matters here because fine\-tuning projects often sound more expensive than they really are\. For this run, the practical reference point was spot and on\-demand pricing on[Verda](https://verda.com/)\(formerly DataCrunch\):
Instance TypePricing \(Approx\.\)Full 4B Run \(Estimated\)**1x H200 \(Spot\)**~$1\.20 / hr~$39\.60**1x H200 \(On\-Demand\)**~$3\.40 / hr~$112\.20**8x H200 \(On\-Demand\)**~$27\.20 / hr**~$122\.40**\(at 4\.5h\)That is one of the more interesting outcomes of the project\. At roughly**$122\.50**for a full 8\-GPU fine\-tune, the barrier to a serious Welsh translation experiment is lower than many people expect\. The hard part is less the rental bill than the discipline required to build a dataset that deserves the hardware\.
## Environment Setup and Python Stack
The minimal environment from the repo is still short\. That is useful, because it keeps the setup honest\.
```
python -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch2110
hf auth login
```
For the logged H200 run, the stack was built around CUDA 12\.8\. That matters because the flash\-attention wheel and the GPU path were chosen around that environment rather than treated as a vague “latest CUDA” setup\.
Before data preparation starts, the preparation stage in[01\_prepare\_data\.py](https://github.com/grctest/finetuned-gemmatranslate-cy/blob/main/01_prepare_data.py)validates that`\./local\_model`contains a complete local TranslateGemma snapshot\. That includes tokenizer files, processor configuration, and model weights so later stages can load`AutoProcessor`and`AutoModelForImageTextToText`without failing halfway through the pipeline\.
### What the Core Python Packages Actually Did
These libraries were not incidental dependencies\. Each one maps to a concrete part of the workflow:
- **transformers**: loads TranslateGemma, the tokenizer, and the standard model APIs used during fine\-tuning and inference\.
- **torch**: handles tensor math, GPU execution, memory allocation, and the underlying training runtime\.
- **datasets**: loads, filters, samples, and deduplicates the Welsh\-English corpora efficiently before training\.
- **peft**: implements LoRA so the run trains lightweight adapters instead of updating every model parameter\.
- **trl**: provides`SFTTrainer`, sequence packing, and the fine\-tuning loop used in`02\_finetune\.py`\.
- **huggingface\_hub**: manages authentication and model asset download to the local snapshot\.
- **sentencepiece**: is required for Gemma tokenization and the subword segmentation the model expects\.
- **pillow**: is still needed because TranslateGemma inherits multimodal processor expectations from the PaliGemma lineage, even in a text\-focused workflow\.
- **matplotlib**: supports token\-length visualization in`01b\_analyze\_token\_lengths\.py`, which helps justify the chosen`max\_seq\_length`\.
That package list also explains why this workflow is more than a thin wrapper around one training script\. It is a full data\-prep, fine\-tune, analysis, merge, and inference pipeline\.
### Reusing This Workflow for Another Language
If your goal is not just Welsh, but “how do I fine\-tune TranslateGemma for another language,” the encouraging answer is that the overall method transfers well\. The caution is that the data adapters usually do not\.
In practice,[01\_prepare\_data\.py](https://github.com/grctest/finetuned-gemmatranslate-cy/blob/main/01_prepare_data.py)is the first file most people will need to modify\. Different bilingual corpora rarely share the same field names, split structures, metadata columns, or language\-code conventions\. If you swap Welsh datasets for another language pair, you should expect to edit the dataset\-loading logic, row normalization rules, and any synthetic\-example generation that assumes specific Welsh resources such as TermCymru\.
The reusable part of the recipe looks like this:
- Normalize every source into the same flat contract of task, source text, target text, and language codes
- Keep translation supervision and instruction\-following data as separate design choices, not one blended pile
- Rerun[01b\_analyze\_token\_lengths\.py](https://github.com/grctest/finetuned-gemmatranslate-cy/blob/main/01b_analyze_token_lengths.py)after every major recipe change instead of assuming the Welsh token profile will transfer
- Validate that the language codes and prompt format still match the TranslateGemma template you want to train against
That is the real lesson if you want to extend this article into Breton, Cornish, Gaelic, or anything else: the fine\-tuning mechanics are fairly reusable, but the data\-shaping layer is where most of the engineering work lives\.
## Dataset Strategy: Balance Beat Raw Volume
One of the easiest ways to ruin a translation fine\-tune is to keep adding corpora and assume the model will sort it out\. That was not the goal here\. The target was a dataset broad enough to cover legal, parliamentary, administrative, and terminology\-heavy Welsh, while still preserving enough instruction data to stop the model from collapsing into narrow translation\-only behavior\.
The main target was a roughly`70:30`translation\-to\-instruction mixture\. That ratio mattered more than maximizing rows for its own sake\. It forced every dataset choice to answer the same question: does this improve the model we actually want, or does it just make the dataset bigger?
These were the main translation and terminology sources:
- `techiaith/legislation\-gov\-uk\_en\-cy`: High\-value legal and statutory English\-Welsh text, useful for stable terminology and formal structure\.
- `techiaith/cofnodycynulliad\_en\-cy`: Senedd proceedings, which add public\-sector and parliamentary language common in real Welsh institutional text\.
- `techiaith/bydtermcymru\-tm\-en\-cy`: Terminology memory data that helps cover domain\-specific vocabulary generic corpora often flatten\.
- `AndreasThinks/welsh\-government\-pairs`: Compact but useful aligned public\-sector translation pairs\.
- `TermCymru/TermCymru`: Authoritative terminology data, used both for direct translation supervision and for synthetic instruction\-definition pairs\.
This is also where restraint mattered\. The Cardiff University translation memory and OPUS were both treated conservatively:
- `techiaith/cardiff\-university\-tm\-en\-cy`was capped at`usage=0\.1`even though it is large\.
- `Helsinki\-NLP/opus\-100`was left at`usage=0\.0`for this run\.
That was deliberate\. The issue was not that those datasets were bad\. The issue was that translation volume was already strong enough to swamp the instruction side\. If Cardiff jumped from`0\.1`to`1\.0`, its constructed bilingual rows would rise from`290,600`to roughly`2,906,018`\. If OPUS also moved to`1\.0`, that would add another`289,521`rows\. Before deduplication, that would mean roughly`2,904,939`additional translation rows\.
Using the current run summary as a baseline, translation rows would rise from about`1,120,776`to about`4,025,715`\. To keep the same $70:30$ mixture, instruction rows would then need to reach about $4,025,715 \\times 30/70 \\approx 1,725,306$\. This run currently uses`412,135`instruction rows before rebalancing, so that would imply roughly`1,313,171`extra instruction rows or a far more aggressive downsampling pass\. For this experiment, cleaner and more balanced mattered more than simply larger\.
### Why the Non\-EN<\-\>CY Muri Instruction Data Still Helped
At first glance,`akoksal/muri\-it\-language\-split`looks out of place in an English\-Welsh translation project because much of it is not English\-to\-Welsh or Welsh\-to\-English at all\. That is exactly why it is useful here\. It is not being used as bilingual translation supervision\.
In`01\_prepare\_data\.py`, instruction datasets are mapped into monolingual instruction\-response pairs\. If a Muri slice is tagged with`language: "en"`, it becomes`en \-\> en`instruction data\. If it is tagged with`language: "cy"`, it becomes`cy \-\> cy`\. Danish stays Danish, Italian stays Italian, and so on\. These rows preserve instruction\-following behavior, turn structure, and answer formatting while the model is also being pushed hard toward translation\.
This is one of the most important ideas in the whole recipe\. Not every row needs to be direct translation supervision\. Some rows need to keep the model acting like a useful assistant\. Without that, a large translation fine\-tune can over\-specialize into “always translate” behavior, weaken general prompt following, and reduce usefulness outside narrow translation prompts\.
Welsh\-native instruction data still matters, which is why`muri\_cym`and`locailabs/nemotron\-chat\-welsh`are present\. But the broader multilingual Muri slices help preserve instruction competence as a behavior, not as a bilingual signal\.
### Why TermCymru Did Double Duty
`TermCymru/TermCymru`is one of the most valuable datasets in the recipe because it does more than contribute a glossary\.
The pipeline uses each row in two ways:
- Direct translation pairs are emitted in both directions when the English and Welsh terms are present
- Synthetic instruction\-response pairs are generated from the available English and Welsh definitions
That means TermCymru is not just teaching the model that`X`maps to`Y`\. It is also teaching the model how to answer definitional prompts, explain terminology, and behave sensibly when a user asks for a translation with context rather than just a literal substitution\. Where context fields are available, the script enriches the definitions before generating those synthetic examples\.
The actual breakdown from the run log is:
- direct translations`en\-\>cy`:`154,618`
- direct translations`cy\-\>en`:`154,618`
- synthetic EN\-\>CY instructions:`29,307`
- synthetic CY\-\>EN instructions:`24,232`
That totals`362,775`constructed rows from TermCymru alone\. For Welsh, that is exactly the kind of dataset that improves both terminological precision and instruction retention at the same time\.
## Why the Cleanup Stage Was Not Optional
Data preparation is not a side issue in this project\. It is one of the main reasons the final training run is credible\. Once all datasets are normalized into the flat contract of`task`,`source\_text`,`target\_text`,`source\_lang\_code`, and`target\_lang\_code`, the script does several things before saving`\./processed\_data`:
- Filters empty or null source\-target pairs
- Performs global deduplication across task, source text, target text, and both language codes
- Rebalances the merged pool toward the target`70:30`translation\-instruction mix
- Balances translation directions toward near\-equal`en\-\>cy`and`cy\-\>en`coverage
The run statistics show why this matters:
- Raw constructed rows before deduplication:`1,532,911`
- Duplicates removed:`174,735`
- Rows removed by rebalancing:`3,408`
- Final constructed rows after cleanup:`1,354,653`
- Final translation rows:`948,257`
- Final instruction rows:`406,396`
- Final translation direction counts:`474,108``cy\-\>en`,`474,149``en\-\>cy`
- Held\-out evaluation split:`1,000`rows, stratified by task and direction
Those`174,735`duplicate removals are not cosmetic\. Repeated bilingual pairs can distort the training distribution, overweight specific phrases, and make the model memorize narrow sentence patterns at the expense of broader coverage\. In Welsh translation, where some institutional phrases recur constantly across corpora, global deduplication matters a lot more than people often assume\.
Another useful detail is that the recipe itself is updated with both`\_run\_summary`and`\_constructed\_summary`\. That makes the recipe a record of what the pipeline was intended to build and what it actually built after cleanup\.
## What`01\_prepare\_data\.py`Actually Printed
The console output below is useful for a different reason than the JSON artifact that follows\. This block shows the real dry\-run style output from[01\_prepare\_data\.py](https://github.com/grctest/finetuned-gemmatranslate-cy/blob/main/01_prepare_data.py): the dataset totals the script printed, the interactive confirmation prompt, and the cleanup stages that followed\. I am keeping it because it shows the operational feel of the pipeline rather than just the final recipe state\.
Click to see output from data preparation script\!```
TRANSLATION | techiaith_legislation | 129,452
TRANSLATION | techiaith_cardiff_uni | 290,600
TRANSLATION | techiaith_senedd | 198,151
TRANSLATION | techiaith_bydterm | 155,704
TRANSLATION | welsh_gov_pairs | 21,928
DICTIONARY | termcymru | 362,775
INSTRUCTION | muri_eng | 108,289
INSTRUCTION | muri_dan | 13,500
INSTRUCTION | muri_est | 14,400
INSTRUCTION | muri_cym | 13,500
INSTRUCTION | muri_ita | 21,453
INSTRUCTION | muri_fin | 14,040
INSTRUCTION | muri_fra | 26,604
INSTRUCTION | muri_gle | 13,500
INSTRUCTION | muri_glg | 14,734
INSTRUCTION | muri_jpn | 25,602
INSTRUCTION | muri_kor | 18,186
INSTRUCTION | muri_spa | 34,266
INSTRUCTION | muri_ukr | 7,592
INSTRUCTION | nemotron_cym | 27,673
TOTAL: 1,511,949 (T: 1,105,071, I: 406,878)
Proceed? [y/N] --- Construction ---
2. Deduplicating...
3. Rebalancing...
4. Splitting...
Done. Saved to ./processed_data
```
## The Full`data\_recipe\.json`Artifact
The JSON block below answers a different question from the console output above\. It is the post\-construction recipe artifact with`\_run\_summary`and`\_constructed\_summary`appended, so it acts as the best audit trail for how the recipe was configured and what the pipeline recorded after rebalancing\.
Because this guide combines adjacent working iterations, some totals differ slightly across the raw artifacts\. That is worth stating explicitly rather than pretending the logs came from one perfectly frozen pass\. The important point is that the same design logic holds across them: a large translation pool, a deliberate instruction component, and a cleanup pass that materially changes the final training mix\.
Click to see full data recipe JSON contents```
{
"profile_name": "Welsh",
"translation_data": {
"techiaith_legislation": {
"path": "techiaith/legislation-gov-uk_en-cy",
"usage": 1.0,
"directions": [
"en-cy",
"cy-en"
],
"rows_used": 129452,
"rows_available": 64726,
},
"techiaith_cardiff_uni": {
"path": "techiaith/cardiff-university-tm-en-cy",
"usage": 0.1,
"directions": [
"en-cy",
"cy-en"
],
"rows_used": 290600,
"rows_available": 1453009,
"source_rows_used": 145300
},
"techiaith_senedd": {
"path": "techiaith/cofnodycynulliad_en-cy",
"usage": 1.0,
"directions": [
"en-cy",
"cy-en"
],
"rows_used": 199245,
"rows_available": 104738,
"source_rows_used": 104738
},
"techiaith_bydterm": {
"path": "techiaith/bydtermcymru-tm-en-cy",
"usage": 1.0,
"directions": [
"en-cy",
"cy-en"
],
"rows_used": 155704,
"rows_available": 77852,
"source_rows_used": 77852
},
"helsinki_opus": {
"path": "Helsinki-NLP/opus-100",
"config": "cy-en",
"usage": 0,
"directions": [
"en-cy"
],
"rows_used": 0,
"rows_available": 289521,
"source_rows_used": 0
},
"welsh_gov_pairs": {
"path": "AndreasThinks/welsh-government-pairs",
"usage": 1.0,
"directions": [
"en-cy",
"cy-en"
],
"rows_used": 22479,
"rows_available": 13154,
"source_rows_used": 13154
}
},
"dictionary_data": {
"termcymru": {
"path": "TermCymru/TermCymru",
"usage": 1.0,
"directions": [
"en-cy",
"cy-en"
],
"note": "Split 50/50 between Translation and Chat-Definition mode",
"rows_used": 362775,
"rows_available": 154618,
"source_rows_used": 154618
}
},
"instruction_data": {
"muri_eng": {
"path": "akoksal/muri-it-language-split",
"config": "eng",
"language": "en",
"usage": 1.0,
"rows_used": 108940,
"rows_available": 113395,
"source_rows_used": 113395
},
"muri_dan": {
"path": "akoksal/muri-it-language-split",
"config": "dan",
"language": "da-DK",
"usage": 1.0,
"rows_used": 13500,
"rows_available": 13500,
"source_rows_used": 13500
},
"muri_est": {
"path": "akoksal/muri-it-language-split",
"config": "est",
"language": "ee-EE",
"usage": 1.0,
"rows_used": 14400,
"rows_available": 14400,
"source_rows_used": 14400
},
"muri_cym": {
"path": "akoksal/muri-it-language-split",
"config": "cym",
"language": "cy",
"usage": 1.0,
"rows_used": 13500,
"rows_available": 13500,
"source_rows_used": 13500
},
"muri_ita": {
"path": "akoksal/muri-it-language-split",
"config": "ita",
"language": "it",
"usage": 1.0,
"rows_used": 21453,
"rows_available": 21453,
"source_rows_used": 21453
},
"muri_fin": {
"path": "akoksal/muri-it-language-split",
"config": "fin",
"language": "fi",
"usage": 1.0,
"rows_used": 14040,
"rows_available": 14040,
"source_rows_used": 14040
},
"muri_fra": {
"path": "akoksal/muri-it-language-split",
"config": "fra",
"language": "fr",
"usage": 1.0,
"rows_used": 26604,
"rows_available": 26604,
"source_rows_used": 26604
},
"muri_gle": {
"path": "akoksal/muri-it-language-split",
"config": "gle",
"language": "gle",
"usage": 1.0,
"rows_used": 13500,
"rows_available": 13500,
"source_rows_used": 13500
},
"muri_glg": {
"path": "akoksal/muri-it-language-split",
"config": "glg",
"language": "glg",
"usage": 1.0,
"rows_used": 14734,
"rows_available": 14734,
"source_rows_used": 14734
},
"muri_jpn": {
"path": "akoksal/muri-it-language-split",
"config": "jpn",
"language": "ja-JP",
"usage": 1.0,
"rows_used": 25602,
"rows_available": 25603,
"source_rows_used": 25603
},
"muri_kor": {
"path": "akoksal/muri-it-language-split",
"config": "kor",
"language": "ko-KR",
"usage": 1.0,
"rows_used": 18186,
"rows_available": 18187,
"source_rows_used": 18187
},
"muri_spa": {
"path": "akoksal/muri-it-language-split",
"config": "spa",
"language": "es",
"usage": 1.0,
"rows_used": 34268,
"rows_available": 34281,
"source_rows_used": 34281
},
"muri_ukr": {
"path": "akoksal/muri-it-language-split",
"config": "ukr",
"language": "uk-UA",
"usage": 1.0,
"rows_used": 7592,
"rows_available": 7592,
"source_rows_used": 7592
},
"nemotron_cym": {
"path": "locailabs/nemotron-chat-welsh",
"language": "cy",
"usage": 1.0,
"rows_used": 27679,
"rows_available": 27807,
"source_rows_used": 27807
}
},
"meta_strategy": {
"target_ratio": "70:30",
"eval_size": 1000,
"packing": true,
"max_seq_len": 2048
},
"_run_summary": {
"total_translation_rows": 1106716,
"total_instruction_rows": 407537,
"total_rows": 1514253,
"translation_percentage": "73.1%",
"instruction_percentage": "26.9%"
},
"_constructed_summary": {
"total_translation_rows": 733196,
"total_instruction_rows": 314227,
"total_rows": 1047423,
"translation_percentage": "70.0%",
"instruction_percentage": "30.0%",
"training_split_rows": 1046423,
"eval_split_rows": 1000,
"duplicates_removed": 174622,
"rows_removed_by_rebalancing": 97446
}
}
```
## Token Length Analysis: Why 2048 Was the Right Ceiling
Before committing to a full expensive training run, it was worth measuring what prompt\-completion lengths actually looked like after data preparation\. That is the role of[01b\_analyze\_token\_lengths\.py](https://github.com/grctest/finetuned-gemmatranslate-cy/blob/main/01b_analyze_token_lengths.py)\.
This step matters because`max\_seq\_length`decisions are expensive to get wrong\. If the ceiling is too short, you throw away useful context\. If it is too long, throughput drops and memory pressure rises\. A 5% pass is a cheap way to see whether the dataset shape justifies the chosen context length before paying for a full scan\. We also kept`2048`deliberately because it stays close to the fine\-tuning setup Google describes in the[TranslateGemma technical report](https://arxiv.org/abs/2601.09012), which makes this Welsh run feel like an extension of the original recipe rather than a totally different training regime\.
The two commands used were:
```
python 01b_analyze_token_lengths.py --num-proc 6 --dataset-fraction 0.05 | tee 01b_token_analysis_0.05.txt
python 01b_analyze_token_lengths.py --num-proc 6 --dataset-fraction 1.0 | tee 01b_token_analysis_full.txt
```
The useful result is that the 5% sample was close enough to the 100% scan to be operationally trustworthy\.
ScanSegmentCountMeanMedian95th99th99\.9thMax**5%****Global**65,910308\.3098\.001580**2049**29463239Translation46,192206\.5575\.001013198329863239Instruction19,718546\.66270\.00**2049****2049**22732918**100%****Global**1,318,210306\.9899\.001587**2049**28853294Translation922,747206\.2375\.001013198329433294Instruction395,463542\.08265\.00**2049****2049**22463164With packing enabled,`2048`is a practical compromise\. It is long enough to capture almost all of the useful distribution, compact enough to keep throughput reasonable, and especially efficient when short prompts can be packed into fuller blocks\.
The full\-dataset summary is the clearest justification:
### Global Token Length Distribution Statistics
- **Count**:`1,318,210`
- **Mean tokens**:`306\.98`
- **99th percentile**:`2,049`, which lands almost exactly on the chosen sequence\-length target
Translation rows are much shorter on average than instruction rows, but the instruction side is exactly why this ceiling matters\. Translation segments average around`206`tokens\. Instruction segments average around`542`tokens and hit the`2049`boundary at both the 95th and 99th percentile in the reported scan\. That is one more reason the`70:30`mix matters: it keeps the model’s instruction behavior alive, but it also means you need a context choice that respects those denser examples\.
### 5% token distribution analysis output
This is the exact output from the smaller scan\. It is useful because it shows what the quick validation step actually printed, not just the summary table\.
Click to see 5% token distribution analysis output```
Loading dataset from ./processed_data...
Applying dataset fraction 5.0%: 1,318,210 -> 65,910 samples.
Loaded 65910 records. Loading tokenizer from ./local_model...
Mapping dataset to calculate lengths...
--- Global Token Length Distribution Statistics ---
Count: 65,910
Mean Tokens: 308.30
Median Tokens: 98.00
Maximum Tokens: 3239
95th Percentile: 1580
99th Percentile: 2049
--- TRANSLATION Segment Statistics ---
Count: 46,192
Mean Tokens: 206.55
99.9th Percentil:2986
--- INSTRUCTION Segment Statistics ---
Count: 19,718
Mean Tokens: 546.66
95th Percentile: 2049
99th Percentile: 2049
```
### 100% token distribution analysis output
This is the corresponding full scan\. The important point is not just that it is larger\. It is that it validates the 5% estimate well enough to justify the cheaper exploratory pass\.
Click to see 100% token distribution analysis output```
Loading dataset from ./processed_data...
Loaded 1318210 records. Loading tokenizer from ./local_model...
Mapping dataset to calculate lengths...
--- Global Token Length Distribution Statistics ---
Count: 1,318,210
Mean Tokens: 306.98
Median Tokens: 99.00
Maximum Tokens: 3294
95th Percentile: 1587
99th Percentile: 2049
--- TRANSLATION Segment Statistics ---
Count: 922,747
Mean Tokens: 206.23
99.9th Percentil:2943
--- INSTRUCTION Segment Statistics ---
Count: 395,463
Mean Tokens: 542.08
95th Percentile: 2049
99th Percentile: 2049
```
## Fine\-Tuning Configuration and Pilot Results
The fine\-tuning script in[02\_finetune\.py](https://github.com/grctest/finetuned-gemmatranslate-cy/blob/main/02_finetune.py)takes the flat processed dataset and turns it into prompt\-completion examples immediately before training\. Translation rows and instruction rows are not formatted the same way, because they are not trying to teach the same behavior\.
- Translation rows are rendered through TranslateGemma’s chat template with explicit`source\_lang\_code`,`target\_lang\_code`, and source text
- Instruction rows are rendered as standard user\-assistant turns
- The trainer uses`completion\_only\_loss=True`
- Embeddings are frozen
- LoRA is applied with rank`16`, alpha`32`, dropout`0\.05`, and the standard projection layers \(`q\_proj`,`k\_proj`,`v\_proj`,`o\_proj`,`gate\_proj`,`up\_proj`,`down\_proj`\)
- The logged`H200F`run uses one epoch over a 5% slice of the training set
That last point matters because it changes how to read the training log\. The`H200F`profile explicitly sets`dataset\_fraction=0\.05`, so the output below is a profile\-driven pilot run, not a claim that the full dataset has already been pushed through the same one\-epoch schedule\. That is the right way to validate loss behavior, throughput, prompt formatting, and packing before spending on a full run\.
The pilot curve is encouraging\. Loss drops from`2\.288`to`1\.718`across the logged checkpoints, while`mean\_token\_accuracy`rises from`0\.57`to`0\.6413`\. Final aggregate training loss lands at`1\.864`over one epoch of the 5% slice\.
### `02\_finetune\_log\.txt`
This was the command used to fine\-tune on a 5% slice of the constructed dataset with the`H200F`profile:
```
python 02_finetune.py --profile H200F --num-proc 20 --sft-num-proc 30 | tee 02_finetune_log.txt
```
Here is the profile itself:
```
{
"description": "1x H200 - Flash Attention 3 & Packing Enabled.",
"device_map": None,
"dtype": torch.bfloat16,
"per_device_train_batch_size": 12,
"gradient_accumulation_steps": 8,
"max_seq_length": 2048,
"bf16": True,
"gradient_checkpointing": True,
"use_flash_attention": True,
"packing": True, # Packing combines multiple short rows into robust 2048 blocks
"training_mode": "lora",
"deepspeed": None,
"use_cpu": False,
"learning_rate": 1e-4,
"optimizer": "adamw_torch_fused",
"dataset_fraction": 0.05, # 5% of the training data!
}
```
The full raw log is below\. I am keeping it because it shows the actual progression instead of a simplified chart\.
Click to see the full output from the 5% finetuning script which ran on a H200 GPU\!```
{'loss': '2.288', 'grad_norm': '0.3656', 'learning_rate': '9.297e-05', 'entropy': '1.702', 'num_tokens': '1.955e+06', 'mean_token_accuracy': '0.57', 'epoch': '0.07835'}
{'loss': '2.014', 'grad_norm': '0.2224', 'learning_rate': '8.516e-05', 'entropy': '1.952', 'num_tokens': '3.906e+06', 'mean_token_accuracy': '0.5897', 'epoch': '0.1567'}
{'loss': '1.892', 'grad_norm': '0.2347', 'learning_rate': '7.734e-05', 'entropy': '1.864', 'num_tokens': '5.864e+06', 'mean_token_accuracy': '0.6092', 'epoch': '0.2351'}
{'loss': '1.909', 'grad_norm': '0.2049', 'learning_rate': '6.953e-05', 'entropy': '1.908', 'num_tokens': '7.822e+06', 'mean_token_accuracy': '0.6072', 'epoch': '0.3134'}
{'loss': '1.873', 'grad_norm': '0.2403', 'learning_rate': '6.172e-05', 'entropy': '1.858', 'num_tokens': '9.781e+06', 'mean_token_accuracy': '0.6128', 'epoch': '0.3918'}
{'loss': '1.833', 'grad_norm': '0.2403', 'learning_rate': '5.391e-05', 'entropy': '1.83', 'num_tokens': '1.174e+07', 'mean_token_accuracy': '0.6191', 'epoch': '0.4701'}
{'loss': '1.803', 'grad_norm': '0.2672', 'learning_rate': '4.609e-05', 'entropy': '1.797', 'num_tokens': '1.369e+07', 'mean_token_accuracy': '0.6252', 'epoch': '0.5485'}
{'loss': '1.794', 'grad_norm': '0.2771', 'learning_rate': '3.828e-05', 'entropy': '1.788', 'num_tokens': '1.565e+07', 'mean_token_accuracy': '0.6275', 'epoch': '0.6268'}
{'loss': '1.81', 'grad_norm': '0.3136', 'learning_rate': '3.047e-05', 'entropy': '1.805', 'num_tokens': '1.76e+07', 'mean_token_accuracy': '0.6252', 'epoch': '0.7052'}
{'loss': '1.763', 'grad_norm': '0.3222', 'learning_rate': '2.266e-05', 'entropy': '1.752', 'num_tokens': '1.956e+07', 'mean_token_accuracy': '0.6339', 'epoch': '0.7835'}
{'loss': '1.742', 'grad_norm': '0.3482', 'learning_rate': '1.484e-05', 'entropy': '1.742', 'num_tokens': '2.152e+07', 'mean_token_accuracy': '0.6364', 'epoch': '0.8619'}
{'loss': '1.718', 'grad_norm': '0.3116', 'learning_rate': '7.031e-06', 'entropy': '1.716', 'num_tokens': '2.347e+07', 'mean_token_accuracy': '0.6413', 'epoch': '0.9403'}
{'train_runtime': '2540', 'train_samples_per_second': '4.824', 'train_steps_per_second': '0.05', 'train_loss': '1.864', 'entropy': '1.772', 'num_tokens': '2.496e+07', 'mean_token_accuracy': '0.6317', 'epoch': '1'}
```
## Merge, Inference, and GGUF Deployment
Once LoRA training is done, the remaining steps are mechanically simple but strategically important\. If you want the model to be more than a successful training run on paper, you need a merged artifact and a format that is easy to deploy into real local tooling\. The merge stage lives in[03\_merge\.py](https://github.com/grctest/finetuned-gemmatranslate-cy/blob/main/03_merge.py)\.
```
python 03_merge.py --profile cpu
python llama.cpp/convert_hf_to_gguf.py ./final_merged_model --outfile 4B_cy_q8_0.gguf --outtype q8_0
```
The above commands produce two useful outputs:
- A merged safetensors model that combines the fine\-tuned adapters with the original model weights
- A GGUF build of the merged Welsh TranslateGemma model for local runtimes
That last step is what turns the project from “we fine\-tuned a model” into “we can actually run this thing on workstations and local inference stacks\.” If your interest is more on the deployment ecosystem than the training loop,[Best Open Source TranslateGemma Tools](https://metalglot.com/blog/open-source-translategemma-comparison/)is the broader guide to the local runtime landscape\.
## Testing the Merged Model in Practice
After merging the fine\-tuning output back into the base weights, the safetensors model can be tested with the simple inference script in[04\_inference\.py](https://github.com/grctest/finetuned-gemmatranslate-cy/blob/main/04_inference.py):
```
python 04_inference.py --profile cpu
```
The GGUF path is even more useful operationally because it makes the model easy to drop into existing local workflows\. In MetalGlot, for example, you can download the matching base TranslateGemma GGUF variant, navigate to the model download folder, rename the downloaded file, and replace it with the fine\-tuned GGUF using the same filename\. That makes the Welsh\-tuned model immediately testable through a GUI instead of a terminal\.
If you want to test the result in a broader local ecosystem, the companion guide to[Best Open Source TranslateGemma Tools](https://metalglot.com/blog/open-source-translategemma-comparison/)is the right place to compare runtimes, wrappers, and deployment styles\.
Before moving anywhere near production, the model still needs evaluation against the reference model\. That means some combination of MetricX, BLEU, COMET, or a similarly defensible translation\-quality evaluation method rather than just eyeballing a few nice outputs\.
## FAQ: Common Questions About This Welsh Fine\-Tune
### Why not just use more translation data?
Because more rows were not the main bottleneck\. The real risk was destroying the translation\-instruction balance and letting a few huge corpora dominate the entire recipe\. Cardiff and OPUS were useful, but letting them flood the mixture would have forced a completely different instruction strategy\. More data also would substantially increase the time to process the fine\-tuning task\.
### Why keep non\-Welsh instruction rows in a Welsh project?
Because they were not there to teach Welsh translation directly\. They were there to preserve assistant behavior, turn structure, and instruction\-following competence while the model was being pushed toward translation\.
### Is H200 required for this workflow?
The current H200F profile does require a H200 GPU\. If you adjust the profiles you could run it on a H100, however if you downgrade to A100 or earlier generation graphics cards then you start to lose out on newer flash attention versions and bfloat16 support, as well as a far longer compute time\. Blackwell graphics cards flash attention v4 is not yet supported for this fine\-tuning task\.
### Can I use this article to fine\-tune TranslateGemma for another language?
Yes, but you should expect to reuse the method more than the dataset adapters\. The training flow, token analysis, LoRA setup, merge step, and GGUF deployment path all transfer reasonably well\. The part that usually changes first is[01\_prepare\_data\.py](https://github.com/grctest/finetuned-gemmatranslate-cy/blob/main/01_prepare_data.py), because every new language tends to arrive with different dataset schemas, field names, language codes, and glossary resources\.
### Why convert to GGUF if the merged model already works?
Because deployment is part of the project, not an afterthought\. GGUF makes the model much easier to run in common local tooling and GUI\-based inference stacks, which matters if the goal is practical Welsh translation rather than a one\-off training artifact\.
## Conclusion
The most important lesson from this Welsh TranslateGemma fine\-tune is that the recipe matters at least as much as the hardware\. The H200 and`H200F`profile made the run practical, but the real quality came from curation: choosing strong Welsh institutional corpora, refusing to let a few giant translation memories take over, generating synthetic instruction data from TermCymru, preserving instruction behavior through multilingual Muri slices, deduplicating aggressively, and validating token\-length distributions before committing to training\.
That is also why the recipe, token analysis outputs, and fine\-tuning log are worth publishing in full\. They show that the pipeline is measurable, inspectable, and reproducible rather than hand\-wavy\.
If I extend this run further, the cleanest next step is still not “add every translation pair available\.” It is either adding more high\-quality Welsh instruction data, or expanding the translation side in a way that still preserves the`70:30`behavior target and the cleaned, balanced structure that made this run coherent in the first place\. For Welsh, and probably for many other underrepresented institutional languages, that is the real takeaway\.