@0xLogicrw: MiniMax published a technical blog post detailing the root cause analysis for its M2 series large models' inability to output the person's name "Ma Jiaqi". Starting from a single case study, the investigation ultimately revealed a systematic degradation issue affecting nearly 5% of the entire vocabulary. The root cause was a severe disconnect in data coverage between the two training stages of the large model. In the first stage (pre-training), massive amounts of internet text were used to cre…

X AI KOLs Timeline News

Summary

MiniMax published a technical blog post providing an in-depth analysis of the systematic vocabulary degradation issue behind its M2 series large models' inability to output specific personal names. It reveals parameter shifts caused by a disconnect in data coverage between pre-training and post-training stages, and proposes an effective solution involving full-scale synthetic data for remediation.

MiniMax published a technical blog post disclosing the root cause investigation process for its M2 series large models' inability to output the name "Ma Jiaqi". Starting from a single case study, the investigation ultimately revealed a systematic degradation issue affecting nearly 5% of the entire vocabulary. The fundamental cause was a severe disconnect in data coverage between the two training stages of the large model. In the first stage (pre-training), vast amounts of internet text were used to compile a "dictionary" of approximately 200,000 tokens; in the second stage (post-training), curated dialogue data was used to teach the model how to converse, but this dialogue data only covered a portion of the dictionary. Words present in the dictionary but not encountered in the dialogue data during the second stage were gradually forgotten. "Jiaqi" is one such token. The tokenizer (responsible for segmenting text into the smallest units the model can process) encountered the combination "Jiaqi" frequently enough in internet text to merge it into an independent unit. During pre-training, the model learned this token. However, in the post-training dialogue data, there were fewer than 5 samples containing "Jiaqi". As post-training continuously adjusted the model parameters, tokens that were practiced became more accurate, while those not practiced were skewed during parameter updates. Ultimately, the model still "knew" Ma Jiaqi and could accurately answer related information; what was lost was only the ability to write out the name. Other tokens with high degradation rankings include internet SEO spam terms like "legendary private servers" and "painless abortion". Such terms were pervasive in the internet corpora used for pre-training, and the tokenizer assigned them independent IDs. However, these contents were not included in the curated post-training dialogue data, resulting in them being forgotten as well. The team conducted a full scan of the complete vocabulary and found that approximately 4.9% of the tokens experienced significant degradation. Japanese was the most severely affected language: 29.7% of Japanese tokens showed significant degradation, far exceeding Korean (3.3%), Russian (3.7%), Chinese (3.9%), and English (3.5%). The severe degradation in Japanese also solved a previous mystery. Previously, the model occasionally mixed in Russian or Korean characters during Japanese conversations, and the reason remained elusive. This analysis indicated that after a large number of Japanese tokens degraded, they "drifted" in the model's internal parameter space into the territories of other languages, causing the model to mistakenly write Russian or Korean when it should have written Japanese. The remediation solution involved constructing synthetic data covering the entire vocabulary, allowing the model to practice every token in the dictionary through simple repetition tasks. The effect was immediate: the proportion of Russian characters mixed into Japanese responses dropped from 47% to 1%, and the stability of parameters across the entire vocabulary rose from a low of 0.329 to above 0.97 for all tokens.
Original Article

Similar Articles

MiniMaxAI/MiniMax-M2.7

Hugging Face Models Trending

MiniMaxAI releases MiniMax-M2.7, an open-weight model featuring self-evolution capabilities, advanced agent team support, and strong performance on software engineering benchmarks (56.22% on SWE-Pro, 66.6% medal rate on MLE Bench Lite), with notable applications in production incident recovery and professional work tasks.

@QingQ77: Training a 0.1B end-to-end omnimodal model from scratch. A single set of weights handles text, speech, and image inputs, while outputting text and streaming speech. https://github.com/jingyaogong/minimind-o… MiniMind-O is an omnimodal model with only 0.1B parameters…

X AI KOLs Timeline

MiniMind-O has released an end-to-end omnimodal model with only 0.1B parameters, supporting text, speech, and image inputs as well as streaming speech output. The project opensources the code, weights, training data, and technical report, emphasizing that both training and inference can be performed quickly on standard GPUs.

@yidabuilds: https://x.com/yidabuilds/status/2053409619641602286

X AI KOLs Timeline

The author conducted a comparative evaluation of four domestic AI models: DeepSeek V4, Kimi K2.6, GLM-5.1, and MiniMax M2.7. The analysis covers their strengths and weaknesses regarding cost, long-context processing, coding stability, and reasoning performance, offering specific recommendations on how to route tasks involving large document analysis, long-running background jobs, and bulk content generation.

@sanbuphy: K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, using the niche Zig language to implement and optimize inference, demonstrating the new model’s generalization ability. After 4,000+ tool calls and 12+ hours of continuous operation, K2.6 iterated 14 times…

X AI KOLs Timeline

K2.6 successfully downloaded and deployed the Qwen3.5-0.8B model locally on a Mac, using the niche Zig language to implement and optimize inference, demonstrating the new model’s generalization ability. After 4,000+ tool calls and 12+ hours of continuous operation, K2.6 iterated 14 times, boosting throughput from ~15 tokens/s to ~193 tokens/s, ultimately achieving 20% faster inference than LM Studio.