@QingQ77: Training a 0.1B end-to-end omnimodal model from scratch. A single set of weights handles text, speech, and image inputs, while outputting text and streaming speech. https://github.com/jingyaogong/minimind-o… MiniMind-O is an omnimodal model with only 0.1B parameters…

X AI KOLs Timeline 05/09/26, 01:59 PM Models

omni-model open-source multimodal speech-processing pytorch lightweight

Summary

MiniMind-O has released an end-to-end omnimodal model with only 0.1B parameters, supporting text, speech, and image inputs as well as streaming speech output. The project opensources the code, weights, training data, and technical report, emphasizing that both training and inference can be performed quickly on standard GPUs.

Training a 0.1B end-to-end omnimodal model from scratch. A single set of weights handles text, speech, and image inputs, while outputting text and streaming speech. https://github.com/jingyaogong/minimind-o… MiniMind-O is an omnimodal model with only 0.1B parameters, featuring a Thinker-Talker dual-path design. It supports text, speech, and image inputs, and outputs text and streaming speech. This project opensources the code, weights, training data, and technical report. The core algorithm was written from scratch in PyTorch, allowing the mini dataset training to be completed in just two hours on a single RTX 3090.

Original Article

View Cached Full Text

Cached at: 05/09/26, 04:10 PM

“Great truths are simple”

Similar Articles

@vintcessun: Pretraining can be this cost-effective? Train a usable 1B base model from scratch for ~$1000, slashing compute and data by hundreds of times. The key isn't brute-force compute, but hierarchical recursive architecture plus latent space reasoning, combined with PrefixLM packing and FA3 to maximize efficiency. Sounds insane, but the paper and code are open-sourced.

X AI KOLs Timeline

HRM-Text released a 1B-parameter base model, claiming it can be pretrained from scratch for only ~$1000, reducing compute and data volume by hundreds of times. It employs efficient techniques such as hierarchical recursive architecture, latent space reasoning, and PrefixLM packing. The paper and code are open-sourced.

@seclink: This 12-billion-parameter model uses a unified Transformer architecture to efficiently handle raw multimodal inputs. It requires only 16GB of RAM to run, making it a perfect fit for devices like the MacBook Pro. It excels in various benchmarks, such as achieving 78.8% on GPQA Diamond and...

X AI KOLs Following

A 12-billion-parameter multimodal model has been released as open source. It features a unified Transformer architecture and requires only 16GB of RAM to run. It performs exceptionally well across multiple benchmarks, supports a 256K context window, and works with over 140 languages.

@FeitengLi: A 99M parameter TTS runs on CPU, faster than a 2B model on A100. Supertone's newly open-sourced supertonic-3 with ONNX Runtime, fully local, can run in browser, on phone, and even on Raspberry Pi.

X AI KOLs Timeline

Supertone released Supertonic 3, an open-source TTS model with 99M parameters that runs faster on CPU than a 2B model on A100, supporting 31 languages and ONNX Runtime for fully local inference.

@berryxia: Small model, big wisdom? It's now real! A 7B small model now acts as the boss of top large models like GPT-5, Claude Sonnet 4, Gemini 2.5 Pro. A new paper shows an RL-trained 7B model learned to write natural language subtasks, assign them to different models, precisely...

X AI KOLs Timeline

A new paper proposes training a 7B small model via reinforcement learning as a task scheduler, automatically decomposing subtasks and assigning them to top models like GPT-5 and Claude. It surpasses individual frontier models on several hard benchmarks, demonstrating that end-to-end reward learning can effectively replace manual prompt engineering and multi-agent pipeline design.

@rionaifantasy: Unbelievable! How Can a 34.5M Parameter OCR Beat a 235B Large Model? Let me tell you something ridiculous: I used to believe the future of OCR would inevitably be devoured by ever-larger multimodal large models. But after seeing PP-OCRv6 released by Baidu Wenxin, I've changed my mind. Because it doesn't follow the path of "continuing to pile on parameters..."

X AI KOLs Timeline

Baidu Wenxin releases PP-OCRv6, offering three model tiers: Tiny, Small, and Medium, supporting over 50 languages. The Tiny version is only 1.5MB and can run locally in a browser, with the fastest single-image inference at 97ms, proving that small specialized models can outperform large models on OCR tasks.

Similar Articles

@FeitengLi: A 99M parameter TTS runs on CPU, faster than a 2B model on A100. Supertone's newly open-sourced supertonic-3 with ONNX Runtime, fully local, can run in browser, on phone, and even on Raspberry Pi.

@berryxia: Small model, big wisdom? It's now real! A 7B small model now acts as the boss of top large models like GPT-5, Claude Sonnet 4, Gemini 2.5 Pro. A new paper shows an RL-trained 7B model learned to write natural language subtasks, assign them to different models, precisely...

Submit Feedback