MinT: Managed Infrastructure for Training and Serving Millions of LLMs
Summary
MinT is a managed infrastructure system that enables efficient training and serving of millions of LLMs by keeping base models resident and moving lightweight LoRA adapters, scaling across model architectures, storage, and policy management.
View Cached Full Text
Cached at: 05/14/26, 04:16 AM
Paper page - MinT: Managed Infrastructure for Training and Serving Millions of LLMs
Source: https://huggingface.co/papers/2605.13779 Published on May 13
#2 Paper of the day Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
MinT is a managed infrastructure system that enables efficient low-rank adaptation training and serving by keeping base models resident and moving lightweight adapter revisions, scaling across multiple dimensions including large model architectures, reduced storage requirements, and distributed policy management.
We present MindLab Toolkit (MinT), a managed infrastructure system forLow-Rank Adaptation(LoRA)post-trainingandonline serving. MinT targets a setting where many trained policies are produced over a small number of expensivebase-model deployments. Instead of materializing each policy as a mergedfull checkpoint, MinT keeps the base model resident and moves exportedLoRAadapter revisionsthrough rollout, update, export, evaluation,serving, and rollback, hidingdistributed training,serving,scheduling, anddata movementbehind aservice interface. MinT scales this path along three axes.Scale UpextendsLoRARL to frontier-scale dense andMoE architectures, including MLA andDSA attentionpaths, with training andservingvalidated beyond 1T total parameters.Scale Downmoves only the exportedLoRAadapter, which can be under 1% of base-model size in rank-1 settings;adapter-only handoffreduces the measured step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while concurrent multi-policyGRPOshortens wall time by 1.77x and 1.45x without raisingpeak memory.Scale Outseparates durable policy addressability from CPU/GPU working sets: atensor-parallel deploymentsupports 10^6-scale addressable catalogs (measured single-engine sweeps through 100K) and thousand-adapter active waves at cluster scale, withcold loadingtreated as scheduled service work and packed MoELoRAtensors improving live engine loading by 8.5-8.7x. MinT thus manages million-scaleLoRApolicy catalogswhile training andservingselectedadapter revisionsover shared 1T-class base models.
View arXiv pageView PDFProject pageGitHub17Add to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.13779 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.13779 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.13779 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Granite 4.1 LLMs: How They’re Built
This article details the technical architecture and training pipeline of IBM's Granite 4.1 LLMs, covering pre-training, SFT, and RL stages. It highlights that the 8B dense model outperforms larger MoE counterparts and notes the release under Apache 2.0 license.
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
This paper introduces AutoTTS, an environment-driven framework that automates the discovery of test-time scaling strategies for LLMs by formulating it as controller synthesis. It demonstrates improved accuracy-cost tradeoffs on mathematical reasoning benchmarks with minimal computational overhead.
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
MiniCPM-V 4.5 is an 8B multimodal large language model that achieves high efficiency and strong performance through a unified 3D-Resampler architecture, a novel data strategy, and a hybrid reinforcement learning approach. The model reportedly surpasses larger proprietary and open-source benchmarks while significantly reducing GPU memory usage and inference time.
Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs
This paper proposes Multi-Stream LLMs, which transition from sequential message-based instruction tuning to parallel stream processing. This approach allows language models to simultaneously read, think, and generate across multiple concurrent data flows, addressing bottlenecks in autonomous agent applications.
SOMA: Efficient Multi-turn LLM Serving via Small Language Model
This paper introduces SOMA, a framework for efficient multi-turn LLM serving that uses small language models adapted via soft prompts and LoRA fine-tuning to reduce latency and cost.