MinT: Managed Infrastructure for Training and Serving Millions of LLMs

Hugging Face Daily Papers Papers

Summary

MinT is a managed infrastructure system that enables efficient training and serving of millions of LLMs by keeping base models resident and moving lightweight LoRA adapters, scaling across model architectures, storage, and policy management.

We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merged full checkpoint, MinT keeps the base model resident and moves exported LoRA adapter revisions through rollout, update, export, evaluation, serving, and rollback, hiding distributed training, serving, scheduling, and data movement behind a service interface. MinT scales this path along three axes. Scale Up extends LoRA RL to frontier-scale dense and MoE architectures, including MLA and DSA attention paths, with training and serving validated beyond 1T total parameters. Scale Down moves only the exported LoRA adapter, which can be under 1% of base-model size in rank-1 settings; adapter-only handoff reduces the measured step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while concurrent multi-policy GRPO shortens wall time by 1.77x and 1.45x without raising peak memory. Scale Out separates durable policy addressability from CPU/GPU working sets: a tensor-parallel deployment supports 10^6-scale addressable catalogs (measured single-engine sweeps through 100K) and thousand-adapter active waves at cluster scale, with cold loading treated as scheduled service work and packed MoE LoRA tensors improving live engine loading by 8.5-8.7x. MinT thus manages million-scale LoRA policy catalogs while training and serving selected adapter revisions over shared 1T-class base models.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/14/26, 04:16 AM

Paper page - MinT: Managed Infrastructure for Training and Serving Millions of LLMs

Source: https://huggingface.co/papers/2605.13779 Published on May 13

#2 Paper of the day Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

MinT is a managed infrastructure system that enables efficient low-rank adaptation training and serving by keeping base models resident and moving lightweight adapter revisions, scaling across multiple dimensions including large model architectures, reduced storage requirements, and distributed policy management.

We present MindLab Toolkit (MinT), a managed infrastructure system forLow-Rank Adaptation(LoRA)post-trainingandonline serving. MinT targets a setting where many trained policies are produced over a small number of expensivebase-model deployments. Instead of materializing each policy as a mergedfull checkpoint, MinT keeps the base model resident and moves exportedLoRAadapter revisionsthrough rollout, update, export, evaluation,serving, and rollback, hidingdistributed training,serving,scheduling, anddata movementbehind aservice interface. MinT scales this path along three axes.Scale UpextendsLoRARL to frontier-scale dense andMoE architectures, including MLA andDSA attentionpaths, with training andservingvalidated beyond 1T total parameters.Scale Downmoves only the exportedLoRAadapter, which can be under 1% of base-model size in rank-1 settings;adapter-only handoffreduces the measured step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while concurrent multi-policyGRPOshortens wall time by 1.77x and 1.45x without raisingpeak memory.Scale Outseparates durable policy addressability from CPU/GPU working sets: atensor-parallel deploymentsupports 10^6-scale addressable catalogs (measured single-engine sweeps through 100K) and thousand-adapter active waves at cluster scale, withcold loadingtreated as scheduled service work and packed MoELoRAtensors improving live engine loading by 8.5-8.7x. MinT thus manages million-scaleLoRApolicy catalogswhile training andservingselectedadapter revisionsover shared 1T-class base models.

View arXiv pageView PDFProject pageGitHub17Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.13779 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.13779 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.13779 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Granite 4.1 LLMs: How They’re Built

Hugging Face Blog

This article details the technical architecture and training pipeline of IBM's Granite 4.1 LLMs, covering pre-training, SFT, and RL stages. It highlights that the 8B dense model outperforms larger MoE counterparts and notes the release under Apache 2.0 license.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Hugging Face Daily Papers

This paper introduces AutoTTS, an environment-driven framework that automates the discovery of test-time scaling strategies for LLMs by formulating it as controller synthesis. It demonstrates improved accuracy-cost tradeoffs on mathematical reasoning benchmarks with minimal computational overhead.

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Papers with Code Trending

MiniCPM-V 4.5 is an 8B multimodal large language model that achieves high efficiency and strong performance through a unified 3D-Resampler architecture, a novel data strategy, and a hybrid reinforcement learning approach. The model reportedly surpasses larger proprietary and open-source benchmarks while significantly reducing GPU memory usage and inference time.