@GitHub_Daily: To dive deep into model research, you can't just stay at the application layer—you need to understand how the underlying system is trained and optimized. I stumbled upon LLMSys-PaperList, a carefully curated collection of papers related to large model systems. It is continuously updated from 2022 to the latest top conference papers in 2026, and organized by categories such as training, inference, multimodality...
Summary
A carefully curated collection of papers related to large model systems, covering training, inference, multimodality, and more. It is continuously updated and includes technical reports, frameworks, and courses, making it a valuable reference for researchers and developers.
View Cached Full Text
Cached at: 06/12/26, 08:58 AM
To deeply study models, one cannot just stay at the application layer; you need to understand how the underlying systems are trained and optimized. I stumbled upon LLMSys-PaperList, a carefully curated collection of papers related to large model systems. It has been continuously updated from 2022 to the latest top conference papers in 2026, and is categorized by directions such as training, inference, and multimodality. Each paper is annotated with its source and publication venue, essentially serving as a continuously updated literature map. GitHub: http://github.com/AmberLJC/LLMSys-PaperList… In addition to academic papers, it also includes technical reports from major companies, open-source training and inference frameworks, relevant courses, and technical documentation for mainstream models like DeepSeek, Llama, and Qwen. If you are doing research or development related to large models, this list is worth bookmarking to save a lot of time searching for papers. —
AmberLJC/LLMSys-PaperList Source: https://github.com/AmberLJC/LLMSys-PaperList # Awesome LLM Systems Papers A curated list of Large Language Model systems related academic papers, articles, tutorials, slides and projects. Star this repository, and then you can keep abreast of the latest developments of this booming research field. ## Table of Contents - LLM Systems - Training - Pre-training - Post Training - Fault Tolerance / Straggler Mitigation - Serving - LLM serving - Agent Systems - Serving at the edge - System Efficiency Optimization - Model Co-design - Multi-Modal Training Systems - Multi-Modal Serving Systems - LLM for Systems - Industrial LLM Technical Report - ML Conferences - NeurIPS 2025 - LLM Frameworks - Training - Post-Training - Serving - ML Systems - Survey Paper - LLM Benchmark / Leaderboard / Traces - Related ML Readings - MLSys Courses - Other Reading ## LLM Systems ### Training #### Pre-training Before 2024 - Megatron-LM (https://arxiv.org/pdf/1909.08053.pdf): Training Multi-Billion Parameter Language Models Using Model Parallelism - Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (https://arxiv.org/pdf/2104.04473.pdf) - Reducing Activation Recomputation in Large Transformer Models (https://arxiv.org/pdf/2205.05198.pdf) - Optimized Network Architectures for Large Language Model Training with Billions of Parameters (https://arxiv.org/pdf/2307.12169.pdf) | MIT - Carbon Emissions and Large Neural Network Training (https://arxiv.org/pdf/2104.10350.pdf?fbclid=IwAR2o0_3HCtTnMxKbXka0OPrHzl8sCzQSSOYp0AOav76-zVWl_pYek2jX8Pk) | Google, UCB 2024 - Perseus (https://arxiv.org/abs/2312.06902v1): Removing Energy Bloat from Large Model Training | SOSP’ 24 - MegaScale (https://arxiv.org/abs/2402.15627): Scaling Large Language Model Training to More Than 10,000 GPUs | ByteDance - DISTMM (https://www.usenix.org/conference/nsdi24/presentation/huang): Accelerating distributed multimodal model training | NSDI’ 24 - Pipeline Parallelism with Controllable Memory (https://arxiv.org/abs/2405.15362) | Sea AI Lab - Boosting Large-scale Parallel Training Efficiency with C4 (https://arxiv.org/abs/2406.04594): A Communication-Driven Approach - Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training (https://openreview.net/pdf?id=uLpyWQPyF9) | ICML’ 24 - Alibaba HPN: (https://ennanzhai.github.io/pub/sigcomm24-hpn.pdf) A Data Center Network for Large Language ModelTraining - The Llama 3 Herd of Models (https://arxiv.org/abs/2407.21783) (Section 3) - Enabling Parallelism Hot Switching for Efficient Training of Large Language Models | SOSP’ 24 - Revisiting Reliability in Large-Scale Machine Learning Research Clusters (https://arxiv.org/abs/2410.21680) - ScheMoE (https://dl.acm.org/doi/10.1145/3627703.3650083): An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling | EuroSys ’24 - DynaPipe (https://arxiv.org/abs/2311.10418) : Optimizing Multi-task Training through Dynamic Pipelines | EuroSys ‘24 - HAP (https://dl.acm.org/doi/10.1145/3627703.3650074): SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis | EuroSys’24 - Demystifying Workload Imbalances in Large Transformer Model Training over Variable-length Sequences (https://arxiv.org/abs/2412.07894) | PKU - Improving training time and GPU utilization in geo-distributed language model training (https://arxiv.org/abs/2411.14458) - DeepSeek-V3 Technical Report (https://arxiv.org/abs/2412.19437) - Megalodon (https://arxiv.org/abs/2404.08801): Efficient LLM Pretraining and Inference with Unlimited Context Length 2025 - Comet (https://arxiv.org/pdf/2502.19811): Fine-grained Computation-communication Overlapping for Mixture-of-Experts | ByteDance - ByteScale (https://arxiv.org/pdf/2502.21231) : Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs | ByteDance - SPPO (https://arxiv.org/abs/2503.10377):Efficient Long-sequence LLM Training via Adaptive Sequence Pipeline Parallel Offloading - TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives (https://arxiv.org/abs/2503.20313) | MLSys’ 25 - Every FLOP Counts (https://arxiv.org/abs/2503.05139): Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs| Ant Group - FlexSP (https://dl.acm.org/doi/abs/10.1145/3676641.3715998): Accelerating Large Language Model Training via Flexible Sequence Parallelism | ASPLOS ‘25 - WeiPipe (https://dl.acm.org/doi/pdf/10.1145/3710848.3710869): Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model Training | PPoPP ’25 - WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model TraininG (https://arxiv.org/pdf/2503.17924) | OSDI’ 25 - Mixtera (https://mboether.com/assets/pdf/bother2025mixtera.pdf): A Data Plane for Foundation Model Training | ETH - Flex Attention (https://arxiv.org/abs/2412.05496): A Programming Model for Generating Optimized Attention Kernels | MLSys’ 25 - Balancing Pipeline Parallelism with Vocabulary Parallelism (https://arxiv.org/abs/2411.05288) | MLSys’ 25 - SlimPipe (https://arxiv.org/abs/2504.14519): Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training | Kuaishou - Scaling Llama 3 Training with Efficient Parallelism Strategies (https://aisystemcodesign.github.io/papers/Llama3-ISCA25.pdf) | ISCA’ 25 - Lumos (https://arxiv.org/abs/2504.09307) : Efficient Performance Modeling and Estimation for Large-scale LLM Training| MLSys’ 25 - BurstEngine (https://arxiv.org/abs/2509.19836): an Efficient Distributed Framework for Training Transformers on Extremely Long Sequences of over 1M Tokens - Robust LLM Training Infrastructure at ByteDance (https://sigops.org/s/conferences/sosp/2025/accepted.html) | SOSP’ 25 - Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters (https://sigops.org/s/conferences/sosp/2025/accepted.html) | SOSP’ 25 - Tempo: Compiled Dynamic Deep Learning with Symbolic Dependence Graphs (https://sigops.org/s/conferences/sosp/2025/accepted.html) | SOSP’ 25 - Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training (https://sigops.org/s/conferences/sosp/2025/accepted.html) | SOSP’ 25 - DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism (https://sigops.org/s/conferences/sosp/2025/accepted.html) | SOSP’ 25 - TrainVerify: Equivalence-Based Verification for Distributed LLM Training (https://sigops.org/s/conferences/sosp/2025/accepted.html) | SOSP’ 25 - Collective Communication for 100k+ GPUs (https://arxiv.org/abs/2510.20171): Large-scale collective communication optimization for massive GPU clusters 2026 - Arena (https://arxiv.org/abs/2403.16125): Efficiently Training Large Models via Dynamic Scheduling and Adaptive Parallelism Co-Design | EuroSys’ 26 - Zeppelin (https://arxiv.org/abs/2509.21841): Balancing Variable-length Workloads in Data Parallel Large Model Training | EuroSys’ 26 - RDMA Point-to-Point Communication for LLM Systems (https://arxiv.org/abs/2510.27656): RDMA-based point-to-point communication optimization for distributed LLM systems | MLSys’ 26 - MoEBlaze (https://arxiv.org/abs/2601.05296): Breaking the Memory Wall for Efficient MoE Training on Modern GPUs | MLSys’ 26 - Kareus (https://arxiv.org/abs/2601.17654): Joint Reduction of Dynamic and Static Energy in Large Model Training - AXLearn (https://arxiv.org/abs/2507.05411): Modular Large Model Training on Heterogeneous Infrastructure | MLSys’ 26 - MoSE (https://arxiv.org/abs/2602.06154): Mixture of Slimmable Experts for Efficient and Adaptive Language Models - MegaScale-MoE (https://arxiv.org/abs/2505.11432): Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production | EuroSys’ 26 - MegaScale-Data (https://arxiv.org/abs/2504.09844): Scaling DataLoader for Multisource Large Foundation Model Training | EuroSys’ 26 - HetAuto (https://dl.acm.org/doi/10.1145/3767295.3803590): Cross-Cluster Auto-Parallelism for Heterogeneous Distributed Training | EuroSys’ 26 - HARP (https://dl.acm.org/doi/10.1145/3767295.3803603): Orchestrating Automated Parallel Training on Heterogeneous GPU Clusters | EuroSys’ 26 - Crimson (https://dl.acm.org/doi/10.1145/3767295.3803606): Collaborative Parameter Updates for Efficient Pipeline Training of Large Language Models | EuroSys’ 26 - Suika (https://dl.acm.org/doi/10.1145/3767295.3803623): Efficient and High-quality Re-scheduling of 3D-parallelized LLM Training Jobs in Shared Clusters | EuroSys’ 26 - Efficient and Adaptable Overlapping for Computation and Communication via Signaling and Reordering (https://dl.acm.org/doi/10.1145/3767295.3769370) | EuroSys’ 26 - BOOST (https://arxiv.org/abs/2512.12131): BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models | MLSys’ 26 - MTraining (https://arxiv.org/abs/2510.18830): Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training | MLSys’ 26 - ProTrain (https://arxiv.org/abs/2406.08334): Efficient LLM Training via Automatic Memory Management | MLSys’ 26 - DreamDDP (https://arxiv.org/abs/2502.11058): Accelerating Data Parallel Distributed LLM Training with Layer-wise Scheduled Partial Synchronization | MLSys’ 26 - Multipath Collective Communication Beyond Scale-up Networks in GPU Clouds (https://dl.acm.org/doi/10.1145/3767295.3769330) | EuroSys’ 26 - STAlloc: Enhancing Memory Efficiency in Large-Scale Model Training through Spatio-Temporal Allocation Planning (https://dl.acm.org/doi/10.1145/3767295.3769335) | EuroSys’ 26 - Maya: Optimizing Deep Learning Training Workloads using GPU Runtime Emulation (https://dl.acm.org/doi/10.1145/3767295.3769366) | EuroSys’ 26 - Bridging the GPU Utilization Gap: Predictive Multi-Dimensional Resource Scheduling for AI Workloads (https://dl.acm.org/doi/10.1145/3767295.3803579) | EuroSys’ 26 - Reducing the GPU Memory Bottleneck with Lossless Compression for ML (https://dl.acm.org/doi/10.1145/3767295.3803595) | EuroSys’ 26 - Efficient Long-Context LM Training by Core Attention Disaggregation (https://mlsys.org/virtual/2026/oral/3754) | MLSys’ 26 - Zorse: Optimizing LLM Training Efficiency on Heterogeneous GPU Clusters (https://mlsys.org/virtual/2026/poster/3636) | MLSys’ 26 - Unleashing Scalable Context Parallelism via Fully Connected Pipeline (https://mlsys.org/virtual/2026/oral/3822) | MLSys’ 26 - FlexTrain: Scalable Hybrid-Parallel Training for Long-Context LLMs (https://mlsys.org/virtual/2026/poster/3553) | MLSys’ 26 - veScale-FSDP: Flexible and High-Performance FSDP at Scale (https://mlsys.org/virtual/2026/poster/3637) | MLSys’ 26 - HexiScale: LLM Training over Heterogeneous Hardware (https://mlsys.org/virtual/2026/poster/3605) | MLSys’ 26 - FP8-Flow-MoE: Casting-Free FP8 Recipe for MoE without Double Quantization Error (https://mlsys.org/virtual/2026/oral/3737) | MLSys’ 26 #### Systems for Post-training / RLHF Before 2024 - An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training (https://arxiv.org/pdf/2312.11819) | Ant 2024 - Ymir: (https://tianweiz07.github.io/Papers/24-ics-2.pdf) A Scheduler for Foundation Model Fine-tuning Workloads in Datacenters | ICS’ 24 - HybridFlow (https://arxiv.org/pdf/2409.19256): A Flexible and Efficient RLHF Framework - ReaLHF (https://arxiv.org/html/2406.14088v1): Optimized RLHF Training for Large Language Models through Parameter Reallocation - NeMo-Aligner (https://arxiv.org/pdf/2405.01481): Scalable Toolkit for Efficient Model Alignment | Nvidia 2025 - RLHFuse (https://arxiv.org/abs/2409.13221): Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion | NSDI’25 - Systems Opportunities for LLM Fine-Tuning using Reinforcement Learning (https://dl.acm.org/doi/pdf/10.1145/3721146.3721944) - AReaL (https://arxiv.org/pdf/2505.24298): A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning | Code (https://github.com/inclusionAI/AReaL) | Ant - StreamRL (https://arxiv.org/abs/2504.15930): Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation - RL-Factory (https://github.com/Simple-Efficient/RL-Factory): Train your Agent model via our easy and efficient framework - PLoRA (https://arxiv.org/pdf/2508.02932): Efficient LoRA Hyperparameter Tuning for Large Models - History Rhymes (https://arxiv.org/abs/2508.18588): Accelerating LLM Reinforcement Learning with RhymeRL - APRIL (https://arxiv.org/abs/2509.18521): Active Partial Rollouts in Reinforcement Learning to tame long-tail generation - Seer (https://arxiv.org/abs/2511.14617): Online Context Learning for Fast Synchronous LLM Reinforcement Learning - SkyRL-Agent (https://arxiv.org/abs/2511.16108): Efficient RL Training for Multi-turn LLM Agent 2026 - Laminar (https://arxiv.org/abs/2510.12633): A Scalable Asynchronous RL Post-Training Framework | EuroSys’ 26 - LoRAFusion (https://dl.acm.org/doi/10.1145/3767295.3769331): Efficient LoRA Fine-Tuning for LLMs | EuroSys’ 26 - HetRL (https://arxiv.org/abs/2512.12476): Efficient Reinforcement Learning for LLMs in Heterogeneous Environments | MLSys’ 26 - ReSpec (https://arxiv.org/abs/2510.26475): Towards Optimizing Speculative Decoding in Reinforcement Learning Systems | MLSys’ 26 - Beat the Long Tail: Distribution-Aware Speculative Decoding for Reinforcement Learning (https://mlsys.org/virtual/2026/oral/3766) | MLSys’ 26 - FLoRIST: Federated Low-Rank Adaptation with Random Subspaces for LLMs (https://mlsys.org/virtual/2026/poster/3617) | MLSys’ 26 #### Fault Tolerance / Straggler Mitigation Before 2024 - Oobleck: (https://arxiv.org/abs/2309.08125) Resilient Distributed Training of Large Models Using Pipeline Templates | SOSP’ 23 - GEMINI: (https://dl.acm.org/doi/10.1145/3600006.3613145) Fast Failure Recovery in Distributed Training with In-Memory Checkpoints | SOSP’ 23 2024 - FALCON (https://arxiv.org/abs/2410.12588): Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training - Malleus (https://arxiv.org/abs/2410.13333): Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization - Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning (https://arxiv.org/abs/2408.14158) | DeepSeek SC’ 24 - Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement (https://arxiv.org/pdf/2407.04656) - ByteCheckpoint: (https://arxiv.org/abs/2407.20143) A Unified Checkpointing System for LLM Development - ReCycle (https://arxiv.org/pdf/2405.14009): Resilient Training of Large DNNs using Pipeline Adaptation | SOSP’ 24 - Minder (https://arxiv.org/pdf/2411.01791): Faulty Machine Detection for Large-scale Distributed Model Training | THU - TrainMover (https://arxiv.org/pdf/2412.12636): Efficient ML Training Live Migration with No Memory Overhead | Alibaba 2025 - The Streaming Batch Model for Efficient and Fault-Tolerant Heterogeneous Execution (https://arxiv.org/pdf/2501.12407) - Characterizing GPU Res
Similar Articles
@GitHub_Daily: Want to understand how Large Language Models actually work? Existing resources are either too academic and hard to digest, or too superficial, focusing only on concepts, with nothing that clearly explains the entire process from start to finish. Similarly, I came across the 'how-llms-work' project, which turns the complete workflow of LLMs into a visual interactive webpage, based on Andrej Karpathy’s...
An interactive visual guide, 'how-llms-work', breaks down the entire lifecycle of Large Language Models based on Andrej Karpathy's lectures, covering data collection to post-training.
@DanKornas: Keeping up with LLM systems research is messy when papers, reports, frameworks, and course links are scattered everywhe…
LLMSys-PaperList is a curated reading list on GitHub that organizes LLM systems research papers and resources into practical categories such as training systems, serving systems, and multi-modal coverage, helping AI/ML engineers and researchers stay updated.
@wsl8297: Sharing an easy-to-read open-source book 'Foundations of Large Models'. From an introduction to large language models to architectural evolution, then to key technologies such as Prompt engineering, parameter-efficient fine-tuning, model editing, retrieval-augmented generation (RAG), all in one book. GitHub: https://github.com/ZJU-LLMs/…
The Zhejiang University team open-sourced an easy-to-understand textbook on large models 'Foundations of Large Models', covering from architectural evolution to key technologies like RAG, accompanied by the Agent-Kernel multi-agent framework.
@VincentLogic: Drowning in new Arxiv papers every day? Head spinning. Just discovered a treasure trove of a website that aggregates the latest AI papers and model benchmarks. Clean interface, just check Trending or filter by week/month. Best part: each paper directly links to the benchmarks and models it uses.
Recommend a free website sophon.at/papers that aggregates the latest AI papers and model benchmarks. Clean interface, supports Trending or weekly/monthly filtering. Each paper directly links to its benchmarks and models.
@wsl8297: Discovered a deep learning paper reading project on GitHub: paper-reading. Author Mu Shen reads classic and new deep learning papers paragraph by paragraph, recorded into video explanations, has been updated for over 3 years. GitHub: https://github.com/mli/paper-reading...
Mu Shen's deep learning paper reading project on GitHub includes in-depth reading videos of major papers such as GPT-4, Llama 3.1, Sora, etc. Each video is about 1 hour, suitable for AI researchers and developers to deeply understand classic papers.