Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference — Kundu et al., 2024. arXiv
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey — Zhang et al., 2024. arXiv
nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training — Lin et al., OSDI 2024. USENIX
MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale — Choudhury et al., OSDI 2024. USENIX
Metis: Fast Automatic Distributed Training on Heterogeneous GPUs — Um et al., USENIX ATC 2024. USENIX
Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism — Yuan et al., USENIX ATC 2024. USENIX
AMSP: Reducing Communication Overhead of ZeRO for Efficient LLM Training — Zheng et al., 2023. arXiv
FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences — Xu et al., USENIX ATC 2024. USENIX
Ray: A Distributed Framework for Emerging AI Applications — Moritz et al., OSDI 2018. USENIX
PagedAttention: Efficient Memory Management for Large Language Model Serving — Kwon et al., SOSP 2023. ACM DL
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Dao et al., NeurIPS 2022. arXiv
Sarathi-Serve: Taming Throughput-Latency Tradeoff in LLM Inference — Agrawal et al., 2024. arXiv
DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving — Zhong et al., OSDI 2024. USENIX
Llumnix: Dynamic Scheduling for Large Language Model Serving — Sun et al., 2024. arXiv
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management — Lee et al., OSDI 2024. USENIX
ServerlessLLM: Low-Latency Serverless Inference for Large Language Models — Fu et al., OSDI 2024. USENIX
Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead — Brüel-Gabrielsson et al., 2024. arXiv
Cost-Efficient Large Language Model Serving for Multi-Turn Conversations with CachedAttention — Gao et al., USENIX ATC 2024. USENIX
Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs — Xia et al., USENIX ATC 2024. USENIX
StreamBox: A Lightweight GPU Sandbox for Serverless Inference Workflow — Wu et al., USENIX ATC 2024. USENIX
Compute-Optimal Inference for Problem-Solving with Language Models — Wu et al., 2024. arXiv
PAL: Program-Aided Language Models — Gao et al., ICML 2023. PMLR
When Will My ML Job Finish? Toward Providing Completion Time Estimates through Predictability-Centric Scheduling — Faisal et al., OSDI 2024. USENIX
See Serving, Inference & Memory Efficiency for complementary deployment patterns such as Compress then Serve, ServerlessLLM, CachedAttention, and Quant-LLM.