Systems-Frameworks-Performance

Scaling & Model Trends

Compute Trends across Three Eras of Machine Learning — Sevilla et al., 2022. arXiv
Scaling Laws for Neural Language Models — Kaplan et al., 2020. arXiv
Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance — Chowdhery et al., 2022. Blog
LLaMA: Open and Efficient Foundation Language Models — Touvron et al., 2023. arXiv
The Power of Scale for Parameter-Efficient Prompt Tuning — Lester, Al-Rfou & Constant, 2021. arXiv
Best Practices and Lessons Learned on Synthetic Data for Language Models — Liu et al., 2024. arXiv

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference — Kundu et al., 2024. arXiv
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey — Zhang et al., 2024. arXiv
nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training — Lin et al., OSDI 2024. USENIX
MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale — Choudhury et al., OSDI 2024. USENIX
Metis: Fast Automatic Distributed Training on Heterogeneous GPUs — Um et al., USENIX ATC 2024. USENIX
Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism — Yuan et al., USENIX ATC 2024. USENIX
AMSP: Reducing Communication Overhead of ZeRO for Efficient LLM Training — Zheng et al., 2023. arXiv
FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences — Xu et al., USENIX ATC 2024. USENIX
Ray: A Distributed Framework for Emerging AI Applications — Moritz et al., OSDI 2018. USENIX

PagedAttention: Efficient Memory Management for Large Language Model Serving — Kwon et al., SOSP 2023. ACM DL
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Dao et al., NeurIPS 2022. arXiv
Sarathi-Serve: Taming Throughput-Latency Tradeoff in LLM Inference — Agrawal et al., 2024. arXiv
DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving — Zhong et al., OSDI 2024. USENIX
Llumnix: Dynamic Scheduling for Large Language Model Serving — Sun et al., 2024. arXiv
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management — Lee et al., OSDI 2024. USENIX
ServerlessLLM: Low-Latency Serverless Inference for Large Language Models — Fu et al., OSDI 2024. USENIX
Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead — Brüel-Gabrielsson et al., 2024. arXiv
Cost-Efficient Large Language Model Serving for Multi-Turn Conversations with CachedAttention — Gao et al., USENIX ATC 2024. USENIX
Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs — Xia et al., USENIX ATC 2024. USENIX
StreamBox: A Lightweight GPU Sandbox for Serverless Inference Workflow — Wu et al., USENIX ATC 2024. USENIX
Compute-Optimal Inference for Problem-Solving with Language Models — Wu et al., 2024. arXiv

Centimani: Enabling Fast AI Accelerator Selection for DNN Training with a Novel Performance Predictor — Xie et al., USENIX ATC 2024. USENIX
Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement — Graur et al., USENIX ATC 2024. USENIX
MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters — Zhao et al., 2023. arXiv
Caravan: Practical Online Learning of In-Network ML Models with Labeling Agents — Zhang et al., OSDI 2024. USENIX
The Infrastructure Powering IBM's Gen AI Model Development — Gershon et al., 2024. arXiv
ChameleonAPI: Automatic and Efficient Customization of Neural Networks for ML Applications — Liu et al., OSDI 2024. USENIX
PUZZLE: Efficiently Aligning Large Language Models through Lightweight Context Switch — Lei et al., USENIX ATC 2024. USENIX

PAL: Program-Aided Language Models — Gao et al., ICML 2023. PMLR
When Will My ML Job Finish? Toward Providing Completion Time Estimates through Predictability-Centric Scheduling — Faisal et al., OSDI 2024. USENIX
See Serving, Inference & Memory Efficiency for complementary deployment patterns such as Compress then Serve, ServerlessLLM, CachedAttention, and Quant-LLM.