Skip to main content

Systems-Frameworks-Performance

  • Compute Trends across Three Eras of Machine Learning — Sevilla et al., 2022. arXiv
  • Scaling Laws for Neural Language Models — Kaplan et al., 2020. arXiv
  • Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance — Chowdhery et al., 2022. Blog
  • LLaMA: Open and Efficient Foundation Language Models — Touvron et al., 2023. arXiv
  • The Power of Scale for Parameter-Efficient Prompt Tuning — Lester, Al-Rfou & Constant, 2021. arXiv
  • Best Practices and Lessons Learned on Synthetic Data for Language Models — Liu et al., 2024. arXiv

Distributed Training & Parallelism

  • Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference — Kundu et al., 2024. arXiv
  • Efficient Training of Large Language Models on Distributed Infrastructures: A Survey — Zhang et al., 2024. arXiv
  • nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training — Lin et al., OSDI 2024. USENIX
  • MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale — Choudhury et al., OSDI 2024. USENIX
  • Metis: Fast Automatic Distributed Training on Heterogeneous GPUs — Um et al., USENIX ATC 2024. USENIX
  • Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism — Yuan et al., USENIX ATC 2024. USENIX
  • AMSP: Reducing Communication Overhead of ZeRO for Efficient LLM Training — Zheng et al., 2023. arXiv
  • FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences — Xu et al., USENIX ATC 2024. USENIX
  • Ray: A Distributed Framework for Emerging AI Applications — Moritz et al., OSDI 2018. USENIX

Serving, Inference & Memory Efficiency

  • PagedAttention: Efficient Memory Management for Large Language Model Serving — Kwon et al., SOSP 2023. ACM DL
  • FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Dao et al., NeurIPS 2022. arXiv
  • Sarathi-Serve: Taming Throughput-Latency Tradeoff in LLM Inference — Agrawal et al., 2024. arXiv
  • DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving — Zhong et al., OSDI 2024. USENIX
  • Llumnix: Dynamic Scheduling for Large Language Model Serving — Sun et al., 2024. arXiv
  • InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management — Lee et al., OSDI 2024. USENIX
  • ServerlessLLM: Low-Latency Serverless Inference for Large Language Models — Fu et al., OSDI 2024. USENIX
  • Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead — Brüel-Gabrielsson et al., 2024. arXiv
  • Cost-Efficient Large Language Model Serving for Multi-Turn Conversations with CachedAttention — Gao et al., USENIX ATC 2024. USENIX
  • Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs — Xia et al., USENIX ATC 2024. USENIX
  • StreamBox: A Lightweight GPU Sandbox for Serverless Inference Workflow — Wu et al., USENIX ATC 2024. USENIX
  • Compute-Optimal Inference for Problem-Solving with Language Models — Wu et al., 2024. arXiv

Scheduling & Resource Management

  • Centimani: Enabling Fast AI Accelerator Selection for DNN Training with a Novel Performance Predictor — Xie et al., USENIX ATC 2024. USENIX
  • Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement — Graur et al., USENIX ATC 2024. USENIX
  • MuxFlow: Efficient and Safe GPU Sharing in Large-Scale Production Deep Learning Clusters — Zhao et al., 2023. arXiv
  • Caravan: Practical Online Learning of In-Network ML Models with Labeling Agents — Zhang et al., OSDI 2024. USENIX
  • The Infrastructure Powering IBM's Gen AI Model Development — Gershon et al., 2024. arXiv
  • ChameleonAPI: Automatic and Efficient Customization of Neural Networks for ML Applications — Liu et al., OSDI 2024. USENIX
  • PUZZLE: Efficiently Aligning Large Language Models through Lightweight Context Switch — Lei et al., USENIX ATC 2024. USENIX

Additional References

  • PAL: Program-Aided Language Models — Gao et al., ICML 2023. PMLR
  • When Will My ML Job Finish? Toward Providing Completion Time Estimates through Predictability-Centric Scheduling — Faisal et al., OSDI 2024. USENIX
  • See Serving, Inference & Memory Efficiency for complementary deployment patterns such as Compress then Serve, ServerlessLLM, CachedAttention, and Quant-LLM.