Riley is a born and raised Vancouverite. Growing up in Canada, hockey is in his DNA. As a former Data Scientist and now an ML Platform Engineer, he recognizes the value MLOps adds to boosting productivity of Data Science teams. He is currently working at Autodesk, and is part of the team responsible for building a unified Machine Learning Platform from the ground up. He is leading the efforts for a managed training infrastructure, including capabilities for large-scale distributed training, HPC, and workflow orchestration.
Lesson learned from orchestrating large-scale GenAI, ML, and data on Kubernetes.
Metaflow, a Python framework for ML/AI infrastructure which was originally open-sourced by Netflix in 2019, has come a long way. From its AWS-native roots, it has expanded to support all major clouds and on-prem deployments powered by Kubernetes. Recently, Metaflow gained support for MPI-style parallel compute, distributed training with PyTorch and Ray, opening up many new use cases around GenAI which require clusters of even hundreds of GPUs.
In this infrastructure-focused talk we share a number of lessons learned from a diverse set of data intensive workloads. This session should be informative for platform engineers, data scientists and ML engineers curious about ML infrastructure, and anyone interested in building real-world, production-grade ML/AI systems in general.
Comments