Oleg Avdeëv is a co-founder of Outerbounds, a company building the modern, human-centric ML infrastructure stack based on open-source tool Metaflow. A startup veteran, before Outerbounds he spent most of his career either getting ML from zero to one at companies like Headspace and Alpine.AI, or building tools for data scientists so they can do that themselves, most recently at Tecton.
Lesson learned from orchestrating large-scale GenAI, ML, and data on Kubernetes
Metaflow, a Python framework for ML/AI infrastructure which was originally open-sourced by Netflix in 2019, has come a long way. From its AWS-native roots, it has expanded to support all major clouds and on-prem deployments powered by Kubernetes. Recently, Metaflow gained support for MPI-style parallel compute, distributed training with PyTorch and Ray, opening up many new use cases around GenAI which require clusters of even hundreds of GPUs.
In this infrastructure-focused talk we share a number of lessons learned from a diverse set of data intensive workloads. This session should be informative for platform engineers, data scientists and ML engineers curious about ML infrastructure, and anyone interested in building real-world, production-grade ML/AI systems in general.
Here, find the entire day's conference presentation:
Comments