In this post, we are happy to explore human-centric machine-learning infrastructure with Outerbounds, our Gold sponsor. As pioneers in the field, Outerbounds has revolutionized data science and AI with Metaflow, a tool that simplifies complex ML infrastructures, allowing experts to focus on innovation and problem-solving.
In 2023, with the rise of GenAI, Outerbounds has adeptly positioned Metaflow to support these advanced workloads. Their collaboration with Autodesk and the development of AMP highlight their commitment to advancing AI capabilities. This interview will delve into how Metaflow is shaping the future of ML and GenAI, tackling the industry's most pressing challenges.
Join us as we uncover the insights and strategies of Outerbounds in this ever-evolving tech landscape.
What is human-centric ML infrastructure, in your opinion? How does Metaflow help achieve this?
Modern data, ML, and AI apps require a thick stack of infrastructure from integrations to data warehouses and scalable compute layers to workflow orchestration and versioning. Metaflow provides one human-friendly API to the full stack which allows data scientists to focus on modeling and business logic, and engineers to provide stable infrastructure. We want to enable data scientists and machine learning engineers to deliver business value by doing what they do best and using the tools they love while having easy access to all the infrastructure they need.
In 2023, GenAI took over, and now it got the attention of a very broad audience. Do Metaflow and Outerbounds support GenAI workloads?
Absolutely. With Metaflow and Outerbounds, data scientists and MLEs can experiment, innovate, and develop AI-powered software using cutting-edge foundation models. You can train and fine-tune your models on GPU clusters and deploy them to production with ease. Look no further than Autodesk’s Riley Hun, who we’ll be speaking with at SBTB.
Riley has been at the forefront of using Metaflow at Autodesk to build the Autodesk Machine Learning Platform (AMP). Autodesk is a global software provider renowned for its design solutions across various industries, including architecture, manufacturing, education, 3D art, entertainment, and more.
AMP at Autodesk has not only utilized Metaflow for managing typical ML workflows but has also explored its integration with Ray, a distributed computing framework. This allows users to create Ray clusters using AWS Batch for multi-node parallel jobs. The results of these tests show the potential of scaling training jobs with GPU nodes effectively, enabling them to perform distributed training with PyTorch, TensorFlow, HuggingFace, and Deepspeed, for example! You can find out more about this in Riley’s talk below:
What are the biggest challenges in ML and GenAI infrastructure today?
One of the biggest challenges is that compute needs are becoming more heterogeneous with many different types of workloads, the need for GPUs, distributed training, and all types of data processing workloads that need CPUs. A key question is how to do this using open-source software, Kubernetes, and cloud resources.
Metaflow recognizes the fact that compute never exists in isolation: functions are not islands - they form interconnected workflows. They don’t spin cycles in a void but they process data that needs to flow through functions effortlessly. And, you want to track, debug, and observe everything on the way.
Having recently released Metaflow extensions for Ray, Deepspeed, PyTorch, Tensorflow, and MPI, we’ve gained significant insight into a diverse set of compute-heavy and data-intensive workloads, which inspired the talk we’re giving at SBTB. We hope to see you there!
In their illuminating talk, "Lessons Learned from Orchestrating Large-Scale GenAI, ML, and Data on Kubernetes", Oleg Avdeëv and Riley Hun will share invaluable insights gleaned from their extensive experience with Metaflow. This Python framework, originally developed and open-sourced by Netflix in 2019, has evolved remarkably. It transitioned from its AWS-native origins to embrace all major cloud platforms and on-premise deployments via Kubernetes. The recent enhancements in Metaflow, including support for MPI-style parallel computing and distributed training with tools like PyTorch and Ray, have significantly broadened its capabilities, especially in handling complex GenAI workloads that demand extensive GPU resources. This talk promises to be a treasure trove of knowledge, offering platform engineers, data scientists, and ML engineers a deep dive into the intricacies of ML infrastructure and the practicalities of deploying real-world, production-grade ML/AI systems. Their presentation is scheduled for November 14 at 11:45 AM in the Air Room on the 4th floor.
More insights are available in the video:
Be sure to visit Outerbounds' sponsor booth on the 14th and 15th of November for an engaging experience. Here, you can delve deeper into their innovative ML and AI work, ask questions post-presentation, and interact with their team of experts. The booth offers more than just information; expect interactive displays and live demos that vividly showcase the forefront of ML and AI technologies. Make it a point to stop by their booth during the conference for a unique and enlightening experience.
See you by the Bay!