top of page

Jonas Mueller: Keys to Data-Centric AI: Asking AI to Improve its own Dataset

Updated: Jul 13, 2023

Jonas Mueller is Chief Scientist and Co-Founder at Cleanlab, a company providing data-centric AI software to improve datasets via automation. Previously, he was a senior scientist at Amazon Web Services developing AutoML and Deep Learning algorithms which power ML applications at hundreds of the largest companies. Before that, he completed his PhD in Machine Learning at MIT.

Jonas has published over 30 papers in top ML and Data Science venues (NeurIPS, ICML, ICLR, JASA, Annals of Statistics, etc). This research has been featured in Wired, VentureBeat, Technology Review, World Economic Forum, and other media. He also helped create the fastest-growing open-source software for AutoML and Data-Centric AI. At MIT, he also taught the first-ever course on data-centric AI.

Keys to Data-Centric AI: Asking AI to Improve its own Dataset.

In Machine Learning projects, one starts by exploring the data and training an initial baseline model. While it’s tempting to explore different modeling techniques right after, an emerging science of data-centric AI introduces systematic techniques to utilize the baseline model to find and fix dataset issues. Improving the dataset in this manner can drastically improve the initial model’s performance without any change to the modeling code! These techniques work with any ML model and the improved dataset can be used to train any type of model. Such automated data improvement has been instrumental to the success of AI organizations like OpenAI and Tesla.

This talk shows how data-centric AI can be operationalized across a wide variety of datasets (image, text, tabular, etc). I will introduce novel algorithms to automatically identify common issues in real-world data, including detection of: label errors, bad data annotators, outliers and ambiguous examples. Once identified, such dataset problems can be easily addressed to significantly improve trained models. Thousands of data scientists have started using data-centric AI software implementing such principles, and results from a few case studies will be presented.


bottom of page