Louis Brandy is the Vice President of Engineering at Rockset. Prior to Rockset, Louis was Director of Engineering at Facebook. During his time there, he was an early engineer and manager in Facebook’s Site Integrity organization where his team built much of the anti-abuse infrastructure that powers Facebook’s spam fighting, fraud detection, and other online, real-time classification systems. He also worked on Facebook's RPC and service discovery ecosystem and built and supported the C++ infrastructure teams responsible for the overall health of the Facebook C++ codebase, working on compilers, sanitizers, linters, and core (and open-source) libraries like folly, jemalloc, and fbthrift.
Challenges at the intersection of ML and real-time data- lessons learned spam fighting at Facebook.
Spam-fighting at scale occupies a unique niche at the intersection between real-time data infrastructure and high-powered anomaly detection and machine learning. When these disciplines collide, a whole host of interesting new challenges are presented by each, to the other.
This talk draws on my experience building spam-fighting infrastructure at Facebook and real-time data experience at Rockset to talk through some of these challenges, and explore some of the mistakes engineers make when coming from one side, into the other. Challenges to be discussed include:
Spam-fighting tends to require low-latency everything. Every aspect of the data system, designed for supporting ML, needs to think about latency.
Large ingest volumes of continuously arriving data needs to be queryable quickly. This requires streaming data to be indexed to power ML features. Spammers act quickly and their previous actions need to show up in the current classification.
Fast queries. Most spam is best stopped synchronously, before it’s ever written to any system. Classifications must be quick. Features need to be generated quickly, or pre-computed. This runs into the classic “materialized view” problems of traditional databases, except in an ML context.
Hybrid queries. The most valuable queries tend to involve both ML or anomaly detection techniques (e.g. vector search) combined with traditional SQL database techniques (e.g. where clauses).
Development loop. It’s always a good idea to make your development loop as tight as possible but this is even more crucial in adversarial or time-critical situations. Every aspect of the orchestration and training of ML workflows becomes latency sensitive, as well.
Comments