Boyang Jerry Peng: Latency goes sub-second in Apache Spark Structured Streaming

Anastasia Khomyakova
Nov 10, 2023
1 min read

Updated: Dec 7, 2023

Boyang Jerry Peng is currently a Staff Engineer at Databricks extensively working Apache Spark Structured Streaming. Before joining Databricks, he was a Principal Software Engineer at Splunk working on streaming and messaging projects especially with Apache Pulsar. Jerry is a committer and PMC member of Apache Pulsar, Apache Storm, and Apache Heron projects. Before Splunk, he worked at Streamlio (acquired by Splunk), Citadel, and Yahoo on distributed systems and stream processing. Jerry has been working in the area of distributed systems and stream processing since his days in grad school at the University of Illinois, Urbana-Champaign.

Latency goes sub-second in Apache Spark Structured Streaming.

Apache Spark Structured Streaming is the leading open source stream processing platform. It is also the core technology that powers streaming on the Databricks Lakehouse Platform and provides a unified API for batch and stream processing. As the adoption of streaming is growing rapidly, diverse applications want to take advantage of it for real time decision making. While Spark's design enables high throughput and ease-of-use at a lower cost, it has not been optimized for sub-second latency.

In this talk, we will focus on the improvements we have made around offset management to lower the inherent processing latency of Structured Streaming. These improvements primarily target operational use cases such as real time monitoring and alerting that are simple and stateless. Extensive evaluation of these enhancements indicates that the latency has improved by 68-75% - or as much as 3X - from 700-900 ms to 150-250 ms for throughputs of 100K events/sec, 500K events/sec and 1M events/sec.

Here, find the entire day's conference presentation: