After earning his Ph.D. in bioinformatics from UCSF, Russell Spitzer took his love of big data to DataStax. There he has worked on all aspects of integrating Cassandra with other Apache technologies like Spark, Hadoop, and Solr. Now his main focus on the integration of Cassandra with Apache Spark via the Spark Cassandra Connector.
In advance of his Scale By the Bay talk "To Spark or Not to Spark", we spoke with Russell about his work at Datastax, the three trends that are shaping the future of Big Data, and whether Spark may or may not be the right fit for one's data problems.
Welcome to Scale By the Bay! Please tell us more about yourself: how did you get interested in Spark and what was the turning point when you decided to join DataStax?
I decided to join DataStax in 2013. I had was just finishing up my Ph.D. in Bioinformatics and UCSF and was looking into distributed computing. During my research, I was always really interested in the growth of distributed computing platforms but didn't have a lot of time to investigate them. When I finished I knew that I was more excited about the "informatics" part than the "bio" part of my work so I applied to work at any distributed software company that would have me.
DataStax hired me and I began learning about Apache Cassandra. After spending a few years testing and working on the integration with Apache Hadoop, we learned about a little project called Apache Spark. My team leader Piotr prototyped an integration with our enterprise offering and I was hooked.
What's your current role and what exciting things are you working on at the moment?
I currently work as a software engineer mainly focusing on our Spark Cassandra Connector. The new DataSource V2 API that is coming out in Spark is providing us a lot of challenges and opportunities going forwards!
What's the biggest challenge that you face in your work and how are you addressing the challenge?
One of the most difficult things I have to deal with is API compatibility. Since some of our main users are enterprise customers, it is extremely important that we maintain our API for users as much as is possible. Since we are dealing with Cassandra, Scala, and Spark, this is a challenge as the various versions all change. Trying to stay compatible with everyone is very difficult.
What's the biggest thing that is misunderstood about Spark?
I think the biggest misconception is that Spark will instantly make any job faster. Like all technologies, Spark has a sweet spot where it is effective and many places where it's not. If I could boil my whole talk down to one concept: if your data is on one machine, and you can process it on one machine, use one machine.
What are the three trends that will shape the future of the industry?
The AI movement will definitely have some big changes on our industry, I am not quite as strong a believer that it will change everything, but I think it will definitely open up a lot of new opportunities.
The number of new programmers: I'm always amazed at how young the industry is and with the fast rise of bootcamps, the number of new programmers is growing incredibly.
Privacy: I think the general public is finally learning about the vast amount of data being collected and we'll be seeing even more oversight of data collection in the future.
What will you talk about at Scale By the Bay and why did you choose to cover this subject?
I'm going to focus on why Spark may or may not be the right fit for your data problems. I think there are a lot of users out there who hear about how great Spark is (and it is great) but don't realize that it may not be great for their particular use-case. I want to help folks find the best option for them.
Who should attend your talk and what will they learn?
I'm hoping that anyone who is wondering about whether they should use Spark will stop by! Hopefully, they will leave with more confidence that they can make an informed decision.
Don't miss Russell Spitzer and his talk "To Spark or Not to Spark" at Scale By the Bay on Thursday, November 14th. Book your ticket now.
Comments