This one is better explained with the presentation below. If you want to learn how to run quantitative analytics at scale, it’s well worth a watch.
Our team recently completed a challenging yet rewarding project: building a scalable and portable risk engine using Apache Beam and Google Cloud Dataflow. This project allowed us to delve deeper into distributed computing and explore the practical application of these technologies in the financial domain.
The Goal:
Our primary objective was to create a risk engine capable of handling large volumes of data and complex calculations efficiently. We wanted to leverage the power of cloud computing for scalability and explore the portability benefits of Apache Beam. This project was driven by our shared interest in learning and applying these technologies.
Modernizing the Traditional Approach:
Traditional risk engines often struggled with scalability and performance limitations. We aimed to modernize this architecture using a distributed computing approach, enabling the engine to handle increasing data volumes and complex risk models.
Why Apache Beam and Dataflow?
- Apache Beam: Beam’s unified programming model and support for multiple runners (including Dataflow, Flink, and Spark) made it an ideal choice for portability. The concept of DoFns provided a structured way to define data processing logic.
- Google Cloud Dataflow: Dataflow’s fully managed service and integration with the Google Cloud ecosystem offered a convenient and scalable platform for running our pipeline.
Implementation Details:
- Java and C++ Integration: One of the biggest hurdles was integrating our existing C++ quant library into the Java-based Beam pipeline. We used JNI for this, but quickly realized the importance of an out-of-process execution strategy to prevent C++ errors from crashing the entire pipeline. We implemented a “dead-letter” queue to handle failures gracefully.
- Serialization with Protobuf: Protobuf proved invaluable for efficient data exchange between Java and C++. Its compact representation and cross-language support simplified communication and testing.
- Modular Design with Beam I/O: Beam’s I/O connectors allowed us to seamlessly integrate with various data sources (like CSV files for testing) and sinks. This modularity kept our pipeline code clean and focused on the core risk calculations.
- Testing with Protobuf: Protobuf’s schema definitions enabled consistent unit testing across both Java and C++ components, simplifying the debugging process.
Exploring Different Runners:
While Dataflow was our primary target, we also experimented with Apache Flink running locally to evaluate its portability. This hands-on comparison highlighted the trade-offs between a fully managed service (Dataflow) and the flexibility of an open-source runner (Flink).
Performance Observations:
We conducted performance tests with simulated data, and the results were encouraging. While both Dataflow and Flink performed well, Dataflow’s auto-scaling capabilities shone when dealing with fluctuating data volumes.
Dabbling in Streaming Analytics:
We also explored Beam’s windowing capabilities for streaming risk calculations. We started with fixed windows and experimented with different triggering mechanisms. Handling late-arriving data and synchronizing different data streams proved to be interesting challenges.
Lessons Learned:
This project was a fantastic learning experience. We gained valuable insights into:
- The power of Apache Beam for building portable data pipelines.
- The scalability and ease of use of Google Cloud Dataflow.
- The complexities of integrating different programming languages (Java and C++).
- The importance of proper serialization and error handling in distributed systems.
Next Steps (at the time):
We planned to continue expanding this project, exploring more advanced windowing strategies, and experimenting with different data sources and risk models. This project solidified our interest in distributed computing and its potential for solving complex problems.