Scaling One Developer - Building a Python Stream Processor
Many of our blog posts are meant to illustrate features or applications of our library, but I thought it might be time to take a step back and write about our philosophy when building Bytewax, and what we hope to achieve.
Scaling one developer
In 2022, I had a chance to hear Peter Wang's talk at Data Council.
His entire talk is worth your time, but one of the themes that resonated with me was the idea of how much a single developer could accomplish on their own. How can we as a community create tools that enable a single domain expert or data scientist to have the maximum impact in their work?
Peter makes the point that Python has played a central role in the data ecosystem for a long time. Python provides a powerful workflow for interacting with data, building sophisticated models and iterating on analysis.
PyO3
One of Python's strengths that was also mentioned in Peter's talk is it's ability to be extended in many other languages. Invoking a Python program in 2023 may internally execute code written in C++, Fortran or in the case of Bytewax, Rust.
PyO3 provides ergonomic Rust bindings for Python, and allows us to implement our library as a native Python extension. Thanks to tools in the PyO3 ecosystem like Maturin, installation of Bytewax requires nothing more than \`pip install bytewax\`.
Timely Dataflow
Timely Dataflow is a Rust framework for managing and executing dataflow computations. It is akin to a distributed data-parallel compute engine, which scales the same program up from a single thread on your laptop to distributed execution across a cluster of computers. A quote from the Timely Dataflow book invokes a familiar theme:
> Timely dataflow arose from work at Microsoft Research, where a group
> of us worked on building scalable, distributed data processing
> platforms. Our experience was that other systems did not provide
> both expressive computation and high performance. Efficient systems
> would only let you write restricted programs, and expressive systems
> employed synchronous and otherwise inefficient execution.
Kubernetes
Bytewax Dataflows are designed to run natively on Kubernetes using waxctl, allowing developers to deploy and scale dataflows easily. In addition, Bytewax integrates with familiar observability frameworks like OpenTelemetry to place a lower operational burden on domain experts and smaller organizations that don't typically have a dedicated infrastructure team.
Bytewax
We now live in a world where the core technologies of Timely Dataflow and Python can be brought together harmoniously with PyO3/Maturin, and deployed natively on Kubernetes.
Bytewax's goal is to take these low-level primitives and build in the knowledge patterns of dataflows, delivery semantics, cluster deployments, monitoring, failure recovery, and rescaling. I believe we have succeeded in creating a framework that should feel approachable and Pythonic while giving our users access to the power of a distributed stream-processing engine.
Hello World
The most gratifying thing about creating tools for developers is to see them used to create things that you didn't anticipate. If you are working in this space, have questions, or just want to share what you have been working on—please come join us in Slack!
If you would like to support our work, please consider starring our repo on GitHub.