Bytewax's MAD Map for Real-time Python Builders

Discover Python's Secret Weapons for Next-Gen Data Handling! 🔥

Jun 20, 2024

Why we built the Bytewax MAD Map

One of Bytewax's biggest draws is its seamless integration with the Python ecosystem. However, we struggled to find a comprehensive overview of libraries and tools relevant to real-time Python developers. While there are excellent overviews of the data landscape, such as Matt Turck's MAD Landscape and Yujian Tang's monthly updated LLM App Stack, these resources don't specifically address the needs of Python developers working with real-time data. This gap led us down the rabbit hole to create our own MAD (Machine Learning, AI, Data) Map.

Originally, the Bytewax MAD Map began as a slide presentation listing about 20 Python libraries. After sharing this initial list on LinkedIn, the community's response was overwhelmingly positive, leading to the expansion of the list with more tools. Thanks to continuous community feedback, further research, and the efforts of the Bytewax team, the MAD map has now grown to include over 100 tools.

Introduction

The idea of the Bytewax MAD Map for Real-time Python Builders is to give an overview of Python libraries and other tools that can be used alongside Bytewax for real-time IoT, Machine Learning, and GenAI use cases. Rather than attempting to cover every tool available, we focused on the libraries and tools our developers most frequently use. You might notice a preference for open-source tools; this is intentional and reflects the values of our developer community.

Data Acquisition

Data Sources

Data sources are the origin points from which data is collected. They can be divided into real-time sources, which provide immediate data streams, and at-rest sources, which store historical data. Both types of sources can be combined to build comprehensive real-time applications.

Real-time Data Sources

Real-time data sources provide immediate data streams, which are crucial for applications requiring up-to-the-second information.

Examples of real-time data sources are:

IoT Devices
IoT Sensors
Website Clickstreams
Mobile Apps
APIs
Events
Enterprise Applications
Financial Market Data ... and of course, there are many more!

At Bytwax, we maintain a list of awesome public real-time datasets and sources, which can be found here: GitHub Repo

Data Sources at Rest

Data sources at rest include various storage solutions that hold data that can be queried and processed as needed. These sources can store historical data, which is essential for comprehensive analytics and often provides the necessary context for streaming data.

Examples of data sources at rest:

Relational Databases
Files & Object Storage
NoSQL Databases
Cloud Storage
Datalakes
Data Warehouses
Document Storage
Media Archives
Log & Config Files

Message Brokers & Queues

Message brokers and queues are systems that facilitate the reliable transfer of data between applications, ensuring consistent and orderly data delivery.

Kafka → GitHub Repo
Redpanda → GitHub Repo
RabbitMQ → GitHub Repo
MQTT → GitHub Repo
Pulsar → GitHub Repo
NATS → GitHub Repo

Streaming Protocols

Streaming protocols are frameworks that enable continuous data streams with low latency, allowing for real-time data transfer.

Websockets → GitHub Repo
Requests → GitHub Repo
Server-Sent Events - A server push technology that allows servers to send real-time updates to the client over a single HTTP connection, ideal for live notifications and updates.

Databases

Databases are foundational systems for storing and querying data efficiently, supporting structured data storage and complex queries.

PostgreSQL → GitHub Repo
SQLite → GitHub Repo
MySQL → GitHub Repo
SQL Server → GitHub Repo
ClickHouse → GitHub Repo

Datalakes

Datalakes provide scalable storage for large datasets, allowing raw data to be stored in its native format and processed as needed.

LanceDB → GitHub Repo
Delta Lake → GitHub Repo
Deeplake → GitHub Repo

Data Warehouses

Data warehouses offer optimized storage for structured data and support complex queries, providing powerful analytics capabilities.

File System & Object Storage

File systems and object storage solutions store unstructured data and provide versatile storage and retrieval capabilities.

Hadoop HDFS → GitHub Repo
Amazon S3 - An object storage service offering scalable, high-availability storage with comprehensive security and compliance capabilities.
Azure Blob Storage - A Microsoft cloud-based object storage solution optimized for storing massive amounts of unstructured data, with seamless integration into the Azure ecosystem.
Google Cloud Storage (GCS) → GitHub Repo

Webscraping

Webscraping tools enable data extraction from web sources, gathering information from websites for processing and analysis.

Scrapy → GitHub Repo
BeautifulSoup → GitHub Repo
Selenium → GitHub Repo

Data Processing

Data Manipulation

Data manipulation libraries provide tools for cleaning, transforming, and validating data, ensuring it is in the correct format and structure for analysis.

NumPy → GitHub Repo
pandas → GitHub Repo
Polars → GitHub Repo
DuckDB → GitHub Repo
Pydantic → GitHub Repo

Time Series Analysis

Time series analysis tools are used for forecasting and analyzing trends in time-dependent data, providing insights into patterns and future trends.

Darts → GitHub Repo
Prophet → GitHub Repo
Kats → GitHub Repo
tsfresh → GitHub Repo

Statistical Analysis

Statistical analysis tools offer advanced statistical functions for complex data analysis, supporting a wide range of statistical tests and models.

SciPy → GitHub Repo
Pingouin → GitHub Repo
statsmodels → GitHub Repo

Natural Language Processing

Natural Language Processing (NLP) libraries facilitate text analysis and processing, enabling applications to understand and interpret language data.

spaCy → GitHub Repo
NLTK → GitHub Repo
transformers → GitHub Repo

Orchestration

Orchestration tools manage and schedule complex workflows, ensuring that tasks are executed in the correct order and at the right time.

Mage → GitHub Repo
Airflow → GitHub Repo
Dagster → GitHub Repo

Stream Processing

Stream processing frameworks handle continuous data streams, enabling real-time data processing and analytics at scale. Unlike traditional batch processing, which processes data in large, static chunks, stream processing deals with data as it arrives, allowing for immediate insights and actions.

Bytewax → GitHub Repo
PySpark → GitHub Repo
PyFlink → GitHub Repo
Quix → GitHub Repo
Faust → GitHub Repo
Pathway → GitHub Repo

Machine Learning

Edge ML

Edge ML tools enable machine learning on edge devices, allowing for real-time data processing and analytics at the source.

TensorFlow Lite → GitHub Repo
edge-ml → GitHub Repo

Machine Learning

Machine learning libraries provide algorithms for data analysis and prediction, which is essential for building and deploying machine learning models.

River → GitHub Repo
Theano → GitHub Repo
scikit-learn → GitHub Repo
XGBoost → GitHub Repo
CatBoost → GitHub Repo

Deep Learning

Deep learning frameworks enable the development of advanced neural networks for complex tasks such as image recognition and natural language processing.

PyTorch → GitHub Repo
Keras → GitHub Repo
TensorFlow → GitHub Repo
fast.ai → GitHub Repo

Computer Vision

Computer vision libraries facilitate image and video analysis, enabling applications to interpret and process visual data.

YOLO → GitHub Repo
scikit-image → GitHub Repo
OpenCV → GitHub Repo

GenAI

Vector Embedding Models

Vector embedding models offer pre-trained models for various NLP tasks, converting text into numerical vectors for analysis.

Hugging Face → GitHub Repo
Voyage AI → GitHub Repo
Jina → GitHub Repo
Mixed Bread → GitHub Repo

The go-to resource for Vector Embedding Models is the Hugging Face's MTEB Leaderboard, which showcases the performance of various models on the Massive Text Embedding Benchmark (MTEB).

Large Language Models

Large language models provide powerful text generation and understanding capabilities, performing tasks such as summarization and conversation generation.

OpenAI → GitHub Repo
Anthropic → GitHub Repo
Mistral → GitHub Repo
Gemini → GitHub Repo

Choosing the right large language model depends on the specific requirements of your use case. This is a very dynamic space, so it is best to use the various performance leaderboards such as the LLM Leaderboard or the Hugging Face Open LLM Leaderboard for selecting an LLM.

Orchestration Frameworks

Orchestration frameworks help in building complex workflows involving multiple AI models and data sources, managing interactions between components.

LangChain → GitHub Repo
haystack → GitHub Repo
LlamaIndex → GitHub Repo

Experiment Tracking

Experiment tracking tools track, compare, and manage machine learning experiments, keeping track of model versions, parameters, and performance metrics.

W&B → GitHub Repo
Comet → GitHub Repo
MLflow → GitHub Repo
DVC → GitHub Repo

Sinks

Vector Databases

Vector databases are optimized for storing and querying high-dimensional data, enabling efficient similarity searches and other vector operations.

LanceDB → GitHub Repo
Milvus → GitHub Repo
Qdrant → GitHub Repo
Weaviate → GitHub Repo
Zilliz → GitHub Repo
Pinecone → GitHub Repo
Chroma → GitHub Repo
Elastic → GitHub Repo

A good comparison of the different vector databases on the market can be found on Superlinked's Vector Database Comparison.

Feature Stores

Feature stores manage and serve machine learning features, providing a centralized repository for feature data.

Hopsworks → GitHub Repo
Feast → GitHub Repo
Vertex AI → GitHub Repo
Tecton → GitHub Repo
Amazon SageMaker Feature Store → GitHub Repo

Streaming Sinks

Streaming sinks enable real-time data storage and retrieval, allowing for immediate access to processed data.

Redis → GitHub Repo
Google Dataflow → GitHub Repo
Azure Stream Analytics → GitHub Repo
Amazon Kinesis → GitHub Repo
ClickHouse → GitHub Repo

Relational Databases

Relational databases provide structured data storage and querying capabilities, supporting complex queries and data analysis.

SQLite → GitHub Repo
MariaDB → GitHub Repo
PostgreSQL → GitHub Repo
MySQL → GitHub Repo

NoSQL Databases

NoSQL databases offer flexible data models, enabling efficient unstructured and semi-structured data handling.

MongoDB → GitHub Repo
Cassandra → GitHub Repo
DynamoDB → GitHub Repo
Firestore → GitHub Repo

Application Layer

Data Visualization

Data visualization libraries enable the creation of interactive charts and dashboards, allowing for effective visualization of data insights.

Bokeh → GitHub Repo
Plotly → GitHub Repo
Rerun.io → GitHub Repo
Matplotlib → GitHub Repo
Seaborn → GitHub Repo
Redash → GitHub Repo
Superset → GitHub Repo

Logging & Monitoring

Logging and monitoring tools provide capabilities for tracking, monitoring, and alerting, ensuring the performance and reliability of applications.

Grafana → GitHub Repo
Prometheus → GitHub Repo
Elastic → GitHub Repo
Logstash → GitHub Repo

Data Applications

Frameworks for data applications enable the development of interactive data-driven applications, facilitating the deployment of real-time data solutions.

Streamlit → GitHub Repo
FastAPI → GitHub Repo
Flask → GitHub Repo
Count → GitHub Repo

Model Serving

Model serving tools facilitate the deployment and serving of machine learning models, enabling real-time inference and predictions.

Ray Serve → GitHub Repo
TensorFlow Serving → GitHub Repo
BentoML → GitHub Repo

🤔 Which of your favorite Python tools are we missing? Let us know which logo you would like to see on our MAD Map.

A guest post by

Jonas

Chief of Staff @ Bytewax.io

Bytewax

Bytewax's MAD Map for Real-time Python Builders

Discover Python's Secret Weapons for Next-Gen Data Handling! 🔥

Why we built the Bytewax MAD Map

Introduction

Data Acquisition

Data Sources

Real-time Data Sources

Data Sources at Rest

Message Brokers & Queues

Streaming Protocols

Databases

Datalakes

Data Warehouses

File System & Object Storage

Webscraping

Data Processing

Data Manipulation

Time Series Analysis

Statistical Analysis

Natural Language Processing

Orchestration

Stream Processing

Machine Learning

Edge ML

Machine Learning

Deep Learning

Computer Vision

GenAI

Vector Embedding Models

Large Language Models

Orchestration Frameworks

Experiment Tracking

Sinks

Vector Databases

Feature Stores

Streaming Sinks

Relational Databases

NoSQL Databases

Application Layer

Data Visualization

Logging & Monitoring

Data Applications

Model Serving

Discussion about this post