Back to Projects
Data Pipeline Architecture

Messaging-Based Data Pipeline

End-to-end streaming architecture processing 400M+ daily events with sub-65s latency

This project implements a production-grade streaming data pipeline built on Kappa architecture. It ingests high-volume event data through Apache NiFi, streams it through Kafka topics, processes it in real-time with PySpark Structured Streaming, and sinks the results into Apache Hive for analytics and reporting.

The pipeline is orchestrated by Apache Airflow DAGs that manage scheduling, monitoring, and failure recovery. Everything runs in Docker containers for reproducible deployments. The system sustains 5K events/sec steady-state throughput with peaks up to 12K events/sec, maintaining 65s p95 latency across the full pipeline.

Key Highlights

400M+ Daily Events Processed
65s P95 End-to-End Latency
5K/s Sustained Throughput
12K/s Peak Throughput

Architecture Details

Tech Stack

Apache KafkaPySparkApache NiFiApache Hive Apache AirflowDockerGrafanaPython
View on GitHub Read the Blog Post