Back to Projects
Messaging-Based Data Pipeline
End-to-end streaming architecture processing 400M+ daily events with sub-65s latency
This project implements a production-grade streaming data pipeline built on Kappa architecture. It ingests high-volume event data through Apache NiFi, streams it through Kafka topics, processes it in real-time with PySpark Structured Streaming, and sinks the results into Apache Hive for analytics and reporting.
The pipeline is orchestrated by Apache Airflow DAGs that manage scheduling, monitoring, and failure recovery. Everything runs in Docker containers for reproducible deployments. The system sustains 5K events/sec steady-state throughput with peaks up to 12K events/sec, maintaining 65s p95 latency across the full pipeline.
Key Highlights
400M+
Daily Events Processed
65s
P95 End-to-End Latency
5K/s
Sustained Throughput
12K/s
Peak Throughput
Architecture Details
- Kappa Architecture: Single streaming path for both real-time and batch processing, eliminating the complexity of Lambda's dual pipeline.
- Event-Time Processing: Watermarking and late-arrival handling ensure correctness even with out-of-order events.
- Fault Tolerance: Kafka consumer offsets, Spark checkpointing, and Airflow retries provide end-to-end exactly-once semantics.
- Containerized Deployment: Full Docker Compose setup for local development and testing, production-ready for Kubernetes.
- Monitoring: Grafana dashboards tracking throughput, latency percentiles, consumer lag, and pipeline health.
Tech Stack
Apache KafkaPySparkApache NiFiApache Hive
Apache AirflowDockerGrafanaPython