Messaging-Based Data Pipeline

End-to-end streaming architecture processing 400M+ records/month with sub-65s latency

This project implements a production-grade streaming data pipeline built on Kappa architecture. It ingests high-volume event data through Apache NiFi, streams it through Kafka topics, processes it in real-time with PySpark Structured Streaming, and sinks the results into Apache Hive for analytics and reporting.

The pipeline is orchestrated by Apache Airflow DAGs that manage scheduling, monitoring, and failure recovery. Everything runs in Docker containers for reproducible deployments. The system handles bursty event ingestion while maintaining 65s p95 latency across the full pipeline.

Key Highlights

          400M+
          Records Processed / Month
        

          65s
          P95 End-to-End Latency
        

          Kappa
          Single-Path Streaming Architecture
        

          E2E
          Exactly-Once Delivery Semantics
        

Architecture Details

Kappa Architecture: Single streaming path for both real-time and batch processing, eliminating the complexity of Lambda's dual pipeline.
Event-Time Processing: Watermarking and late-arrival handling ensure correctness even with out-of-order events.
Fault Tolerance: Kafka consumer offsets, Spark checkpointing, and Airflow retries provide end-to-end exactly-once semantics.
Containerized Deployment: Full Docker Compose setup for local development and testing, production-ready for Kubernetes.
Monitoring: Grafana dashboards tracking throughput, latency percentiles, consumer lag, and pipeline health.

Tech Stack

Apache KafkaPySparkApache NiFiApache Hive Apache AirflowDockerGrafanaPython

View on GitHub Read the Blog Post