Back to Projects
Data Pipeline Architecture

Messaging-Based Data Pipeline

End-to-end streaming architecture processing 400M+ records/month with sub-65s latency

This project implements a production-grade streaming data pipeline built on Kappa architecture. It ingests high-volume event data through Apache NiFi, streams it through Kafka topics, processes it in real-time with PySpark Structured Streaming, and sinks the results into Apache Hive for analytics and reporting.

The pipeline is orchestrated by Apache Airflow DAGs that manage scheduling, monitoring, and failure recovery. Everything runs in Docker containers for reproducible deployments. The system handles bursty event ingestion while maintaining 65s p95 latency across the full pipeline.

Key Highlights

400M+ Records Processed / Month
65s P95 End-to-End Latency
Kappa Single-Path Streaming Architecture
E2E Exactly-Once Delivery Semantics

Architecture Details

Tech Stack

Apache KafkaPySparkApache NiFiApache Hive Apache AirflowDockerGrafanaPython
View on GitHub Read the Blog Post