back

writing

thoughts on building ml systems, evaluating frontier models, and lessons from production.

Building an LLM Evaluation Framework from Scratch

Why off-the-shelf benchmarks aren't enough, and how I built a custom evaluation harness for frontier models with consistency scoring, calibration measurement, and chain-of-thought quality analysis.

Fine-Tuning LLMs for Healthcare with LoRA: A Practical Guide

How I adapted Mistral-7B for clinical text extraction using QLoRA, achieving 67% exact match accuracy with just 12GB of GPU memory. Practical lessons on rank selection, target modules, and failure modes.

What I Learned Building a Multimodal RAG System for 500K Scientific Papers

Architecture decisions, FAISS optimization tricks, and why agent evaluation is harder than agent building. Lessons from processing 2.3M figures, 890K tables, and 410K equations.

Designing a Streaming Pipeline That Handles 400M+ Events Per Day

Lessons from building a Kappa architecture with Kafka, PySpark, and Airflow. How we got from 180s to 65s P95 latency, and why fault tolerance is harder than it sounds.

Building DreamStudio: Orchestrating 4 AI Models for Real-Time Cinematic Generation

How I connected Gemini Live, Imagen 4, Veo 3.1, and Lyria 2 into a unified creative pipeline. Lessons on multi-model orchestration and real-time generation UX.

Using Computer Vision to Make NYC Bike Lanes Safer

How Bike Lane Sentinel uses object detection and lane boundary analysis to automate enforcement reporting for illegal vehicle encroachment in NYC bike lanes.