thoughts on building ml systems, evaluating frontier models, and lessons from production.
Why off-the-shelf benchmarks aren't enough, and how I built a custom evaluation harness for frontier models with consistency scoring, calibration measurement, and chain-of-thought quality analysis.
fine-tuningHow I adapted Mistral-7B for clinical text extraction using QLoRA, achieving 67% exact match accuracy with just 12GB of GPU memory. Practical lessons on rank selection, target modules, and failure modes.
rag & agentsArchitecture decisions, FAISS optimization tricks, and why agent evaluation is harder than agent building. Lessons from processing 2.3M figures, 890K tables, and 410K equations.
data engineeringLessons from building a Kappa architecture with Kafka, PySpark, and Airflow. How we got from 180s to 65s P95 latency, and why fault tolerance is harder than it sounds.
generative aiHow I connected Gemini Live, Imagen 4, Veo 3.1, and Lyria 2 into a unified creative pipeline. Lessons on multi-model orchestration and real-time generation UX.
computer visionHow Bike Lane Sentinel uses object detection and lane boundary analysis to automate enforcement reporting for illegal vehicle encroachment in NYC bike lanes.