Back to Projects
Multimodal RAG for Scientific Literature
Agentic retrieval system for multi-turn queries across 500K+ papers with text, figures, tables, and equations
Scientific papers contain more than text. Figures, tables, equations, and cross-references carry critical information that traditional text-only RAG systems miss entirely. This project builds a multimodal retrieval-augmented generation system that handles all four modalities with separate embedding strategies and unified agentic orchestration.
The system uses LangGraph to implement an agent that plans retrieval strategies, decides which modalities to search, and synthesizes results across multiple retrieval steps. Multi-turn conversations are natural: the agent maintains context and refines its search based on previous results. A three-layer evaluation framework measures retrieval quality, grounding accuracy, and cross-modal consistency.
Key Highlights
500K+
Papers Indexed
94%
Recall@10
<100ms
Retrieval Latency
2.3M
Figures Indexed
Architecture Details
- Multi-Modal Embeddings: SciBERT for text, CLIP for figures, TAPAS for tables, OCR + embeddings for equations. Each modality has its own FAISS index with optimized parameters.
- LangGraph Agent: Planning, retrieval, and synthesis nodes with explicit decision-making about query decomposition, modality selection, and multi-step retrieval.
- FAISS Optimization: IVFPQ with nprobe=64 for the sweet spot between recall (94%) and latency (<100ms). Separate indices per modality for optimal parameters.
- Evaluation Framework: Retrieval metrics (recall@k, MRR), grounding evaluation (claim-document alignment), and cross-modal consistency checks.
- Scale: 500K+ papers, 2.3M figures, 890K tables, 410K equations indexed and searchable.
Tech Stack
LangGraphSciBERTCLIPTAPAS
FAISSOCRPythonHugging Face