Multimodal RAG for Scientific Literature

Agentic retrieval system for multi-turn queries across 500K+ papers with text, figures, tables, and equations

Scientific papers contain more than text. Figures, tables, equations, and cross-references carry critical information that traditional text-only RAG systems miss entirely. This project builds a multimodal retrieval-augmented generation system that handles all four modalities with separate embedding strategies and unified agentic orchestration.

The system uses LangGraph to implement an agent that plans retrieval strategies, decides which modalities to search, and synthesizes results across multiple retrieval steps. Multi-turn conversations are natural: the agent maintains context and refines its search based on previous results. A three-layer evaluation framework measures retrieval quality, grounding accuracy, and cross-modal consistency.

Key Highlights

          500K+
          Papers Indexed
        

          94%
          Recall@10
        

          <100ms
          Retrieval Latency
        

          2.3M
          Figures Indexed
        

Architecture Details

Multi-Modal Embeddings: SciBERT for text, CLIP for figures, TAPAS for tables, OCR + embeddings for equations. Each modality has its own FAISS index with optimized parameters.
LangGraph Agent: Planning, retrieval, and synthesis nodes with explicit decision-making about query decomposition, modality selection, and multi-step retrieval.
FAISS Optimization: IVFPQ with nprobe=64 for the sweet spot between recall (94%) and latency (<100ms). Separate indices per modality for optimal parameters.
Evaluation Framework: Retrieval metrics (recall@k, MRR), grounding evaluation (claim-document alignment), and cross-modal consistency checks.
Scale: 500K+ papers, 2.3M figures, 890K tables, 410K equations indexed and searchable.

Tech Stack

LangGraphSciBERTCLIPTAPAS FAISSOCRPythonHugging Face

View on GitHub Read the Blog Post