Back to Projects
Multimodal RAG Architecture

Multimodal RAG for Scientific Literature

Agentic retrieval system for multi-turn queries across 500K+ papers with text, figures, tables, and equations

Scientific papers contain more than text. Figures, tables, equations, and cross-references carry critical information that traditional text-only RAG systems miss entirely. This project builds a multimodal retrieval-augmented generation system that handles all four modalities with separate embedding strategies and unified agentic orchestration.

The system uses LangGraph to implement an agent that plans retrieval strategies, decides which modalities to search, and synthesizes results across multiple retrieval steps. Multi-turn conversations are natural: the agent maintains context and refines its search based on previous results. A three-layer evaluation framework measures retrieval quality, grounding accuracy, and cross-modal consistency.

Key Highlights

500K+ Papers Indexed
94% Recall@10
<100ms Retrieval Latency
2.3M Figures Indexed

Architecture Details

Tech Stack

LangGraphSciBERTCLIPTAPAS FAISSOCRPythonHugging Face
View on GitHub Read the Blog Post