RAGFlow supports a comprehensive range of document formats covering most enterprise use cases: PDFs (including scanned images via OCR), Microsoft Word documents (DOCX), Excel spreadsheets, PowerPoint presentations, plain text files (TXT), Markdown, images (PNG, JPG, etc.), HTML/web pages, JSON, and structured data formats. The system handles both native digital documents and scanned/image-based documents through intelligent parsing. For PDFs, RAGFlow offers multiple parser options: DeepDoc (the default, with OCR and layout recognition), MinerU (converts PDFs to machine-readable formats), and Docling (open-source processing). DOCX support includes Q&A parsing for FAQ-style documents and table/image extraction. For images, the system performs OCR via DeepDoc to convert visual content to searchable text. Markdown and DocX support Q&A parsing, useful for documentation and FAQ knowledge bases. Web pages are parsed, extracting text while handling HTML structure. CSV and structured data formats can be imported for tabular knowledge bases. The document engine (Infinity, currently v0.6.1 as of v0.24.0) continuously improves format coverage and parsing accuracy. RAGFlow can ingest documents through multiple methods: web UI upload, programmatic APIs, directory monitoring, or cloud storage integration. Because RAGFlow handles multiple formats natively, you avoid converting everything to TXT or PDF before ingestion—the system works with documents in their original formats, preserving structure and semantics. This format breadth is critical for enterprise RAG, where knowledge lives across heterogeneous systems (emails, PDFs, databases, web documentation, spreadsheets). RAGFlow's multi-format support eliminates preprocessing bottlenecks.
In production environments, storing and retrieving embeddings efficiently requires purpose-built infrastructure. Zilliz Cloud handles this as a managed vector database service, while Milvus offers the same capabilities for self-hosted deployments.
