Yes, UltraRAG can answer questions from images. UltraRAG is designed as a multimodal Retrieval-Augmented Generation (RAG) framework, with its recent versions, particularly UltraRAG 2.1, featuring enhanced capabilities for handling various input types, including visual data. This allows it to process and retrieve information from images to answer user queries.
Specifically, UltraRAG 2.1 introduced native support for text, vision, and cross-modal inputs across its Retriever, Generator, and Evaluator modules. This update includes a dedicated "VisRAG Pipeline" which can analyze documents such as local PDFs, extract both textual content and charts, and then construct cross-modal indexes. This functionality enables "image-to-text" and "text-to-image" hybrid retrieval, making it suitable for complex tasks like analyzing scientific papers or addressing questions from technical manuals that often contain embedded images, diagrams, and charts. The framework's modular architecture facilitates the integration of various models and tools necessary for processing and understanding visual information within a RAG workflow.
To support these multimodal capabilities, UltraRAG incorporates comprehensive tools for managing knowledge bases that can include diverse document formats. The framework's design allows for flexible configuration and orchestration of components, making it adaptable to different application scenarios where visual data plays a crucial role. For instance, a vector database, such as Milvus, can be integrated into an UltraRAG pipeline to store embeddings of multimodal content (including image features) and perform fast similarity searches, thereby enhancing the retrieval process for image-related queries. This integration ensures that the system can efficiently find and utilize visual information relevant to a given question.
