Events
Multimodal RAG with Milvus and GPT-4o

Webinar

Multimodal RAG with Milvus and GPT-4o

Zilliz Webinar | Zoom

Join the Webinar

About the session

We've seen an influx of powerful multimodal capabilities in many LLMs. In this tutorial, we'll vectorize a dataset of text and images into the same embedding space, store them in Milvus, retrieve all relevant data given an LLM query, and input multimodal data as context into GPT-4o.

Topics Covered

Multimodal embedding models
Milvus multi-vector hybrid search
Multimodal generative models
Demo of multimodal using Milvus and Zilliz

View presentation slides

Transcript

Today I'm pleased to introduce the session Multimodal Ragwith Melva and GPT-4 oh and our guest speaker Jung Chen. Jung is head of AI platform and ecosystem at Zillow. He's got years of experience in datainfrastructures and cloud security. Before joining Zillow, he previously served as a tech leadand product manager at Google, where he led the developmentof Webscale Semantic Understandingand Search, ending that search indexingthat powers innovative search products, uh, suchas short video search. He has extensive industry experience handling massiveunstructured data and multimedia content retrieval.

He's also worked on cloud authorization systemsand research on data privacy technologies. Jung holds a master's degree in computer science fromthe University of Michigan. Welcome, Jung. Hi. Thanks for the introduction.

Amli, uh, really excited to be here. Um, so first, uh, a short introduction of myself. Uh, my name is Jung and I coming from the searchand data infrastructures background. And right now I'm, uh, working on the ecosystemand developer experience of vus. Um, here's my, uh, email and, uh, LinkedIn and Twitter.

So you'd like to, um, follow, uh,follow me on the social channels. I'll be, um, really happy to connect and,and chat afterwards. So today's, um, my sharing, um, is about Multimodel RACwith MIL and GBT four Oh. Um, I believe manyof the audience should have been hearing about rac. The Multimodel rag is still kind of a new thing,and it has been, um, vastly, uh,advancing in the past few months.

And at zills, um, we are the creatorof the high performance, highly reliablebacteria database called, uh, vus. And also the, uh, zills Cloud,which is a fully managed VUS on cloud. Um, so that we have done a lot of researchand experiments with the Multimodel Rack,and we have seen this being applied in manyof our customers use cases. So today I will share about the backgroundand history about Multimodeland specifically about the, uh,information retrieval around Multimodel. And then we'll have, we'll have a, um, a quick, uh, a demoof the Multimodel Rack with VU in action.

So without further ado, let's, uh,jump right into the topic. So I think, um, talking about multimodel, we haveto talk about information retrieval. 'cause this is kind of the, uh, kind of starting pointof why Multimodel has been so interesting. Um, so the information retrieval at the multimodel setup waskind of started from clip, which is, um, multimodel, um,which is a architecture that can embed both the text modelof information and image model of informationinto the same semantic, uh, latent space. And so that using this approach, we can search, um,say an image with textand we can, um, do the opposite, which isto search text based on image.

And the idea behind this approach wasto pretty much aligning the text encoderand image encoder into the same light space. And, um, what preceded this was that, um,there was machine learning model,or specifically deep learning models that canclassify images into different semantics. However, the limitation of that is that you can,you can only predefine a finite list of, um,uh, uh, categories to classify into. Say for example, you can define a cat car dogand a limited list of semantic. And then you can train a model sothat the classifier can de can detect, um, like withwhat probability this picture looks like a dog or a cat.

However, this has this limitation'cause it is not so flexible in practice. A lot of times we just want to know the semanticof the image without a kind of predefined list of semanticsto start with so that they, with some smartand careful training techniques, the clip model can, um,align the tax encoderand image encoder specifically can kind of aligns the, um,generated vector inviting of both modelsinto the same Latin space, so that if the informationor the semantic behind the a piece of textand a, um, um, a picture of somethingand they will have a very close, um, um, uh,distance in the Latin space. Um, this approach doesn't necessarily grabs the, um, essenceof the semantic behind the image to, um,to my understanding, this is more of just aligning kindof two distributions into one. Um, and another limitation of this approach is thatit only supports searching the other, um,searching one dimension. Um, it only supports searching from one, um,modality of information.

For example, um, if I want to search a pieceof text plus an image, um, the,this architecture won't, uh, won't do that. So that with this as a foundation, there were improvementsand newer research researching on how do I, um,conduct information retrieval both ontech class image information. And then come, um, sorry,before we jump, um, jump to the, the, um, the next stageof the information, uh,of information retrieval with Multimodel. Um, a concrete example of what clip can do is here. So given an image, for example, it can classifythe image into a set of, uh, semanticsand then only output the, uh, correct semanticwith the highest probability.

Um, for example, the, the closeness of this, um,picture of a, of an airplane, um,between a list of semantics such as a photo of a bird,a photo of a a car, um, the,the model is trained so that it aligns the, um, vectorof this picture to the, um, text piece,a photo of an airplane,and then make sure that they, they has the closestdistance in the latent space rather than, uh,with the other, um, incorrect semantics. So, um, as as mentioned previously, um, this approach iskind of training a classifier rather than a, um,something that can essentially graphs the, the semantic of,um, either image or text or even more text plus image data. And another nuance about this, uh, this model isthat it actually requires two modelto do this cross modality search. Um, you need to, um, you need to embed, uh, uh,the text with the, um, text modeland image with the image model. Even though they together, they, they,they form a pair of clip models.

And so that the newer approach is actuallyto both combining the, um, encodingof image and text data, um,and also applies some, um, smarter, um,training techniques to make the, uh, model better grasp the,grasp the semantic behind those information. Um, so a t uh, a representationof the mo this model is called, uh, Vista. Um, and the model trainedwith this approach is called visualized bge. Um, this is a work done by a research institution,and the novelty about this is that it can,is establishes the in-depth fusion of textand image data compared to the previous approach. Um, it, it can actually fuse the information from both imageand text as well as, um, you know, it kind of still workfor solely image data or text data.

Um, and another property about this thing is that it kindof preserves the, um, the strong performanceof the general purpose text embedding that, um,which it utilizes in, in this model architecture. Um, so here is the visual illustrationof the model architecture. So, um, a typical example is that it's given both imageand text data, which we call multi-model data here. So with the image data,it uses the vision transformer encoderto kind extract the, um,token representation of this image. So, um, a metaphor for this, an analogy for this is like,um, it kind of converts the image into a text representationof the same information.

Not exactly, but to some extent it is kind of, um,doing this thing so that it converts the, um, imagerepresented as a set of pixel into a set of tokens,which actually carries some, um, um, in, uh,which carries the information represented by, um,tokens that the pre-trained text encodercan actually comprehend. And for the text data, there's nothing special it,it just uses a regular text tokenizedto tokenize the text into a set of tokens. And then what's, um, uh, what's magical here isthat it actually concatenate the, um,the tokens of the visual contentand the tokens of, of the, um,text content expressed in natural languageand then use the pre-train the text encoderto encode this information into one single vector. And the what's generatedafter this, um, kind grasp the semanticbehind both the image and text. And moreover, it kind of combines the semantic of the imageand text so that it can do this, um, combined imageand text retrieval.

Um, so with this architecture, it overcomes the limitationof clip, which can only accept one modality of data,even though the model itself is kind of, uh, cross modality. And because the, um,tax encoder used here is a general purpose tax inviting, um,model so that it can also, um,can inherit the, the, um, um,the good capability of understanding the semanticof everything in the, in the world. So instead of just mechanically aligning the, um,embedding encoding of imageand text, it can actually, um, ha uh, uh,it can actually do a better job on comprehending thesemantic between, uh, behind the image and text. And here we have a concrete example on, um,on the capability of this model. This is also a illustration of how the, uh,training data is, is, is, uh, is generated offline.

But let's, um, look at the, this example here. So, um, the, the purpose of this model isthat given a piece of text, which is, uh, either relatedto this imageor is a, um, kind of a, um,an a piece of information that describes how do I wantto add this image so that, um,the retrieved result will look like the edited image. So for this, this example actually just de describes thisimage, uh, which describes it as man on balcony, uh,mito print by some artist. Um, and if we just input this into the, um,visualized BG model, it will actually retrieve, um,something similar to this image. 'cause pretty much they are talking about the same thing.

Um, it's not really adding any extra information. Um, so that's not really so interesting. What's interesting is thatif the image is actually described, uh, sorry if,if the text is actually describing something slightlydifferent from this image, but also related to this image. So, um, both the purposeof this model is to retrieve such kind of informationand the training so that the, the constructionof the training data also aligns with this principle sothat it uses, um, large language modeland, um, we, uh, and,and visual content generation techniquesto construct synthetic data to help train, um, to help for,uh, to help construct the training data set. So it does this by, um, firstcollect a list of, um, images with caption, which, uh,describes what the i, what the image is about.

So this comes from an existing data set,and then it develops, uh, develops a piece of prompt,which, um, less large language model to be creative onaltering the content about the image. For example, this in, in this example,the altered information is, um, still mount on balcony,but with a flock of births. So, um, with this, it generates, um,another synthetic image with the information about this. And then treat the,and also using the large language model to summarize, um,the new image caption into editing instructions, kindof simplify it and shorten it,and then use a filter to filter out thingsthat doesn't really make sense. And then what's left are synthetic data setsof a piece of contentand an image which describes what the, uh,what the image is about.

So with this technique of constructing, um, training data,it is about, uh, it is able to produce a large amount of,um, samples for training. And then the only with this large, um, this,this large scale of samples, is it possible to train a,um, piece of model that can reliably, um, do the jobthat we designed in at the beginning, which is to, uh,retrieve a piece of content based on both the, uh,image description and the, and the image itself. So this is one of the approach in multimodel. The other approach, um, developed by Google recently, um,has a, um, arguably even, even more, um,impressive results. And it applies the training techniques by utilizing the, um,uh, the data at web scaleby scripting the data from the public weband then do data mining to produce the,um, training data site.

Um, I'll go over the training technique, uh, in, in a bit,um, but let's look like the, someof the impressive results from this model. So with the similar idea, which is given a query imageand a, um, a pieceof instruction expressed in natural language, it is ableto find the, um,image which can express the combinationof the information from both, uh,from both of the modalities. So the first example is that, um, I just wantto find the identical image, which is, um, pretty much, I,I don't want any editing instruction on this. And then it is able to retrieve the exact same image. Um, so this, this, um, indicates that it has a pretty good,um, baseline, which is not doing anything interesting,but what's interesting is that, um, given an instructionof compare its hat to the world's tallest building,it is actually able to retrieve a piece of image,which shows the, uh, head comparison of si of similar, um,architectures of similar buildings.

And this is very impressive compared to the other, um,multimodel retrieval techniques, which can only, um,so this one just shows, uh, another, um, individualbuilding, which that looks, uh, similar to this,which is also a scratch, scrap, a scrapper. But, um, it is not as close, um,as this one in terms of the semantic closenessto the combination of, uh, of both the imageand text information. And this is, um, even more imp impressive, um, which,um, instructs the model to find something. Um, some picture which, uh, which is the, um,Um, which is looking from inside of this building. I'm not sure if there's exactly the, um, the picture of the,um, of the, of the view from this, um, uh, luxury hotel,but it looks like, looks like it is able to, um, kindof exclude information like this, which looks like the samebuilding, but, um, is, uh, from,is looking at the building from the outside of it.

And then the, the other one, um, I think it actuallygrabs the grasp, uh, grasp some, some of the, um, essenceof the information here,or it has some, some kind of intelligence here, um, which isto find the other attractions. Um, I think this, this hotel is in UAE, so this is, uh,probably the Palm Islands, um, uh, oneof the famous attraction in UAE. So with this, it kindof indicates this model actually have some intelligence in,um, in reasoningand also, um, kind of comprehending the semanticbehind the, uh, both the imageand the instruction pretty well. Um, and the, um, in order to train, um, a model like this,it requires a substantial amount of data, which are either,um, manually labeledor, um, mined from the, um, larger data sets suchas the worldwide. Um, so Google actually utilize their, um,deep expertise in managing the web scale data.

So the way that they construct the training example isby first, um, crawling some of the public, um,webpages, HTML,and then extract the images in the same webpagebecause they are in the same webpage webpage, sothat semantically they are close to each otheror they have some kind of connection between each other. And then it uses, uh, um, uh, vision log,uh, vision, a vision, large language model togenerate the instruction regarding, um, sorry,it uses the large language model to, uh,comprehend the text description in this, in this webpage. So that, given it is, uh, it, allof the information reside in the same webpage,it's very likely that the text is describing therelationship between two images appearingon the same webpage. So in this case, um, the, um, it is actually ableto find the, uh, text piece from the webpage,which describes that the bottom page, uh,the bottom picture is a batteryor a charger, actually it's a charger of the camera on the,on camera on the top. And then it uses a large language model to, uh,rephrase and, uh, summarize the, um,the text piece from the orange webpage to, uh,generate the instruction, which describes, um, withwhat text piece.

And this bottom, uh,and this top page you will, um, um, uh,a human being can refer it to the bottom, uh, picture. And then this process is then in a large scaleby scrapping the whole, um, worldwide, um,and doing more sophisticated grouping and cleaning. So that similar, um, imageswith similar semantic are grouped togetherand then generates the instructions as described,uh, previously. And then eventually it is, it is ableto generate a large set of high quality training data,which looks like, um, so we can see the, this,this example here, um, which describes the, with what, um,kind of transformationor editing the right, the right image will, um,match the, the left image or vice versa. And then during the training phase, it uses, um, the,um, the left imageand the instruction to, um,to find the image on the right.

And with, um, with correctionand large scale of training, is ableto distill this information into a, um, into a language, uh,into a, uh, inviting model that can, um, achieve this job. So that this is kind of the theory behind, um,uh, multi-model retrieval. And to apply this in, in realityor in production, we actually, um, need to, uh,find a piece of infrastructurethat can store the vectors generated from those multi modelsin many models and retrieve them efficiently at scale. Um, and also,moreover, we can, um,apply this information retrieval stage into the largelanguage model generation phase, um, the,so-called retrieval augmented generation approach. Um, just instead of a typical, um,a traditional retrieval augmented generation,which only retrieves, um, text informationand then send it to large language, uh, toto large language model, um,this approach can actually retrieve, um,the image informationand then send it to a vision language modelor a multi-model, large language model, large time remodel,and then do interesting, um, uh, applications.

So, um, in terms of the vector storageand retrieval, um, we can use vu, which provides, um,kind of two or four, two forms of deployment. Um, it has the lightweight mill light, which can bedeployed locally on laptop or notebooks. And in today's demo will actually use mul light to show, um,to showcase this, um, image, uh, uh, multi-model retrievaland multi-model generation. And it also has a more performant, um, Dockerand Kubernetes deploymentand also fully managed mul on this cloud,which is more suitable for larger scale of retrieval,um, uh, workflows. And the Mills, uh, project is also integratedwith many frameworks in rac.

Um, for example, in LAMA Index, there's, um, um,a multi-model retrieval, uh, componentand working with the, um, frameworks in theopen source community, this makes it easier to develop, um,more comprehensive rec and other AI powered applications. Um, once with, uh, once, uh,chosen the infrastructure stack, we needto design a data modelthat can fit the multi-model information into a database,uh, like VU vector database. And there are two ways of designing this model, depending onwhat kind of model you are using. For example, um, if you are using, um, single modalityof inviting models, you can still, um, do this workload byapplying hybrid search technique here. So the way it works is, is actually stores thevector embedding of image and text separately.

And this can be, um, the,the vectors can be generated by, um, two inviting models. In this case, for example,you could have a vision transformer, which is only ableto comprehend visual content,and then you can use the traditional text inviting model to,um, to embed the text information. Or you could use clipand then store them in separate, um, vector fields. And during retrieval time, you will actually embed the,um, target information. For example, you could have, uh, two piecesof target information.

One is, uh, in text, the other one is in, in,uh, is in image. And then you can embed them separatelyand then have two query vectors,and then query those two vector fields separately. And then combine them, um, by merging techniques such as,uh, reciprocal, um, reran functionor, uh, weighted rerun curve. And then merge them into one single list, um, by, um,say you could merge them by, um, um, uh, RFand then have a list of IDs. And then with that information you can, um, combine themand then present it in your, um, in your, in your u uh,applications, uh, ui,or you could, uh, send them further to, um,multimodel large language model to do other comprehension.

The other techniques is actually to use a multimodel modelthat can, um, take both textand image, um, information as inputand then embed them into, uh, one single vector. And so this vector actually combines the informationor semantic behind, uh, both pieces of the information. And then during search time, you don't need to do, uh,hybrid search anymore, you just need to, um,use the same model, um, supply the query imageand query text piece to the embedding model,and then generate one single query vectorand then just do a plain vector search in Vos. Um, a slight variation of this isthat if you use clip model, which technically,technically are two models, um, you can still do this,but you need to be a bit more careful on, um,on defining the identifier of this, um, of, of each entity. For example, um, uh, thisdata structure can be, um, is either, um,there's one field which is either image RL or text.

And then you can, um, you pretty much trade the imageand text as two different, uh, informationand then store them in different, uh, different row. And then for each row it can be a vectorof either text or image. And then during search,you can still search on the single field, uh,single vector field,but the return result could be either textor image for some of the application. This actually makes sense,but be careful, um, to align the data structure designto your applications requirement. Okay.

So, um, let me check questionsbefore I jump into the live diamond. Um,okay, I think we're good here. We, we have just one. Um, I think it meet, uh, how does the,I think it means pre-trained. How does the pre-trained text encoder understand thesemantics of both image and text?Okay.

Um, going back a bit to the training phase,so, um, it's, so the,in the training phase, it actually, um,I won't say it has the abilityto understand the text and image. It is actually through the training purpose where aligningits distribution traverse a, um,traverse a symptoms so that it looks like it has under, um,it has the ability to comprehend the semantic between, um,image and text. So I will, um, let me give some intuition behind this. So it just like, by looking at, um,an enormous amount of examples,a human being can learn something so that the, um,the large language model, sorry, the, the, the, uh,deep neural network model training, um, process, um, is kindof mimicking this process. Um, so that the, as long as the training dataaligns the, um, labeled output to the,to the input, for example, the training datapretty much tells the model that okay,hold, um, looks like this.

So with this image, which is an island in the, in the water,um, given this piece of text, um, same island,when it's sunset, you are supposedto point out this image. So that just by repeatedly doing this, it is able to kind ofapproaching the state where it looks like it has understoodthe semantic behind the, both the image and text. So the, so this process is kind of general toany kind of training in deep neural network. What's special about this one isthat it designs the model architecture sothat it can take two pieces of information as inputand then aligns the output tonot only, um, one modality of the input information,but a combination of, uh, information fromboth of the two modalities. I'm not sure if that answers the question here.

Um, feel free to, um,raise a follow up question in the q and a. Okay. Um, I'll jump to the, um,demo here so that, um, okay, let me set up,um, let me pre-run this sothat we don't need to wait too long. So, um, this demo is supposed to do a multi-model search. Um, so here in this example, um,we have the image, which is, uh, a, a picture of a cat,and then a, an instruction in text, which isto find an earphone with the theme of this image.

And then it is able to retrieve, um, things likecombines the semantic of both of from the inputand specifically the,I think this is the second best result returned fromthe retrieval. It's actually a pair of head side, well with the,uh, cat ears. So, um, very close to our intention here. Um, and then in this demo, I will presenthow do we implement this kind of, um, multi-model retrieval. And, um, uh, at the end, I will have a GPT-4 ohto do re-ranking, which further enhances the results sothat, um, we wish the, um,target result is actually ranked the first inthe, uh, from the results.

And GBT four oh is, uh, able to help us to kind of, um,further enhance the retrieval quality. Um, before doing anything, we'll install a benchof dependencies, including vus, which is the client libraryof vis and the other, um, utilities,which we'll use to process the imageand download the inviting model weights,and then do the inference. So, we'll first, um, in this example we'll use theBGE visual, uh, model, which we introduced, uh,previously, and this model, um, in orderto do inference on this model, we needto download this flag embedding, uh, framework,which is a convenient wrapper of this model. And we also need to download some, uh, image, uh,image data, um, as the, um, as the, uh, uh,the pool of targets, uh, target candidatesthat we want to search from. Um, so here we'll download the public Amazonreviews data site.

Uh, we'll also need to download the widths of the, uh,b BG visualized model. Um, so as you can see, this model isn't that big on the,has about three to 400 megabytes. Okay, now, with everything, uh, let's build the encoder,which is a thing used to encode the, um, textand image information into a single vector embed. Uh, this uses the previous downloaded, uh,BG visualized model. Um, so formally it's called visualized BG English way 1.

Now we're able to load data. Lemme try this. Okay, so here we're, uh,we're sampling 900 images from this datasetand then generate vector embeddingsthrough this BGE visualized model. And once these are generated, we'll store allof the embedding model, uh, all of the embed, uh,vector embeddings into VUS vector database.

And what happens here isthat it actually reads the image folderand then samples, uh, 900 images. And then is this encoder image, which is a abstractionof the flag embedding framework. Okay, with all of the embeddings generatedand stored in memory, now we can, uh,load them into mill vector database. So here we actually use mill light,which is pretty convenient, so that you, you only needto specify a local file, um,as they will persist the data into, uh, into local files. Um, and then, uh, you pretty much doesn't needto do anything other than defining the, uh,the vector collection where you want to store data.

And then, um, just ingest the data. So it works as, let me run this. First, it works as, um, as you can see, it's pretty fastto ingest 900 bints. So it's a, it's not that big of a load. Um, to start the meals light, you just needto import mill client from Mvuand then define some of the parameters, like the dimensionof the vectors, um, the name of the collectionwhere you want to store the vectors,and then you just instantiate the mills client with a file,uh, a name of the local file of mul demo dot db.

And then with this newly created Mills client,you can create a collection with the specified name and,and dimension, and will also enable dynamics, uh,di dynamic field here, which is used to store the metadata. If you want, um, more efficient metadata filtering,which in some of the cases, uh, you'll, you'll need it,for example, you want to, um,do a semantic search on something. But, um, only for things that, uh,that satisfies some criteria, like publish time is greaterto something or the author is someone. Um, if that's the case, then you can disable dynamic fieldand then define the schema. Um, formally say, um, yeah, you could define a,uh, data model like this.

Um, like you can define publish timestamp, um, in, uh,in the, in the schema,and then, uh, create the, uh, we, we call Scala index,which is the index of non vector field. And then, um, when you do the search, you can specify,I want the semantic, uh, similarity of something,but only specifying the published timestamp is greater thansome unique ebook. Um, and by having together index, um,this can be done in a very efficient way compared to, uh,uh, this dynamic schema, which, uh,doesn't really have this, um, scatter index. But for con convenience, uh, let's just do this sothat we can, uh, uh, we can skip, uh,defining the schema formally,and then we will just, uh, call insert on this clientwith all of the vectors generated previ previously,and then, uh, specify the collection name. And so all of the, uh, 900 vectors are ingested already.

Okay, now we're ready to do a, uh,multimodal search on the data stored in vector database. So, um, let's select an image of, uh, leopard,um, and then define the instruction as I want to, uh,have an ear form, which, um, looks like this image,and then we can, um, uh, encode this, um,uh, both piece of the, uh, both the textand image information, um, with the, uh, multimodelinviting model, and then generate a single vector. And then we can use this single vector as the target vectorand use, uh, VUS search, API to conduct a search. And then, uh, other than this, you just needto specify some parameters for the search. For example, uh, what fields in the schema do you want, uh,this search to return?In this case, we care about the image path, which is here.

We, uh, treat it as the identifier of the image,and then the limit, which means the, uh,which defines the top K, which ishow many results you want the search to return. Um, so in this case, nine means we want the nine, um,candidate image, uh, images that are closestto the target image in terms of sematic. And then you can define search parameters, which is kindof nitty-gritty details in vector search, uh, such as, um,the metric type, we want it to be co-signed. Um, there are also other metric types like ip. Um, with this, we just needto extract the results from the, um, extract the, uh,the retrieved images from the searchresult and then print them.

So here, let's run it. So this prints the image, uh, uh, prints the image passof the top nine images semantically, similar to, uh, was,uh, defined by image and text. Let's look like, uh, let's look atwhat it, uh, what they are. So here, uh, we combine the image into five by five Panera,uh, pan, uh, panoramic, uh, view,and then also display the image we, which we use, I target. Um, so I'll run this again,but, uh, here is what it looks like.

Um, so the left leopard, uh, actually is a three by three,uh, Panera view, uh, view. So on the left is a leopard image. On the right are top nine images, which looks like this, uh,image, but also kind of carries the semanticof, um, earphone. Let's run this again. Okay.

Um, okay, uh, fun thing about live demo. So I actually ingest the data twice sothat I'm seeing duplicates in the results. Let me see. Do I have time to fix it?Actually, I want, I do want to fix it. Let me do this.

Um, so I'll just do a hack here by defining a different, uh,let's say two, uh, a different, um, persistent file. So that's pretty much creating another mailbox market Davis. And then I'll ingest it again so that, uh,there won't be duplicates. Okay, let's run this again. Okay, here we go.

Um, so we have nine unique images,which looks like a combination of the semantic here,however, if you, um, look a bit closely. Um, so, um, some of them, um,do look more similar to our target and some of them. So for example, this, it just looks like a, a pairof headset, but nothing specialand related to leopard,it doesn't really have the fancy pattern of it. So, um, let's use GPD four ohto do re-ranking. Um, so GBD four oh is very intelligent in the sensethat it can comprehend the se semantic behind this, uh,uh, image information.

And actually we have, um, composed, uh, a pieceof image which contains the, uh, target imageand a site of images, uh, with their ranking. And so that in this, um, instructionto the large language model, we tell it todo a re-ranking, um, based on the semantic closeness toboth the text and image. And let's see how, uh, inch four, how G PT four are, uh, uh,perform in this workload. Um, so most of the code are actually, um, justfor trigger on g PT four. And what's the in, uh, um, what's interesting here is that,uh, given the information which describes the kind, the,the, uh, the background,and it specifically tells the modelto provide a new ranked list of indices from most suitableto less, uh, to least suitable.

And with this, let's do, um,let's do this, uh, generation from large language model,and let's also display the results here. So it actually returns this one, notso impressive in this case. Um, andbecause there is some randomness behind the generation,let's see how it works, um, in the, in the second time. But for this one, um, let's see what kindof objectives we are trying to achieve here. So we are, we're pretty much having GPT four ohto do the reasoning.

Um, we rank the nine images on the rightbased on your understanding of the semanticof the image on the left,and then give the reasoning behind this. The, like, one kind of tricky techniquebehind a large language model prompting isthat if you light it kind of think out loud, it will usuallydo the reasoning a bit better. And so it does give the reasoning, which is, um,is things the most suitable atom is number two. And that's because, um, you know,it just looks more like an infoand then just, uh, carries the,the information from the image as well. Um, will not, uh, read all of them.

And let's regenerate thisand see, oh, actually let's regenerate here, uh,regenerate it here and seehow it performs on the second time. Okay. So this time it does give a slightly different result,which I think, uh, is better than the first one. So that means it does have some, um, it, it is not alwaysthat stable in terms of the performanceof the, uh, language model. Um, but this time it does give a more reliable result,which, um, I mean this pattern arguably is kind of similarto the pattern of the skin of leopard.

And then it is definitely a, a pair of earphone. I'm, I'm pretty sure if I generated again, it will,will likely give a different result. Okay, still the same one. Um, so let's, let's, uh, play, uh, with it a bit. So let's change the, uh, query to another one.

Um, so last time we were talking about earphone, um,and this time how about we, um, check about,uh, phone cases. So I want to have a phone case, which looks like the patternof leopard, um,but it is a phone case rather than an, uh, an animal. Let's run this and see what happens here. So this time it retrieve some resultsand I believe it will be pretty different results here. Okay, here we go.

Um, so the model does give a, um,so pretty much all your phone, uh, uh,pretty much all phone cases. And, um, well from my, uh, human eyes,I think number six is actually a bit more promising'cause it has this, uh, beautiful pattern. So let's, uh, let's see if, uh, I mean, this is,this is already, um, an impressive result even though, uh,number six is not ranked the first, it's still kind of in,in very limited, um, set of results. It gives, uh, uh, one promising result. We just need to, um, make it, uh, go a bit furtherand then rank the first one to the top so that we don't wantto risk, um, kind of showing, uh,some unrelated results to our end user.

So let's do the same thing. Um, re-ranking with GPD four oh,and then pre, uh, and then print the final result here. Okay. So this time, very impressive result. It does give this, uh, phone case, which is, um,you know, uh, first of all a phone case,but also features a, a leopard print pattern, um,that closely matches the intention behind, uh,both the text and image.

Okay. Um, that's pretty much, um, what I want to share. Um, I'll check if there's any question from the audience. We do have some questions. Um, so clip is, it's, so it's clip basically two models.

Can you explain again how that affectshow you do retrieval in mils?Yes. So clip, um, is a cross modality model,which means it can, um, map the information of both textand image to the same Latin space. But actually, um, if you look at the detail, um, it,it does use two different encoders. So clip, uh, for image, it uses the image encoderand text for, uh, text encoder. And those are actually technically two different modelsand they have different model weights.

So, um, it affects the design of the retrieval systemin two aspects. One is that in terms of the, uh, embedding phase, um,you need to, uh, load kind of two copiesof model ways into your GPU memory. And, and that's pretty much using two models. And, and then for, um, if you are given a piece of textand then you kinda utilize one part of the, um,GPU memory and use this, this particular modelto do the inference,otherwise the other, and that's one piece. Um, the other piece is for the data storage.

Um, so this here, I actually didn't show how to, um,uh, store the inviting, uh, uh, the inviting result of,uh, um, of clip exactly. 'cause um, if I were using clip,I'll probably use neither of them. I will very likely to design the model, uh, data model as,um, I'll have either imageor text as a unique, uh, as a unique entity. And then I will identify them separately. So that, um, the way that I design the, the data model isthat I have the, uh, one single vector fieldwhich stores the clip embedding vector,either from the text model or image model.

And then I will have, um, kind of a shared, um,non-actor field. I'll probably called image URL or text. And in this one I will just store a, uh, a piece of string,and the string can be either the, the, the raw text,or it can be the say, URL of the image. And then, um, I'll kind of, um, decide,um, what the type of,I'll probably have another field which, uh, defines the typeof this piece of information, whether that's textor image, I'll use an enumerator to, to store thator just a, a, a number to indicate. Um, and then, um, during ingestion time,I will, um, first detect whether this is image or text,and then I will decide which model to use for inference.

And then we'll generate a vector embedding on them,will store it in the, in the field. And then based on whether it's textor image, I will store this, this, uh, this one single fieldwith the, um, kind of representation of that information. Uh, the representation for images, you RL the representationof, of text is text. So there's a follow on to that question, which is,do you mean that clip encodes images into one vector spaceand it encodes text into a completelydifferent vector space?No, uh, exactly not it actually encodes, um,even though it uses two, um, different model waysfor encoding, but it actually encodes the textand image into the same latent space. That's why it can do this cross modality search,which is you can search an image of docfrom a piece of text called doc.

Great. Uh, here's another question for you. How does GPT-4 oh know the schematics of our custom model?Is it because it understands the raw vector dataor the embedding types are the same?Wow, that's a very good question. Um, it actually does not, uh,it does not understand the vectors. It understands the image pixels.

So here, um, in this, um, we rank with GPTO phase,we actually send this, this, this literal imageor a, a set of pixels to G PT four rather than,um, say 10 vectors to g PT four, sothat it doesn't really know about vectors per se,but it's still pretty impressivebecause it can know, um,first there's very complex instruction, um,sorry, down below. Um, there's very complex instruction telling it, okay,you need to rerun the stuff on the rightof the image based on the semantic of blah, blah, blah. And secondly, it can kind of connect the, um,the information expressed in the instructionto this imageand then comprehend this imagewhile this image is just a set of pixels. So it, it's, I mean, to me it's still pretty impressivethat it can, does, uh, it can do this, this great job, um,on say picking number six out of these nine images. And, and even, I mean the, the, uh, um, the levelof comprehending of the,of the spatial information is also very impressive sothat it kind of knows, okay, here is aligned as, uh, so the,it is ranked as, um, 0 1 2 rather than 0 1 2, uh, vertical,uh, like vertically.

So that's kind of pretty, I don't know, futuristic,um, already. Um, and the other thing to mention is, um,even though GP four oh can do this very impressive jobof selecting the seman, the related content, um,doing at this scale will be extremely expensive. So that it, um, even though, I mean seemingly this,this multi-model retrieval model doesn't do as good jobas GP four oh, um, it kind of, uh, still has a, a strong,uh, advantage as you can do this at scale. You can, you can, um, kind of same save timeby paying some storage cost by doing this offline indexing,using the cross modality model to index the images, um,into vectors and then store them in vector database. And then at retrieval time, the retrieval is kindof instantaneous, um, compared to with GPD four, oh, only,only kind of selecting one from nine.

Um, I think last time it took about like five second ish,um, like yeah, five seconds to, to do this kindof generation or, uh, reasoning. Um, however, for vector search, um,let's check it is probably less than a second. Uh, uh,'cause there are only 900, uh, candidates to search from. Yeah. Is and zero second.

Yeah, I mean, it's, it's pretty far. I believe it is just, uh, like 10 minute secondsor so, so that there's still a lot of benefitof applying the, kinda the traditional technology of search,which is that you do a lot of work offline through indexingand then online during query serving time. This can be very efficient, very, um, uh, very applicableto large scale serving. And then only with, um, only for some, uh,high like highly valued use cases will it deserve for youto use GPD four ohand burning your cache, um, on, you know,large time remodel tokens to do this kind of fine, um,fine processing of information. Thank you.

Uh, there's a requestof can we make this demo code available? Oh,Yes. Um,actually this is a public available, um,VUS website so that you can go to vu, um, docsor you can directly jump into that, um,by scrolling down a bit, we have, uh, you know,rag image search and also multimodel examples. So this will, uh, nav, uh, this will navigate you toa hosted, um, kind of online demo,which you can play with it. And here in the docs, um, we do have, uh,the instruction on how to implement this, uh, yourself,which covers what ha what I, I'm just shown. And also, um, you can also, um, deploy thiswith, uh, things like stream list sothat you can have a frontend experience.

I just need to click on thisand then it will now get you to GitHub so that, um,you can deploy, um, a model, uh, sorry, a demowith UI experience, uh, similar to this, uh,in your local laptop, uh, laptop or notebook. AndWe can go ahead and share those,those links out in the follow upemail with the recording as well. So keep an eye out for those. Yes. Um,and also through, um, on, on this online demo, you can,you can just try with, uh, preset list of examples.

For example, here, you can search the toy of this, uh,this machine, um,and then you can also do reran, which will shuffle the rankof the surgery sauce. So we've kept everyone over a couple minutes. We've got one last question and then we'll call it a day. Um, is there any image generation using dolia,like LMS in multimodal rag?For instance, if you ask, uh,earphones like a leopard, it's only searching. Why is it not generating a completely new image usingsearched images of a leopard and earphones?Um, that's a long question.

Let me,It's in the q and a box. Yeah. Um, so first of all, um, you can kind of, um,um, have more creative way of, uh, doing this kind of demoby, you know, searching somethingand then you can send the search results to, uh, uh,uh, a stable diffusion kind of modeland then generate something out of it. Uh, let me check what's the other part of the question here?Um, one sec. Uh, when you're asking your friend, like,answer why is not generating a complete new image using,and the reason that, uh, the demo that I showedbefore was not generating anything, uh, wasbecause I, I wasn't telling to generate, uh,I was actually doing a,something a bit different than generation, which is, um,well, it is essentially, it's also generation,but it just, it is not generating new information,it's just generating a piece of text, which tells me, um,what's the best image out of the things showing that threeby three box, which matches the imageand my instruction, uh, best.

So essentially it is still generating insight,but not really generating another piece, uh, pieceof image per se. So something we might see on like a e-commerce website,like looking for some, you know, specific typeof product using, you know, different inputs, correct?Yes. So a, a more creative way of doing this, well, in,in real world application could be you retrieve some imagessimilar to this, this semantic, uh, expressed by both imageand text, and then you apply your business logicto generate something so that you inspire the users, um, um,you know, um, like inspire the userto buy something which is really interestingor like drawing something for artists. John, thank you so much for this great session. Thank you to all of you who have joined us.

We will send out the recording as wellas the links that he shared today. Um, thank you for staying up late. I know it's, uh, it's late in your time zone,so we really appreciate itand uh, we hope to see all of you on a future webinar. Thanks so much. It's a pleasure.

Thanks everybody.

Meet the Speaker

Join the session for live Q&A with the speaker

Jiang Chen
Head of Ecosystem and Developer Relations
Jiang is currently Head of Ecosystem and Developer Relations at Zilliz. He has years of experience in data infrastructures and cloud security. Before joining Zilliz, he had previously served as a tech lead and product manager at Google, where he led the development of web-scale semantic understanding and search indexing that powers innovative search products such as short video search. He has extensive industry experience handling massive unstructured data and multimedia content retrieval. He has also worked on cloud authorization systems and research on data privacy technologies. Jiang holds a Master's degree in Computer Science from the University of Michigan.

Multimodal RAG with Milvus and GPT-4o

About the session

Topics Covered

Meet the Speaker

AI Assistant