Webinar
Exploring Sparse and Dense Embeddings: A Guide for Effective Information Retrieval with Milvus
Join the Webinar
Loading...
About the Session
Sparse vectors are very high-dimensional but contain few non-zero values, making them suitable for traditional information retrieval use cases. Typically, the dimensions represent different tokens in one or more languages, with values assigned to each of these indicating their relative importance in that document. Dense vectors, on the other hand, are embeddings from neural networks which, when combined together in an ordered array, captures the semantics of the input data. These vectors are typically generated by text embedding models and are characterized by most or all elements being non-zero.
What you'll learn:
- Ins and outs of both sparse and dense vectors
- Differences between sparse and dense vectors
- When you’d want to use one over the other (or both in conjunction)
- Examples of how to use both in Milvus
Okay, so today I'm pleased to introduce today's session,exploring Sparse and Dent Embeddings. And our guest speaker is Frank Liu. Frank is the director of Operations head of AIand ML at zills. He, Frank worked as an ML software engineer at Yahoo. His passion for ML extends beyond the workplace.
In his free time, he trains ML modelsand experiments with unique architectures. Frank holds a BS in MS degrees in the electrical engineeringfrom Stanford University. So welcome, Frank. Thank you, Christie. I appreciatethe Introduc introduction.
Um, and I wanna thank everybody today for, uh, comingto listen to me ramble about sparse and dents and beddings,and especially to the folks who actually cameto my last webinar, I wanna say maybe about a month ago. That one was purely about dents, embeddings,more an introduction to embeddings. Why are they important? Um, you know,how could you potentially use'em inside of the Vector database?And what are some things to watch out for,maybe some pitfalls, uh, howto select the right embedding models, so on and so forth. And today's, uh, today's webinar,today's talk is very much going to be an extension of that. Um, I wanna talk a little bit about sparseand dense embeddings, and it's a really,really exciting time when it comes to both Zillow'sand Melva as well, because with the release of Mils 2.
4,we have the capability to do not just dense vector searchas we had demonstrated many, many times in the past,but also sparse, plus dense, uh, what people,folks like today call hybrid search as well. So, again, welcome everybody. Uh, thank you so much for joiningand let's hop right into it. So I wanna, you know, give a,give everybody a quick refresher first,and I know we have some folks here who are pretty newto vector search are pretty new to vector databasesand sparse vectors in particular as well. Uh, and I wanna talk a little bit about the motivation very,very briefly before we dive into everything.
And really, if you think about it, vectors, they're,they unlock unstructured data. Uh, and oftentimes we talk about these embedding models. We talk about these dense embedding modelsand the capability to use them to perform really,really rich semantic search, not just over documents,but also over things like images, video and audio as well. And once you have those vectors, you can store them insideof a vector database such as zills Cloud or Viss,and do really, really large scaleapproximate near neighbor search. And each vector is a greatrepresentation of your input data.
Right. Now, if we move forward a little bit here, uh,for folks who were on my last webinar, you can actually,you probably remember this slide in particular. This is, uh, a little bit old now,but I think the idea is still the same. You know, embedding models really are workhorses,and this is a comparison between downloadsfor LAMA two on hugging phaseand downloads for a fairly popular,but a little bit outdated sentence transformer model calledall, all mini LL six V two. And, you know, it's, I think that the, there's a really,really stark contrast in terms of the number of models,you know, 5 million for this embedding modeland, uh, about 800,000 for LAMA two seven B, right?So this is, I think it just goes to show you there's a lotof interest in embedding models.
It's used everywhere. And it's important for us to be able to understand not justhow dense embeddings work,but also how sparse embeddings work as well. What are some of the instances we might want to use them?And most excitingly,and I think most relevantly, uh, how you might use theminside of Novus 2. 4 as well. So moving forward from there, uh,you know, there's a lot of embedding models.
Again, this is a slide from the previous webinarthat I covered, just to give everybody a quick rehashof some of the material that, uh, that I covered there, uh,in case you weren't, uh, in case you didn't have, uh, timeto go through the, to go through that presentation just yet. Um, and then there's lots of waysto visualize these dens and beddings as well. And this is, uh, from Arises Phoenix Library. It's a great tool for understanding, you know, distributionsand, uh, shifts in the distribution of your data, uh,for looking at individual embeddingsand understanding, okay, hey,maybe which particular clusters are, are, are,are a little bit, uh, problematic for you. But this is a great example of what denseand beddings really look like in this highdimensional space, right?They occupy at least the three dimensions.
Uh, this is, you know,dimensionality reduce the three dimensions they occupy. They, you know, they, they can occupy really any pointin this high dimensional space. And that's the whole idea behind dense embeddings. Um, and then, you know, there are a lot of waysto generate these dense embeddings as well. Recurrent neural networks is, is one of the options.
Uh, you also have a sentence, Burt, the idea of being ableto take two sentencesand compare them with co-sign similarity. And this in particular is combined coder. It gives you one embedding per individual text,or it could be, you know, it could be a short sentenceor the longer form document, depending on the token lengthfor the Burt model they're using. And really, once we have these dense embeddings,it's really, really nice to be able to generate, to be ableto store them in inside Vector databaseand to be able to search, do semantic search across them. And that's really the basis for a lot of applicationsthat leverage semantic search out there, right?Retrieval, augmented generation taking, you know, findingdocuments that match your prompt and insertingand taking those prompts and inserting into them,into inserting them into your large language model, right?Uh, things like personalized searchor recommendation as well all can be donewith text and vetting models.
But I want to move forward to really the meatof the topic today, which is about sparse embeddings, right?And sparse embeddings, they're not a new topic. They've been around for a long, long time,and really, I think they are in some wayscomplimentary to dense embeddings. And I'll, I'll, I'll go over a little bit of that, uh,later in this session as well. But the whole idea behind sparseand beddings is that they, whereas dentsand beddings, for every dimension that you have,let's say you have seven or 68 dimensions, uh,there is a non-zero value for every single dimension. Maybe you'll have some dimensions that do have zero values,or if you use, uh,let's say value activated dense embeddings, maybe halfof those values will be zero.
But in, you know, still the vast majority of those valueswill be nonzero, right?They will have, um, they will have some value to it. That's the equal idea behind dense embeddings. But spars and embeddings, you might have, let's say, much,much higher dimensionality. You could have, let's say, you know, 10,000, 20,000,maybe even 50,000 dimensions. But the big difference between sparse and embeddingsand dense beddings is that very, very fewof those dimensions are actually activated.
Very few of those dimensions are actually non-zero. And that's the whole idea behind spars and beddings. We'll get to get into sort of a little bitof the differences and how you can generatethese spars and beddings as well. But, uh, I wanna say first, you know,dense beddings are excellent. They're great, right?Uh, you can, you have all these different embedding models.
They generate really great dense embeddings for you, for youto use in your vector database. Why do we need sparse embeddings, right?Why are we interested in that?And the reason is because dense embeddings lack lexileinformation, whereas they're great at understandingsemantics of your input data, understanding the intentbehind, uh, and let's say your sentenceor your prompt, they lack that keyword information. They lack that. Let's go information that oftentimesso many applications really, really need and require. And I'll give you a couple of examples of that, right?When we talk about search, I mean, imaginehow you, how you search for things.
Let's say on Google, uh,or on Bing, what we are really doing is let's call search. We're doing keyword search, and I'll give youa couple examples of that, right?Here are some common Google searches out there today. Uh, you know, rub rubrics cube A, what is the algorithmfor solving cube?How many cups are there in a port? How to tie a tie. These are all, we're, we're all very,very keyword focused when it comes to Google search. And when it comes to search in general, uh, I'll give you,you know, what are, if we didn't have these keywords,if we didn't have these, you know, these nouns tolet's say nounsor adjectives to let's say, describe what it isthat we're trying to search for, maybe we could say, uh,how do I solve forthat first instead, Ruby's cube algorithm.
Maybe we could say something like, how do I solve a threeby three by three rotatable cube?Something like that, right?In that instance, yes, semantic searchand pure dense search would be much, much better. But oftentimes today, uh, we are usedto keyword Lexi search,and that is really where sparse vectorscome into the picture, right?So, vector databases, uh, non, you know,vector databases such as nobis today,not only support dense vectors,but they support spars vectors as well,to really give you a nice combinationof Lexi and semantic search. And again, as I mentioned already, keywords play a really,really important role in how we do not just Google searches,but also, you know, searches in general today, right?It's very, very keyword based. And I give some, I give a couple examples here already, um,but I hope, hopefully that give you,it gives you a general idea of sort ofwhat I'm going after here. Uh, let's go search at the end of the day is really,really superior for out of domain data.
So if I have an embedding modelthat let's say is trained on legal documents, uh,but then all of a sudden I start getting a lot of financialdocuments, that's my input data. Uh, that is when, you know, that is when lexile search,that is, when keyword based search sparse vectors suchas T-F-I-E-F or BM 25, are really, really gonna outshinethat dense embedding model that I have, simplybecause that dense embedding model was not trained on thedata, um, that I, that, that I had initially, right?So moving forward, you know, there's many waysto combine sparse dense beddings as well. Um, and one of the ways is just simple weighted average,but I think another common one today is something calledreciprocal rank fusion, RRF, uh, and reciprocal rank fusion. I won't get too much in the details here,but the idea is that you have two different, you know,here you have, let's say, uh, bm you have,you have three different ways of ranking your documents. You have BM 25 with title boosting BM 20 content boosting,and then you have semantic search from those dense vectors.
And RF is simply a way for you to be ableto rank the importanceof each document given your different, um, methodsof ranking, right?And it's, uh, you know,commonly used method information retrival, uh, not justfor ated retrival, but for other modalitiesof information retrieval as well. And again, there's many, many different ways that you can,you can combine sparse and dents embeddings. Uh, weighted average is in anothersimple way that I already mentioned. Uh, if you have a score for let's say T-F-I-D-For BM 25 sparse vectors,and you have a score for dense vectors, you can simply say,I will, you know, weight the dense vectors, uh,or the dense embeddings with let's say 0. 6and the spars vectors with 0.
4,uh, and now add them together. Yeah, that's one way to do it, right?Um, and then RF is another one,and there's many, many other ways to do it out thereas well, but I won't get into all those today. Uh, the reason why I mentioned RF isbecause we are going to use that in, well, a little bitwhere we demo, uh, sparse dense vector search in MO 2. 4. So stay tuned for that for sure.
Uh, and then I wanna go over very,very briefly some algorithmsfor generating sparse and embeddings. Uh, and while I mentioned some already, uh, I'm only goingto go over, I wanna say probably the two more what I believeto be the two most important ones today. Uh, most relevant ones at least, uh, or, you know,and there's many algorithms that are offshoots from these,uh, but these are pretty well known ish. Um, and, you know, they, they, they sort ofhelp understand a lot, a lotof these other sparse embedding algorithms as well. And the first is what I liketo call pure tical sparse embeddings.
So something like T-F-I-D-F-T-F-I-D-F is Idiscuss some fancy looking equations here. Really, there's only two components. So there's the first component,which is the term frequency component,and then the second com component,which is the inverse document frequency. Um, sounds fancy, I know, but no worries. Term frequency just means the number of timesthat a particular term appears in documentand inverse document frequency means that the numberof times or the inverse of the number of timesthat term appears in all the documents in my corpus.
So again, you know, this equation might look a little funky,might look a little fancy, butwe'll go over an example of this. Um, you know, we'll, we'll have three simple documents in ademo later on, I'll generate,I'll use A-T-F-I-D-F vectorize, you'll be able to seewhat those vectors look like, and then we'll go from there. And then there is what I liketo call learned sparse embeddings as well. And, and learned, you know, is a bit of a, uh, you know,it's, it's, it's, it's, it's sort of both learnedand less cool, I think is a better way to put it. And this in particular is actually, uh, a representation.
This is a good sort of drawingor a diagram from one of the engineers actually at Zillow ofsplayed and splayed is a, what you could think ofas a learn sparse event. The idea behind splayed is that it leverages,if you remember, uh, from that last webinarthat we had talking about, Bert, Bert, uh, is trained withwhat is called mask language modeling. So actually at every token, at every single position,it gives a distribution of probabilities ofwhat could potentially be at that token. So let's go over this example here very closely, right?So in this case, I'm running the sentence,Novus is a vector database built for scale similarity searchinto a burden model. And the output of that burden model is actually notjust a single token.
It's not, you know, it's actually a,it's actually a distribution across different tokens. So for let's say, um, you know,let's say a vector database built, uh, built,let's take this word for example, right?Uh, in addition to the word built, it could also be created,you know, that is also another possibilitythat the Burt model gives me. Now, what I do in this instance is that now I take, created,now that, that there's a,there's a bit of a score assigned to that. I take the aggregate ofthat at the very end across all my entire vocabulary,and that ends up being my sparse vector. So you see here, it's not just giving me lexical purekeyword search, it's also doing a bit of term expansion.
So I'm expanding the term in this case, build,not just build, actually, but the rest of these tokens do. I'm expanding all those tokensand giving me more information aboutwhat potentially could be in that slide. This is really one of thereally important distinctions, right?Well, that's why I call it a learned sparse, and Ben, isbecause it gives you other terms that are similar to builtthat could also be in that particular slot. And that's one of the, that's, uh, that's, that's why,you know, models like splayed, uh, and,and these other learned smartsand embeddings are so powerful isthat they give you both a laland sort of a bitof a semantic representation as well, right?It does give you a little bit of that meaningbehind the original tenure. Alright, so moving forward, uh, I want to talk, you know,I wanna talk very, very briefly about Colbert two on thetopic of sparse vectors.
Colbert is not a sparse embedding model. Um, I actually would argue that Colbert isnot a model alone. It's actually more of a way of doing information retrieval. But the reason why I wanna mention Kohlberg isbecause it is very popular today. Uh, and it is one of the ways that you can actually augment,uh, you know, your search, your,your search on retrieval solutions.
So vector your databases. I wanna take a quick step back talking about sparseembeddings, talking about dense embeddingsand talking about vector databases. Previously, you know, you saw vector databases,they support one vector, right?And one vector corresponds to one piece of vector data. Great, right?You know, we can do approximate standard search over it, uh,we can do filtered search over it, um,and it becomes a really great way fordoing submit search recognition and so on and so forth. But with the introduction of sparse vectors,not only are we saying, Hey, we're giving you the capabilityto do lexical search as well,we're also introducing the capability to,to do something called multi-vector search.
So you can have with one, you know, one documentor one piece of unstructured data you can have multiplevectors associated with. And that's really, that's really why I wanna talk very,very briefly about Colbert, even if,even though it doesn't really have anything to dowith sparse, uh, sparse vectors, uh,in this presentation here,and if you remember, if you, you know, we go backto the example of Burt, one of the waysthat you can compare two sentences isby taking both sentencesand throwing them into a Burt modelwith a separator token, right?And then the output of that, you then, uh, you know,you then have a fully connected layer on top of that,a dense layer on top of that. And the output of that, uh, would be a score, you know,between zero and one, uh,or negative level and one or whatever you want. It would be a score, but is a really, really inefficient wayof doing, um, of doing nearest data search. Now, typically, you would use this kind of method to do, uh,reuniting for folks who are unaware of what re-ranking is.
Uh, it's essentially saying, once I have my top k,once I have my top 100, uh, PC unstructured data,I'm gonna re-rank them and maybe pull out,let's say the top 10 or maybe the top 20. Uh, and then that, that,that will really give me the most relevant results, right?That's what re ranking is. Now in sort of, sort of traditional or naive,but you could think of, if you think of it that way,it's very, very expensive. I've gotta compare the query to all of my documents from allthat my, you know, do this huge inference pass through Bert. And especially if my documents are very long, uh,because transformers, because transformer encodersand decoders are quadratic relative to the number of tokensthat are input into them, it ends up taking a very longamount, you know, ends up taking a whileand a lot of compute as well.
And that's where Colbert comes in to reallysay fix things necessarily. But it gives you a much, much, it really improves,significantly improves the speed with which you can dore-ranking with which you can get like results. And the idea behind Colbert is that you have,you still have your query, you still have your document,but now what I'm doing is, for every token,for every output embedding, right, for every output token,embedding not sentenceor not, you know, uh, long form embedding,for every token embedding, I do a maximum similarity score. I compare it with every embeddingthat I get in my document as well, right?And then I sum all those up and that gives me a score. So now, if my query, if all those tokens are very relevantto, let's say oneor more tokens in the document, then I probably have,I probably have a very, very strong match that is the, uh,that is the, that is this sort of, uh, paradigm iswhat Colbert leveragesto give you really, really solid sites.
Now, again, I wanna emphasize right,Colbert is not a sparse embedding model. Cobert at the end of the day, leverages Bert. It is, uh, it is a, actually,it is more than a dense embedding model. It's actually a multi-vector denseembedding model you can think of that way. So queries and documents are no longer representedwith a single embedding.
They're now represented with multiple token embeddings. In some cases, many, even hundreds of token, hundreds,thousands of token embeddings, depending on the sizeof your document and the size of a query. But because we are moving to a paradigm, we're,because we are moving to, we're moving the direction ofnot just dense vectors, not just sparse plus dense vectors,but more of a multi-vector solution. I thought it was a good time to talk very briefly aboutKohlberg and sort of potentially what the future looks likefor Novus as well, right?So again, if you have any questions about any of this,feel free to, uh, to, to stick 'em into the, in,into the, uh, into the chat. Uh, and, you know, we'll get to them.
So towards the end of this presentation,but now it is demo time. Uh,sorry about that, folks. I, I just dropped off there for a sec. So let me go back to my slides. Alright, can everybody seethat okay, Christy, can we see that?Okay.
Okay, cool. Again, sorry about that. I, I've, I've had choppy wifi, um,over the past couple of days. So, uh, anyway, we'll get right back into it. The idea behind BGM three is that it's multilingual.
It's multifunctional and multi granular as well. So by granularity we means it, you know, the idea isthat it supports both really, really short phrases,short sentences, and very, very long documents as well. It's got up to, uh, uh, 8,000,about 8,200 token length, right?So this is the idea behind B GM three. And, uh, you know, one of the, I wantto focus a little bit extra, a little bitof extra time on BGS sparse vectors in particular. It is not exactly the same splayed,but it is splayed like in that sense.
So, um, there is actually, you know, sortof like a, an extra wait term. It does use readily the same as display. There's an extra wait term that actually takesto multiply all of those vectors, all of the output tokensby, and then it'll do,it'll do a bit summation based off that. So, I, I wanna, I wanna emphasize it is display,like it is not, um, exactly the same as splayed, uh,but, uh, you know, happy to link the paper a little bitafter this for folks who are interested in it as well. So let's get right into it, right?Um, and, you know, stop sharing right here.
Bring this notebook up. So I want to, so I'm gonna, I'm gonna do it,it's gonna be a really, really simple example. What I wanna do for this particular demo notebook, um, isI'm going to go over how we can do, well,I'm gonna go over very, very quick T-F-I-D-F example first. And then what I'm gonna do after that is I'm gonna use BGEMthree to generate sparse vectors for, uh, well,to generate actually both sparseand dense vectors for these three documents. We're gonna index them into vus,and then we're gonna do re we're gonna use Reciprocal RankFusion to actually do, um, this, uh,sparse dense sort of hybrid search for, uh, for,for all these documents.
Right? Before we go there though, uh,I'm just gonna do a quick example of T-T-F-I-D-F. So through this notebook, you know, you'll acquire PI Novas. And then one of the really interesting featuresof the latest version of Pine Novus, pine Novus 2. 4 isthat it includes what we call this model library,the model in the model library,you can actually perform inference directly on themachine using this library. So you don't actually need to, let's say, um, you know,it's, it's, it's essentially a wrapper aroundcommonly used models, and it gives you the capabilityto really, from just Python, uh, you know,generate your embeddings, insert them into VISand have a great time, can use second learn.
Second learn has the TFID effect rise, uh, inside of it. Um, and then we're gonna, you know, we're gonna,we're we're first gonna compute T-F-I-D-F sparse vectorsover these three documents. So I'm gonna do is I'mgonna, I'm gonna, I'm gonna run that first. I'm also gonna do all of my imports just inone big import right here. And, you know, because I have all of these already,you'll see that this runspretty quickly, pretty efficiently.
So the first thing I'm gonna do is I'm gonna createA-T-F-I-D, T-F-I-D-D effect T,and then what I'm gonna do from there is I'm actually goingto fit that to the docs and transform it as well. So we can do X equals, like, transformdocs, and then what that'll give me is x will actually be,and actually it might be easier if, oops,actually might be easier if I do that,X is now a sparse matrix of three by 27, right?And what does that actually mean?So first thing I'm gonna show you guys is, uh, sort of,I'm gonna get that quote unquote feature names here. And what's going on here is that all of these tokens here,all the tokens that are in our docs have now beenrepresented in, you know, that's a, that's now partof a dictionary, part of this TFID of VECTORIZE dictionary. So you see it includes the 1956, uh,which is in the first sentence,but it also includes England, um, right here,which is in the last sentence. Uh, it also includes born for example,which is in the last sentence, uh,excuse me, the last document.
Where, where's that one? Right here. So I've taken all of the tokens, all the wordsthat appear throughout all my documents,and I'll generate a vocabulary based out,or excuse me, the TFID vector as it has generated based out. And from there, what I'm gonna do now is I'm actually goingto print as well, the dense vector representationof x and x, again, is the sparse vectors. It is sparse, the sparse vector representationof all three of my documents. And I wanna just draw your attention to the first one hereand show you guys, and, you know,I wanna show the folks here in this webinar today, sowhat these vectors or what these values really mean.
So 1956, right?It appears only once throughout all of these docs,and only in first sense. So that's actually generated, there's sort of a,a quote-unquote high score that's generated, uh, for that. If we look at the last value here, this is, was,and was actually is not only in the first sentence, uh,the first document right here, but also in the secondand third documents too. So that one is actually a more common termand is assigned a lower score by the TF audio vectorbecause it appears in multiple documents, right?So even though it, even though it, uh,it appears once in all these documents there appears,it appears it's more common throughout the entire corporate. So that's why it's given a lower weight.
And you can extrapolate thisthrough the other documents as well, right?So, again, I'm repre I've represented this sparse vector setas dense vectors,and these are, it's a really good high level explanation ofwhat each of these individual values are. Okay? So that is the TFID fizer out of the way. Uh, now what I wanna do is, you know,something pretty special as well. So I'm going to use, you know,we talked about BGM three here very, very briefly. What I wanna do is I want use BGM threeto generate sparse and dense vectors.
I'm gonna put 'em into vis,and I'm gonna use the same documents here, right?Uh, and again, I'm only using three documentsbecause the BGM three is a a bit heavyweight. I'm on a pretty, you know, I'm on a bit of an old laptop,and I don't want this, that take too, too much time. So what we're gonna do is, you know, again, as I mentioned,we're gonna use BGM three, we're gonna generate thesevectors and we're gonna need insert wheels. So really cool. But before we do that, uh, I'm goingto create a query vector first.
So my query is going to be, um, uh,query research. This's gonna be my query, right?Uh, what I'm gonna do is this query, I'll try to match itwith this hybrids bartan search to these three documents,and hopefully get the most relevant ones at the end of this. Now, what I'm gonna do here is, uh, we talked very,very briefly about the model capabilitiesof the latest version of Pines. So just to show you guys what that's like with,oops,so we're on the latest version, 2. 4 0.
- Now, VUS 2. 4 0. 0 also supports sparse and dense vector. So you wanna make sure that you have both of these, uh,available on your machine.
I have an instance of Nova standalone that is upand running already on my machine. Again, it is 2. 4, so I will be able to connect to it,you know, with this version of client Nova, no product. What I'm gonna do now is, uh, I'm going to create a,what's what's called a bg e embedding function. And we can actually go right here,say embedding function,and then we're gonna do, uh, 1716.
It's false. So I'm just gonna use good old, uh, uh,good old 32 bit floats, and I wanna say device full. And the reason I'm gonna do this is'cause I don't have a GP on this machine. Um, you know, I wanna keep things pretty simple as well. And then we can also take a look at that, the dimensionalityof the dense vectors.
Yes. Uh,let's see, let's see what that looks like. So again, depending on whether or not you have used thisbefore, it actually goes and will it, it will goand fetch the files that you need. It will fetch the torch, uh, the torch modelthat you need directly from hugging face,because I've already done that. It works in this case.
Uh, no problem. And we can actually take a look at what that,so it's 10 24, so pretty recent. So let's get rid of that. What I'm gonna do now is I'mactually going to use this embedding functionor use this CF to compute my embedding. So let's do docs embeddings.
Oops, yeah, docs. I'm gonna do query embeddings with cf. And because I have only one query, I'll put in a list first,and then I will give the embedding function off that. So we'll do this really quick. Uh,it's actually much quicker than I thought it would be.
So, um, I probably could have added some more docs in there. But we can take a real quick look at the dosand pennings, what that potentially looks like. And again, it is a, you know, it shows you it, it says,now we have these dents embeddings, right?They, they're this li the,because I have three documents,I've generated three different Bens embeddings,and we also have these sparse embedding. Well, that's what's really, really cool about not just this,uh, not just this model,but also I know this,I know this is the new model capabilities as well, right?You can generate these very,very easily on demand just on your own laptop from thecomfort of your Python interview. So from here, you know, you'll seethat our vocabulary is actually fairly large, uh,and what we've done is now we've generated a threeby two 50 K sparse array with flow 30 twos, right?And there you'll see there's 43 stored elements,and this can press sparse, uh, sparse growth format.
Now, I'm not gonna, I'm not gonna do the same thingsas I did up here where, you know, I call it too dense. That would be probably a little bit too much. Uh, it'll be a lot of elements, as I mentioned,where there's only very, very few non-zero. But we'll just move forward with these sparse vectors,and we'll go from there and we'll show you, you'll,you'll see just really how easy it is to, um, uh,to use these smart vectors side wheels. So from there, I'm gonna, I'm gonna do a BIV copy pasting.
Uh, this is, this is what this is doing is, uh, for folkswho are familiar with Novusand familiar with ZE Cloud, which I hope is most, um,and if there are folks who aren't,please go onto zi. com/cloud. You know, spin up an instance, go play around with it. Or again, if you're interested in the SPR vectors, goand download mobiles 2. 4, uh, download the versionor the cluster version, um,and go around and play with this, right?So what I'm gonna do here is I've already imported,I've already done all the imports that I need,so I've imported connections already.
I'm going to connect to, uh, the local instanceof Moses money on my laptop. So hopefully that works. Yes, it did. And then from here, uh, there's actually quite a bit,so I've already, I've sort of pre-typed it already. Um, I'm actually going to generate some fields and,or excuse me, I'm gonna generate my schema.
So my schema in this case is actually composed DOC, uh,weakening of a primary key. It's composed of the original text, right?So I'm storing the original document insideof my vector database as well. I'm doing this primarily for simplicity, sothat way I don't have to, you know, go back and, and,and reference, uh, another data storefor my original embeddings,and I'm storing the sparse vectorand the dense vector as well. Now, a couple, one really, really important thing to know isthat we've introduced a new data type inside of VUS 2. 4,called sparse float Vector, right?And this is really the data, the data type that you wantto use, but moving forward, you'll be able to seethat we can actually create indexes on both these.
Now, with these fields,I can then define a schema for my collection. And again, I've imported this already. Collection schema is important right there. So using these, these fields, I can define a schema. And then from there, with that schema, I can then createto create a collection, which in this case,I've called sparse stems.
Let's see if that works. No problem. Uh, and then from here, what we're gonna do, uh,I'm actually gonna type this out, so,so hopefully it'll be a little bit easier to remember. We're actually gonna create an indexfor both the sparse column as well as the dense column. So I'm gonna, let's do sparse index.
And what we're gonna do here is we're gonna specify twothings, the index typeand the metrics type, uh, the metric type as well, indextype, we're gonna call it, uh,sparse index. So what that's doing is it's building an inverted indexacross all of my sparse elements. And then here we're gonna specify the metric as well. That will just be inner product, okay?So you'll see as we have an inner product metricfor dense vectors, we can use an inner product metricfor sparse vectors as well. I'm also going to specify what I want for the dense index.
So hopefully this should be pretty, pretty familiarto a lot of you folks out there. Uh, I'm going to do index type in this casein coin use, flat indexing. And again, the reason I'm doing that is just, justto keep things simple, because there are so few documents. If you have, uh, let's say a lot of documents,or if you have a lot of vector stored in your vectordatabase, you can use something like it, SWor BFPQ, something along those lines. From there, we'll define metric, typeas L two.
You could, you could also do cosign, sorry,you could also do cosign as well. Uh, it depends on what you really want to do. I would play around with it. Uh, there's a lot of different ways if you do it there. So from there, we can actually then create, um,indexes across all of that, right?So what I'll do is I have mycollection that is defined here.
So collection do create index, and then I have two fields. First is sparse vector fields,and I'm going to do spars index. I'm gonna do, so basically I'm, I'm specifying the, thisas the index parameters. I wanna create, I'm doing the same thing,dense factor as well, hopefullythat, uh, that I'll, oh, oh, let's see. Something, something happened there.
So give me one sec while I be this really, real quick. What am I doing wrong here?Oh, perhaps I didn't drop the original collection. Let me see. Drop the original collection. All right, let's try that again.
Hopefully that works this time around. So there we go. So the reason why it wasn't working a little bit earlier isbecause I'd actually runthrough this demo already on my notebook. I'd created a collection called Sparse Dense Demo already,and it was tryingto recreate those indexes that already existed. So, again, always, uh, always rememberto start fresh from a fresh environment every time you dothese demos, lesson learned.
So going from there, you know, that's really, you know,that's really all you need to actually get the collectionand the indexes up and running from there on. It's actually pretty easy, right?So what we're gonna do is now we're going to create entity. We're gonna create the entities that we're gonna insert. These are just gonna be our docs right now. They're also going to be the sparse and dense embeddings.
So we're gonna do dots embeddings, sparse dos,beddings dense. These are gonna be our entities. And then we're going to do, insert them into the collectionand then flush. And what this will do is we will take, as we've created,define the scheme up hereto be the text sparse vector and the dense vector. It is taking these, compiling them into, uh, into a mistwith three sort unquote columns.
And then it is taking all of theseand inserting them into thecollection that we've just created. Calling Flush just means that we're gonna seal this. So from here on, um, you know,it's pretty much all that you really need. Uh, and what we're gonna do is,I also have some pre-med code here as well,which I'm gonna copy paste into this, uh, into this, uh,into this cell right here. What we're gonna do here is, um, we're going to select,you know, given these parameters that we have, uh, uh,given these parameters that we have, uh, done here,we're actually going to create these aand m search requests.
They're going to be across the sparseand dense columns, right?And then what we can do is we can actually combine thesetogether into a single, uh, what we liketo call a sparse dense hybrid search. So what we're gonna do here, I'm gonna,I'm gonna copy paste this as well. I'm gonna show you guys exactly what's going on here. So using these a n search requests,I can then add them into a new function calledhybrid search. And this search, it will actually take a re-ran it.
Now, if you, if you remember from my slides here,we talked about reciprocal rank fusion. It's essentially a way of being able to give multiple waysof ranking your documents, uh, combine them together. It's an information method. Uh, we'll use a rank, uh, we use an RF ranker here,and then we're actually going to, let's workthrough a little bit to two,and then we're gonna give us the output feel. So again, running, running through it very, very quickly,what we've done is we've created the search request.
Search request is actually both the sparseand dense columns. Inside most 2. 4, we have, we have these pres herethat we have, uh, that, that we've,that we've, uh, defined as well. And then using these search pres, using these requests,we're actually just going to do a hybrid search. It's really just as simple as that.
So let's run thisand see, hopefully it doesn't air out this time. Well, no, it is. Gimme some error. So let's see what's going on here. Oh, uh, that really,really simple mistake here.
I forgot to load the collection. I'm going to download,and then let's rerun this. Hopefully, uh, work. Alright, let's take a look at our results. So you can see, uh, you can see based on,based on this hybrid search, we've actually, it'll,it'll, it'll sort of return.
Um, it, it saysthat Alan turn was first burden conductedsubstantial research. And ai, uh, is actually the closest sense, which, you know,even though there are three documents,it's actually pretty good, right?Again, our queries who started AI research,these are our three documents,and, you know, it is indeed the most relevant, right?But this is, even though there are only three,I think this is just more of an example of someof the amazing things that you can do with this sortof new hybrid sparseand search in a really, really scalable vector database. And also, you know, the ideathat vector databases are becoming morethan just a vector database, right?They're also becoming these engines for searchfor retrieval, um,and really more broadly ways of being able todo really, really, you know,complex analysis across not just text,but other types of unstructured data as well. So, that is all I wanna talk about for this session today. Uh, I hope the folks, uh, listening out there enjoyed, uh,enjoyed my rambling and enjoyed the demo here today.
Um, uh, as well as the slides I presented, I'd be happyto take some questions right now. Um, I don't know, Christie, if we have any, uh,Chat. Yes. The first simpleone is just can youshare the link to your notebook?I, I can absolutely. Um, and, you know, we'll figure out a way to do,we'll figure out a way to do this,uh, shortly after the webinar.
Okay, sure. So, yeah, this was really great stuff, Frank. A lot of information really quickly. Um, I, I really liked your explanation of, of, um,the different algorithms for generating, um,sparse embeddings from T-F-I-D-F type to displayedto the new, um, BG three. Um, I really like the new vis, um, API that's something new.
Um, and, and also, um, you know, you,you produced your two vectorsand VUS 2. 4 supports how many vectors?I think three now,Well, mils Mils 2. 4. Well, mils has always supported both, uh, both binary,so sort of booming vectors as well as dense vectors. So flow vectors now, in this case, in addition to, uh, sortof lene vectorsor binary vectors, whatever you wanna call them, uh,we also support sparse vectors as well.
So now they're, now, now it supports three differenttypes of vectors. Uh, and again, the sparse vectors, they are mainly usedfor text, but there are other modalities as wellthat are starting to adopt sparse vectors as well. Um, generally I think modalities that are more discreet,so things like molecular search, uh, things like, uh,you know, obviously tech search benefit more from sparsvectors, but I have seen them usedfor computer vision as well. I'm not fully convinced about those just yet. But there's definitely value to sparse vectors when it comesto, uh, sym tech searchor when it comes to electrical tech search, excuse me.
Right, right. So yeah, we support three different typesof vectors, sparse stents, and binary. And as well, each collection can now have multiple vectors,um, as we saw in this demo. Um, so a question I see, what is the difference betweena dense embedding and a dense retrieval?Yeah, a dense embedding is simply a vector,or it's a, you know, it's a vector of let's say, you know,fixed dimensionality typically for your,for a single collection, for a single model. Uh, and there's, you know,just a series of numbers aside of that.
Now, when you talk, when I talk about dense retrieval,what I'm talking about is using a large collectionof those dense vectors to figure out what are the mostrelevant vectors to my query vector. So again, I have a single query vector here,and then I have, let's say, you know, a hundred millionof these dense vectors. Now, the individual elements themselves are dense vectorsand dense retrieval is the processof using a single dense vector, comparing itwith my a hundred million dense vectorsand picking out the most relevance. Okay. Um, I think I saw another question here.
Um, oh, um, I think this is an example about,um, metadata. Somebody said, can you point me to an example codewhere I can constrain my rag docs to define set of URLsand a defined set of PDF docs?So maybe mention something about metadata versuswhat we're doing here, Frank. Yeah, so, so, so what we're doing,so in this particular demo notebook,what I've defined is I've defined these field schemas, uh,and these field schemas are simply individualcolumns inside a database. But I also could have done something like field schema,call it metadata zero, right?And then this could be some data type. Uh, it could be you,let's say like a max.
Yeah. And this metadata, now I, I have this sparse vector,you know, I have these, I have all these documents. The metadata could be, you know, something like, um,the book that it was from, or the original u rl xs from. So it could be something that looks like, let's say H http,you know, example. com/uh, alan, whoops.
It could be the UR that the original document is. Well, and that is, you know, when we talk about metadataand we talk about using all of that in conjunction with,let's say your, your posters about PDFs here, uh,that is really sort of what we talk about. So you have different ways of being ableto construct your schema. You have many different field types that you can use. And combining all of them together is,uh, is, is a great one there.
And I think a,a question some people have is when they see this new, um,this, what's new newly supported now with the BGE threesparse and dense, what kind of accuracygains typically are you getting versus just dense, or,That is a great question. That is a great question. Um,I would say it depends heavily on your data,depends heavily on your data. And again, I think, you know, it, it sortof goes back a little bit to, I,I hope everyone can see my screen still to what I'm talkingto, you know, some of the stuffthat I was talking about here, which is we live, you know,I think keyword search has a lot of valuebecause we live in a world where, uh, where things are very,very, very much dominated by keywords. Uh, and I think they will continueto be dominated by keywords.
And the value that sparse vectors provide, first of all,it's great for out of domain data. So if you have a dense, if you have a dense embedding modelthat is, you know, I gave the example earlier,it's trained non-financial data, uh,or trained on legal data,but then I'm applying it to a different domain of data. Sparse vectors are great for that, right?Uh, they're very, very generalizableand they give you the capability to say, Hey,I have a keyword in my query. Yeah, the keyword in my document. Uh, I know for a fact that, you know,that document is probably somewhat relevant, right?Where I think sparse vectors fall short is that, um,unless you use these sort of learned sparse vectorsthat give you a little bit of semantic meaning, suchas splayed, um, or, or you know, this, this, or, or,or bgm three spars vectors you end up having, I think you,you know, you dense vectors can always,dense vectors will always continue to change and improve.
Uh, whereas if you're not using these learned sparsevectors, if you're using, let's say BM 25, uh,if you're using T-F-I-D-F BM 25, that is pretty much staticthat, you know, those algorithms have been aroundfor a long time and information retrieval,and those aren't really changing that much. So I think that's typically the downside that you see. So it's, it's, it's a trade off. It's something that we as humans have handcraftedthat works really, really well, versus somethingthat has the potential to get betterand be, you know, tuneable to your own personal dataset. It's really, really hard to say how much you'll see,how much improvement you'll see from sparse vectorsor from bge, uh, from using BGM three.
But, um, I would say test on your own dataset, right?Um, and it's always, uh, it's always interesting to see,you know, the different types ofimprovements that we could get by. So off the top of your head, Frank, if you wereto give some advice for people how to think about, do they,do I need, should I have just dense vectors?Should I, do I, should I go for sparse and dense?What do you recommend for people?Um, well, look, I say this, right?If you have, if you're using just text data,if you have just a text-based rag,I would use both sparse and dense. If you have the extra, you know, if you, if you have the,um, if you don't mind the extra sort of, oh,there is a little bit of extra oped that is involved, uh,and obviously, obviously a little bit more computationas well, if that extra computation doesn't really impactyour business or your organization that much,use both, right?Uh, you're never, it's, it's very rare. You know, you, you, you tune in a little bit,and I think you'll find that it's very, very rare. It's very rare that sparse vectors arereally gonna hurt performance.
Uh, and again, the reason isbecause you have different ways of being ableto combine the two, right?You can weight more heavily towards spars vectors,you can weigh more heavily towards dense vectors. You can always find a good middle ground that works for you. Okay? Where I would consider potentially maybe avoidingsparse vectors is if you have, uh, you know, pure,let's say modalities today that are continuous. So when I mean continuous, I mean, you know,things like video or, or potentially audio. And the reason why I say that isbecause it's hard to turn those into sparse vectors withoutbucketing them, without turning them into, without sortof classifying them, right?Uh, if you think about ImageNet,how we classified images into a thousand categories,it's not a very, very accurate way of doing things.
Uh, now there are, there are algorithms out therethat can actually take imagesand turn them into sparse vectors. Uh, but again, I haven't tried them, uh, very much. I, I, I don't know how good they are from a retrievalperspective, uh,but for sure sparse, you know,sparse vectors are great if you havejust, uh, a text metally. Okay? And a another question came in,if I understand the concept correctly,sparse vectors are better for sentences involving keywords,some out of the box, out of, um,or some out of domain keywords. I worked for an enterprise environmentwith many abbreviationsand entity names, like name of buildings and business units.
Sparse vectors going to help retrieval, searchand rag more than a simple off the shelf,uh, vectors would, well,I mean, I would argue that sparse vectors would,would help even more, right?So if you have abbreviations and entity namesand names of buildings, these are probably not things that,you know, a a a generic embedding model has seen very,very often, uh, especially via these very,very domain specific ones, right?And if you take a sparse vector,because spars vectors, you know, they're,because they take the keywords as they come, so to speak,uh, it is much easier for sparse vectorsto be more applicable for, you know,let's say if you use something like the M 25, right?It's much easier for them to, to be more,to be more generalizable. If you have a pre-trained embedding model, uh, suchas open ai, you know, uh, a embedding model oh oh three, uh,you, you, you're really bound by the tokenize. So you, you know, it's got a fixed vocabulary. And with something like BM 25, T-F-I-D-F, uh,you can be very, very flexible with that vocabulary, right?And you can still get the results that you want. It's, it's, it's easy to sort of redefine that exactlyas I did in, uh, uh, in, in, in sort of the,these cells right here.
So you just simply fit and transform those documents, uh,and, and, and you're good to go. Right? That sounds good. Um, well, I don't see any more questions,um, coming through. Um, oh, okay. Somebody, somebody just, okay.
So a question. Um, how would you go about combining sparse vectorand dense vector retrieval?How would you weigh each retrieval results?Or at least do it systematically?I think we, great question. We have that built in,but yeah, let's, let's talk about that some more. Good question. Um, honestly, I would just startwith some weighted averaging.
Uh, so, you know, again,you have your sparse and your dense vectors. Um, you know, there, there's pretty easy waysto do it from there, but I think RRF is a prettysolid way of doing it. Uh, so again, that's, we talk about that right here in,in this slide right here. Uh, there's, if you look on the Mils docs, you'll see,you know, you'll see some of the different, uh, someof the different ones that you can use for your ranking. Now whether what works best for youis gonna depend on your application, um, it, I think it, I,I would, I would try it out and, and, uh, and go from there.
But RF is, is, is used pretty widelyand, uh, it doesn't have to start to use that as baselineAnd it's fast. It's an advantage. Um, Okay. Alright. Well, um, I think we had a great audience today.
Thank you for, uh, sticking around and asking questions and,and asking great questions. So, um, you'll get, um, an email in a few dayswith this recording. Um, and we'll add some links. Um, also the, the, that hugging face linkthat I posted in the chat has a link to someof our vis supplied, um, example code as well. Um, so thank you everybody.
Thanks folks. Frank.
Meet the Speaker
Join the session for live Q&A with the speaker
Frank Liu
Director of Operations & ML Architect at Zilliz
Frank Liu is the Director of Operations & ML Architect at Zilliz, where he serves as a maintainer for the Towhee open-source project. Prior to Zilliz, Frank co-founded Orion Innovations, an ML-powered indoor positioning startup based in Shanghai and worked as an ML engineer at Yahoo in San Francisco. In his free time, Frank enjoys playing chess, swimming, and powerlifting. Frank holds MS and BS degrees in Electrical Engineering from Stanford University.