Events
Choosing the Right Vector Database: A Practical Guide

Webinar

Choosing the Right Vector Database: A Practical Guide

Zilliz Webinar - Zoom

Join the Webinar

What will you learn?

In this webinar, we'll take a step back and simplify the process of selecting the perfect vector database for your specific use case. Join us as we break down the key factors to consider, demystify complex terminology, and provide actionable insights to help you make an informed decision. Whether you're a seasoned developer or new to the field, this session will equip you with the knowledge and confidence to navigate the landscape of vector databases easily.

Topics Covered:

What are the performance requirements to consider by use case
How to evaluate different vector databases for compatibility with your needs
Comparing scalability and flexibility across different vector database solutions

Transcript

Today we're gonna talk about how to pick a vector database. And we already have a couple of questions that we're gonna, they're like, perfect. I don't know if somebody, uh, is like trying to tee these up for me and make my job easier, but I do appreciate it. Um, so the first question is really a simple one, uh, Li. And what is a Vector database?And can you answer that in just 30 seconds?Okay.

I don't have a very, very good explanation,but Vector Database is for, uh, for, uh, for storeand search and, uh, uh, for, uh, a databasefor Vector Search and the Vector Store. There we go. Perfect. Simple answer. Um, and then there's another question, which is,what are the key differences between a graphdatabase and a vector database?And we're gonna answer that towards the end.

Um, 'cause I, we have, uh, a lot of stuff that, uh,I wanna make sure that we cover. And then I think it's gonna make a little more sense, uh,about what these key differences are. So instead of, uh, kind of going deep into that,we're gonna try to lay the groundwork so thatby the time we get to that question, it'll be super obvious. Alright? This is also meant to be, uh, very interactive. So if you have questions, uh, I will do my bestto, uh, watch them.

And Saachi my friend here is also watching them,but let's go and get started. The other thing that I wanna make surethat we all recognize is that today is pie day. Um, for those of you that might not be familiarwith Pie Day, uh, you know, March, uh, 14th, 3. 14, uh,in honor of this very mysterious, uh, number, you know,it's always interesting to me that, uh, you know, it,it's an infinite number,and yet circles are everywhere in the world. So hopefully you guys had a little bit of pie, pizza pie,or apple pie or cherry pie in honor of this.

And, uh, let's get started in the topic. And the approach that we're taking today isthat I wanna make sure that we, I don't wanna just dictatemy opinionsor Lee's opinions on how you should pick a Vector database. I think it's really important for usto help everybody understand what are someof the key things in a Vector database that you needto consider so that it works really wellwith your particular use case. Um, everybody's a Vector database nowadays. Hooray, that's so awesome.

You would think that I would be a little bit upset,but actually I'm really happythat everybody is in interested in Vector embeddingsbecause they are so powerfuland, uh, they've already made a pretty significant impactin, um, our work that we do. And, uh, the more that we can learn about it,the more we can get more developers, um, utilizing it,I think we'll be able to get a lot more insights. Um, so that's really super great,but then it makes it confusing for a lotof you about which Vector database you should considerand what are the things that you should look at. So we're gonna start off with this question, which is,you know, what are the primary use casesfor Vector databases?I think everybody knows what RAG is,retrieve augmented generation,but Lee's gonna tell you that that's just oneof many, many, many. And, uh, and then Lee also,if you can not only tell everybody what are the use cases,but what are the particular features,what are the particular requirementsthat people should consider with Ector database in regardsto these very specific use cases?Alright, so yeah, gi, gi given we got some questions aboutwhat is Veg Vector database.

So I think I need to introduce some, some,some, some concepts here. So Vector databaseand, uh, we, we, we, we try, we extract the data, uh, the,the embedding, which is, uh,vectors from some unstructured data like graphand, uh, uh, video, audio. These kind of things install in vector database,and we can search it upon another vector. So we call it similar search. And, uh, about the use cases, I say the most, uh, the,the most I can,I can simply categorize this into two different,two different, uh, parts.

And first is online one. So online one, the most, uh, typical one should, uh,it should be the rack and, uh, the hottest one. And the rack is, uh, uh, work with LM model. So you need promptand you get some, uh, rerelevant information stored in database. And, uh, combined with your query to become a promptand get into, uh, a model to help, uh, large modelto understand some, some more information.

Maybe it's, uh, private information is already updatedinformation, or this companies, we call it rack. And also, uh,besides the rack, actually rack is the hardesttopic, but se besides rack. So, so many other things happen inside of database. On the online cases, for example, you dograph search like you do, for example, you tryto search some similar, uh, product in the Amazon,or you search, uh, you do a picture search on Google. Uh, basically they behind, behind the screens, uh,graph search happened.

And, uh, behind, uh, behind this graph, uh,graph search is, uh, vector search. And this is way where we should use vector databaseand also some, uh, some document searchand, uh, fraud detection. So fraud detection, which means that, uh, uh, some risk, uh,uh, some problem, some risky happened,and some transaction maybe is, uh, very suspiciousand you need to find it out. And, uh, we can, we cannot exactly match it with some thingsthat our database, but with some sim uh,some semantic similar, uh, match. So this a reason, or this is a matter how we find it out.

And, uh, this is online cases. So for online cases, uh, I,I'll answer the questions mentioned. Uh, what, uh, the features we need to, to, to,to satisfy this, uh, uh, this scenario. First is, uh, of course a performance. And, uh, performance is always important.

Not, no, no matter for what kind of databaseand, uh, for online, uh, online things. And may, maybe people are, maybe some of the people, uh,think that, uh, uh, think that rack doesn't matter. The performance doesn't matter for rackbecause, uh, log model is super slower. So, but we can consider that you will have so many,uh, tenants there. So you, so many different, so many users can, uh, user, uh,user logic model, and then rely on, uh, their database.

So we also need to have a very highthroughput and very high performance. And the second one is about the, uh, the, the,the real time, which I'll call it. So the, the, uh, building speed. So index, building speed. So how fast we can ingest into the very databaseand, uh, serve to, and, uh, serve this kind of data.

And that is another important thing. So for, other than the performance,we have also some, some other features. For example, the, uh, strong scalability since, uh,in this online services, we have always have peakand, uh, value traffic. And we need to take advantage of this,and we need, we need to scan outand scanning to, to, to, to save the money. And also, uh, since you're online, so it's very important,you can never have any, uh, problems.

So fault tolerance. And, uh, so which content, uh,which include, uh, backupand, uh, uh, failure, re uh,failure resistance, these, all these kind of things. It's very important just as a traditional database. And also observa, uh, observ availability, like, uh,we have monitor system and the tracing system. And also this can help us discover theproblem as fast as possible.

Also, the opera ability, i, I call operability, which isthat we, we find out the problem outand we have a good design, a good tool to fix it. And that is why we, where we need to most important featuresfor the online casesand for offline scenarios, we have some,for example, the data mining. So when, uh, in the machine learning build stages,you will try to find some similar data that help youto fine tune this model,or you want, uh, to, to, to tune the modelor fine tune a big model, whatever. So you have, find some similar data. This, uh, a database can always help on that.

Also, some data cleaning work. I noticed, uh, some, uh, some of the large models training,training process there, they, they need to, they need tonsof data and then to two them. And the d de duplication is not based on some single word,uh, letter to letter match. It's about some, uh, semantical, uh,semantic de duplication. So we need to use a vector search here, uh, as well.

And, uh, the most important feature is, uh,I say the batch search performance. So just, uh, little bit different than the online. Why we search, uh, oneby one in this case is the offline case. We will always do a super back search. Also some other features like cable, uh, the capabilities,so how, how, how, uh, how big, how many data you can store,because offline is always big.

And some reach search semantic for analyze, like, uh,the ran search field, searchand group by this, all this kind of, uh, traditional, uh,OAP database, this kindof reach search semantical is also neededfor a vector database. So let's, um, let's untangle that a little bit. So what you're saying is that, um, for some of these online,uh, use cases, uh, let, let's just go from the top. So like with rag, I think there is, um, there's a lotof chatter out there that saysthat performance doesn't matterbecause your LLM is super slow. But you're saying that no,actually performance does matter in the rag use case.

And so it's something that, that people should consider. Um, and then, uh,and then also, if I heard you correctly, you know,there are some use cases where, um,like the anomaly detection use case that you, uh, described,um, you know, I, I think, you know, having a, um,a very precise answer is really important. And having all the data, uh, presented inthat results is also importantbecause we're trying to find that, that outlier, right?We're trying to find that thing that is causingthat anomaly. And, and we want everyone to keep in mind that, you know,a lot of what we're doing here is a lot of times we're doingwhat we call an approximate nearest neighbor search. But in that anomaly detection use case, we, we wanna be ableto find, you know, what that actual thing is.

And so, uh, it's really important to make surethat you can understand if your vector database has allthese capabilities to get you to that, uh,that outlier, uh, question. And then I think you also just mentioned, uh, some, someof these o offline, uh, use cases. Um, you know, it's, it's not just about just doingthat initial search, it's about then taking that dataand then, um, and,and doing more, you know, using that to,to further train the model. So there's so many different use cases. And, uh, in this brief moment, I think we've just heardthat there's also lots of different situations, lotsof different ways to search on that data.

Um, and there's also a lot of different kind of expectationsthat you should have on the performance of the database. Is that correct?Yeah. So yeah, definitely. So actually, this is just a rough conclusion. So actually we have so many different echo corner.

It's not about this corner,but it's not that, uh, hot not that popular. For example, some, uh, I know some bio, uh, biotechnology,they also use vector search. They're doing some, uh, some protein predictionand they wanna match the, uh, to,to find the similar protein and through vector search,because you can always transfer the protein to some letters. So from bio biology perspective, yeah, also this kindof, this is very interesting. Yeah, actually, I just, uh, met with a user, uh,just two days ago that's like doing all that.

And, uh, and uh, what, what did he say to me?He said it in such an eloquent, eloquent way. He said that, um, when you realize that proteins are, uh,a sequence of amino acids, then you start to realizethat you can, you know,these models can actually learn the protein grammar,which I didn't even think about it from that perspective. And, uh, yeah. And so in that particular case, we do wanna get like,you know, a very, uh, relevant, uh, um,search result in that case. 'cause they're ultimately trying to build, you know, uh,a really optimal protein for a numberof different kinds of use cases.

And some of it could be in things like,I think he was telling me, uh, to, um,to use these proteins instead of using, uh,oils in plastics. Um, or some of it could be in medications. It's, it's just completely different worldthan just software engineering. Yes. Fabulous.

Okay. So, um, moving on here. So, um, so why does how a vector database, uh,handle scaling matter?Like, because I think there's, um, you know, we can scale upor scale out, but why does this matter and,and why should we think about this?Or why, why should we make sure that this is important in,in regards to our particular, uh, use cases?Okay, so first I wanna answer, uh, uh,I wanna just introduce a little bit a base, uh, uh, concept. So about, uh, what I already mentioned scaled outor scale, uh, scaled up. So I would call horizontal scale or vertical scale.

And, uh, and about, uh, uh, also why it's important. And definitely we, because, uh, data is more getting moreand more so everything as, as what mentioned, bio, uh,biology, uh, bio, some, some, some bio, uh,biology company use, uh, use, uh, larger model,use vector database and some other, maybe other,uh, industry world. You, uh, you, you use these large network models. And when you get into modeland, uh, vector is, is always you, you cannot avoid. And, uh, when you come to vectors,and we have one more vectors,and definitely you, uh, vector database need to scaleand, uh, uh, about a different kind of, uh, scale method.

So, uh, the horizontal scale is as simple as, uh,simply enlarge the host, make the host bigger and bigger. And the vertical means that you can add more, uh,host to this cluster. And, uh, horizontal scale is, uh, uh, horizontal scale is,uh, uh, uh, is limited, is unlimited, since you can, oh, so,oh, maybe, uh, sorry, I made a mistake. So horizontal is add more host,and the vertical is, uh, make it bigger. And horizontal is unlimited since you can add, uh,uh, as much as you want.

And the vertical is limitedbecause, uh, uh, it cannot,one single host can never grow very, very big, right?It can never grow unlimited bigand, uh, many vector database. Now, uh, since, since our topic is about how to pick up the,pick your, uh, vector database, so a little bit more aboutthat is, uh, many vector databases, most of them, uh, uh,usually support, uh, uh, horizontal, uh, sub support, uh,vertical, uh, scale. So since it's mostly easy way to implement, like, uh, uh,and also in some other database, like, uh, uh,yeah, search this kind of things. And they usually also do only do to, to do, uh, this, uh,ver uh, vertical scale. And, uh, uh, this can help make sense easier.

And when you, uh, when you grow, when you scale it,I think the performance, uh, will be, uh, the, the,the improve the performance will be linear to the, to, to,to, to how much scale since, uh, the overhead will be less. And the problem is that it can, can never grow, right?So it can never grow. Super big, can always support, uh,can the horizontal scale and for horizontal scale. So, uh, so for, for, so for the verticals, uh, for vertical,the problem is that you, at, at the very beginning,you can know in the future how big your, uh, data will be,how, how big your da, your DA database will be. So if you miss under estimated, it will sufferand you'll find, find that, oh, it's so crowded,I cannot do any more scale.

Yeah. Yeah. And I mean, it's, it's interestingbecause it's, um, you know, it's what, uh, it's somethingthat we came up with, we, not me,but the industry came up with like, you know, 20 years ago,maybe more like the idea of horizontally scaling. And yet, every time we introduce a new type of database,we fall into the same trap. We all start with a database that scales vertically,and then we realize over time, oh, dear, uh, actually, our,um, the number of, uh, items that we're trying to dealwith is actually really large,and now we need to figure out howto do it, uh, horizontally.

So it's not a new, uh, a new, um, uh, wayto address, uh, the challenges of scale is justthat I think, you know, with Vector database, it's somethingthat you need to think about. And, um, and when we're building prototypes, you know,we might, maybe we have a hundred thousand vectorsand we think, yeah, it doesn't matter. Um, but it's really, really easy to get to, uh,a hundred million vectors. And, uh, it's really easy to getto a billion, don't you think, Lee?Yeah. Yeah.

And, and, and, andthis world is changing, right?So with the model is, uh, is growthso super fast within last years,and maybe at, at the, uh, at the beginning of the 20, uh,2023, you'll say, I'll have just, uh, 1 millionor 2 million of data, since you didn't saythe, didn't say anything. But after that, uh, so you need, uh, tons of, we,for nowadays, most of our use cases is about, uh,use billion to count instead of million. That's right. That's right. Yeah.

And I think, you know, if anybody on this, um,on this session is building out a rag solution,you're probably seeing that evenas you're chunking up your textand your PDFs that, um, you know, it, it,it can start to grow really quickly. And, and let's just do some simple math. Let's just take in example of maybe like a medium sized, um,e-commerce site. Um, you know, you're, there's a, a bunch of itemsthat are being sold, and each item can easily have,you know, maybe five pictures, a video, a bunchof user reviews, a product description. So it's really easy to see how a single itemthat you're selling could have 30, 50 vectors, uh,of all different kinds, right?And then multiply that by the numberof items you have in the, uh, e-commerce site,and then boom, you at a hundred million and then,or, you know, 300 million, right?It just, it really does, uh, add up, um,a lot quicker than I think, uh, uh,a lot of people do realize.

Okay. Um, actually, let's,before we go to the next question, there's, uh,actually a good question from the, uh, audience. So, is distributed computing applicableto both horizontal and vertical scaling?Uh, sorry, lemme take a look. Uh, so yeah, yeah, of course. So, uh, when, when it comes to disputed, uh, com, uh,computing, so, uh, this, this,this description is a bit vaguebecause, uh, uh, horizontal, uh, no matter, uh, horizontalor vertical, both of them are distributed.

So I think the answer is yes. Yeah, yeah, yeah. It's just a, um, you know, as, uh, as Lee mentioned, like,you know, if you're gonna go up,you're gonna eventually hit a ceiling. Uh, and so it may be fine depending on your use case. So, uh, once again, think about, you know, how much,how many vectors do you think you have todayand what you think you're gonna have tomorrow or,or in a year or two.

And if a, uh, a vertical, you know, scaling is good enough,then put that in your checklist of howto then determine which vectordatabase is gonna work for you. If you think you're gonna get to,you know, much larger scale. It's better to understand thatand make that decision early on than having to then,you know, try to make, move all these large numberof vectors in your,in your vertical scaling solution to a horizontal one. I, okay, let's go on. Okay.

So, um, you know, we, I think, you know, there's a lotof terms that are thrown out there that, um, tryto describe, um, kind of like the performanceand kind of the unique characteristics of vector databases. So can you describe to us, you know,what does relevancy mean?What does scalability mean,and what does, uh, efficiency mean in the worldof vector databases?Mm. Yeah. So relevance, maybe I, I, I thoughtbecause, uh, vector search is, uh, is not a exactly search,which means that, uh, uh, you will not, you will not get,so we always say top key, top key, similar, right?But maybe it's not, uh, you know,very exactly the top key since you,since the reason why we have this kind of, uh, uh,concept called relevance or record. So, uh, because we treat,or we, this is trade off, we sacrifice some of the accuracyor relevancy in this kind of thingto improve the performance.

And this is, and alsobecause it is, uh, mo, uh, vector is, is a databasebehind ai, and we have model there, so we don't need to bethat exactly, uh, the sameto make the model perform very well. Yeah, this is, uh, about the relevancyand, uh, scalability as we we just mentioned. And, uh, actually it said it,it doesn't have any differences then. And, uh, then the traditional databaseand, uh, uh, you, you about architecture, you needto scale horizontal scale and, uh, work scale. And you have sharp, this kind of, uh, replica, replica,this kind of concept, basically same and, but efficiency.

So, uh, it's, uh,if you mean sort some performanceand cost, uh, do, do you mean by this? Yeah,Yeah. Okay. So, uh, it depends on the use cases. And, uh, uh, we have different kinds of, uh, so the, the,the most heavy part inside the very database,most important part is about the index. So what different kind of index you choose indexes help youto do similar search.

And, uh, and the index, uh,will occupy most of the resources. This will off, uh, affect the efficiency, affect the code,and you can either put it into a disc,either put it into a memory,and you can quantize image, you can compress it. And this can also have different kind of efficiency,depends on the cases, I think depends on your use case. That's, that's great. So then let's just dig into, um,relevancy a little bit more by asking the question.

So what kind of searches available in, uh, vector database?Because you mentioned that, you know, we're tryingto find something that is as close to possible, right?The top K, what does top K even mean?Uh, top K means that, uh, uh, in, in vector search,we have some, some metrics to howto evaluate the similarity. We have L two ip IL two is just, uh,d some L ip, and, uh, cosign this kind of distance. And, uh, uh, if, uh, two vectors arevery close in this distance, uh,or we, we call in this, uh, vector space,and, uh, it, it is, uh, it is, we,we find the top key closest one from, uh, so this ishow we call top key search. And, uh, what kind of search?So top key is, uh, uh, the, the,we call in search or top key search. This is a very basic one.

And, uh, actually it's, uh, it's a,it's more like a vector search instead of,uh, vector database. And so many, so many live can do this, right?And, uh, like phaseand this kind, you don't need to do a database upon it. And for, when it comes to database, I think, uh, we needto more, uh, we need to reacher, uh, search sematic. And for example, we have, uh, uh, filtering search. Filtering search means that, uh, since since you, you can,you, you can treat, uh, uh, your, your database as, uh,your collection or your database as, uh, uh, MySQL table.

And, uh, uh, you have a column is vector column,and you do a vector search upon this vector column. And also you can, some have other column. So we, it can be scaler columns that can be streamed,can be inked, for example, you have a picture,you have a description of it, you have a label there,you have, uh, uh, the, the, the date it get, it, get, take,uh, uh, in there. So all these, uh, datas can be, uh, used as a filterfor the vector search, for example,you wanna find the top K closes, the, uh, uh,top K closes the picture,but, uh, you don't want anything involved, uh, including,uh, red hat maybe. So this called filter search.

And that is, uh, very important in the,I think is a very important, uh, search sematicinside the, uh, vector search. And also range search is also, uh, another one. Range search means that instead of your top K relevant,and we want to have some, uh, uh, some, the, the vectors,uh, within a range distance range, as mentioned,we have L two, uh, uh, and IP as a distance, uh, metric. And, uh, some, some cases like, uh,you find you wanna find a similar one,but you don't know how similar, uh, you, you want,you don't know how much you wanna find you. You also, you, you also feel concernedhow much similar data I have here.

So maybe you have just a range. So I say with the distance within two, I think, uh,the definition is, uh, within two, it will be very similarand you won't get everything else. So that is a ran search. So about, uh, goodbye search. Goodbye search means that, uh,just like traditional database, we, we group buy some, some,some data and, uh, get it, get it out.

For example, you, if you want,if you do a similar search on a Snoopy, uh, Snoopy picture,you will get, uh, top, uh, you, you, you may,what you want get is two different,three different top three different kind of, uh, dog. So you will get, uh, uh, you hope you can get coffee,you get some other dog, you get Snoopy. But what happens is you make a two exactly Snoopy,because you have so many Snoopy there. So in this case, you wanna say, I want to go buy something. So you go by the name of it.

So if it is, it is, it is Snoopy, just gimme one. And, uh, then you have, then finally you can get,still be coffee and some other dos. So this is, uh, just an example. So they say, we call it group by search. And uh, also you can group by some, some, some, some vector.

So instead of group by only on the, on the scaler data,I would say the stream, you can also groupby the on the vector, which means that if, uh,the distance is super close,you can treat it as, as the same. So, uh, it's another, so I, I would call it virtualize, uh,uh, the, the, the, the, the, the some semantic search, some,uh, search semantic in from the traditional database andotherwise, uh, uh, literature. So given some cases, uh, you just, just like Google, you tryto search, uh, some similar things,and you try next page, next page, next page, basically. So you do it, and the vector search also, you want to have,uh, top K, top, uh, top 10, top 10, top ten two, top 30, uh,20, top 32, top 30, this kind of thing. So we call search and also spas spa is another topic.

So, uh, since the vector search is, uh, uh,it's about semantic search, which, uh, so the,the d the model will extract a semantic, uh,information from the abstract dataand make it, uh, uh, a dense vector, echo dense vector. So, uh, and sparse, which means that, uh, uh,you all select, uh, the, the, the mo the,the vectors extracted by BM 25, TFIF, this kind of, uh, uh,traditional statics, a static based on, uh, based, uh, um,no, it's not a model. It's about some, some algorithm. And, uh, you'll get a very sparse vector. And this is, um, has more information about the keywords.

It's about statistics. And it is a, actually, it's a complimentaryto the dense one about semantic one. One is focusing on the context of, on the semantic,one is focused on the keyword, on the, yeah. And then you can combine together. So finally, it's about, uh, how to hybrid this kindof D two, two different search.

And like, uh, you have multiple vector. One is spouse, one is dense,and you how to con uh, how to combine 'em togetherto make your search better. So also important thingsto support inside of Vector database. Yeah. And it, and, um, that last point,making your search results better, um, is, um,is gonna depend on what you're trying to search forand what's in your database, right?It's not a what, that you can't answer that, that, that, uh,question with is one way of doing a search.

'cause you mentioned a whole bunch of different waysthat we might do searches. You mentioned that, you know, we can, uh, filter first, uh,filter out by the, the, uh, s scaler dataof the, uh, metadata. So that basically you're, uh, trying to limit, you know,what you're, you're, uh, populthat you're gonna search aga against. So that's the first one you also mentioned. Um, you know, I think something super fundamental, which isthat we're trying to find similar things.

Um, and so that's why we are getting this top K results. We're trying to find the top five similar items. Um, and sometimes, um, in your, in your data, um,maybe your results are not gonna be what you expected. When you do that, I wanna find the top five. And so that's why you need to also consider the ability of,um, doing what we call a range search,where you can basically put like a range around that data.

So I want, you know, uh, you know, within this, this area,I wanna find the results. And the top K may be outside of that area,or maybe all inside of that area. Um, you also mentioned group buy,which I thought your example of, um, looking for Snoopyor looking for dogs like Goofy and Snoopy. Um, you know, what are you really tryingto find when you're doing that search?Are you trying to find just Snoopy,or are you just trying to find dogsor similar dogs to Snoopy?There's a lot of different questions that could be asked. Even the other thing I think you, you know, in Groupathat you kind of hinted at is, um,when you think about like a rag solution,we're doing a semantic similarity search on the chunkof the data, do we,and then what are we actually trying to surface?Are we trying to just surface the chunks,or we wanna group by the actual PDFor the, you know, the original corpus of data to say like,okay, we want to, you know, see all the similar docsand we're doing a search, you know, based on the chunks.

So you might wanna group by that. Um, yeah,They're similar, like, like similarto the traditional device. Maybe we can still have some aggregator after. So you, you, you group by the idea of the PDF,and then you wanna, uh,aggregate the scores for each paragraph. You just add 'em together, or you just find the max one,and you, you, you do a top K on it.

Yeah, that's a really great point. 'cause when I first heard the word group, I, I,first thing I thought was aggregators. So, uh, yeah. So, uh, so there's a possibility of, you know, addingthat aggregation, uh, as well. Um, and then, uh, what else did you add?And then, oh, yeah, and then you, you know, you mentioned,uh, you know, so hybrid search, I think in the industry,when you look at all the vector databases,everybody has their own definition.

So it's important for you to tease outwhat does hybrid search mean?And, uh, hybrid search, go ahead. Oh, yes, it's just to interrupt. So, uh, sorry about that. So, hybrid search, actually, uh, so for, for now, the mostof the definition is about since, since, uh, a very typicalrecommendation system is, uh, uh, use someu search and vector search. The search is, uh, some, uh, traditional, uh, statics, uh,statistic based, uh, search and then vector search,and then call, uh, top K from each other togetherand combine that result together.

We have a re ranking model,this kind of thing to get it out. We call it re hybrid search. And in our, in, in data side, so we call ve uh, the,the hybrid search, uh, like if you wantto combine the result of two, kind, two, two vectors,two columns of vectors,and then you want to combine the result. So we call it hybrid search. It can be, it can be, uh, either be a sparseand dense vector, also, uh, dense vector and dense vector.

So for example, you can, you, you can have a, uh, a videoand you, you, you, you encode, you invent, uh,you get embedding from the, uh, uh, movie, the, the,the graph or the, the, the, the visual data. And then you get a, uh,you get a column vector from the audio data,and then you can try to search, combine the search resultto get what you want, what video youwant, the example. Yeah,That's a really great example. Uh, yeah. And so unfortunately, if you're,when you guys are looking for, um, a vector database, uh,you can't just take the name of the feature at face value.

So you can see hybrid search isunderneath is a whole bunch of different things. So as you're looking for, you're tryingto choose these vector databases, uh, really dig into itand figure out like, what does hybrid search mean?Is it just metadata filtering?Is it, you know, the, uh, dense and, uh, sparse vectors?Um, you know, what is actually,what are they actually talking about?And then, uh, make your choice based onwhat you're actually, what your feature requirements are. Uh, it's an unfortunate thing inthe industry, but it's important. We wanna make sure that, you know, we highlight that. Um, one other thing that you said at the topof this question was, um, you know, when it comes to, um,you know, uh, uh, approximate nearest neighbor search,you can just use a, um, a vector, um, index like face,which is not a database.

So how would you describe what the difference between a,a index and a vector database?Alright, so yeah, this is very, very classical question. So from, uh, phase is, uh, collection of library,and, uh, uh, beyond that, the, the, the, so the relationshipbetween phase and, uh, for example,VUS is just something like we can inband my, so you have to more than or be three,and my, so you have more, uh, more than that. So you have the, uh, as mentioned, the scalability, how,and also data, persistent for tolerance monitoringand, uh, yeah, all these kind of things. So, because library, library, just, uh,working on it is just algorithmand comparing with algorithm and the database. So we, we, there's a lot of gap there.

I love that bere. And, you know, like MySQL,that's a perfect, uh, way to describe, you know,the difference between like a, an, uh,a library index like face or HNSW and a, a vector database. So, I mean, and,and do you think that there are some situationswhere maybe a library is just good enough for people?Uh, I guess so if, if you're doing some experimentwith more just, you know, leveraging,and you work on some small, uh, small amount of data,like less than 1 million, uh, feel free justto work on library, since you don't need,and you, you also, you, you are not serving the production. You don't need to have this kind of fault toleranceor this kind of monitoring systemor some security system. Yeah.

And so then, um, uh, the next question is, um, uh,what's the difference between a similar metric and a, uh,and a, uh, index?So like, how would you like, describe, like,because you talked about, uh, Euclidean L two,you talked about inner products,you talked about co-sign, right?Those were your similarity metrics,but then we have these libraries or indexes like HNSWand uh, face, et cetera. So what, what's the difference between the two?All these words just sound like, you know,a lot of confusion, right?Uh, uh, distance is, uh, the, the metrics is, uh,just a de just, uh, decide how close you are. And, uh, uh, uh, for example, the, uh, uh,so the, the, the, the two, there are two vectors. It can be use ip, which means that just, uh, multiplywith each, uh, dimension and together as a distance,or you can use, uh, L two uc, euclids. And, uh, about the index is, uh, different kindof ways is a data structure to form this kind of,to form this kind of vectorand, uh, make, you can search this, uh, as fast as possible.

And this, this search, the top K search is upon the metrics. For example, you use IPO, it will search,the result will be, uh, similar in the IP perspective. If you use L two, it'll be clo uh, top K with, uh,under the top K perspective. So yeah, it's, uh, it is two different kind of thing. Cool.

Um, awesome. And then, and, and when you look at your actual, uh, um,index building or you, um, building out your queries,you'll actually see that a lotof these things are requirements. Uh, and so hopefully that's gonna make a lot more senseas you're, uh, learning how to use these various, uh,vector databases that are out there. Okay. Let's go on.

Okay. Um, all right. So everybody talks about, oh, wow, well, we just support Hand SW so everyone knows what that acronym is. Now if they're in the world of, uh, um, vector databases,but there are so many other indexes, is Hand SW good enough?Should they, should people know about the other ones?Are they important?Uh, I think HSW is, uh, famous because, uh, easy to get,and also it's easy to understand,and it's, uh, data structureor this kind of, uh, concept is, uh, so, is so oneso solution, but, uh, not, not meansthat HSW itself is the best one. So it is most well known one, but not a best one.

So from the pure algorithm perspective, so HS W is, uh,mutable, which means that you can add data in thereand you can delete it data from it,and you can search, well, it, and some also,or some, some other immutable, uh, graph, like, uh, uh,like, uh, and oh, just, justbefore the, in case people don't know, what is HWHW is, uh,graph, uh, data structure. So each vector will be a, uh, pointor vector, uh, vertex inside a graph,and we connect this, uh, vertex, this, uh, vectorswith the edges, and you have multiple lasersand you can search, uh, iterate upon this graphto get a close this way. And some other graph like Ana from disco and NSGand this kind of thing, they are immutable. And, uh, also they have better performance. And, uh, also someizer can be applied on the HSW or some other graphor other algorithms.

And also, uh, if you changing into GPU based hs, w might,uh, might, might not work because, uh,because of gpu, GPU is totally different things. And we have, uh, car all this kind from nvidia. Uh, now it's supporting vu. This is super suitable for, uh, for, for, for gpu. Also, Asian sub is not for, uh, not good for bk.

So graph based algorithm actually not very good for bk. So if your K is around, is a couple of thousandsor, uh, 10,000 more so graph can never support, uh,can never satisfy your, uh, your needs. And also if you do a batch search,also graph is not a very good choice. So it is a general choice. So people always will not have that BKand not have that big batch, and they don't needthat super high performance.

And, uh, they will not put around GPU. Uh, I, I think Asian sub is a good, a good start point,and also from some rich semantic, uh, features,Asian sub also have some, uh, pure Asian subor some problems like filtering and spars. You have some specific support on it. And from the u use cases per perspective, I, I, I, I,I have a, a concept called, uh, CAP vector search. It's not about the distributor, it's about in, in, in, in,in vector search area.

So it's a capacity, accuracy, and performance. So it's a triangle you can never support. Uh, you can never satisfy all of them. So only two can be, uh, only two of them can be satisfied. And Azure sub is typically isaccuracy and high performance wise.

And, uh, uh, another two combinations can also, uh,also have some use cases for the other two cases. For example, you wanna have a very big capacity,very big storage capacity, and very accurate. So may choose some disc based, uh, index like this. And, uh, also if you want to have the performanceand the capacity, you don't care aboutthe accuracy that can happen. So because people, some, some use case people can have a BKand search, get a result,and you throw the machine modelto do a fine, uh, re-ranking.

And this, in this case, they, they need to, they don't,they don't care about the accuracy. So like, uh, I-V-F-S-Q-P-Q, this kind of kindof solution can be a big, can be a great one. That, that's a really great point. So I just wanna reiterate to everybody. So, um, you know, just like with databases,we have these trade-offs.

So the, the famous cap theorem, uh, where, you know,you just can't have it all. Um, you know, if you, if your use case is needs performanceor availability, or, uh, p is what partitions, um, you know,the, you, you have to make some some choices, right?And it, and it, it's really easyto make those choices based if you think aboutwhat your use cases are. And so there is a similar kind of triad with, uh, with, uh,vector search as well. And, um, once you kind of figure out, you know, what is,what is the most important thing?Is it, uh, is it accuracy?Is recall, is it, uh, performance?Is it, um, is it being able to, uh, search withthat metadata as, uh, Lee just mentioned, understandingwhat those are, and then, um,and then also not just choosing the database,but then choosing that, that index that's gonna be ableto match the requirements, uh, of, um,of your particular use case. And, um, and so HNSW is, is a, is a great index,but just make sure that it matcheswhat you're trying to accomplish.

Yeah. Excellent. Okay. Uh, so it's Vector database is a database, right?So, you know, what about the ingestion side?Like, is there, you know, what do we gotta do on that side?What kind of pre-processing is requiredfor a Vector database?Okay, so I have two integration on this. First is about, um, people always askingwhere does the vector come from?Where does the vector come from? And, uh, uh,always you have a modeland you put your picture, you put your,uh, document into a model.

It can be a bird, it can be a open eye, whatever,large models company. And, uh, you get this, uh, you get, you get this, uh,in vectors in balance out and input into, uh, vector Davis. And that looks like a very long, very, very, very tedious,very long run. So, but we have,but luckily we have some, some, some tools to help,for example, long chain and LA Index, they famous at it. So they can combine large learning model, very database, uh,database, uh, and vector extraction,all these kind of things together.

And also some of the vectors,our database has already have already supported the, uh,they, they, they include some large model, uh, not like,just some, uh, some model inside their database. So they can do, uh, unstructured yin instructor and,and, uh, some of them are not. So it depends. So it's about the first,it's the first interpret, uh, in, in, in, in depression. So about, uh, how to do the interesting pre-processing.

Another one is, uh, information about some pure v uh,vector, uh, uh, streaming in. And, uh, how we support this, uh, mouse is, uh, uh, uh,so vector just getting, and we search on it. And, uh, actually it's pretty hardto satisfy the data freshnessand the data efficiency at the same time, which meansthat if you ingest data, uh,you wanna have the very high visibility you wanna search onimmediately, it de it depends on different kind of database. Some database use Asian sub directly, so they will, uh, uh,when data get in, they will try to add this point to a, to,to, to the Asian sub graph. It takes time and, uh, which means that you can now searchwhat you just stream in, uh, immediately,and it takes some time.

And, uh, this I call, is sacrifice, uh, sacrifice freshnessto reach the data efficiency. And inside mill what our designs, we have, uh, two lasers. We, uh, we, we will get it inand we'll have a very, very fast building indexto serve the, uh, data freshness. And we then we have, uh, uh, in background, we,we build some, some graph basedor some other, uh, board, uh, indexto help us serve the data efficiency. Yeah, that is, uh, nice.

I'm not quite sure where thatNo, that was perfect. Yeah. So a as you mentioned, so, um, you know,after people learn about vector databases,then their first question is, oh, you know, where,how do I get those embedding?So you can either, um, you know, use your own modelsto generate those, uh, embeddingsand then insert them into a vector database. Some vector databases have the mechanismto actually create those vector embeddings. So look for that.

When you're choosing a vector database. You can also, you'll use other tools like Lang Chainor LAMA Index, uh, semantic Kernel, et cetera,that have actually got really strong ties to,um, a large language model. So they can also help you generate those and Bennysbefore you store it in a Vector database. So it all depends on, you know, what your use case isand what your, uh, your stack is. Um, but things for you to, uh, think about when you're,you know, you're choosing a, a vector database.

Okay, really cool. Okay, what about security?Because, you know, these vectors look like nothing to me. I don't even know if I can like, decipher them. So why do I have to worry about securityor, or maybe I don't. Uh, I, I think it always, this security always matter.

Oh, always matters. And no matter what kindof database you are, and, uh, since, since the vector,this maybe it's hard understanding for you,but maybe it's easier, easy to understandfor large chain language model. Uh, maybe in the future we have a magical interpreterthat can understand it with some context on it. But in this, all these kind of cases, we,it's super important to have the security, uh, insuranceinside the database also be. And another point is beyond the vector data,we also have some, uh, structured data, uh,which is a string anytime mentioned the metadata.

And we also need to keep the sec, uh, keep,keep the sec security of, of that. So, uh, so features when it comes to what kindof secure features, so first you have you, the database needto, uh, pass the, uh, needto get some security certification, like, uh,SOC two from SOC two, uh, IO and GDPR or this kind of thing. And, uh, another thing is about data security. Data security is, uh, about how you do data isolation,access control, the role based access control, this companyand, uh, user authentication, authorizationand also cyber security,but some, uh, IP address control, uh, access controland private link into, uh,encryption in the transit, all this kind of thing. So this, and these are another things I want, uh,I wanna express a difference between, uh, face this kindof library and, uh, just recall this question.

So security definitely is one of it. Yeah, you're, you're absolutely right. So I was being a little facetious when I said, you know,does it even matter, um,when you're choosing a Vector database, uh,and you're, you plan to put it into production more,you're gonna have to talk to your internal security teams. And, um, you know, they don't knowwhat vector beddings are a lot of times still. And so you're gonna have to explainwhat a vector bedding is, um, in their,their natural first reaction's gonna be, oh, well,definitely somebody is gonna be able to reverse engineerwhat that, that, uh, vector embedding is.

And if there's any kindof like company confidential information in there,they're gonna be really, really concerned. Uh, so, um, and it's not something that's easy to do,but there are some papers out there that claimthat they can do that kind of reverse engineering. So security is actually really important when it comesto your vector embedding,and your security teams will let you know that for sure. And then, as Lee mentioned, all the traditional, uh,security, uh, questions are gonna come fromyour security teams anyways. So when you're looking for these vector databases, uh,you're just gonna have to put that on the list.

It's gonna be really critical when you're choosing, uh,your vector database. Um, so we have, uh, a few minutes left and I wanna just goand get to some of the questions from the audience. So, uh, Jaylynn asks, can you please share some examplesof currently available vector databases?Uh, okay, so for some open source one, uh,meals must be the most famous one. And, uh, Kron and vapa, uh,and, uh, uh, yeah, some, some, this is, uh,what my mind first classand, uh, some lightweight, uh, li li li lightweight, uh,wine like, uh, chroma and, uh, lensand some code source wine, uh, the, the, uh, like Pankoand also, yeah, this called, of course, based on the, the,the, uh, the manage, manage professional familiars. And then also, uh, you know, there's, um, you know,every database that's out there, whether it's a a a no SQLor a SQL based database,I don't think there is a database out therethat hasn't adopted, uh, a vector index.

Um, and, and so, um, it is really important that you,that everybody understand, you know, who, who,what is a vector database?Uh, our definition of a vector database isthat it's purpose built for vectors from the get go. It supports, uh, a number of the featuresthat we chatted about today. So not only does it support, um, one or many vector indexes,but it's supports, uh, the, um, similarity metrics. It also handles the life cycle of vectors, uh,in a particular way. And then also, um, it's important to understand that, um,trying to find those, uh,similar vectors is computationally heavy.

And so, uh, vector databases will typically focus on tryingto be, uh, very performant on that side of the equation. Whereas, you know, some of the, uh, traditional databases,um, you know, they're, they're, they were builtfor different purposes, right?So, uh, it may work, um, in your situation,but just, you know, go into eyes wide openand make sure that you, uh, articulate your use cases. Okay. Um, Ali asks, which LLM works the bestwith RAG currently, according to your opinion, Lee?Uh, so this, this is a hard answerto, hard question to answer. So since i, IM the, the model keep, uh,keep updating, right?So like, uh, we know recently the cloud three did a greatjob on the benchmark, and G four is always a good choice,and also Lama is wildly adopted in the open source.

I don't think I have a very specific opinion. Yeah, Ali, sorry about that. Um, just keep in mind, there's a lotof LLMs that are out there. Uh, of course the most famous right now is, uh, um,you know, GPT-4 from open ai, uh, has a, you know,a really simple, uh, API for you to interfacewith does come with a cost. So, um, you know, if you're building something in,and it's not in production, you might wanna look at, uh, goto hugging face and look at, uh, the leaderboard thereto see what other LMS are there.

There's some LLMs that can actually fit on your laptop,so you can just kind of prototype,might wanna consider downloading those,or there's also a bunch of open source LLMsthat might be sufficient. Once again, think about what's your use caseand, um, you know, uh,and see, you know, which model might be, uh, sufficient. Um, so, so don't just go with the, the market leader. And then unfortunately, uh, as Lee mentioned,these models are changing constantly. Uh, so it's a race, isn't it?Yes.

Every day it seems like,you know, somebody new is coming out. I think, uh, philanthropic like now has haiku support, like,you know, uh, and then a couple weeks ago,voyage AI was talking about how they're really goodwith like code, right?Because like, code is also very specific. So unfortunately it's, uh, for all of us, it's, um,it feels like, uh, like a million,like people running by us constantly. So it's hard to catch up. It's hard to keep up.

Okay. Uh, ESH asks, is a collection a setof vectors which could have one or more indices?Can you index the same set of data with different indices,indices for different use cases?Uh, so it depends on thedifferent database, right?In vus, uh, there's a, a collection is a concept in VUSthat is a set of, uh, uh, not only about vector it a tableas as mentioned, it's a, you can treat it as a MySQL table. And one column is, uh, vectors from, from the latest, uh,uh, from the vector, uh, S 2. 4. So it can multiple columns of vectorsand, uh, do hyper transmission only.

It, and, and, uh, and, uh, you can index the same cell. Uh, you can, uh, you can index, uh,the one column of vectors with, uh, uh, with one specific,uh, uh, algorithm, uh, uh, an algorithm testing. Yeah. So it's from per perspective,it depends on the database you chooseBig niche, but you bring up another good pointthat we didn't even think about. Yeah, even the word collection is very different from vectordays for the vector BA database.

So, uh, you have to kind of dig into that to understandwhat is, what does that, uh, especially mean. And um, and I think, you know, um, you also haveto ask yourself, you know, what's your multi-tenancy, uh,requirements for your applicationbecause you can use these like collections or partition keysor a number of these, um, these, uh,things, uh, to help you with that. And so unfortunately you're gonna have to dig into that'cause nobody's using these terms, uh, consistently. Yeah. Um, okay.

So I wanna go back to Armando had, um, a, a,a comment question. So he wanted to know, can you do a hybrid searchwith two dense embeddings?The answer is, uh, yes, also depends on different kindof uh, vector, uh, vector base. So from VIS 2. 4, we support it so we can expect a release. Yeah, there's, I think I've seen like really,really intense workarounds to be able to do that, uh,from our users because it isdefinitely something that they want.

Uh, with, uh, the next release of vis,we are allowing you in like a,a single row, a single entity. You can have, uh, multiple vectors that are, um,you can store, so you can do a search against that. We're just trying to make things, uh, a lot easier. Um, but evenbefore implementing this, what we saw with people were, uh,trying to build that themselves. Putting, you know, doing these crazy kindof queries, which you can,You, you can have it in a couple collections.

Yeah, you can have 'em in a couple. So just as inefficient. So, you know, back to like, we're tryingto make sure performance is still important, let's do it sothat you can get the best performance. Uh, okay, so we did that. Um, Abe, you had a really good, uh, comment earlierand I do wanna share with everybody.

Our goal for todayafter today's talk is to turn this into a paperwith lots of visuals. Uh, so for those of you that are more visual learners,we'll have, uh, um, a way for you to, um, you know, be ableto gr this, uh, information, uh, via those means. And, um, if you have other questions that you want Leeto answer, uh, just shoot us. Uh, a note, uh, Lee has,has obviously is passionate about vector databases,but also with his background,super passionate about databases in general. Uh, probably coming from his, uh,his time spent at Carnegie Mellon.

Um, you can also hear him speak at, uh,the Carnegie Mellon, um, podcast. I think that was just maybe like five or six months ago. Uh, so where it goes really deep in,in the database side as well. Um, let me see if we have any one last questions. Here we go.

Um,I have a very large collection in vis I'll be performingeither substructure or superstructure search. Should I drop index every time I perform a searchor build two collections?Uh, can I ask what does it mean by substructureand the superstructure search?Uh, it's, uh, so I'm not quite sure about thesubstructure and the superstructure search,the, the, the meaning behind. Yeah. So if you can give us a little more details, we can, uh,dig into that a little bit more. Um, but hopefully we don't have to.

Um, we don't have, you don't have to do what you're asking. Dropping indexes and all that,that sounds like just too intense for us. Um, but let's, let's do this, uh, shoot us an email. Um, you can send me an email at chris dola@zillows. com.

You can, uh, ping us in the Discord channel. Um, if, uh, you are listed as anonymous. So I, I wouldn't be able to respond back to you,but if you can send back a moredetailed question, that would be cool. We'll, we'll try to get that answered. Let me just check one more time.

Is there anything,Oh, sorry, I, I think I have the answer. So it means, uh, you, you,you must have a binary things, right?You must have a binary vector. So it is kind of metric inside of binary data. And, uh, if you mean that you wanna have two different kindof metrics, yeah, definitely. For now, you need to rebuild the data.

You, you need to rebuild, uh, index. And, uh, if you don't want, if you, you don't wanna dropand rebuild and you, you definitely needto have two connections, or from two point to four,you can have two columns. So let us know. So it sounds like depending onwhat your situation is, uh,if you're talking about binary vectors,or if you're talking about what we had mentioned earlierabout having, uh, two vectors in, uh, a row, uh, there,there might be two different answers here that we needto get to, uh, what the question really is. Alright, cool.

Well, thank you everybody. Uh, Mike, I mentioned we will, Saachi will, um,post a recording as quick as she can. She's gonna do a quick edit. Uh, if you have any other questions for Lee,you can reach 'em in our Discord channel. You can also, um, reach him in our, um, in GitHub, uh,in the GitHub discussions area.

And, um, also if you have, um, you know,we will create this into a paper with lots of visualsto make this, um, even more clear. And if you even have more questions,what we can do is we can always do a follow up, uh, webinarwith, um, in whatever format you like. So if everybody, uh, likes this kind of a formatwhere we just do a q and a, we can do this again. Or if you prefer more traditionalwebinar star, we can do that. Uh, we wanted to try something a little different this time.

So, uh, let us know what works bestfor you and your learnings. And, uh, we hope that this information helps you to decide,um, what are your, um, requirements for vector databasesand, uh, whatever vector database you chooseto match your requirements. We hope that you are very successfulin the cool things that you're building. Any last words, Lee?Uh, no. I'm so gladto share all these kind of things with you.

And actually I have, I have, we, we, we way more wordsto say, uh,but it looks, I don't have any time now, so hopefullyAnother Time. Yeah. Alright. Thanks everybody. We'll see you guys again soon.

Bye-Bye.

Meet the Speaker

Join the session for live Q&A with the speaker

Li Liu
Principal Engineer
Li Liu is the Principal Engineer at Zilliz, leading the vector searching research and development. Before joining Zilliz, he was a Senior Engineer at Meta, designing and shaping numerous advertising stream data frameworks. With a Master's degree from Carnegie Mellon University, he boasts extensive experience in databases and big data. Li Liu's expertise in technology and innovation continues to drive advancements in vector searching, leaving a lasting impact on the field.

Choosing the Right Vector Database: A Practical Guide

What will you learn?

Topics Covered:

Meet the Speaker

AI Assistant