Webinar
Vector Database 101: A Crash Course
Join the Webinar
Loading...
What will you learn?
Vector databases experienced a flood of interest in 2023. They play a critical role in the most popular LLM application: chatbots. Did you know that is only a small portion of what vector databases can be used for though? They can also be used for: product recommendations, chemical analysis, reverse image search, and more.
Why? Because the primary use case of vector databases is to allow you to work with unstructured data.
Topics covered:
- Why unstructured data is hard to work with
- An overview of common use cases that benefit from vector databases
- How to get your unstructured data into a vector databases
- What to consider when evaluating and building your own vector database
- How to avoid common mistakes when getting started with a vector database
Today I'm pleased to introduce today's session, Vector Database 1 0 1, a crash course,and our guest speaker, Yujian Tang. Uh, uin is a developer advocate here at Zilliz. Um, he has a background in software engineering working onauto ML at Amazon. Yujian, studied computer science, statistics, neuroscience,and, um, has research papers published at conferencesincluding IEE fake data.
He enjoys drinking bubble tea. I can confirm this, uh, spending timewith his family and being near water. Welcome you, Yujian. Oh, thank you Emily. Um, hello everybody.
So today we're gonna be covering Vector Database 1 0 1. Uh, and the basics ofwhat we're gonna cover today is we're gonna start with kindof an overview of how, uh, of what of the problemthat Vector databases solveand how they let you work with unstructured data. Uh, and then we're gonna dive into some use cases,how they work, and then we're gonna gothrough a demo at the end. So, uh, first a little bit about me, as Emily mentioned, um,I'm a developer advocate here at Zillow,and I love bubble tea. And, uh, in the QR code there,or in the screen, on screen there to your right,there's a QR code that will take you to, uh,my LinkedIn if you would like to, uh, connectand ask me questions later.
Um, a little bit about Zilliz before we get started. So, Zilliz, we were founded in 2017. We're based in Redwood Shores, California,and we are the maintainers of a bunchof open source projects. And the ones that are, uh, most relevant towhat we're talking about today are vis,which is our open source vector databaseand GPT cache, which is our, uh, open source semantic cache. Okay, so today we're gonna cover these topics in this order.
So we're gonna do, uh, vector databases 1 0 1, kindof like review and over, uh, overview of the data, uh,of the types of data that you get to work withand some of the, uh, use cases,and then, uh, how vector databases work, and then a demo. Okay, so Vector Databases 1 0 1, why vector databases?What's the point? So, um, I think the first kind of,uh, thing to talk about here isthat vector databases are meant to workwith unstructured data. And unstructured data is 80% of the data in the world. Um, and Vector database is the only typeof database that do this. And the way that these are different from, let's say,any other database, like a SQL databaseor no SQL database, is essentially the search pattern.
The search types, right?So in SQL databases, in no SQL databases, you're doing a lotof key to key matching. You're trying to find like, Hey, uh, you know, I'm lookingto select X, Y, Z where X, Y,Z from, something like that, right?So you're doing a lot of key to key matching, uh,and in vector databases,what you're actually gonna be doing is a lotof vector comparisonsand vectors are a, these, uh, representationsof unstructured data that are really just long,uh, series of numbers. And so vector databases are specifically built to workwith this type of unstructured data, including text, images,videos, audio, by turning it into numbers,and then allowing you to, uh, quantitatively work on that. So let's kindof just cover some examples of why, uh, textor, uh, why different kindsof unstructured data might be hard to work with. So first we're gonna look at something around text,and then we'll look at something around, uh, uh, images.
And so this is gonna be an interactive portionof the presentation. So, uh, get ready to type into your chat. Okay, so here, when we're looking at text, um, you know,we're doing some sort of keyword search. One thing that might miss, that you might miss with, uh,regular keyword search that you might be able to pullwith vector search would be something like the semanticmeaning and the context and the user, um, intention, right?So for example, here you have an apple,and you could mean an apple like the fruit,or you could mean apple like the company. Um, and then there's, you know,something about rising dome proofing bread.
Um, and then there's car tires, right?So if you wanna change your car tire, are you looking toget instructions on how to actually change the car tire?Or are you looking to know when you should bechanging your car tire, right?So with text, there's a lot of, um, meaningand context and things outside of just the text itself. And so doing this keyword search isnot always gonna be good enough. A you may not get the, uh, right results back,but b, you might not even get the rightintention, uh, back, right?So then this part is the interactivepart of the presentation. So, um, here I've got some pictures of a very,very famous celebrity, uh,my favorite celebrity, Taylor Swift. And what I would like to know from you guys is which oneof these pictures do you think is not Taylor Swift?So, uh, I'd love, I'm gonna take a pause here,and I'd love for you guys to put some number into the chat.
Either one for this one right here with the very red one,two for the one, the second one here, three for the onethat's, you know, she's got like the black top on. And then, uh, four for this last onewhere she's wearing this little sparkly dress. Let's see, 2, 4, 2, 3. This emoji, that's not really a number. Uh, okay, looks like there's a lot of, uh, a lotof guests out here, a lot of disagreement.
I'm gonna let everybody kind of, I'm gonna let someof these answers trickle in before I tell you guys. So hopefully this exercise here gives you an idea ofwhat it's like to perform the operationthat a vector database might perform. In this case, the operation here issimilarity search for images. And, uh, those of you who said two, by the way,you guys are correct, two is the onethat's not actually Taylor Swift. All of the other ones are Taylor Swift.
I see there's a lot of threes and even a four out there,but I guess everybody knowsthat this one here is Taylor Swift. This must be a very iconic picture. Okay?So these vectors that are numbers, how do they work?Where do we get them? Let's start with our knowledge base,which in the examplesbefore would've been some sortof text would've been the pictures of Taylor Swift. Uh, in an enterprise setting, probably for work for you,it's probably your actual internal documents. So what happens with this datais it gets passed into a deep learning model.
And it's really important here that the deep learning modelthat you have is the right type of model. It's gotta be trained on the right type of data. If you're working with images, you gotta have a modelthat's been trained on image data. If you're working on text,it's gotta be trained on text data. If you're working on imagesand you're specifically looking to identify types of cats,you gotta have that in the data, right?And so what happens is you take your image dataand you run it through deep learning model,and then you cut off the last layer.
So the last layer of a deep learning model typicallydoes the prediction, right?So, um, let's say for example, you have an image model,it'll typically do some prediction like, yes,there is a cat in this picture,or no, there's not a cat in this picture. But what we want is we actually want to knowwhat the model has learned about that data,and we don't want it to give us a prediction. We want to just know what it's learned. So we have the numerical representation sothat later on when we want to work with that data,we have a way to quantitatively work with that data. So the way we do that is we cut off the last layerand we take the output from the second to last layer.
So why the second to last layerand not any of the other layers in the middle isbecause as you pass data through a model,each layer will learn something new about the model. And the second to last layer contains allof the semantic information without doing the prediction. And that is basically what a vector is. And then you take that vectorand you put it into a vector database like Zills or Milds,and later on you can queryand compare other information to that vector. Um, so it's important to also understandthat the information that you compare hasto all be the same length, the same dimensionof vectors, and we'll cover that.
We'll, uh, take another look into that,uh, at a later slide. Okay, so how do you use vector databases, right?How do you actually use these? So here are nine use casesfor vector databases. Uh, the most popular onethat we've seen over the last few months,over the last year basically has been rag retrieval,augmented generation, right?It's right here, it's in the top corner. Um, if you've heard of rag, you know, drop the chat,drop something in the chat, let me know, uh,what you've heard about it, or if there's any questions youhave about rag, you know, uh, drop that in the chat. Um, and outside of rag,there's a bunch of other things as well.
So, uh, there's recommender systems. Um, so for example, product recommendations is actually oneof the most common use casesand one of the most common, uh,use cases in production for vis, right?Um, products are complicated things that have textsand images and, uh, reviewsand all of these different things attached to them. And so recommender systems,what they'll do is they'll be able to compare these productsand these users and, and use a vector database toquantify that and compare that. Uh, other examples, right?The text search, we just talked about that,the image search we're gonna talk about,oh, we just talked about that. Uh, video similarity search.
You know, if you can do videoto video search, that'd be great. Uh, audio similarity search, uh,anomaly detection is another really common one. Um, how different are two, uh, user actions, you know,is it really the same user?And this is super important when it comesto things like fraud detection, right?Um, and then the last onethat we have listed on here is multimodal search. And this is something that, uh,we're gonna be working on more in 2024. We're gonna be creating some tutorials on how to do,uh, multimodal search.
Uh, so let's look at some examples. So in this, uh, example here, um, in this slide here,there is a QR code to the, to the right,and there's a bunch of pictures right here, right?So these pictures are all paintings,and I believe they're all, allof the ones here are impressionist paintings. And these are all the search pictures. So what this notebook walks you through is essentially,uh, how do you ingest these picturesand compare other pictures to them?Um, so pictures do take a while. So this notebook will take you somewhere between 12to 15 minutes, uh, to run.
Uh, but so check it out later. Uh, and then another examplethat we have here is text search, right?So we just looked at image search,and now we're gonna take a look at text search. And later on for our demo that we're gonna walk through,we're gonna be doing something with text as well. Uh, and so in this text search,what we did was we just scraped Wikipedia. Basically, I scraped a page from Wikipedia, the Nightmarebefore Christmas, and I asked, what is the plot?And it says, the plot of the nightmarebefore Christmas revolves around Jack Skellington,the pumpkin king of Halloween town, who becomes tiredof the same routine of Halloweenand discovers Christmas Town.
And then it goes on and it tells you about the restof the plot of the nightmare before Christmas. Okay? So how do vector databases actually work?What's going on beneath the surface?So first, let's look at an example entry. So this is what you might storein one entry in a vector database. The two most important pieces of this, um,of this entry are the ID and the embedding. So the ID is your, uh, the waythat the vector database is ableto have a unique ID on your entry.
And then the embedding is the actual vector embedding. This is what gets compared when we go and we search throughand, uh, compare vectors, right?So this is why earlier I was saying, you know,vector databases are a specific type of databasethat are optimized for a specific type of use case. And that use case is to compare and benningsand to do, uh, this kind of, um,high compute, uh, task. And then in addition to those, uh,two fields, we also have a bunchof metadata fields in this example. And these are fields that, uh, you can filter on basically.
Um, and for this example here, what this is, is actually,this is an example from a, uh, d uh, a demo app that I made,uh, for chat towards data science. And actually this uses Zillow's cloud. Okay? So let's look at the oneof perhaps the core functionality, um,behind the vector database. This is like the, the very basic functionality. This is semantic similarity.
This is what the vector database does for you. So in this example, we have four words that are all set upas two dimensional vectors. And before we get into the example, I want to be clearthat this is a toy example, uh,that you'll never see companies using two dimensionalvectors for words. Um, that'd be really weird. And then, um, I also want to point out here that queenand woman and kingand man have the same value on the first axisand are the first dimension.
And all that means here isthat these words have the same value on that dimension. It doesn't tell us anything about what that dimension means. It just says that these words relate the sameway along that dimension. So it could be, for example, that 0. 3 correlates to wordswith, um, uh, uh, five lettersand 0.
5, uh, maps words with either threeor four letters, right?So I just wanna point out this dimension doesn't necessarilyrespond or correspond to a, uh, discrete concept, uh,in our world, I guess. So let's take a look at the math behind this, right?So queen minus woman plus man equals king isthe idea behind this slide. So queen is 0. 3, common 0. 9,and woman is 0.
3, common 0. 4. And so the difference there when you subtract them,by the way, Manhattan, this,this is called Manhattan distance. When you just do the subtraction,this is also usually not implemented. And this example is 0.
5,and then we're adding man, which is 0. 5, common 0. 2,and we'll see that adding that together gives us 0. 5,common 0. 7, which maps directly to king.
So the example, uh, the idea behind this example isthat you can do math on thingsthat are not originally numbers using vectors. And in the example we used words, okay,so let's look at someof the actual similarity metrics, right?So I said this, the waythat we're doing the math in this slide,I called this Manhattan distance, right?So let's look at distance metricsthat we actually do use in vector, uh, similarity search. So Manhattan distance is just, you know, uh, Qqi minus pi, basically. So this is Euclidean distance,and this is probably very familiar to most of you. Um, if you've taken anything beyond algebra, uh,two or geometry, whatever, uh, you probably have an idea of,uh, this, um, line,I guess this concept called a hypotenuse, uh, in,in, in right triangles.
And essentially that's what the L twoor Euclidean distance is measuring. It's x squared minus y,or you know, XX one minus y one squared plusx one minus x two squared plus Y one minus Y two squared. And, uh, in practice in vis,we actually don't do the square rootbecause that's just extra calculations. And all the distances actually come out to the same,uh, the same rank order. Sorry.
Um, okay, so the next one is ip. And so earlier we, uh, we just saw EU clinician,this measures distance in space, right?This is basically, if you have a right triangle measuresdirect distance in space,IP is a little more complicated than that. What this measures is this measures the distance, uh,this measures the projection of one line onto another. So if you were to imagine here, instead of, uh, saying that,oh, these two points we're gonna do the hypotenuse, uh,what we're actually gonna do is we're gonna say like,we're projecting from the origin these two points,and this, uh, the IP is the projectionof this point onto another. So it's like, how do you project that at a right angle?Um, and this essentially measures not just the angledifference between two, um, vectors,but also the space distance between two vectors.
So this measures difference in orientation and magnitude,and L two is just magnitude. Now the next one is, uh, cosign,which just measures orientation, right?So we have magnitude orientation, orientation and magnitude. So cosign, the way that you think about this, um, you know,is basically just the difference in anglebetween two vectors. So vectors can be thought of as both pointsand, uh, lines that point to points. And so basically with co-sign ip, you're thinking about themas the lines, the points and points.
And with co-sign, you're just thinking about how big isthat angle between my two, um, vectors. And so one thing you'll notice here that's kindof interesting is also that cosign and IP look very similar. So you see IP is some of ai bi cosign hasthat up in the top, uh, numerator. And so cosign is actually just normalized, uh, ip. And so, uh, if your vectors are all normalized,which means if the magnitude of your vectors is all one,then you should just be doing ipand you get the exact same value as co-sign.
So when it comes to picking which of these metrics to use,uh, you need to be thinking about basically, do I needto measure like how far apart these concepts are, uh,or how far apart maybe the meaning behind someof these concepts are, uh,and are my vectors normalized and things like that. Now, it's important also to note that, you know,these all have the same rank order,so no matter which one you use, the rank of the vectorsthat you give back should pretty much be the same for allof them, um,or it should be the, the, the same for all of them. Uh, so, okay, so now let's look at some of the waysthat you can access indexes. Indexes are the way that you store your, uh,or the way that you store, the way that you reach your data. So the way that you find your data, the waythat you search your data is determined by the index.
So what we just talked about was the distance metric,which measures how far, which is, which tells ushow we measure how far dis uh, our data is from each other. And then this tells us how we look for points in our data. So inverted file index,or IVF is probably the most intuitive vector search method. And the way that you can think about this is basically just,I'm doing a bunch of k-means,I say there's gonna be some number of OIDs,and then we just cluster and we find all of the vectorsthat are closest to those OIDs,and we create these clusters. And then what happens at search time is first we search theOIDs for the closest OIDs,and then inside of each OID we search the,um, the actual points.
And we can actually do that, either search all the pointsor through a quantized version,which I'll talk about in a further slide. The next index that is of importance is HNSWor hierarchical navigable small worlds. And this is a graph index. And essentially what this does is it takes allof the points in your vector spaceand it inserts it into a graph. And as it's inserting into a graph, it gives,it assigns a uniform random variable.
And depending on the value of that variable,you're gonna be assigned a layer. So all points go into layer zero,and then if your variables within a certain range,if your uniform, random variables within a certainrange, you go to layer one. And then once again, uh,if your variables within a certain range,again, you go to layer two. Now, it's important to note that all points in layer one arein layer zero, and all points in layer two are in layer one. And what happens at search times,you start at the topless layer, and you goand you find the closest one,and you drop down, you find the closest one,and you drop down again and again and again.
So, HNSW is very accuratebecause it saves all of the distances in the graph. Um, and it's very fast because all distances are saved. The trade-off here is that HNSW takes up a lotof space, as you can imagine. You have to store things multiple times. Okay?The next, uh, part of the indexing thingthat you wanna know about is quantization.
So this is, uh, what we call scaler quantization,which is quantization along one dimension. And quantization is pretty much, I just thinkof this as bucketing, right?So if we have the list of reels, now we havethe list of integers. So maybe you have, uh, you know, negative threeto three as real numbers. And now you say all the values from negative threeto negative, uh, 2. 5 are negative three.
And all the values from negative 2. 5to negative 1. 5 are negative two. So it's just bucketing. And this, what this does is this allows you to saveand store a lot less information on disc, uh,but you get a lot less granular of a search.
And then NEX is product quantization,which is essentially scaler quantizationacross the entire index. So it's not just one vector, uh,but uh, uh, it's horizontal and vertical. And product quantization saves a lot more spacethan scaler quantization. And it also is a lot less, uh,accurate when it comes to search. And so when you actually want to, when you,so Novus allows you to combine the scaler quantization,the product quantization with the IVFand the HNSW, which gets you, let's say you combine HNSWwith scaler quantization, then you can get, you know,the nice, like, oh, I don't have to save, I don't haveto have a lot of memory on disc,but I also get the nice, um, accuracythat HNSW provides, right?So you can take these kinds of different methods,you can combine them, you can do different thingswith them to get better results.
Uh, okay, so I'm gonna jump into the demo. Before I jump into the demo. I think we should, uh, pause here for some questions. Let's see. Um, oh, okay.
None of these questions need to be,uh, answer at the moment. Okay, cool. So let's look at the demo. So in a second, I'm gonna code up some, we're gonna,we're gonna look into a code demo. And this code demo is basicallywhat we're gonna do is we'regonna download some embeddings models.
And this is one of the pieces that, uh,has been asked about a lot. This is definitely an FAQ, uh, over the last year, which is,how do I get my data into viss or, uh, Zillow. And so one, so we're gonna walk through that example herewith three different embedding models. But, um, an even easier way, um, isto use Zillow's pipelines, which is somethingthat we created, uh, pretty recentlybecause we kept hearing so much about people wantingto get their data into, uh, a effective database. So once we download the three models,we're gonna make up a dataset, um, uh, the link, I,I will have a link so that you can justgrab the dataset that I'm using.
Um, and then we're gonna embed that datawith two different models, and we're gonna ingest bothof those datasets into viss. This is just, you know, bulk insert, basically. And then we're gonna query, uh, vis usingthat third set of embeddings. Um, so here, I'd love to get your idea onwhat you think will happen with these embeddings. Uh, drop some comments in the chat and let me know.
Well, we wait for them to respond. Um, there, we had a question just come in. It says the rank order of all three metrics are the same. Is that true? It seems not true to me. The rank order of all three metrics are the same.
Um, I don't have like a mathematical prooffor this off the top of my head. Mm-Hmm. Uh, but we can at least easily see that, um,the co-sign and the inner product, uh, ones are the samebecause they are a simple, uh, transformationwhere you're being, um, allof the co-signs are basically divided by their magnitudes. Um, L two i,I don't have a transformation for this off the topof my head, but there is a way to transform that equation ofXI, you know, minus YI squared. And, uh, compare that to, um, the, the, the, the co-sign,uh, ab, uh, valuation.
So, um, the rank order for them should all be the same,definitely all the same if you have normalized, uh, data. What are those different methods? I think?Is that, was that about the metrics?Yes, I believe so. Ah, okay. That makes sense. Um, okay.
Uh, has everybody had a chance to scan the demoand is able to see the notebook that we're gonna be workingwith and give thisanother minute or so?Um, and if you haveideas for what might happenwith comparing different embeddingsand different embedding models, you know,you should drop them in the chat. Mm. Okay. I think this is long enough to get this, so I'm gonnapull up my notebookand let's also, we can drop the link to this, uh, notebookthat the demo links to in the zoom. So let me drop that in the zoom,okay.
And then we will, uh,all right, cool. So I'm gonna be doing some copy and pastingand some live coating, some actual live coating, um,and some copying and pasting. So the first thing we're gonna do is we'regonna copy and paste this. We're gonna install viss, viss and sentence transformers. These are the three libraries that we're gonna need for, uh,this, uh, project.
So I just did this earlier, so I'm really hopingthat nothing breaks. And if this is too small for the audience, um,just drop us a note in the chatand we can have, uh, you Jenzoom in the screen a little bit. So let us know if it looks good to you. Oh, I'll zoom in a little bit. Here we go.
Great. So this is what we're installing right now,and this is taking surprisingly long,actually, I'm not surprised. Oh, I also need to install pip, but we can do that later. We can install PIP at a later date. Okay, so I'm also going to get a bunch of, uh,imports here.
So here we're gonna import sentence transformer fromthe sentence Transformers library. This is how we're gonna get the embeddings models. Then we're gonna import the default server fromvis, this is vis light. So this allows us to have a serverthat we can spin up directly from our notebook. And then, um, I'm going to get Pine Viss.
And from Pine Viss, I'm gonna import connections,which allows us to connect utility, which allows usto work with the S schemas. Uh, and then field schema selections, collection, schema,data type, and collection. So Viss creates collections,and inside of collections they have schemas. And inside of schemas, uh, there are fields,basically fields are what are used to define a schema,and the fields are defined by data types. Okay.
And then here, I also have time. This is mainly just for measuring, uh, time. I don't know if we'll actually use it in this, uh, specific,um, I don't know, use case, I guess. All right, so let's see how this does. Is this taking a really long time to import?I always get nervous whenever any of myPython notebook sells take longer than 10 seconds to run.
It's like, what's going on?It's the magic of live demoing. It is the magic of live demoing,but it looks like it's working. It's definitely executing some commands here, at least. Um, wow, 30 seconds. This is taking quite a while to run.
This is quite unusual. Oh, there we go. Okay, cool. Nice. Okay, so the next thing we're gonna do is we're gonna wannastart the default server,and then we're gonna wanna connect to the default server.
Um, so let's do that. So default server dot starts basically starts the server,and then connections dot connect. So here, uh, you'll see that this is local host,and this is where the server, um, uh, starts by default. And then we also have a, uh, method that listens towhat the port is, um, that the service started on. And I, I think by default, this is, um, 19 530.
Okay, here we go. Ah, nice. Okay. Uh, right before I got nervous, 10 seconds. Okay.
So the next part here, what we're gonna do is we're gonna,uh, download a bunch of, so I, I don't expect youto actually be able to like get, type all this this quickly,but you can probably copy and paste it thisquickly from the notebook. Uh, we're gonna grab three different, um,sentence transformers. And so you'll see that we're getting the multilingualmini LM L 12. And you'll see that, you know, uh,these are all the same base,these are all the same base model. And so this is the very mu this is the base modelfrom sentence transformers.
And then the other two are fine tuned. And these are fine tuned on different data setsby different, uh, organizations. Okay? Sowhat we're gonna do here is we're going to rename these. So this is actually not mini LM B two quantized,but, um, probably different one. Uh, but we'll just, we'llleave it, we'll leave the name here.
So basically what I'm doing here is I'm giving twocollection names, and I'm also definingthis thing called dimension. And dimension is the size of your vector embeddings. It's the number of dimensions, the number of numbers,essentially that gets, um, that, that the machine learning,that the embedding model produces for your vector embedding. So all of these paraphrase, multilingual mini lmm,L 12 V two models have a dimension of 3 84. And you can find this by either a, uh, playing aroundwith it inside the code and, and,and, you know, just getting the outputand seeing how long that vector is.
Or you can read the model card. And then after we define these, we're just gonna go inwith utility and just make sure that we have a clean slateand drop them if they're currently, uh, in the, um,in the, uh, whatcha call it collection, uh,not collection, uh, uh, database. So now we're going, we're gonna define our data. So I copied and pasted a bunch of this data. This is inspired by, uh, speak.
Now, uh, I took Taylor's version, um,although really, I think I made this initially when she cameout with her version of speak. Now, um, actually no, that doesn't sound right. No, I'm not entirely sure. Anyway,I pulled some lyrics from these four songs,and you can just see that, you know,they're all basically sentencesand, uh, there's like 50 of them. So this is how we're gonna do some embedding comparisons,and we'll just see how long this is.
I believe there's 51, there's 51 of them. Uh, and so with 51 sentences, you know, um, it's really,this is very much a toy example. There's not a lot of, there's not a lotof results that you can get back. There's only 50 of them. So, uh, just keep that in mind.
Okay, so now let's define someimportant parts of our schema. So what we're gonna do here is we're going to,uh, define the schema for our,um, collections, right?So we have these two collectionsthat we have two collection names for them,and we're gonna give 'em the same schema. And so in our schema,what we're doing is we're gonna define the fields. If you remember earlier in the presentation,I was talking about how there's only two fieldsthat really need to be defined,and those are the ID and the embedding. And the rest of the fields are metadata fields.
So here we define the ID and the embedding. And unlike the example, I'm gonna use an auto incrementing,uh, integer as my id. And then here I have to tell the, uh, databasethat the dimension of the vector is 3 84. And then I just say, enable dynamic field equals true. And this allows me to insert, um, basicallywhatever metadata I want, uh, without havingto predefine it in the, uh, field schema, um,or in the collection schema.
And, um, actually, we just wrote a blog about this,and so, uh, we should put a link to that, uh,in the, in the chat. But there is a blog about thatthat tells you a little bit more about howdynamic schema works and what it is. The next part here is we're gonna get the actual embeddings. Um, and so the way we do that is we're gonna goand you see, we have these models, right?Remember, we got these models up here,where are the models, the models up here. And then what we're gonna do is we're gonna call thisfunction called encode.
And, um, essentiallywhat we're gonna do is we're gonna encode,we're gonna create a dictionary here that's gonna map allof the sentences to their vector embeddings. And so this should take, uh, just a little while to, oh,wow, that was really fast. Oh, that's nice, okay. Um,and so what this will do is this will give us our first twosets of embeddings, right?So we have these two models,and we're getting all of the embeddings from these twomodels, and we have them mapped. And so now what we want to do is we want to createhow we're going to, uh, create the index onthese two, uh, sets of vectors, right?So remember we talked about the vector,the indexes, and the distances.
And so here, when we create the index,we're gonna have to keep that in mind. So IVF, right?This was the first index that we talked about. The most intuitive one, the onethat's basically just creating a bunch of clusters,and then flat, flat means not quantized, right?So the other examples that I gave were ssqand pq, scaler, quantization, and product quantization. Those are good for reducing memory consumption. Um, but since we have a very small number of vectors,and it doesn't matter, and here we're using Ltwo as our metric type.
And then for our parameters enlist is, uh, for IVF,what this describes as how many clusters do we want?And since we only have 50 data points, uh,four clusters is probably a fine number of clusters. Okay? So now what we'll wanna do is we wanna create theseindexes on these fields, and then we wanna load thecollections in memory so they're loaded up inmemory so we can work with them. Okay? So now that we've done that, what we're gonna wantto do is we're gonna want to insert a bunch of data. So remember up here we created this dictionary,these two dictionaries of sentences. And remember up here we created sentences, which is our listof, uh, sentences.
Um, so what we're gonna do is we're going to goand we're gonna create, uh, an insertion function. So this is a list of dictionaries, um, mainlybecause it is meant to be used as batch insert,and I'm using it here as single insert. Um, but, uh, you can use this to insert hundredsof data points at once. So basically what I'm doing here, I,I could probably batch insert theentirety of all 50 of these at once. So basically what we're doing here is we're saying like,here's the sentence, here's the embeddings.
And notice that I don't identify, uh,or I don't define ID in the insert, and that'sbecause Id is gonna auto increment. But notice that I also, uh,define this new field called sentence here. And the way I'm able to do this without having to definethat before was the dynamic scheme, I think. Okay? So basically what we're doing is we're creating bunchof things to insert, and then we insert them,and then we call flush on the collection. And what Flush does is in Mils, mils is distributed systemthat holds, uh, a bunch of memory in these nodes.
And so what flush does is initially your memoriesand nodes, um, flush flushesthat memory into permanent storageand creates the, uh, actual index onthat data. Okay? So now we're gonna find a wayto search these embeddings. So search embeds, I'm gonna once again create a dictionary. Um, and, uh, what I'm gonna do is I'm gonna create, uh,I'm gonna take some sentences from the sentences. I'm gonna take the fifth and the sixth sentence here.
Uh, and I'm gonna encode them using the modelthat we didn't use before. And then I'm gonna map themand I'm gonna have it pin into a list sothat when we search vis, uh, we can just give it this list. Okay, here we go. This is a little bitof a complicated section,so I'm gonna pause here and explain this. As I was saying earlier, time is basically just,was basically just for me to see the experiments here.
Um, but what we're gonna do is we're gonnagive it the data, right?So data is the data that we're search with,and we're gonna pass it this list of vector embeddings,and then we're gonna tell it the approximate nearestneighbor search field, A NNS,approximate nearest neighbor search field is embeddings. And that's the embeddings field, right?That's the field that we compare across. And then in order to search it, we haveto use the same metric type lmm, uh, or not lmm L two. And then, um, in parameters, we have to tell ithow many clusters to look at. We only have four clusters.
Let's just look at two of them. And then let's say we only want the top three resultsand the output field, uh, which is the, uh, output,the field, the metadata field, essentially,that we wanna pull from the, um,from the database is sentence. And so we'll do that here for this collection. And then we'll do the exact same thingfor this other collection, right?See how this is a one-to-one,repeat search data embeddings L two, probe two, one,probe two, uh, uh, clusters limit of three,output the field, uh, output the sentence. So let's see.
So time here, you can see the search timefor Novus is insanely fast. Um, and then we can seewhat the actual values are. So this is the first one,and we'll see what the query sentence is,and then what the nearest neighbor is. Uh, so I probably should have, uh,separated this into these three and then these three. But you'll see that the query sentence is like, uh,ooh, I don't know what song this is.
Oh, I know this is speak now. Um, so this is the query sentence,and this is the result sentence. And you'll see that these are not the same. So that tells you that these two embeddings models have verydifferent embedding spaces,because none of these three sentences here,uh, are the same sentence as the query sentence,which is something that you would expectto see if you were gonna use the same, uh, embeddings model. And, uh, once again, you see that these three also, noneof these nearest neighbors are the qua ants as well.
So that tells you that, um, these twoembeddings models have quite different, uh,embedding spaces despite the factthat they're the same base model. Okay, so now let's look at the other one,and this will be the last thingthat we're gonna take a look at today. Where's my, ah, there we go. Okay, so the other one, uh, looks almost similar. Almost the exact same, right?So you'll see that this one is actually the same result.
This one is different. Uh, and oh, hey, here we go. We actually get the same sentence backhere on the third one. So this is interesting. This tells you that, you know, these models were ableto map these two sentences into similar spaces,but not the exact same.
And this also tells you that, you know, here,there's actually a couple, uh, crossoversbetween these first two, right?They're right here, these are the same two. Uh, and then here on this one, you can seethat there's once again, no repeat of the same sentence. Um, so this really demonstrates how important your vector,uh, embedding model is,and, um, how it's very important to the resultsthat you'll get from working with your vector database. Okay, cool. So that's pretty much itfor this, for this demo.
Um, I think we can take questions now. Um, where did you specifythe embeddings models as different?Okay, so I will go over that again. So see here we get the sentence transformers,um, here, sentence transformers,and then later on we get the sentence transformers. So what this load, what this does is this loads thisspecific, uh, sentence transformer,and you'll see that I loadedthree sentence transformers here. And then when with these three sentence transformers,you'll see that what I do with the data is I encode the datawith each of these transformers, uh, to embed, uh, to notto embed, to, um, uh, this is the embed basically,by the way, uh, and to, to store into, um, vis.
And then when I do the search, you'll seethat we use the last one to do the last embedding. So that's where we specify thedifferent, uh, embeddings models. Are there any pine mils embeddings utilities geared towardstime series data as opposed to image text?Or do the same functions apply?Um, so, um,I think the first thing to answer here isthat pine mils does not have embeddings utilities. Uh, the embeddings here was done completelywith hugging face. Uh, Zillows cloud does have a way to, uh,allow you to automatically just do drag and drop your data.
You don't have to worry about the embeddings. I think it only works on text data right now. Um, but I'm sure that there are plans to have images,audio time series, whatever other kinds of data, uh,in there, uh, in the pipeline,but, um, in the pipeline, yeah. And the tool is also called pipelines. And, uh, it's available for free to anyone who is usinga Zillow's cloud, uh, free tier, um,or a serverless cluster.
So that is one waythat you can get your embeddings in, uh, as opposed to, um,when it comes to time series versus text versus image data,what matters is the model itself and what it was trained on. So if you have a model,or if you know of any modelsthat were trained on time series data specifically, then,uh, that that would be the type of modelthat you would need, uh, the type of embedding modelthat you would need to get time series data. Any other questions, feel free to drop them into the qand a tool at the bottom of the screen. Um, e Eugene, you've talked to a lot of people,you've helped a lot of people get started with Novus. Where do you feel like most people struggle, um,and what tips do you have for 'em?Um, um, so I think we've, uh, I'm gonna stop sharing here.
Uh, so I think we've covered this a little bit. Um, most peoplehave trouble getting their data into MOIs. Uh, this is something that I get a lot, um,and something that, uh, I've been thinking about, like,you know, how do we best address this kind of, um, this kindof, uh, uh, a challenge?And one of the things that I thinkmakes it difficult to deal with this,or to make it easier for people to get started with is justthat there's a lot of, uh, misinformationand, um, it makes like the waythat things are framed almost makes it seem like there isimage vectors or there are text vectors. Um, but that's just not true, right?So like the vectors themselves are really numbers, right?And this is something that, uh, this is something I triedto, you know, get across in the presentation a few times,which is like, you know, take it, take the numbers,the output from the second last layer,and then the image of the,the embeddings, which were the numbers. And then when we did the, uh, input, um, I,I didn't show the embeddings, uh, when we did the demo,but, um, imageand texts are just two types of unstructured data.
And when you put them into vectors,they're all become just series of numbers. And I think the reason why there's confusion around this isbecause image vectorsand text vectors are typically different lengths,and I just use the terms image vectorsand text vectors despite the factthat I was just talking about how these things don't exist. But vectors created from image datamodels trained on image data typically have a differentlength than, uh, different length,their second last layer than models trained on text data. Um, so I think that's oneof the biggest challenges that I see. Um, the other challenge that I think,uh, I get, uh,the other question I get a lot is people talking about, um,LMS versus embeddings models, or if there's any difference,or even, uh, when it comes to rag, people ask, you know,what does the LLM do?Can I do this with just vector database?Uh, and then, you know, I also get the, the other option,which is the other, the reverse option, which is, oh,I have an LLM, why do I need a Vector database?Uh, and so I'll just answer both of those.
So, um, with a Vector database,the reason why you would drop an LLM on top to do rag is,for example, uh, if I ask the question, you know, um,oh, I didn't have a good question off the top of my head,uh, what's the most similar fruit to an apple?Uh, my, and,and in my factor database,I have a bunch of text stored there. What it'll do is it'll pull the most semantically similarquestions or SEM similar text back,which will probably be things like,what's the most similar fruit to an orange?What's the most similar vegetable to a tomato?Uh, or things like that, or tomatoes or fruits. Um, but, uh, you know, it wouldn't be able to say like, oh,like, what I actually want is I want the,I wanna find the thing that's most similarto the word apple, which iswhat the LLM would do is would break down that queryand say, here's what you should actually search for. That's semantically similar. And then the other way around is why would you put a Vectordatabase under an LLM for RAG is, um,the lmm doesn't have access to your data.
It's, uh, doesn't know like what you are working withand it doesn't have the context of what you're working with. And so with a Vector database,what you do is you take your data, you vectorize,you insert it through the Vector database,and then during RAG you pull your data from retrieval. Augmented generation is literally like it's generationthat's augmented by data retrieved from the Vector database. And then you put that in as context into your LLM prompt,and then you'll get some sortof more human readable response. Thank you.
Um, Rishi's got some questions. Um, they're running into an issue with the Google collab. Um, so there's some details in the chat for youto take a quick look at. Uh, yeah, that's just a dependency issue. Um, probably want to pip install PI there, maybe see what,uh, libraries you have installed there.
Rishi, uh,And then we have a question. Is VUSs LLM agnostic or is it, um,or is it used to have better performancewith some specific type of LLMs?Uh, yeah, Viss is totally LLM agnostic. LLMs have, I mean, like LMSand venture databases are entirely decoupled. Um, so yeah, there, there's no,it doesn't have better performance than any specificLLM you can bring whatever. GreatQuestion though.
Yeah,same with embeddings models. It's also entirely embeddings model agnostic. You can use whatever embeddings models you wantand you will get the same results, not not the same results,but like, you know, the same quality of results. So you, Eugene, you've done a bunch of projects. Um, some of them are image-based, someof 'em are text-based, a lotof them are Taylor Swift based, um,Taylor Swift.
When you go to start a new project, how do you,how do you think about choosing your embedding model?Like do you sort of, kindof continuously use sort of the same ones?Like what is your sort of research approach?'cause I know there's so many options out there now,especially in the open source model world. So how do you sort of think about choosingand what are you looking for?Yeah, um, that is a great question. And so I don't do too much experimentationwith the Bennings models anymore. And the main reason isbecause I spent a few months basically just playing aroundwith Betty's models, uh, last year. And um, that's actually how the demo that we just walkedthrough came about wasbecause I was like, oh, you know, it'd be cool,we should compare some Betty's modelsand see how different they areand how, how similar they are and whatever.
Um, and, uh, what I actually look for when I do my testing,when I was doing my,when I was I guess doing more experimentation, um, a lot ofwhat I was looking for was like, uh, you know,is this, does this make sense?And like, so for example, um, when I was working with imagesthat that painting's one, the things I thought aboutwith, does this make sense?Is if I'm gonna give a similar painting,I better getting the, the painting back in the top threethat I expect to get back. It's like, it looks, you know, similar. So like think the one that I did the test on was the Skiof Galilee, uh, and I was like, oh, there better be a boatand there better be some water. Um, and then with text it's more like, uh,you know, um, does this make,like, like does this make sense?Like, does the text back, like it's,it's, a lot of it is just human. Like, does this make sense? Now there are actually evalmethods that people come up with where they use LMSto evaluate the results from the LLMsand the results from the embeddings models.
But the challenge with that is thatthat eval method evals the entire rag, um, app,the entire stack and not just your embeddings models. And it's actually, I, I think the thingwith embedding models is it's actually very hardto judge if you're embedding models is the good is,is the right embedding models. Um, and the real like answer to what you'd be doingwith embedding models, especially in, in,in enterprise production, iscreate your own embedding modelsor fine tune an open source oneand make sure that you put in the DI data typesthat you wanna relate into it, that relate to each other,so you get one that operates in your latent space. Awesome. Um, we have another question.
Is there a way to visualize the Vector DBto create a knowledge map?Um, yes. Uh, so you can create a, uh, uh, if you just look,if you literally just look at the HNSW index,that is basically a knowledge map. Um, but another way you could do it iswhat you could do is you could actually take all thevectors, download them,and then you map them into a three-dimensional spaceand, um, look at that on, uh, ooh,what is the thing that lets you do that?I think map pot lib lets you have three-dimensional space. I think you can do that in map pot lib. Um, and so you can just download them, you map them,and then plot them into three-dimensional spaceand see what the clusters look like.
Uh, and this is actually, that's actually the basisfor a lot of, uh, eval methods. Um, and it's similar to what, um, uh, uh, uh,a lot of like open source eval, uh, libraries do. Doesn't Galileo have some visualization tools as well?Yes. So Galileo and Arise Phoenix, um,and Tru True Lensand Y Labs, Y logs all do very similar things, um,that will basically allow you to do that visualization. We have a great question.
Uh,what is Viss built on top of?Is it MongoDB?Aha, yes. You know, actually I heard that Atlas was in toxto use Viss for their vector search, uh,that MongoDB was in TOXs Viss for their vector search. Um, but Viss is built,is is actually built from the ground up for, uh, it's,it's built purpose specifically for vectors, right?So MongoDB is built as a NoSQL database, which is built forkey to key search, and they have to implementa vector search on top of that that is able to, um, kindof run that without, uh, having to do like, oh,like I'm gonna search through all myvector or all my entries. Um, so we arebuilt natively to work with vectors,and we're built on top of like Kafkaand Min io, uh, Kafka or Pulsar. Um, VUS is modeled as a pub subsystem, so it's,you publish data and it's just like astreaming data and it just goes out.
And then it's essentially like,if you think like you have a Medium account,you just publishing things on your Medium account,and then there's like the sub, uh, subscriber,which is like people who are like,maybe they're adding your medium account,maybe they're adding your medium publicationsto the list or something like that. Um, so yeah, that's kind of, uh, that's, that'show you can think of the way that VIS is built. Um, we will probably do a more, uh,we'll probably spend some more time doing a deep diveinto viss at a later section. Um, but Novus is not built on top of any existing database. It is built from the ground up.
And I'll just add to that, you know, the reason that we,um, you know, that the foundersand the original creators took that approach was reallyto be able to, to build it for scale. So I think there's a lot of new emerging sortof vector search tools, um, that probably work finefor very small data sets, um,which was never really the intention of Viss. The real viss is really meant for billion,10 billion scales. So, you know, if you have a hundred thousand vectors,you know, a lot of, a lot of options work. Um, but if you're talking 1 billion, 5 billion, 10 billion,um, scale and,and having a purpose built per vectordata is really important.
So that's kind of where we specialize. Uh, see couple, uh,what does a Vector database expert mean?Same as SQL Server expert, or is it completely different?Completely different. Uh, um, these havenearly nothing to do with each other uh, SQL isa, the SQL databases aren't even builtwith the same intention thatbatch databases are built with, right?So SQL databases are built with the intentionof letting you store, uh, entities with known attributesthat you want to be able to create relationships with. At query time, vector databases are meant for youto store the semantic meaning of, of entities as a listof numbers and allow youto compare these vectors at query time. So the intention behind these twodatabases entirely different.
Um, and maintaining the databases is probably also a littlebit different, although you could arguethat containerization and, and scaling and auto scalingand all these things are very similar when it comesto doing DevOps on the backend. Um, do you have some benchmark in termsof performance comparisons among other vector dbs?You wanna tell 'em about Vector DB bench?Yes. Yes, I do. There's Vector DB bench,which I actually was, was in the listof open source projectsthat Zillows maintains in the slide that I did not mention. Um, but oh, that, uh, thank you, Emily.
Yes. So, uh, we have a open source, uh, benchmarking toolthat you can use, that you can compare all the vector, um,databases and you can bring your own data. Um, so it's open source and, uh, you should check it out. Well,I think that's just about all we have time for. It looks like, uh, we've gottenthrough all of today's questions.
Um, thank you everyone for joining us today. We hope to see you on a future, uh, webinar. You can check out zillows. com/event for allof our upcoming sessions. We've got one on, uh,text embedding a little bit later in the month.
Um, you, Eugene, thank you againfor doing such a great sessionand we hope to see you all next time. Thanks. Thanks for coming everyone.
Meet the Speaker
Join the session for live Q&A with the speaker
Yujian Tang
Developer Advocate at Zilliz
Yujian Tang is a Developer Advocate at Zilliz. He has a background as a software engineer working on AutoML at Amazon. Yujian studied Computer Science, Statistics, and Neuroscience with research papers published to conferences including IEEE Big Data. He enjoys drinking bubble tea, spending time with family, and being near water.