Join the Webinar
Loading...
About the Session
What is a vector embedding? What is a vector index? What is a vector database? Get the answer to these questions and best practices for vector search in this live session.
We start with a quick introduction to neural networks which leads into vector embeddings. Then we cover vector indices, how they work, and how to choose one. We wrap up with what makes a vector database, and what makes Milvus stand out.
You’ll need:
- A basic understanding of computer science fundamentals
- A basic understanding of software development
Today I am pleased to introduce today's session,vector Search Best Practices and our guest speaker Gin Tang Yuin is a developeradvocate here at Zillow. He has a background as a software engineer working on auto ML at Amazon YuEugene studied computer science statistics, neuroscience and research papers,uh, with research papers published to conferences including I e E E, big Data. He enjoys drinking bubble tea, uh,spending time with his family and being near water. Thank you so much for joining us. Welcome you, Eugene.
Thanks for the, thanks for the introduction, Emily. Um,thanks for everybody for joining. Today,I'm gonna be talking to you guys about Vector Search best practices. We're gonna learn about Vector and beddings, what they are, how to use them,when to use them, what to do with them, and how you can use a Vector database. So a little bit about me.
Uh, my name is Chenin Tang. I put a QR code on here. If you happen to have your phone handy,you can scan this QR code and you can find me on LinkedIn,which is where I'm the most active. Um, if you aren't on LinkedIn,you can connect with me via email or via Twitter. Uh, so yeah, and you know,as Emily was saying, my background is in software engineering.
Um,I've been at Amazon, I've been at I B M, and I've worked on many ML systems. Um, a little bit about Zillow. Uh,you can find us on Twitter as Zillow universe on LinkedIn as Zillow. Um,you can also go to Slack. You can go to viss io, uh, mils io slash Slack.
Um, or you can find viss the open source factor database on GitHub. Um,so a little bit of point here. Zillow is the company. Viss is the open source factor database that is maintained by Zillow. Okay,so let's get into it.
What are we gonna be covering today?We're gonna be covering, uh, these basic four topics here. Why do I need a vector? Embedding? What's a vector embedding?How do vector databases work?And how you can get started with a vector database. Okay,so let's get started with this one. Why do I need vector embeddings?Unstructured data is everywhere. This, this is something that, you know,I'm sure many of us have heard, um, many times now,is that 80% of data is unstructured data,and that 90% of that is never analyzed.
Um,as we go into this more machine learning AI forward,uh, industry, universe, whatever world that you,where we're moving into with these LLMs dominating the scene now where there'sgonna be more and more unstructured data and that structured data is gonna startgetting analyzed. And this is kind of how,or this is why you would need a vector embedding,because vectors can be used to represent this unstructured data that's not,that doesn't fit a predefined data model. So you can use it to, uh,represent text, images, video, audio, uh, you know, molecules, um,many, many different things. Um, so that's, this is, uh,part of why vector embeddings are so important. So where can vector databases help what Convector databases actually help with?Um,these are some of the topics that vector databases are used in a lot.
These are more, many of these are more conventional topics,such as image similarity, search, video, similarity search,audio similarity search, so on text, similarity search, and so on. And later on I'm gonna be showing some code on how to do someimage similarity search and how to do some text similarity search. Uh,we won't necessarily get into all of the details, but, um,I'll show you some code to kind of understand and get started. And then the link to the notebooks will be dropped, um,either in the chat or afterwards. Oops.
Okay. So these are some of the more interesting or, uh,larger use cases for vector databases that I'm gonna go into that I've kind ofseen around. So number one is adding data to LLMs. So for example,um,LLMs right now are not trained on completely up-to-date data,and they're not trained on your internal private data. So if you want to use an L L M in production and an enterprise scale,you're gonna want to add some data to it.
And these are some examples of tools that you can use in this space. So,G P T Cache, for example,this is a caching tool for um,G P T that we made. Uh, it's open source. It's got almost 5,000 GitHub starts. Pretty popular.
Uh, then there's LAMA index, which, um,if you've heard of is a data framework, is a framework for accessing your data,uh, with LLMs. And I built quite a few things with this,and I'm sure that some of those links will also be available. Uh,and then there's OSS chat, which you can go and check out@osschat. io. This is basically, it allows you to chat with open source software.
And what this is,is an example of an app that adds data to LLMs following something we call the CV P stack. Um, and I'm touch a little bit more on that later. Uh,but basically this demonstrates how a vector database can be used toadd your internal data or your personal specific data. So in this case,that data would be the open source software documentation and how you can addthat data and query that data and essentially chat with that data using a chatbot and a vector database. Other things,vector databases can be used for include things like product recommendations.
This is very, very common. This is very popular. Um,and one of the reasons for this is actually, and I'll, I'll, I'll,I'll probably touch more on this later as well. Um,you don't need really good,you don't need like super high accuracy for product recommendation 'cause you'regonna find the product that you want and you just need to be able to find itwithin a couple clicks. So this is an interesting use case where there's many,many different ways to do this, but vector databases, uh,particularly good at finding the most similar products.
Then there's reverse image search, um, which is basically reverse image search. This would be similar to kind of the product recommendation kind of category. Let's say you have a picture of something and you wanna find something similarto it. Uh, so I've created a mini fashion AI project that, uh,basically does this with like articles of clothing. Um,but in this example here, you're gonna, you see it, it's, uh, it's food,it returns the close images, some food based on the images that you give it.
Um, the most, uh, I guess, you know,the thing that has made vector databases much more popular recently is thisgenerative ai, right? These LLMs. And so that kind of falls into this text search engine, this text, uh,q and a kind of bot question answering kind of thing. That's kind of where this falls in. And so we're gonna zoom in on this and we're gonna take an example use case andjust kind of understand how you or, uh, you know,an enterprise could use a vector database, uh,to create a generative AI app. So, for example,you have hundreds of thousands of pages,maybe millions of pages of proprietary internal documentation so that yourstaff, uh, your customers, you know,these people who need this information can go and get what they want done.
So if you are, you know, uh, if your,if your customers come and they ask you a question,most of the time you're gonna need some sort of customer service rep to do this. Um, but you can also automate a lot of the easy questions,the ones that can be answered almost automatically through using a vectordatabase and an L L M. And it'll be faster because searching through these millions,hundreds of thousands of pages requires some sort of internal expertise from thecustomer service rep or whoever is doing the, the, the searching. And so instead of, you know, having someone waste that much time doing that,what you can do is you can use these LLMs to understand the customer query orthe user query, and then, um,put that into the vector database and use the l l m tofind the correct most similar documentation. Okay,so what is a vector embedding? I've talked a lot about vector databases,how you can use them, why you need them, right?There's a lot of unstructured data.
There's a lot of data that needs to be represented somehow,and vector embeddings are basically the answer to that. So let's,let's get an introduction to vector embeddings. I, I love this image. Um,this is a very, uh, this is an image from a very seminal paper from Sudor et al. And this is about vector embeddings.
What is a vector embedding?Why are they important? Right?So basically what vectors allow you to do is they allow you to access knowledge,uh, that you don't, that you don't have, uh, uh, a structure for, right?And so in this example, what we see is we see the words queen, king, woman,and man. And what we want,what I wanna show you with this example is that you can do math on words usingvector embeddings, okay? And then ideally from there, you know,we can extrapolate this. You can also do math on images, on audio, on, you know,sentences, paragraphs or whatever. But, so for now, we start with this. We, so,you know, look at queen, uh, look at king and look at woman and look at man.
So notice how queen and woman have the same first number, 0. 3,and king and man have the same first number, 0. 5. So now when you subtract woman from queen,you're gonna end up with that first number is gonna be a zero. That second number is gonna be a 0.
- So if you add man to that,that first number is gonna be a 0. 1. That second number is now gonna be a 0. 7.
And or the first number is gonna be 0. 5. The second number is gonna be 0. 7. And now you have the word king.
And so basically what this vector, uh,what this image, um, shows is that with these vector embeddings,you can do these ma do this math on words. Queen minus woman plus man equals king. Uh, and so this is a very,very simplified example. Traditionally, vectors are gonna be in a much,much higher dimensional space. Right? Now,I'm showing you a two-dimensional example.
If you go and you use something like sentence transformers, um,you'll have 768 dimensions. Uh, and actually, we'll,we'll talk a little bit about that, uh, later on. So how are these vector embeddings generated?What determines how many dimensions they have? What determines you know,what they are?So actually what's really important for this is the model that you pick a vectorembedding is actually just the output of the second to last layer of the deeplearning model that you decide to run. Um,so in the traditional deep learning model, you have some sort of, you know,some sort of input, some sort of hidden layers, and then like an output layer. So this output layer is usually going to be some sort of prediction,some sort of classification.
Maybe it's the location,maybe it's whether or not your place is location,maybe it's predicting the next word, right? And like,like G B t's decode architecture, maybe it's, uh,finding the part of speech or something like that. This is a predictive kind of thing, right? So when you move one, one layer back,you have these layers, this, these, this layer of neurons, usually hundreds of,uh, neurons, and they provide, uh, some sort of output,and that output gets fed into the last layer. So that output is our vector embedding. Um,so how can you make some vector embeddings, right? Is you're really going to,one of the most important things to choose when you are creating these vectorembeddings is the model that you're gonna use. Um,you can spend hundreds of hours optimizing your indexing for perfectpre-call or, or, or, or perfect speed or whatever.
Um, uh,perfect isn't a real thing, but like for really good recall, really good speed,right? But it doesn't matter if your vector embeddings are wrong. So for example, these are some places where you can get vector embeddings open. AI has open ais embeddings. These are 1536 dimensional. These are slightly bigger.
Hugging face allows, you know, hugging face, just,there's just so many models on hugging face that you can, um,that you can use. And so hugging face is a hub for that. Um,I think PyTorch also has some, I've been put it on here. PyTorch also has some,um, and then there's RESNET 50. So RESNET 50 is available both on hugging face and PyTorch,and it's available in many formats.
Uh, but this is an image architecture. And this is something that, uh, this is an image model,and this is something that you know,you need to keep in mind as you create these vector embeddings is when I'mworking with images, I have to use the image model to get my vector embedding. If I'm working with text, I have to use a text model to get my vector embedding. And actually, one thing that's come up a lot for, um, from, from me,like talking with, uh, people who are working with these is like, uh,using the right model for the right type of text. For example, um,if your text is a story, uh, or a novel,then sentence transformers,something that works on and is trained on complete sentences would work reallywell for you.
But if your text is in let's say, CSV format,then sentence transformers may not be the best idea because it doesn't know whatto do with these commas that are between all the words or the slash r slash nkind of stuff. These characters, these special characters that go into csv, uh,or P D F, right? So there's many, many formats that, uh, text comes in,and you have to think about that as you create these vector embeddings as well. Oops. Okay. So let's take a look at what vector embeddings look like.
This is a question I get a lot, um, because pe you know,what does a vector database look like?What does it look like inside of a vector database?I say things about vector databases sometimes, and people are just like, okay,cool. This is like going over my head. I have no idea what you're talking about. Like, is this a key value store? Is this like a, like a,like just a bunch of numbers, like what's going on?So this is a screenshot straight up from Zillows, um, from Zillows Cloud. And this shows you what could be inside of an extra database.
This is from our example dataset. So our example data set is on, um,medium publications. And this one shows that, you know,this is what one entry looks like. It shows the ID and the vector, right?So this is the important part to know,is that this vector is really just a long series of numbers. And then there's some other metadata here.
So if you need metadata,uh, Zillows novis also supports putting metadata into the,into the, into the database. Okay. How do vector databases work? Oh,I see there's a few things in the chat. Uh, uh,embeddings are specific to models. How would you search through embeddings of two different models?You have to search through two different embeddings of,you have to search through two different vector databases, basically.
Um,okay, so we're gonna go over some vector indexing methods. So this will be talking about how you do the vector search stuff, right?So this is the vector search best practices part. So first thing we're gonna look at is we're gonna look at approximate nearestneighbors. Oh, yeah. Or annoy.
This is a algorithm that came out of, um, Spotify. And essentially what it does is it cuts your, um,it cuts your space, your dimen, your, yeah, it cuts your space in half, many,many times until you reach a certain predefined parameter,such as maybe the number of leaf nodes or the number of nodes in the leaf node. Um, and what it does is every time it cuts your space in half,it creates a binary tree, uh, sorry,it creates a binary tree by cutting your space in half, multiple times. So the first thing you do is actually choose two random points. It does not matter what the points are.
Um,and then you classify them by finding the points that are closest to one and theother, and that's where you get your first half,and you do that again within each half of those spaces. So that's annoying. This is outta Spotify. This is a pretty decent, uh, index. It is not too space heavy, um,because you're basically just saving the binary tree,and it's not too computationally heavy because you are only comparing to a fewnodes once you traverse the tree.
Inverted file index is the next one that we're gonna cover. This is probably what I would say is the most, uh, intuitive vector, um,vector indexing method. So if you have a,if you have taken some intro to machine learning class,you probably know what uh, K means is, right?So k means is a clustering algorithm that essentially says,here's some number of OIDs. Find me the clusters by finding the points that are close to each oid,and then that's gonna be a cluster. And that's basically what I V F does.
It says, okay, we're just gonna find some number of centris,we're gonna group the points around them. And then when you comment that,that's at, uh, you know, indexing time. So at query time when you come in and you search and you say, Hey, I wanna see,you know, this vector here,what it does is actually it searches and finds the closest cent,and then it goes into that OID and it searches and it finds the closest, um,point in the, within the oid. So once again,you can kind of see that the memory here is gonna be, um, pretty good, right?It's not too, you don't have to store too much memory for the indexing. Uh,and then, uh, you can also, uh, see that based on the way that you search it,you will most likely be getting the, the right results back as well.
Then there's hierarchical navigable small worlds, or H N S W. This is a really,really, really popular, um, vector indexing method. And one of the reasons why this is so popular is because it is extremelyaccurate, is almost a hundred percent guaranteed, a hundred percent recall. Um,and the reason it works that way is because of the way that it works. Soat index time, when you are inserting your data, it creates a graph,it creates a graph index of all of your nodes.
So from that,you can kind of think like, okay,so we actually have to keep track of more information than just the nodesitself. So it's actually very memory expensive, but right now,one of the reasons why it's popular is because even though it's memoryexpensive, memory is not terribly expensive to buy right now. Um,and people enjoy the accuracy. So how does H N S W work at insertion time? Every time you have a data set,every time you have a data point and you wanna put it into your graph database,or sorry, your graph index,what happens is you assign a uniform random variable,and based on that uniform random variable and how that compares to some sort ofpredefined parameter, it's gonna go into a certain layer. So for example,if my predefined parameter is 0.
9 and I'm inserting PO and my,I'm inserting data points. If my data point, when I insert my data point,I generate a uniform random variable. If that uniform random variable is between zero and 0. 9,it goes into layer zero. If it's between 0.
9 and 0. 99, right?90% of the next layer as well,then it goes into 0. 0 and layer zero and layer one. If it's between 0. 99 and 0.
999, then it goes into layer zero, layer one,and layer two, and so on and so on and so on. So it's kind of probabilistic which layer your point is in,which is also interesting. This means that H N ss W is actually non-deterministic when it creates the graphunless you set the random seed. Um, but what this allow what at query time,what you do is you code and you go into the top most layer. So the number of layers you'll get is also gonna be probabilistic, by the way.
Um, but you go into the tomos layer and then you say,what's the closest point in this layer? And you drop down and you say,what's the closest point in this layer? And you drop down and you do it again. So actually the only real calculations,the the main real calculations that you have to do here, uh, are very,are very short,because you already have this graph index built where you know where the closestnodes are to each node that, uh, is in the, the, the,the index. Um, so yeah, this is a really accurate index,takes up a little bit more memory, um, but also fast query time. Okay? So how can you get started with a vector database? Um,oh, slow, okay, how can you get started with a vector database?So we're gonna cover some, uh,we're gonna cover a couple different use cases on vector databases and how theywork. So let's first think about what is similarity search?Let's cover what similarity search actually is.
This is what you are doing,this is what a vector database essentially handles for you. Um,and this is why, this is the main reason why you would use a Vector database,right?If you have things that you need to find how similar they are to each other,that's when you would go into a Vector database. But if you have things like,you know, um, key values, like for example, if you have a list of users,then you probably don't want a Vector database because with that,you want the exact response back. 'cause if I give you a username and it's only similar to another username, I,and, and you gimme that password back, I'm not gonna be able to verify my user,right? So vector databases are used for similarity search applications. And the way this works is you start with your unstructured data, step one,right? Uh, some number of images, some number of videos, documents, audio,whatever, some sort of data, and you transform it into a vector.
And this is the part that I was talking about earlier that's important is right?You gotta pick the right vector model when you are transformingyour vector. Ah, okay? Um, and then once you transform it,you store it into your vector database and then you, that's it. That's all you need to do for putting your data into a vector database, right?So from here, what you are concerned with is how do I get my vector back?Um, and essentially what you're doing is you're gonna perform the query. And when you perform the query, you use the same model as you did before. You have to use the same model.
Um, that was a good question to ask earlier,is like, what if you have two different models, right?You have to have two different vector, um, embeddings for that,you have to have two different vector searches for that,two different collections or something like that. So once you get your embeddings from your, uh, query,then you run it into the vector database, and we do something kind of like this,oh, I forgot to cover this slide. Uh, I'll cover this in a second. Um,and then once you run one of these indexes such as I V F or annoy,or H N S W,then you get your results back and you get to see where the most similar vectorsare typically, um, you would ask for some number of results back. So maybe like a top five or like a top three, um, or something like that.
Zillow cloud automatically returns the top 100. So you can also go look at that if you would like. Uh,so this is a slide I skipped earlier, and I dunno how I skipped it,but this is important to understand. Um, so why do you,why do you want a purpose-built vector database?What's the point of using a purpose-built vector database when I've just coveredall of these indexes that you can just go use,you don't need a vector database to do these. You can just,you can implement them yourself.
You can find them on some sort of open source repo or whatever these arecalled vector search algorithms or vector search libraries,which provides you with some sort of high performance vector search instead ofjust brute forcing, right? So compared to these,what does a vector database have that makes it important? Um, a,I think one of the most important things about having the vector database, uh,at least with novis slash illus,one of the really cool things is it allows this thing called dynamic schema,which you can basically,you can insert new fields at insertion time,and you don't have to define it in your schema. And so this is like,this is really cool, this is great. This is what you can do with no sql, right?With SQL databases, you're gonna have some, some issues if you, you know,play around with this table schema too much, but with no sql,you can kind of insert whatever you want. And we kind of have that dynamic schema ability. Uh, number two,I think is like the filtering.
So you can now filter on your metadata,or actually, I think you could do this before too,but you can filter on your metadata, um, which allows you to say,perform more specific tasks. So let's say for example, um,I want to create, uh, some sort of, I,I wanna find the most common answers to aquestion that, uh, you know, I get asked a lot,but I only want the answers that have come up in the last month. So maybe, uh,maybe the question is something like, what's the most popular l l m?And, uh, you know, I store all these answers in my Vector database,and then I say, okay, well,I wanna know in the last month what was the most popular l l m, right?So that is kind of one of the great things about filtering is you can select onmetadata, or for example, with a medium, um, example,let's say I only want things from the publication UX collective or the startup,right? I can filter these, uh, these examples when I do my vector search. So naturally,one of the questions that comes up about the filtering by the way is, uh,do you filter, do you filter the, uh,vectors before or do you filter the vectors after?This is something that I get asked,and the actual answer is that you actually filter the vectors during the search. So this is also a really cool part of Novus,is that this is an extremely effective and efficient search pattern becausewhat you do is you create a bit mask so that as you go through,you just, you know,mask your metadata with a bit mask and it says whether or not you even want toconsider doing these calculations.
Are you recording? Where?Can I get the slides? Yes, we are recording. These will be sent out later. Um,other things that make a vector database important include, you know,the very basic things, right? Uh, we manage the, the scaling, uh,horizontally vertically if you want a bigger C P U, bigger G P U. Um, we also,this is specific to Novus again, um, we have G P U support within Nvidia, uh,billion scale storage. Um, if you want to, you know,multiple nodes for your, uh, multiple instances, multiple servers for your, uh,vector database, we can also do that.
Um,sharding for your streaming data backups, lifecycle management, multi-tenancy,like role-based access control for enterprise users. Um, really high t p s,you know, transactions per second. These are the kinds of things that vector databases support. Then a purpose-built vector database like Novus supports. Do vector databases support acid? Uh, yes.
Well, actually, actually,so Novus allows you a couple different types of consistency. Um,but I'm pretty sure ity is pretty,pretty much always there. Uh, consistency, there's, what's it called?Eventual consistency, uh, strong consistency, um,session consistency. Andthere's a fourth one is on the tip of my tongue. I do not remember what it's called.
But basically they allow you different types of consistency. Um, so yes,spectra data, the short answer is yes, vector database support acid. Um, so how do we implement this vector similarity search that I just showed youhere? How do we implement this in practice, right?We're gonna start with some sort of knowledge base, um, some images, some audio,some videos, some texts, something like that. And we're gonna run it through a deep learning model. And you're gonna have to pick the right model, right?You're gonna have to pick the model that corresponds to your data.
If you're using sentence data, use sentence transformers, you know,if you're using C S V data, use something else. And then you're gonna remember you have to cut off the last layer and you getthe output, and that's the vectors,and you can put that into VIS or zills or whatever. Um,one more comment before I, uh, get into the code. I'm going to do the code really quickly so that we have some time at the end forsome questions. I'm actually gonna stop after this for some questions before I get into thecode.
Um, but I will also, um, you know,do the code quickly enough that we can get some questions in. This is a vector database benchmarking tool. Um, once again,the QR code is here, scan the QR code. It's an open source GitHub,if you download it. A lot of the, um,benchmarking is already set.
Um,but you can also do your own benchmarking if you have,if you want to evaluate different vector databases, we do allow you to do that. Okay, so before I get into my code, I'm gonna take some questions. Um, and also here's, this is, oh, sorry,this is another QR code that you can scan that will take you to the vis GitHub. Novis is an open source vector, uh, database, right? Um, and once you go there,you can, you know, download it, start check it out, questions. Okay? Uh,oh, thank you, Chris.
Uh, yeah, so Christoph just, oh, okay, cool. Emily,thank you for that. Um, is it true that vector search is linear boundary only?I actually, I actually don't know what linear boundary only means. Um,what, what does that, I actually just dunno what that means. Vector search is like, uh, essentially you're taking some vector metrics like,um, L two, uh, or ip, inner product, right?Um, or cosign.
Cosign is just normalized inner product, right? And, um,you're comparing these using one of these vector,you're comparing two vectors using one of these vector metrics and you're justreturning the closest ones, um, based off of the metric that you chose. And you also have to ensure that when you insert,you choose the same metric as when you search. 'cause that makes the most sense. If a model publishes a new version, will the dimensions change? Or if not,will they be compliant to the embeddings I already have in my database?Or do I need to recompute all embeddings and store them again? Uh,so this is actually, I mean, I, most times dimensions won't change,but if it does, like they'll put it into the model card. Um, so for example,like the sentence, transformers model has had 768 for a long time,and it will probably continue to do that.
Um,open OpenAI has 1536 and they will probably continue to do that as well. Uh, I think it's kind of uncommon for people to change the,change the vector arch, uh,change the model architecture and call it the same model. Usually they would publish a paper that says, Hey,we've published this new model. Um, you know,because people wanna have publications. Um,and yeah, if not, will they still be compliant to the embeddings? Yes.
Will you wanna recompute them? Uh, up to you, you can, uh,you can check your performance, see if it matters. Is model training or parameter creation required before we can use the Vectordatabase. You don't have to train any models. You can go online and online. You can find models, uh, such as, you know, throughthese, uh, resources, right? So you can go to hugging face, tons of models,you can go to open ai, you can use their open AI embeddings only works for text.
Um, you know, uh, you can use resnet 50, which is also on hugging face,also on Pie Torch. And this is for images, so you don't need a train,but you do need a model. You can train if you want a specific type of data. What do you think about the universal sentence encoding from Google for vectorsimilarity? Um, I think you can use it. I don't have any particular comments on, uh, on it.
Just that, yeah,it's, it's, I mean, it works, right?Okay, cool. It looks like we don't have any more questions on thispresentation,so I'm gonna go ahead and slide over into the code andthen we're gonna take a look at this code. So we're gonna look,take a look at two pieces of code. Um, I've, there's,there's tutorials that describe what these are doing, uh, available. But, um,here I just wanted to cover these so that we can kind of take a look intopicking, picking a, a model and also, um,why that model's important, how this kind of changes, uh,over different types of text.
So in this example, we are working ontext data. And what,essentially what we're doing here is we're gonna find some,we found some dataset, it has some, this is where you can download it. Um,this is all, this will all be available, uh,so you can kinda replicate this yourself, but we have some dataset, uh,it is a set of White House speeches. And so the first thing we need to do is,as any good data scientist does, we read in our dataset and we clean it up. So when you read it, you'll actually see that, oh, this is,I didn't start this survey.
Okay,so let's see if I can just start this survey here. Sowhen you, uh, when you, when you start looking at the data,you can see that there's a lot of things to clean up, right? Uh,and then as you go and you clean up this data,so now you can see I've just done a bunch of things. I've dropped some nas,I've checked for some speech lengths, whatever, um,and I've cleaned up the data. So now it looks nice. It looks like data that you could actually use,and you wanna make sure that the data that you have,that you put in your vector database is data.
That makes sense, right? Uh,otherwise it's not very helpful. Um,this is actually not a very helpful code. Okay? So now we go from here, we can,uh, you know, start putting things into the vector database,but actually what I wanna show you is not that slide, that's not important. Where's the important one? Ah, ah, here we go. Okay.
So what I've done here is I've used this thing to get sentence transformers,right? So sentence transformers, and this is a specific sentence,transformer model that is all mini lmm, uh, L six V two. This is a model that is popular, um,and this is a version of the sentence transformer model. So this kind of goes towards what you were saying, uh, earlier. Um, Christophe,you know, if a model publishes a new version, uh,most of the time the dimensions won't change. I think all of the sentence transformer models are at 7 68.
Um, uh,so I show, oh, they're actually 3 84. Oh, I was wrong. Okay,they're actually 3 84. This one's 3 84, but there are many that are 7 68. So I guess as they publish new ones, they will change.
Uh,and then this is basically what I wanted to show you about picking a model foryour text data. Um, and that's pretty much all there is to this one. I'm not gonna run this much longer here, I don't think it needs to be. Um,how can I, okay, lemme restart this one. And then the other one is paintings.
So envision models,reverse image search, reverse painting search, okay, so in this one,what we're doing is we're downloading some, you know, some files again,once again. Uh, and this one's where you're gonna use PyTorch. So this one last time we used sentence transformers,which is actually from hugging face. And this one we're gonna use PyTorch,which is a different integration. So we have, you know, all these integrations,uh, all these integration examples available on ZI Cloud as well.
Um,and in this one we have a different dimension. As you can see, we have 2048,which is, um, quite a, quite a few more dimensions. Uh, and basically, you know, we, we start our vector database, um, you know,it gets started and then we create our field schema, blah, blah, blah,put it in, put all the, uh, put all the images in. And then here,this is where, you know, this is where we pick the, the model, right?Is that we go into Torch PyTorch, right?And then we load up some version of resnet 50. Um,and you'll see that I actually have this model cached here.
Um,but that's because I, uh, have already downloaded the model. So actually,if you need to download the model,you're gonna have to do this Ss s l step first too, just to give yourself a, uh,context to type build able to download this model. And it's important to know here, um, the important thing, the,the couple important things that I want to touch on here that you guys shouldsee is A, this gives you this like way to remove this second to last layer,uh, to remove the last layer, right? Using Python. Uh, and also that, um,here we've defined the dimensionality up here. Um,and that's because we knew ahead of time that this specific modelhas a dimensionality of 2048.
If you don't know that ahead of time,then you probably need to either A, you can read the vec, uh,you can read the model card, or B,if there's not a lot of information on the model, you can, uh,run it through and then just check that yourself. Um, and here once again,you know, we also,where is it?Where's the 3 84? Ah, here, it's okay. So we also knew the dimensionality of the sentence transformers model before weset that up as well. Um,and here we'll also note that you actually handle the,this kind of model differently than the PyTorch model,and that's because of the library itself, right?So PyTorch is a library that, uh,kind of allows you to build these neural networks and it allows you to work onneural networks. So it kind of allows you cut these layers, whereas, uh,transformers, our hugging face isn't,isn't really mainly focused on you building the known networks,but rather using their models.
Um, so the,they actually allow you to just in encode this is just the function,the data to get your vector embeddings. Uh,so that's mainly what I wanted to point out here. Um,the difference between these texts and these vision, um,uh, models are important to note,especially when it comes to dimensionality. Uh,and you cannot compare text data that is 384 dimensions toimage data that is 20, 48 dimensions. Um, right now, there's actually, actually,so we, Emily mentioned this earlier, uh, there's gonna be,we're having a webinar on the 27th that is gonna be about, um, multimodal,uh, multimodal models.
Models and multimodal models will allow you to do factor search between, uh,text, images, audio, whatever it is that the model is trained on. Um,so yeah, that's pretty much all I wanted to cover here. I've left quite a bit of time at the end for questions. Uh,if there aren't a lot of questions,I can also go over this actual code itself and we can kind of take a look at howthis works. Uh, we're gonna,I'm gonna step through the notebook and we can take a look at how that works.
Okay. Doesn't look like there are, oh, okay. I please post a link to the notebook. Okay? Uh, yes,I will go get these. I should have had these pulled up.
Um,lemme just show youWhile we let you, Jen, pull up those links. Um,if you guys can drop your questions into the q and a panel at the bottom,it just helps us, uh, not lose them in the chat window. Um, soChat window to everyone. So this one is reverse painting search,so I'm not gonna go over this one because this one takes like 12 minutes orsomething like that to, to get all the embeddings. Um, but this oneshould be the image search, the one, the text search.
Yes. So this one's a text search and this one will allow you to, uh,do all the tech data. And, um, yeah,so those are the two models and they both come with, uh, those, sorry,those are the two notebooks and they both come with, uh,this U R L stuff and you should just be able to run through the notebooksthemselves without having to change anything, um, to get them to workHill Amsm. What are the main advantages of viss comparing to Pine Cone Vespa and otherplayers? Great question. Uh,I will let you check out the Vector database benchmarking toolto see how viss and those compare to other vector databases.
Um,I can say, you know, everything I want up here, I can say, oh,Novis is the greatest thing ever, but you know,you're not gonna believe me unless you go and test it out yourself. So I really suggest you just go test it out yourself. I will tell you that I,the way that I see Novis having an advantage, uh,compared to other vector databases is really that the scale that viss works atis ridiculous. Like, I think Vespa might be,might also work at a billion scale, but, um, it's definitely not as fast. Uh,and it definitely takes longer to load and it has a higher cost for query persecond, but I'm, once again, this is what I see as, you know,main advantages of Novus.
Please check out,please check out the vector database benchmarking tool and check it outyourself. Okay,Tom Tofi, tofi,can you elaborate on utilizing vector databases for anomaly? Detections? Yes,actually, so we were actually just talking with a company, um,called Galileo about this, um, and uh,basically anomaly detection, right? Is saying like, oh,this data point looks really, really weird. It looks like an outlier. And the way the vector databases work is like, you know, essentially you,you're, you're creating all these clusters, right? Remember the I V F, uh,image that I showed the, it creates all these clusters or the annoy one, right?It creates all these clusters of data. Um, even, even H N S W does it,it just doesn't, like the image just didn't show it.
Um, so when you,when you, when you query, if your, you know,vector that you're querying with is really, really far out there, it's like,you know, a really far distance away,then you know that that's an anomaly and that's probably wrong. Um, okay, Christophe Busler,not sure how related it is when using large text often it is recommended to bebroken up into pieces and embeddings created for each piece piece. What is the best granularity page, paragraph sentence? Uh,this is a use case by use case thing. Unfortunately,I cannot tell you what the best granularity is for every single piece of text. Um, but I will say, you know, sentence transformers is mainly, uh,is mainly trained to work on sentences.
So I would suggest that I was suggest taking a look at that. I would, I actually have something to add to this. Sorry to jump in. Um,that's okay. But we have a webinar coming up.
It's not on our website yet,but it hopefully will be, um, by the end of next week. So we're doing a webinar,um, August, let me get the date for you. Um,August 24th with Lance from Lang Chain, and he's actually gonna talk about, um,how to sort of chunk, uh, data and or chunk text and sort of, uh,maintain a lot of the, um, context within sort of those documents. So that's definitely worth checking out if you are interested. OhYeah, I actually talked with Lance about this, um, recently as well.
So what Lance said to me about what Lane Chain was doing is they're doing thisthing that, um, is called like smart chunking. Um,what I can tell right now it looks like it's still rule-based was,is you give it some sort of chunk size and you give some sort of overlap. Um,so that is part of the thing is like for context,you do want some overlap between your texts, uh, to keep the context. Um,but yeah, chunk size completely up to you. Tom Tofi specifically,I'm interested to compare unstructured key performance indicators to,for wireless networks.
I,That was the question related,that was an add-on to the question about anomaly detection. Detection, okay. Uh, I mean, yeah,you can put your KPIs into a vector and then you can seeif they're like, you know, you're,you can store all of them and then like as as they come in you can say, Hey,is this one like really far off from the usual? And if it is,then it's probably anomaly. Um,Mohammed,can you please give the clue to use Zillow for gene prioritization based onsimilarity? Wow. Uh, so this is the first time I've gotten this question.
Um,this is cool. You're working on gene similarity. Um,I, I mean, like you can get your gene like, uh,gene similarity and gene vector embeddings. I don't know anything about genes to be honest. I know that we've discovered the entire human genome and that it's pretty small.
Um, but you can just get your genes, get an vetting for them,and you essentially what Zillows can help you do is find the most similar genethat you have already created a vector embedding for, um,compared to the gene that you would like to search with. Alex, can you suggest the best encoder for PDFs?Why not sentence transformers or have I just misunderstood what you said?PDFs are a weird, weird data type. Okay,so PDFs are actually, uh, particularly challenging,and this is something that we are working with as well, is we're, I'm,I'm working on, uh, how do I, like,how can we ingest PDFs into a larger database because PDFs actually contain manymixed types of data. So PDFs contain images, they contain graphs, uh,and they contain your text, right? Um,can you use sentence transformers for A P D F? The answer is yes, uh,but you will have to extract your sentences outta the P D F. So if you just like pass in like, you know, p d F format, it's not gonna work.
Um,so you have to remember that you have to extract your sentences into some sortof workable format, basically. Um,I don't have an answer for what the best encoder is for PDFs, to be honest. This is a, this is, this is the, this is a real problem that is,is being worked on, um, by many people. Tom Tofi, I like to copy those links in chat. Can you please email them to me?Wait, what? Can you copy? I mean, sure I can, yes,Alex? Like with a lane chain loader for instance? Sure, yeah, yeah, yeah, yeah.
You could use a lane chain loader to, to load PDF d f documents if that is, uh,what you would like to doQ roll. How big should be the amount of data that we are embedding to have benefit ofusing Zillow's database instead of face node and store the embeddings in firestore, for example? How big should the amount of data? I mean,it's not just about the data size,I would say you probably also wanna look at like things like, uh, you know,different ways to index. So face is an indexing algorithm that is correct. Um, I don't know what face node is. Maybe is that like specifically for Node?I don't know, but face is a specific indexing algorithm.
But as we were saying earlier,vector databases provide many more things to you. For example,if you want to do some filtering on your data,if you have metadata and you want do some filtering, um,you're gonna want to use something like zills, uh, or you know,if you want to be able to do tons of transactions,you're probably gonna use something like zills. Um, so if your app,if you're building like a P O C, like just for fun kind of app where it's like,oh, you know, I've got a hundred embeddings and, uh,they're all pretty small, like it's chill, use whatever you want, use face,use H N S W, use whatever, it doesn't matter. But when you get into scale,when you get into really being able to use the vector database for more complexqueries, not just comparing your vectors,that's when you're gonna want to use like Zillow or Novis or something likethat. Gil,do you ha do you use open source engines like face Leucine, et cetera,or do you use a proprietary engine? Um,I'm going to interpret this as asking about these indexes.
Um,and we use many publicly available ones as well as ones that we've createdourselves, and you can choose them. So one of the cool things about VIS is that it allows you to choose your indextype, unlike many other vector databases, which will go unnamedrock. This is at Tom,you could click on the link and copy it from the U R L bar. You can also download it or copy it to your Google Drive. Yeah, exactly.
Uh, Alex,can you expand on the topic of hybrid search please and how Viss supports it?Yeah, sure. So hybrid search is this ability to, uh, actually,so you'll actually see this, you'll actually see this term used multiple ways. The way that, uh,some people use it is it allows you to search like vector types, like, you know,BMM 25. The way that, um,other people use it is it allows you to search this unstructured and structureddata,which is actually something that I see a lot of questions about in industry. I've talked to some people who are like, yeah, you know,one of the things we really wanna figure out is how can we query like these keyvalue structured data along with the vectors is their unstructured data.
But the way that I kind of see it is like, look,you have your vector data and you can store your metadata with that and you canfilter over that metadata. And that's what I think of as, as,as the real hybrid search is filtering over like the, the, the data that, um,that comes with your vector. You can also think about different ways to do that,but that's kind of how I think about it. And what it allows you to do what it,what it does when it goes in and does a search and why it's sufficient is thatit applies this bit mask onto all of the metadata, filter all the metadata,and it says if this metadata doesn't follow this bit mask, um,just skip it. Don't do the calculations at all.
So that's why it's sufficient. Do you support an approximate search of binary data,not vectors based on hemming distance? I don't know the answer to this actually. Emily, do you know the answer to this? I don't. Okay. I'm gonna say no.
Um, yeah. Sam Corbett,do you guys use G P U for index creation only or do you use it during the searchphase as well? Uh, yeah, so the G P U is, is used, um,I think it's used during search phase as well, because essentially what it is,is it's NVIDIA's provided G P U support for viss that allows VISS to use the G PU instead of a C P U to do, uh, the calculations. So, so yeah,it has to be used inquiry phase Francisco,the metadata capability is meant to bring you closer to standard relationaldatabases. Postgres now is vectors in it to bring the same gap,but in the opposite direction. What is Postgres messing missing in its information versus vector database?Why would you choose one for the other?So I don't know if metadata capability is really meant to bring you closer tostandard relational databases.
You can see it that way. I think the metadata capability is really solving this issue of we want to beable to tag our data, we want to be able to see like, you know,when it happens or where it's from or, uh,maybe like some, some other like tag. It is kind of the way I see metadata. Um,I don't think it really gets you that much closer. It's not like we're,we can't like search like, you know, like I just wantthe users in this table where, you know, like select from where whatever.
Um,but yes, Postgres does have vectors. PG vector is a thing. And what is Postgres missing in its implementation versus Vector database? Well,it's pretty early on. We have been around since 2017,so we've had a long time to work on vis, and we have a pretty active community. In fact, the Pine viss, um, s D K that you guys saw me use earlier,uh, was actually contributed by a community member.
Um,but Postgres is, well, one thing, Postgres only has one type of indexing method,so Postgres, right? So we're, we have these multiple indexing methods. I believe Postgres is only I V F. Um,and it's also like, it's not optimized for vector search, right?So what is,one of the things about vector search that makes it so different from this,like searching relational databases,is that you have to do a ton of calculations, right? These vectors are,you know, these, these long, long numbers, there's long series of numbers. You have to do a a lot of calculations on these numbers. And so if you have an implementation that's not,that is just basically brute forcing this or, you know, using, uh, you know,something that isn't optimized for this kind of calculations,you're gonna have a slower, uh, you're gonna have worse performance basically.
Um, why would you choose one versus the other?If I'm doing anything with similarity search,I'm gonna be using a venture database, I'm gonna be using novis. Um,if I wanna do, like, you know,if I have a lot of key value data such as a user database,I would definitely be using Postgres or like some sort of org, org database,whatever. I would definitely be using Postgres, right? Um, but you know, I,I really think that as we move into this world of LLMs,as we move into this world of more, um, machine learning focused things,we're really gonna be looking at more unstructured data, more,more unstructured data, more vector embeddings,and really vector databases are gonna be really important for that. Uh,vector search is not for finding exact matching, right?So it can find exact matches, in fact. So for example, in this one, I got an exact match right here, right?So what was this? What is, who is this a painting by Dega? Yeah.
So for example, in this one I got a match for my Dega painting,and there's actually two of them in there. But you'll see that, you know,you get the distance, uh, and that's,so this distance is basically zero. Um, so you'll see these matches, um,you can get exact matches as well,but it's primarily used for similarity search. Cool. That's all that's in the q and a OSS vector.
Oh, and that's Emily. Oh, what is this?Vector to search with Postgres sql. Wow. Okay. That's a, it's a funny picture.
Um, Postgres actually has so powerful add-on to managed vectors. Okay. This is not a, this is not a question, so I don't, I don't really know what to,what to tell you there. Um, okay. We've got five minutes left if anybody else has Oh, what happened to my,can you guys see the screen?We're seeing your, there we go.
Okay. Okay. I don't know what happened to my, to my screen there for a second. Since it looks like we've wrapped up the rest of the questions,we just wanted to thank everybody for, uh, joining us today. You Eugene,thank you for such a great session.
Uh, we covered definitely a lot of material,so, um, everybody keep an eye out. We'll send out a link to the replay,so if there's anything you missed or wanna go back and review,you'll definitely have, uh, time to do that. Thanks so much and we'll hope to see you on a future webinar. Thanks Eugene. Thanks guys.
Meet the Speaker
Join the session for live Q&A with the speaker
Yujian Tang
Developer Advocate at Zilliz
Yujian Tang is a Developer Advocate at Zilliz. He has a background as a software engineer working on AutoML at Amazon. Yujian studied Computer Science, Statistics, and Neuroscience with research papers published to conferences including IEEE Big Data. He enjoys drinking bubble tea, spending time with family, and being near water.