Training
Tutorial: Building a Semantic Text Search Application
Join the Webinar
Loading...
What will you learn?
In this hands-on tutorial, we’ll introduce embeddings and vector search from both an ML- and application-level perspective. We’ll start with a high-level overview of embeddings and discuss best practices around embedding generation and usage.
Then we’ll use this knowledge to build a semantic text search application. Finally, we’ll see how we can put our application into production using Milvus, the world’s most popular open-source vector database.
What you’ll need:
- Python 3.9 or above
- A basic understanding of vectors and databases
What you’ll learn:
- What is vector database
- What is semantic similarity
- How to use a vector database to find similar texts
I'm pleased to introduce today's session,building a Semantic tech search application and our guest speaker Gin Tang Yugenis a developer advocate here at Zillow. He has a background as a software engineer working on auto ML at Amazon Yugen,studied computer science,statistics and neuroscience with research papers published to conferencesincluding I E E E, big Data. He enjoys drinking bubble tea,spending time with his family, and being near water. Welcome den. Uh, thank you for that introduction, Emily.
Um,thank you everybody for being here today. So today we're gonna be talking about how you can build a semantic, uh, uh,tech search application. And this is gonna be a hand application. This is gonna be a hands-on workshop,so you know, you should have an ID ready, um,and you should have Python 3. 9 ready to go.
Um, I use VS code,so I suggest using VS code, uh, once I kind of go over the conceptual,uh, concepts, I guess, of what, uh, semantic search over text includes. Um, we're gonna dive into the code and I will show you the,um,I will do the virtual environment and I will do the installs. So all you need is your ID and, uh, Python 3. 9. Okay, so let's get into it.
Um,today we're gonna be talking about semantics search over text. My name is Tang. Oh aha. Okay. Uh, I'm a developer advocate at zills.
You can connect with me,um,by emailing me@usingatzillows. com or going into my LinkedIn or myTwitter. I'm more active on LinkedIn and that QR code there, uh,is a QR code that links to my LinkedIn. If you are on your phone,you can scan that. And then a little bit about the company Zillows.
We are the maintainers of viss,the open source vector database,and you can follow us on Twitter at Zillow universe. You can find us on LinkedIn at Zillows. Um,and you can join our slack@vissio. slack. com,or you can check out the GitHub for viss,the vector database itself at github.
com/viss io slashnovis. Okay,so today we're gonna be covering three sections of, um, concepts. So we're gonna talk about what is semantic search,and then we're gonna talk about what effector embedding is,and then we're gonna get into how you can do semantic tech search. Um,and at the end we're gonna do a demo where I'm gonna, you know,a workshop where I'm gonna walk you through the code. So what is Semantic search?The idea behind semantic search is semantic similarity,which is that some words are similar in their meaning.
Um,and this paper is one of the first papers in this space. Um, and this is, uh, from word to vec. And this is basically showing the four words, queen, king, woman, and man. And the idea from showing you that these vectors is that if you subtract theword woman and add the word man to queen,you should end up with this vector king. And semantic search is,if we take that idea of vectors, um,and having words as vectors and,and imply it to a large scale,we can find words that are similar to each other through looking at thedistances between the vectors.
Um, and it is important to note that, you know,today, um,the power of vector embeddings is actually muchstronger than it was before. Now we have much more complex vector embeddings,and we can embed more than just words. We can do phrases, sentences, paragraphs. And so for example, in this slide here, if we were to look,um, for,if we were to do semantic search and we were looking at the word protein,we could see that the three closest words would be, you know, ions,reactive and glucose or maybe molecule. So that leads us to vector embeddings.
You know,this is what we use to do math on our words. And the idea of a vector embedding is,it is basically a set of numbers that represents the semantic meaningof your word. And this is a screenshot that I took from the ZillowCloud website that shows youwhat some of the data in your vector database would look like. And I really wanna direct your attention here to this title vector column. So this is a vector,a vector representation of the titles.
Um, so you know,here are the titles you see in the column next to it. And here we have a bunch of floats. You can see that this is like, you know,point something, something, something, comma point something, something,something. And in this case, we have 784, uh,numbers representing a title. And in our demo,we're going to have a slightly different vector size.
And the vector size is gonna depend on your,the chosen model that you create your vector embeddings from. Um, so how are these ve vector embeddings actually generated? Uh, like I said,you know, they're generated from a model, right?So if you have a neural network model, if you have something like this, right,you usually, you would input your words or your phrases,sentences, whatever, and you would get some sort of output. Maybe your output would be something like the part of speechor the type of entity, you know, named entity recognition, um, or something,some sort of classification. And the way that vector embeddings are generated is we justput our, um, we just put the, the words,the phrases through the model, and we go and we take the penultimate slide. We take, uh, the, the second to last layer,and that layer contains some numbers that representyour word.
And in the example that we show here,this layer would be a layer of 784 neurons that contains,in which each one gives you some sort of number that represents your word. And some tools that you can use to generate factor embeddings,include, you know, open ai, OpenAI has open AI embeddings. You can go to hugging face. This is what we're gonna use in our examples today. We're gonna use sentence transformers, um, from hugging face,or you can use cohere, which also has a way to, for you to generate embeddings.
So how do you do semantic tech search, right?So we know what semantic search is, we know what semantic similarity is,and we know that we can do math on words using vectors. What you first want to do in your semantic tech search is you want to have somesort of knowledge base, some sort of documents. And so today we're gonna be using a data dataset from Kaggle, um,and it's also available on Google Drive. And, uh,both of those links will be dropped, uh, in the chat. And then you take your data and you put it into your deep learning modeland you, you know,you take your charact from the second to last layer and you store it inViss or a Vector database.
And so for this example, we're gonna be using viss,um, but there are other vector databases as well. And so before we dive into the code, um, I'm gonna just, you know,cover a little bit about what the code's gonna do. First thing we're gonna do is we're gonna download and import our data. Um,and I'm just going to, um, oh, Emily, can you share the,the CoLab notebook? Um, so all of this,all the code that you're gonna see today is available in a CoLab notebook. And I'm going to be walking you through, um,most of the notebook and what it does.
So the first thing we're gonna do is we're gonna download it and we're gonnaimport the data. Then we're going to take a look at the data,we're gonna explore the data,kind of make sure we have an understanding of what kind of data we're workingwith. That's always really important. And then we're gonna clean the data up,you know, just to make it, uh, usable. And then we're gonna spin up a vector database using Novas Light.
Novas Light is a vector database that you can use right in your JupyterNotebook. So, um, you know, not, you don't have to install anything, you just,it's gonna be done through Pivot Install. Then we're going to get the embeddings from the sentence transformers,and we're gonna populate the database,and then we're gonna query the vector database with a couple query examples. So yeah, this is what we're gonna go through. And, um,that's the last slide.
Uh, I'm going to open up my, uh, i d e now. So open up my i d e. Where is it?Yes. Okay, so, uh, this is, um,you know, vs code and basically I've created a completely new, uh, environment. So if you, you know, have a workspace,you just make there some sort of name and you dive into it.
So first thing we're gonna do is we're gonna go in here and we're gonna createan I Python notebook. So, um,I'll call this semantic text search IPyn B,and then we're going to open up a vector,a virtual environment. So virtual m ss c s. And so I'm using, you know, a version of Python three,and then we're gonna activate this. So source, s t, s,and activate.
So now I'm in this and I can pip install the,um, the, uh, libraries we're gonna need. So I'm pip install Pybus Pybus. Eugene,Sorry to interrupt. Can you, um, bump up the tech size just a little bit? Oh,Yes. Yeah, I can make this look bigger.
How's that?Thank you. Okay. Um, so we're gonna pivot install Pie. Elvis Pine, Elvis is the, uh,client that you use to interact with, um,to interact with the Viss database. And there's actually going to be a Pie Elvis client coming soon.
It's gonna be released tomorrow. Um,and then we're also gonna install sentence transformers,transformers G down,which is how we're gonna download the dataset from Google Drive and Vis,which is the vis, which is Melva slide Oh, and, uh, Python environments. And, uh, so need to install IPI Carnel. So once this runs, we should see this installed. Okay, the next part we're gonna do here is we're gonna download the data set.
I'm just gonna copy and paste the code for this,and I'll talk you through it as the libraries are installing. Um,so obviously we're gonna start by importing G down. This is our Google Drive downloader. And then you're gonna put the,you're gonna get the U R L to the Google Drive link,which should be in the chat,and you're gonna tell it to output to a zip file locally. And then you're just gonna download it using G Down.
And then we're gonna import zip file. Um,and we're gonna use that zip file to read the zip file and extract it into afolder that I call White House twenty twenty one, twenty twenty two. Oh, right. So the dataset that we're working with is a White House speeches dataset from2021 to 2022. And I specifically chose this dataset because, um,chatt PT doesn't have access to it yet.
And so later we're gonna be building a large language model application that'snext month. Um, and we're gonna use chat G p t in that. So first thing we're gonna do is we're gonna download this and we're gonna zipit and it's gonna unzip it and get the data. Now that we have this,what we wanna do is we want to take a look at the data and we wanna actuallylike, clean it up. So I'm once again,just copy and pasting the code from the CoLab here.
Uh, you have a notebook,you should be able to copy and paste and kind of follow along. We're gonna use Pandas for this. Um, oh, I shouldn't,I think Panda's actually already installed. Um, and if not,I'm gonna install it, but it looks like it's already installed. And then, uh,we're just gonna read the c we're gonna read the C S V and we're gonna take alook at what it looks like.
Okay? So we can see here that, you know, we can see a title, a date, time,a location, and the speech. And right off the bat,we notice that there's some issues with the speech, right?There's these nan values, these, um, null entries,and these are completely useless to us because they don't provide any semanticmeaning. And then there's also like these slash r slash ends,which are used to format the speech in the document,but totally useless to us. So what we're gonna do here is we're going to start off by just dropping the,dropping the knolls,and then we're gonna look, take a look at our data frame,and we'll see that we don't have any Knolls in the speeches anymore. Uh,and if there were any knolls in the titles, the daytime location, they've also,um,and we can see that we still have a little bit of an issue herewhere we have this, um,time as a speech.
And this is due to, you know,just error in the data and you'll come across this kind of stuff often. So what we're gonna do is we're gonna clean this data up, um, and I'm going tocall this one Cleaned df,and we're gonna set this equal to DF location,where the speech, uh,the speech strength one not strength string is greater than 50. And this just makes sure that we're gonna have more than 50 characters in our,um, speech section. So we're not just looking at some sort of, uh, time. And now we'll see that, you know,we don't have anything that looks like just time anymore.
And we'll also see that, you know, this was 1000 rows, this was 637,so there was quite a lot of dirty data in this, um, data. And the last thing we wanna do is we want to replace the,and just copy and paste this, but the last thing we wanna do is we want to go,you know, as a string, we wanna replace these slash r slash nthe slash r slash n values, um, that are completely useless to us. And then we will just pull up what a speech looks like to make sure that wedon't have any, um, of these slash r slash n characters. So now we see, you know, 6:47 PM the president, well, thank you very much,et cetera, et cetera. Um, why can't I call? Okay?And there's no slash r slash ns in here, like there are in here.
Now if we want, so in this, uh, case,I'm just gonna cover this very briefly as a data cleaning thing,and if we want to be able to filter later on, uh,if you wanna be able to filter by daytime later on,what you're gonna want to do is you're gonna want to actually convert thedatetime, which right now is a string,and you're gonna want to put it into Unix, uh, daytime. And we will just see what that looks like. And so we'll see the Unix time has,you know, all of these daytimes in numerical order,and we'll see the daytimes have been converted from strings into actual daytimes. Um,the next thing we're gonna do is we're going to actually start up the Vectordatabase and we're gonna, uh,start that by just establishing some constants that we're gonna use throughout. And, you know, the reason why we establish these constants for best practice,we don't want to be changing numbers in multiple places.
No, you know,no magic numbers, right? So we're gonna call our collection. A collection is kinda like a table. Um, it's basically, you know,a set of data that we put into Vis. And our collection name is gonna be White House 20 21, 20 22,and our dimension size is gonna be 3 84. So this dimension size is actually derived from the dimension size ofthe, um,of the sentence transformer model that we're gonna use.
And then we're gonna have a batch size of 1 28. So this just means that we're gonna be putting in 128 rows at a time. And so you'll see that we have 637 rows. So we're gonna be doing what,uh, five batches. Basically we're gonna be doing four, like full batches of 128.
That'll get us up to five 12. And then we'll have one at the end of 125. And then top K is used to determine how manyresults we want to return from our query in our Vector database. So in this case, we have three,which means that when we query the Vector database,we're gonna get the top three results. Okay?Now what we wanna do is we want to spin on vis,so the first thing we're gonna do here is we're gonna go from vis import defaultserver.
And this is the Novas Light server that runs directly in your notebook. And then from Pine Elvis, the client side, um, we're gonna import connections,which allows us to actually connect to the, uh, to the,to the server. And then we're gonna import utility. And this is utility import is mainly so that we can kind of lookinto what the server is doing. So it allows us to get the server version,and we're also gonna use it to check if we have an existing, uh,collection and drop the collection if need be by the same name.
So, uh,what I'm gonna do here then is I'm gonna import these and then I'm going tostart the default server. Yep. This only takes a few seconds. And so once you start the default server,you say, you know, vis light, welcome to use Vis,and you can check the version here. Um,and then we will connect to it.
So we'll use connections from Pine Vis and we'll connect, connect,and the host is gonna be local host. So it's gonna be 1 27 0. 0 0. 0 0. 10.
Oh, that's, oops,well, oh, connected, okay, the port,we're gonna pass a port because you can also adjust the port. And, you know,if you ever move to Zillows cloud, you're gonna need to know what the port is. So the port, we're gonna do default server dot list,listen port, okay?Now we can just use utility. We're gonna,I'm just gonna show you how to use utility here. Um,utility dot get server version, right?So we'll get server version and we'll see that we're using 2.
2 0. 8 light. And now we're gonna go and we're just gonna clear our collection. Uh,just in case the, you know, we have an existing collection with the same name. We don't want to, um, we don't want to use that.
We want to use an entirely fresh, uh, collection. Or if you have a collection by the same name, you can just check, you know,if utility that, so it has collection, you know,by the name, you know, then you can,you can see, uh, if you have the collection and you don't wanna drop it,then you can just check to see if this is true. But in our case,what we're gonna do is, oops, that's in all caps, utility,and we're gonna drop the collection. Uh,okay. And now we're gonna move into building the schema.
Um,so this is a really important part. I'm gonna walk through, through this,and once we create the schema,I'm gonna just pause for a couple minutes to see if anybody has any questionsabout the code that we've gone over so far. Um,and so you still need a schema in a vector database. If you are familiar with databases, you, you know, know what a schema is. Um,if you think back to taking computer science at college, you probably remember,you know, normalized database schemas from SQL and, and things like that.
Um,so you don't need to need to have, you know, your database in first, second,third, or fourth form normalization here. Uh,what you really need to do with the schema for a vector database is you justneed to define the schema so that the, uh,you have the metadata. 'cause sometimes you, you know,you want to have more information than just the vector itself. It's kind of useless if you just have like a vector and you're like, ah, this,this vector is close to other vectors, like, great, it doesn't mean anything,right?So you wanna have like some information along with your vector so that you know,um, what else you kind of need to do, or you know, how you can use it,or you have some sort of context if someone, uh,else comes along later and they want to use a database,they can kind of see some context from it. So the way we do this is we're gonna start, and we're gonna go from time Elvis.
We're gonna import, uh, four things. We're gonna import field schema. This is what we use to, this is like the, the object to, to find,um, a field in the schema. And then collection schema,which is the object that we use to define the schema for the collection, uh,data type, which tells us what kind of data type goes into a particular column. And then a collection, which is the, you know, the actual collection itself.
So the first thing we're gonna do is we're gonna start here and we're gonna makea list of fields. So the first thing we're gonna need in our list is an ID field schemaname equals ID and D typeequals data type. And 64 is primary equals true. And auto ID equals true. So if you're familiar with working with most SQL or no SQL databases,you are probably familiar with the IS primary and auto id.
This is basically auto increment. And this is basically saying that this is the primary key. And in this case, you don't actually,so one of the things about Novus here is you actually don't need to, sorry,you don't actually need to have the auto Id be true. You can have your primary key be something that is not auto incremented. And then we're gonna add the next part, which is gonna be the title.
Title, um, and that's gonna be data type. And we're gonna make it a dot varchar, sorry. So I don't think there's actually gonna be any 500 length, uh,character length strings here. Uh, varchar is just the string. Um, you know,if you're familiar with sql, same thing.
Um, instead of far char, you know,parentes 2 55 or parentes 500, we see Far char as just our, um,ability to define a string data type. Next,we're gonna ma make the date. Soyou can also make this the Unix state if you would like, but, um,I'm just gonna make this the string, um,max one. So you,this obviously we don't have any 100 character length, um, strings. You can make this anything that, you know, holds the number ofcharacters in an expected string length and the expected date link.
Sorry. Uh, now we'll do field schema again and we will do location. Um,and this is location that we saw earlier, right? Uh,like location not determined, or the Eisenhower executive building, Princeton,New Jersey, something like that, right?So this D type is gonna once again be a bar chart. Okay? Um, and this is gonna be a maxlength of, um, we'll just say 200. And then finally we're gonna make the field schema with the embedding.
So this is gonna be the embeddingand dtt equals stenotypefloat vector, oops, vector. There it is. And this is gonna be of dimension. And so this is gonna be our embedding, uh,of the actual text of the speech itself. Um, and yeah, this,this is how we create the fields for our schema.
And then we create the actual schema by setting the schema,by wrapping collections, uh,by wrapping collection schema over these fields that we just, oops,over these fields we've just created. And then we create the collection by having the collection equal to, uh,collection that with the name of our collection that we defined earlierand with the schema that we just defined. Eugene, we've got a couple of questions from the audience. Is it data or date?Which one? Ah, this one. Yes, it should be date.
Yes. Good catch. Okay,that was a typo. Are there any other questions about the code right now?No, I think we're okay. We've got, um, some for the end, but, um,I think that's the, we just had a couple people ask about that.
Okay. Yes. Uh, thank you. That was a good catch. I did that on purpose to see if you're paying attention.
Um, okay,so the next thing we're gonna do is we're going to define the index. Um,and the index is really important because this is how you actuallyquery the vector, basically. So the first thing we're gonna do is we're gonna define our index parameters,and this is gonna be a dictionary. Um,so index type gonna be I V Fflat, oops, I I V F flat,uh, metric type L two,and Rams, I'm gonna explain this in a second, but, uh,enlist. Okay,so for this, what we're defining here is the index type,which in this case is I V F flat.
And this determines how we actually create and query the index. And so I V F flat means that we're gonna use the inverted file index, um,which is just one way that you can create a vector index. You can use other indexes such as, uh, H N S W, which if you join the slack,you'll get, you know, a link to that. Um, in, in my welcome message. Um,there are, you know, what else? There, there's like a noise.
Um,there's just straight brute force. Um, but this is like the, a very,a very basic index type that's pretty much used for anything that's, um,simple enough and, and not, uh, something that you need to do at scale,but also provides nice, uh, latency. Um, the metric type of L two is,you know, the L two norm for the vectors. Uh,the other metric type that we provide is ip,and that's basically your cosign similarity. Um, and then,uh, enlist just means the number of, uh,we're looking at basically 128 OIDs.
Um, now we're gonna, oops,collection create index,and we're gonna give it the field name embeddingindex pers equals index pers,and then we're gonna load the collection. Okay. Um, here, uh, we've just finished creating our index,um, and I'll pause here to see if there's any questions about this next type. There are, do you want me to read them out to you, or are you on top of it?Uh, oh, I, I, I just opened the thing. How do you decide if we should pick I V F flat,but not I V F SQ eight or H n ss w Uh, the best index is, I mean,so I, we can, where I just pick I D F I V F flat here 'cause it's easy.
Like you can, yes, you can do the scaler quantization or you can do, you know,hierarchal, navigable small worlds. Um, it is up to you, you know, these have,so for example, scaler, quantization, you're gonna have a smaller index, um,but it's gonna be not as accurate or h and ss w is gonna be, um, you know,it's gonna be more memory, but it's more accurate. Uh,so it's up to you on how you, how you wanna do that. Uh,do you always prefer L two to co-sign similarity? No, that is,that is also up to you, depending on what you need, uh, what does load do?It loads the, uh, collection so that we can, um, use it basically. Um,if you want a more technical deep dive into what exactly load does,I would just read the function on GitHub.
Um,okay,now we're gonna get the vector and vettings and actually populate the database. Uh, so we imported,we pip and saw sentence transformers earlier, and now we're gonna load that up. That's the model from hugging face. So from sentence transformers,import sentence transformer,and we're gonna create our transformer. And this is gonna be sentence transformer, and I'm just gonna pick this one.
It says all mini lmm L six V two. There is no specific reason why I'm picking this other than it is a popular oneto use and it works well for general purpose use. There are many available embeddings,and you can use basically any one that you want. Um,this is a very popular one to use. OpenAI also has one.
Um, but yes, this is,that's just why I'm picking this. Um,okay, I see there's a couple questions. What do you think of the Google?I'll answer that later. It seems like you need people familiar from with linguistics. Uh,you don't need to be familiar with linguistics.
I think that that is, you know,you can have somebody for that, but, um,we can also address that towards the end. I'm only gonna take questions about the code for now. Okay. Um, okay. So now what we're gonna do is we're gonna create a function that's going toembed and insert our data all at once.
Okay? So this function does two things. It takes the data, the raw data, it turns it into a vector embedding,and then it inserts it into the collection,embed insert data, um,and we're gonna create our embeddings from our transformer, uh,code data three. And so, uh,I just want to note here that I'm creating this function with the idea ofhow we're going to insert this data in mind. So data three is gonna be the fourth entry into a list of lists,and that's gonna be the text of the speech. And you'll see that we create that, um, list of lists later.
So this is kind of,you need to create this function with the second function in mind. And we're,we're actually gonna be changing this up with the vis client. It's gonna be easier to do, you can do it, um, more intuitively,but this is how it works at the moment. And so I will take questions after I do the next section as well. So the insert,we're gonna do data zero, oops, data one,data two, and thensmart and embeddings.
So this is basically vector for vector,the vector and embeddings. And then we're gonna call collection,and we're gonna insert our, insert our list. So now we need to create the data batch. So in this case,we're gonna create a list of lists. So it's gonna be a list of four lists,because we have four data types.
We have the title, the date, the location,and the speech. And we're going to create a list of lists because we're batch inserting at 128of them at a time. So data batch equals,oh,and then for the title, the dates, location,the speech and zip means df,do location, title. Uh, this one should be date time. This one should be what, uh, what did I say? This one was supposed to be,uh, location.
And this one's gonna be the speech itself. And then we're gonna just append all of these separate, um,values into our list of lists. Data batch,zero title, batch one,pens as a string. The dates, ah,I wrote data again. Wow.
Okay. Clearly,I like to write the word data more than I like to write the word dates. Uh,data badge three pen page. Okay. So then if the length of any of our datas in our data batch,so zero isdividable by the batch size and equals zero,we will say embed data batch,and then we'll reset data batch to equal and empty list of lists.
And also, you know, you can also just check if it's equal to the batch size,but this is kind of the standard in machine learning. It's just how people usually write it. Um, then at the end,if our data batch, uh, is not empty,then we want to insert the UNP empty data,and we're gonna call flush to index and load all of the data points. So actually what I'm gonna do is I'm gonna, yeah,actually we'll just call it in the same block. Okay.
So this should take about, you know, 20 seconds, 30 seconds. What is the best way to insert data in index or first index dataand then index?I'm not sure I understand your question. Um,yeah, maybe you can clarify later. But generally you insert your data and then create the index and you cancontinue to insert your data and update your index as you need. So one of the nice things about viss is that everything is done in 512 megabytechunks.
So you never really need to re-index. You can just continue inserting your data, you caner, um,you can delete and it kind of, uh,you don't need to re-index because everything gets queried in parallel and it'sfast enough, um, because the data size is small enough. And, um,yeah, I, I'm not sure what the question is, but, um,we're gonna move on and we're gonna just do the querying now. So this is just the last part here. What we're gonna do is we're gonna create some, um,some codes to query our existing database.
So this step is, this import's unnecessary. I'm just importing the time to show you how long this takes,just to demonstrate how quickly this is done with I V F flat, which is not even,you know, the fastest way to do things, search terms. And so I'm just actually gonna copy these because this is not terriblyimportant. Um,okay,I'm gonna be searching for the president speaks about the impact of renewableenergy at the National Renewable Energy Lab,and the vice president and the Prime Minister of Canada both speak. Um,so just keep that in mind for what we're searching for.
We'll see words that are similar to those appearing in the speech and the titleswill return, uh, that,that the query will return will be speeches that are similar to the wordsthat I've just queried for. So now we need to create a function called embed search. And this will embed the, embed the data,uh, using sentence transformers and then search for it. So the first thing we're gonna do is we're gonna create the embeddings,which we use a transformer to encode the data. And then we're gonna return a list of lists ora list of, uh, sorry, a list ofthis is wrong.
Ah, uh,we're gonna turn a list of embeddings. Now I'm gonna just,this will tell us when we start our query,and then at the end we'll have an end, uh, for when we end our query. So the first thing we'll do is we'll create our results and we'll do acollection search. And now this is how you actually query the data. So the data that we're gonna search for is in search data.
Uh,oh, I need to create a search data. Sorry,search data equalsembeds search, uh, surcharge. Okay. So the data is gonna be equal search data we just made. And then the fields that we're gonna be looking for,the approximate nearest saver field is embedding, oops,embedding.
Um, and then the parameter is gonna beL two Metric, type L two. And we're just gonna look into the 10, uh, top, um,uh, OIDs for I V F. And so that's n probe and probe10. Okay, cool. Uh, and then after this,we're going to get the limit of top K.
So this means we're only gonna get the three top, uh, the three top, uh,results. And then output fields equals,we're just gonna look for the title, oops title. There we go. Um, and that will give us our results,and then we're going to loop through the results that we gotenumerate results. So we're only looking at two,and we're gonna print out the search term.
I call this title in the, in the CoLab notebook. But, uh,you know, you can really call this whatever you want. It's just the search term. Actually, this kind of probably goes to the end search time. And this will give us, uh, my start.
And so this is just the search time it took to do all of the searches. Um,and then we're gonna print out the results,uh,get, we're gonna get the title,and then this tells you,uh, oh, this is just a, uh, uh,spacemaker basically. And then we're gonna get the distance of the actual thing. Oh, okay. So that was very fast, um, very fast.
And we'll see that the search term,the president speaks about the impact of renewable energy at the NationalRenewable Energy Lab. The first result is remarks by President Biden during a,a tour of the National Renewable Energy Lab, kind of what we're looking for. And then for the search term,the vice president and the Prime Minister of Canada both speak. The first result is remarks by Vice President Harris and Prime Minister Trudeauof Canada, um, before bilateral meeting. So that's, you know,just showing that that's what we're looking for.
And the search time was very,very low. Um, and then at the end,you want to shut down your server so you don't,can have it continued running default server stock. And that is it. That's how you create your own, uh,semantic tech search application. And now I will take questions.
Okay. Does this database allow to pre-filter on non vector fields beforeexecuting the actual vector search date range? For a example? Yes. Yes, yes. So you can search, uh, with a filter. Um, and I didn't demonstrate that here, but you can, um, we can, I can,I can show you some examples.
Um, there will probably be some links, uh,to find, but, uh, you can pinging me after and I can find the links for you. Um,and we will ba and basically, you know,you can pre-filter for let's say a specific type of date range ora specific location or something like that. Um, so yes,it does allow that qa. Can we have, um, yeah, we had a question earlier in the session. I just wanna capture that really quick.
How do we measure the quality of someEmbeddings? How do you measure the quality of some embeddings?That's a kind of broad question. I don't, I don't actually know how to, I don't,I don't, I don't know how to answer that. I don't know what that means. Um,usually the quality of the, of vector embeddings can be,you can use, um,you can use like the popularity of the model as a,you know, as a, uh, indicator. Um, I'm not sure if that's exactly what you mean,but that's kind of the best that I can understand from what that means.
Um, do you work with LAMA index, G P T index or Link Chain?You know, that's a great question because guess what,I have a couple of block pieces that actually the block piece forLAMA Index was just published this morning, I think, um,I think it was just published this morning. Emily, do you know where it is?I do. I will get you that link in just one second. Yes, that would be great. So yeah, we do, and in fact,if you come to the June one, uh,you will see how to create a Q and Aover a set of documents using, uh, LAMA index.
So that is, so yes, we, we do, we, we, we do use LAMA index and link chain,and I'll even show an example on how to use it. What's your opinion on pine count database? Um,can't say open source is the way to go. Uh, let's see, what is, uh, there's a couple that, ah,what do you think of Google Universal sentence encoder? Um, it's fine. It's good. You can use it.
There's no, like,I don't, I wouldn't say that there's any factor in bedding, like neural networkor,or engine or whatever you wanna call it that is necessarily better than anyother. Um, I mean, yes, there are some that are necessarily better. So for example, if you're using an R n n, like a very,very basic R n n that can not always accurately identify part of speech,you're probably better off using a different network to get your bacteriabeddings. But as long as you're using something that is able to,you know, that is, is good enough in a sense to use, uh,that is able to predict things pretty accurately, you're, you're good. Um,and, uh, okay, wait, I wanted to also address this,the linguistics one.
Um,I don't think that you like,I don't know like what you mean by familiar with linguistics,but like you should understand at least that like, you know,there are some words that are similar to other words,and the quality of the output of your semantic tech search is reallydecided by you. Um,if you have a set of documents and, or it's, it's cited by,I guess like you, but also maybe like your users, right? The metric is kind of,uh, it's quality, right? It's, it's different from quantit quantitative metrics. Um, but in this case, you know, we could show that we were able to find,you know, the first result that we got back made sense. And as long as it makes sense to you or your users, that is a,a good indicator that you're getting that, you know, your index metric is fine,that your, uh,distance metric is fine and that you are getting some set of results that aregood. Um,is there anything else,Which open source l l m do you prefer?Which open source l l m do I prefer? Um,which open source l l m do I prefer?That's a good question.
Um, I don't really, I don't really know. I don't have one that I, I would say is, you know,specifically amazing. Um, maybe,maybe Claude, I don't know. I don't know. It's, it's the LLMs are allat a certain level.
The LMS are all pretty similar and like, you know,my honest opinion is I'm not very satisfied with any of them right now. We have a little bit of time left. Any additional questions that we've got for you, Jen, today?Drop them into the chat or into the q and a panel. We'llgive it just one more minute in case people are typing. Um,last call on questions.
Thank you all for joining us. We hope you enjoy the session. As I mentioned, at the top of the hour,we will send out the recording to this,so if there's any section you need to review in more detail,you'll certainly have the opportunity to do that. And if you have any, um,questions about where to find links to the documentation or any of theother materials that we referenced during the session,please just reply back to the email with the replay. We're happy to track stuff down for you.
Um, okay. I'm gonna answer the last question from Ashish. Can you talk a little bit more about the N probe argument? Yeah. N probe is basically how many of the clusters in,in terms of I V F flat is how many of the clusters you wanna look at to look forthe, uh, closest, um, factor. So the way the parallel search works is basically like you havethese different segments.
They're all 512 megabytes. In our case,we definitely didn't have high control megabytes data. We only have one segment. It goes in, it goes to the end probe, it says, okay,we're gonna look into the 10 close,uh, OIDs, and we're gonna find the top three from each. And then I kind of unify them at the end.
That's it. How do we decide best endless and mpro values for given data? Yeah,so these are, um, more or less like, um, these are more or less kind oftest and uh, uh,these are, these are things that you test for, to be honest. It is mostly,so enlist is mostly going to be something that, uh,affects your, uh, the size of your index. Um,and it affects the indexing speed. Although indexing speed really isn't a problem with Novus.
Um,so mainly the thing you're gonna think about is size of your index, right?And then, and probe is going to be, um,something that affects the latency of your search. And you can see that,you know, that's really not a problem either. 'cause we had our search in like 0. 009 seconds. This is like less than a, like,this is like a millisecond, basically, right? Um, was that 10 milliseconds?It's less than, it's, it's very, it's very, very quick.
Uh, and so, you know,you can kind of have and probe go up and as you get imp probe to go up,you'll probably have higher accuracy. Um,but at a certain ti at a certain point it's probably not worth, uh, you know,having higher accuracy for higher latency, uh,and then enlist as you go up. It's gonna take more memory,but you might be able to have better organized data. Um,can you recommend a model embedding for multilingual, uh,in Spanish? I'm sorry. I, um, I, I, I,I don't know any, uh, anything that, that works, uh, in Spanish.
Um, Ashish said that's Ashish,Ashish says that the cohere model is good for, for Spanish, I think. What is his workstation like? Is it like this? What am I working with?I'm working on a MacBook Pro. Um,this is like, I don't know, 20 22, 20 21. I don't even know which version this is. I can, let me just,I'll just click here.
I'm working with MacBook Pro on two point on 12. 5, uh, 2021, um,16 gigabytes of memory and using the Apple M one, uh, chip. Um, but for reference, this also works on a, uh, 20,oh boy, 20 13, 20 14, Acer Aspire E,which is my Windows laptop that I have also played around with. But I,I totally, I totally recommend Max in terms of programming. I just think they're nicer rather general way.
Do we have any metrics to decide which index we should use? Uh,for indexing? Like which index you should use in terms of like, you know,the I V F H N,SS W whether or not you want to use scale or quantization product quantization. What you're looking at there is you're really looking at a set of trade-offs,and the set of trade-offs you're looking at is the size of your indexand the accuracy of your results, uh, and, and,and latency when you do your query. Um,but most cases when you're creating the index,you're looking at size of the index and results. So quantization, like scaler,quantization of, or product quantization make your index smaller,but they also make your accuracy lower. And then, you know,uh, H N S W is really good for accuracy, right? And it's,even query speed is usually pretty good, uh,assuming you don't get any like freak random numbers.
Um, but that makes your,uh, index size really big. A lot of people use H N S W now because it's cheap to buy memory. Um, but it depends on what, what you're working with and, uh, what you need. Looks like most people's questions have been answered. Ah,will quantum computers make things better for those dimensions?I also have no idea what this word here is, but, okay.
Will quantum computers make things better for those dimensions you just said?Um, what, what dimensions you mean? Like,can you clarify that?No, I'm not asking. No, I'm not asking you to c clarify the word. Okay. Okay. Geeky, that makes sense.
But I'm, I'm asking what you mean by what dimension?Like what, what dimensions are we talking about? Like the,the size of the index or the, the, the number of embeddings,the size of the index. Okay. The size of the index. Um, well,quantum computing make it better for the size of the index. You know, I haven't given that much thought.
I would say quantum computing will make your queries faster. Um,that's, that's, I mean it in terms of like the size of the index,that's mainly memory. Um, that's like, you know, your, your,your rom your RAM stuff. Uh,I would say quantum computing, as far as I know,kind of mostly affects your, like, speed of queries. And in terms of vis, um, yeah,it'll probably make it a little bit faster, but you can actually try like huge,like the, I i, there's a reverse image search one as well.
Um,I did a reverse image search notebook, which I will actually,I didn't send you the link Emily, but I can find it. Um,the reverse image search notebook is a lot like bigger of a data set and ittakes like 12 minutes or something to load the data. But even so you can kind of like go through it and you can see that, um,here's the, I'm gonna send here to, uh,you can even see there that the, um, time it takes is,is still incredibly small. It's like six 60 milliseconds or something like that. Any other model you suggest other than OpenAI?Because OpenAI has a lot of limits right now in free version.
What do you mean?Like the LLMs? I would, I, um, I would, I would use,I mean, open ai,like the G P T model is kind of like the default model for all the tools thatare surrounding this ecosystem right now. But, you know,you could probably use like Bared or Palm or clo or, uh,there's another one that was released recently. I can't remember the name of,uh, yeah, I don't know. Those are, those are kinda like the, the popular ones. Um,All right, we are just about at the end of the hour.
We wanna thank everyone for joining us today. Um,thank you for all the great questions. Um, you Eugene, thank you, uh,for putting on such a great session. Uh, we will see you all next time. If you wanna see our calendar of upcoming webinars,including Eugene's next tutorial, um, which will be on June 15th,you can head over to zillows.
com/event. Um,and we will catch you guys next time. Alright, thanks Emily. Thanks everybody for coming. I hope this is helpful.
Meet the Speaker
Join the session for live Q&A with the speaker
Yujian Tang
Developer Advocate at Zilliz
Yujian Tang is a Developer Advocate at Zilliz. He has a background as a software engineer working on AutoML at Amazon. Yujian studied Computer Science, Statistics, and Neuroscience with research papers published to conferences including IEEE Big Data. He enjoys drinking bubble tea, spending time with family, and being near water.