- Events
Tutorial: Diving into Text Embedding Models
Webinar
Tutorial: Diving into Text Embedding Models
Join the Webinar
Loading...
About the Session
In this tutorial and presentation, we'll dive into transformer-based embeddings for long-form text, highlighting some of the theory around why BERT-esque models perform much better than recurrent neural networks (RNNs). From there, we'll go into some examples utilizing the "sentence-transformers" library, showcasing its use in a variety of tasks such as sentiment classification, clustering, and, RAG.
What you'll learn:
- The theory behind long-form text embeddings
- Using sentence-transformers in an actual application
I'm Christy Bergman, and I'm adeveloper advocate here at Zilliz. So today I am pleased to introduce today's session,text Embedding Models tutorial,and our guest speaker is Frank Liu. He's director of operations and head of ai, ML here at Zilliz. Welcome, Frank. Thanks, Christy.
Yeah, great to be here. And I'll, I'll get started right now. Is that cool with you guys?They can't speak, I think. Okay. Uh, no.
I just wanna double check with the twoof you, but anyway, yeah. Thank you guys. Thank everybody here. Uh, thank everybody here in this session for coming today. Um, today I wantto talk more broadly about text embedding models.
And I wanna, I want to give everyone a heads up. This is more of a beginner to intermediate session. So if you're really, really into deep learning, um,or if you want to understand really, really deeplyhow embedding models work, uh,what embedding model should I pick for my application?And in particular, what are some of the really,really interesting cutting edge, bleeding edge researchthat's being done into embedding MAL today?Uh, I will talk a little bit about that. In particular, I'll talk a little bit about someof the things that you wanna watch out foras you pick an embedding model. But this is not a real, this is not, let's say,a research deep dive, uh,or a dive into a lot of archive papers, so on and so forth.
This is just a very, very high level overviewof text embeddings, someof the things you'll wanna watch out for. And in particular, we're gonna go over a tutorial of howto generate embeddings with sentence transformers, uh,later on towards the end of the presentation. But I'm just gonna sort of gonna divide this,uh, into two parts. Uh, there's gonna be a tutorial at the end. I wanna talk a little bit more about theoryand a little bit more about, you know, how did we get towhere we are today, the history of these embedding models,uh, here in this presentation upcoming.
So I'm gonna talk a little bit first about, um, sort ofwhat our embeddings, uh,and how did we get to where we are todaywhen it comes to them, right?And if you really think about,if you really think about Bend models,and if you really think about, uh, how they relateto larger language models and how you can use 'em insideof a vector database, I think they arereally, really underrated. And one of the things I like to say that I liketo tell folks is that bend models really are workhorsesfor AI applications. Uh, and proof for that is you don't haveto, you know, you don't have to look far. It's on hugging phase, right?And if you look at, um, uh, Alman L six V two,which is a pretty popular text embedding model, uh,it's got over 5 million downloads last month. Uh, and if you look at LAMA two, which is a very,very popular large language model,obviously there are some barriers to accessand there are other ways to use llama two,aside from hugging face as well.
But hugging face remains one of the most popular. Uh, it has something like 700, 700,000,less than a million downloads on a hugging phase last month. Uh, so I think, you know, given the, the,just the sheer number of embedding models out thereand the number of ways to generate embedding data,hopefully it's clear to you guys,a embedding models are really, really important for us,especially if we think about applications, not justthat leverage retrieval augmentation, not justthat take your data stuff it into a Vector database,but also if we think about other forms of data as well,if we think about images, you can embed those. If you think about videos, audio,you can embed those as well. You can even embed things like graphs,you can embed things like that.
Me, uh, what's that, Christie?Okay, anyway, I'll keep going, but, um,but if you really think about it, we can embed a varietyof different types of data, andthat's why embedding models arereally, really important for us. And that's why we're going to want to use, we're gonna want,really wanna understand how they work. Now, this particular talk, this particular webinar, uh,is going to be focused on text embedding models, right?It's going to be focused on, um, in particular a very,a component, the retrieval component of retrieval,augmented generation, and how you want to get your data outof a vector into a vector databaseand get it out as well, right?So I'm gonna continue from here. I'm gonna talk a little bit about thehistory of embeddings first. Uh, and hopefully this will comeas pretty interesting to some of you guys.
Um, I'm actually a computer vision guy. So, you know, back in the day you'd have these things calledsift features, uh, or corner detectors. I think the one in the lower right is, is,is Harris corner detection. And this is the way that you generated embeddings. Uh, back then they were called feature descriptors, uh,in the old days, you know, in the,in the early two thousandsand the late two thousands as well.
And you would use, uh, you would use these,so-called handcrafted features or handcrafted algorithms. You'd figure out a way to represent portions of an image. And, uh, in the case of CIF featuresor corner detectors, you'd figure out waysto represent portions of that imageand use those features, stuff them into a bag. Uh, and then all those features would bea representation for that image. Uh, obviously pretty inefficient.
You have many, many features for a single piece of data. Uh, and, uh, in the caseof let's say TF IDF, that is the equation. You know, some of you might recognize the equation. The upper right hand corner is TF IDF. It gives you, you know, a lot of these, a lotof these vectors were very, very large and sparse as well.
So that's really old days, right?And today, I think we havemuch more efficient ways of generating embeddings. Uh, and this is actually a screenshot from Arise Phoenix,uh, a project that I really, really like. And it shows you that it's sort ofwhat this shows is it's actually taking these really,really high dimensional embeddingsthat we generate from neur networks. And it will actually give you a visualization ofwhat it looks like in three dimensions using umap. And there's, um, you know, I won't go into someof the intricacies of umap here in this presentation,but really you can think of itas taking this high dimensional spaceand really compressing all of it sothat we can visualize exactly what's going on, uh, in,in this really, really big embedding space.
But nowadays, unlikebefore where we had many, many, you know, we had many,many feature descriptors per image,or we had, you know, these really,really long sparse vectors that represented text. Nowadays we can have these very, very compressed, uh,fixed length vectors to represent images, text,represent a variety of different modalities all together inthe same space, right?And that was really, really awesome aboutwhere we are today when it comes to text embeddings, uh,and just embedding models beyond that as well. So from there, uh, I now that hopefully we have a,a good understanding of a bit of the history. Um, and, you know, the history of embeddingsand the history of vector search is very, very long. So, you know, vector databases have really become popular inthe past year or two years,but they have, you know, vector search has been aroundfor, for a long, long time.
And I wanna talk about some of the waysthat we can generate text embeddings today, right?Uh, and I'm gonna start with recurring neural networksand, uh, also going old school as well. The idea behind recurring neural networks isthat you can take this, this sort of blue box here. This is your neural network. You give it an input, uh, hidden state,and then you give it a token. That token will typically represent, uh, you know, that tthat token will typically be, be represented as an embeddingor as a sparse one hot input you give into this recurringneural network and give some sort of output.
There are many, many different flavorsof recurrent recurring neural networks. You can, for example, um, give it a single hidden stateand remove all hidden states. After that, you can make it sortof a deco encoder decoder architecture. Where it doesn't out doesn't necessarily output a,you don't read the outputof the neural network at every single stateuntil you give it, let's say, a separator token,and you ask it to complete the sentence, right?You can do it, you can use it in waysthat we use transformers today, uh, as well. But the idea is that it is a single modeland that model, you continually feed the data back into itin order to give you the right outputor hopefully the desired output at the very end.
And in particular, the embeddingsthat you dely get from these recurrent neural networks. While you could use, depending on the cell typethat you have, you could use one of the internal embeddings. I have seen that done before. Typically,you generally want this hidden state, sothat in this diagram is represented as a of Tand that hidden state is fed backinto the recurrent renewal network. And that is what gives you, that's sort of what maintains,uh, that, uh, that, that sortof gives the recurrent renewal network a bit of memory.
And this is a really old school wayto generate text embeddings. And today we have much more powerful ways of doing so, uh,specifically with transformer encoders. Um, and you know, this, this session is not meantto be a deep dive into transformers. It's not meant to be a deep dive into, um, you know,Bert versus GPT or self attention, uh,or all the different variants of self attention,different ways that you can train transformers. But really from a high level, you can think of, uh,you can think of the Transformers, enc,the transformer encoder is taking a sequence of tokensand giving you a sequence of embeddings at the output.
And both of these sequences at the inputand the output, they're gonna be the same line. So you get one embedding per token,and that obviously has a lot of restrictions, right?So it's very, very inefficient, uh,in terms of representation. Oftentimes you have, as I mentioned earlier,you want a long piece of text or you want a whole image. You want all, you want all of thatto be represented by a single embedding. This is much more efficient for you to perform searchand retrieval that way.
Um, and if we fast forward a little bit beyond,transformer paper came out in 2017,if we fast forward a little bit beyond that,I think in 20 20 19, there was Bert, uh,and Bert really is introduces the idea of, well,I guess it didn't introduce the idea of pre-training,but it applied the idea of bi-directional pre-training, uh,to the transformer encoder architecture. And again, uh, this is not a, you know, there's lotsof great articles out there about how Burt works, uh,and about some of the, really, someof the intricacies of Burt as well. And for those who are interested,I definitely recommend going to check out the Burt paper,uh, or reading some of the online tutorials about Burt. But really, at the end of the day, it was a way for you to,they demonstrated how by pre-trainingand a transformer encoder, you can get really,really powerful results on a varietyof different applicationsthat leverage natural language understanding. Uh, and again, all these, the Transformer encoderand Bert, all of these are, uh, they don't use,they use mass language modeling.
They don't use causal modeling. So they're not meant to do next token prediction like GPTor Bard or Llama is, but they're meant to take a sentenceor take long form textand give you a representation, really help you understandor help the computer understandwhat is going on in that sentence. Right? And there's still a problem with birth though, right?Uh, it still uses the transformer encoder architecture,which means that if we look at this diagram here, uh,it does a little complicated, it's okay,what goes on is there's gonna be an input token,and there's going to be the number of embeddingsthat the output is going to be the same as the numberof tokens that the input. And the way that Bert typically tries to figure out whetheror not two documents are related or, or any, uh,or any sort of variant of the transformer encoder, isthat it will actually take two sentences,or two pieces of long form text, add a separator in there,and then it will add a classification layer at the,at the end, across all those tokens. And you will be able to have, excuse me,across all those embeddings,and you will be able to have, uh, it will sort of,it'll give you a value between zero one or negative oneand one as to whetheror not these two sentences are related.
That's really inefficient,because it means that we have to do one inferencefor every pair of sentencesor every query document pair, uh, or every symmetricor asymmetric pair. There's a lot of inefficienciesthat are present in this entire process, right?If you think about transformer inference,that's not a cheap thing to do. And if I have hundreds of thousands of documents,and I want to compare an input prompt,if I'm doing a retrieval augmentation,or if I want to do document search,I'm comparing my input query toa hundred thousand documents, I haveto do inference across an ML model a hundred thousand times. You know, I could paralyze all that,but that requires a lot of computational power,and it's a very, very, it's still a very,very inefficient way to do things,even though we have a better method of represent, uh,even though we have a better methodof really understanding whether or not two sentencesor two pieces of text are related to each other. So, we fast forward a little bit more,and I think, uh, speral also came out in 2019,um, or was it 2020?Don't quite remember.
Don't quote me on that. But it adds, it takes stock. It just takes, you know,a regular pre-trained transformer encoder,and it adds two layers onto it. The first is pooling,and the second is, uh, it gives it a new objective function. So in, in the case of sentence bird, you can actually,there are actually three differentobjective functions that you can use.
You can, you have a classification objective function. You have co-sign similarity, and you also have triplet loss. This, uh, and this particular diagramshows cosign similarity. It shows how you would regress,how you do regression across two sentences,across two pieces of text. This is, uh, it's, it is not a very,very difficult thing to understand,but it, I think it introduced, uh, it, it sortof really changed reshaped the waythat we generate embeddings today.
And I would ar I would argue that even today in 2024,most text embedding models are based on some flavorof sentence bird, obviously,you have a lot of exceptions out there. There's, um, a number of, you know, there's a numberof really, really cutting bleeding edge papers. I've seen some papers that go back to using convolutionsto help you generate sentence embeddings. But most very, very popular models today, including the onethat I showed, I'm gonna go back a little bit,including the one that I showed in this diagramhere a little bit more. And including the one left here, Al lm, this one is a flavorI would argue of sentence Burt and sentence Burt.
What it does is it just gives you the capability toreally efficiently generate embeddings. And as sentence Burt, as you train a sentence, burnt modelor a sentence, transformers model, what happens is you have,uh, you have these sentence pairs. They will go through a stock burt, that there,it will undergo pooling at the very end. This is very, very importantbecause if you, if you recall in this, in the traditionalor in the, um, in the original Burt architecture,in the original, original transformer encoder,we get one embedding for every token, which is pretty,it's pretty inefficient when it comesto a retrieval standpoint right now,we will pull across all those tokens,and then we will fine tune,or we will retrain the entire neural networkon some new objectivethat lets us perform retrieval much, much better. So instead of having, you know, if I have two sentenceswith 20 tokens each, instead of having 40or 43 tokens that I need to, you know, uh, that I needto store inside of my vector database,or that I need to do similarity search over,what happens here is now I just have two tokensbecause I've pooled the outputs of these two, uh,of, of these two results.
And I'm retraining the, the whole model using objectivethat I want to, using a cosign similarity objective,using a similarity objective that I can very,very efficiently run in a retrieval system, suchas a vector database, uh, or,or any other type of vector search solution out there. And that is really the core of sentence, right?So, you know, what is the point of talking?Oh, and before I move any further,this is called a buying coder. And it gives you one embedding per text. And that is really what we want. We want a compressed single embedding representationfor long form text, right?And, you know, before I talk, uh,and you know, if you, if you think about it, when we,at the very beginning, our goal, what we really wantto do is to take a piece of data.
In this case it's long form text,and we wanted to transform it into a compressedrepresentation, a single embedding,and have that embedding be representative of what it isthat we're doing, what it is that we're trying to generate,and hopefully have it be good for our application, right?And if we look, if we fast forward, uh, I sortof gave a sneak peek about some of this,but if we look at text embedding models today, you'll seethat, uh, most of them are some flavor of sentence Burt,uh, they will use, um, they,they have the typical self supervised pre-training, uh,of masking a next sentence prediction, uh,that you have from Burt. And then they will use some sort of contrastive lossor triplet loss to be, to train that model. And a lot of the changesthat you see in these models are actually mostly relatedto data, but occasionally you do see changes in modelarchitecture, training strategy. Um, you know, you might leveragesomething like sparse attention. Uh, I think Roberta does.
Uh, they, instead of masking random tokens,they will mask any tokens only,or you might use a different objective function altogether. But at the end of the day, really the way that, uh, the waythat a lot of these models operate, they, uh,their foundation is in sentence birth. They are in sentence transformers. And that's what makes sentence transformers so importantfor us today when it comes to these text embedding models. I think another thing to really watch out for,and we'll get to the, I'll talk about this a little bit, uh,in the, the upcoming tutorial, is that it is very,very highly data dependent.
You have, um, you can have, uh, what are called symmetricor asymmetric embeddings. Symmetric embeddings are saying, uh, sorry,symmetric or asymmetric text. Asymmetric is, let's say, uh, prompt to document matching. Whereas symmetric text is just tryingto understand if two sentences mean the same thing,if they have the same semantic meaningand different, the ideal situationsthat you have different modelsfor different domains of text as well. And I think voyage AI does this or plan to plans to do this.
Uh, you should have an embedding model for, let's say, uh,you know, the finance industry, if you,if you've ever read analyst reports, you knowthat these reports have different language, right?They, they don't quite read like, um, like, uh,like something that you would, uh,that you would find in New York Times or,or something on, uh, you know, wall Street Journalor some of these news outlets. And likewise, uh,archive papers also read differently than financialreports and other documents. And as it takes time for us to adjust to these typesof writing styles, you also need models that adjustto these writing styles or forms of writing as well. And last but not least, text embedding models, even though,you know, even though some of 'em might have, you know,30 2K token lengthor one K token length, I think they,what you are doing is you're taking some really,really long text and trying to condenseor compress it into a single fixed length embedding. Typically this, uh, it's 3 3 84 or 7 68,or as we'll see in the upcoming tutorial,I think it's 10 24 is actually the length of the embedding.
The dimensionality of the embedding that's roughly equalto one paragraph, right?So the sweet spots around a hundred to to 200 tokens,and that's roughly equal to one paragraph, right?I wanna warn, warn folks, uh, you really want to play aroundwith your chunk size. There's no strategy of one size fits all,but from my experience for most models,I would say somewhere between 102 200 tokens is a good sortof cutoff for when you would want to, um, forhow much you want to chunk your data, right?So there's a lot of things to watch out for when it comesto text embeddingsand which ones at the end of the day, you know, it's helpfulto ask which ones are the best. And this, I actually just pulled the screenshot from,I pulled the screenshot yesterday. I, I believe that this is actually from the retrieval. If you go into the mt a leaderboard on hugging face,this is actually the retrieval tab.
So it will show you the performance of allof these different models, and it'll rank them based on thisaverage of a variety of different data sets, right?Uh, so for example, ask a bunch of duplicate questionsand so on and so forth. And all of these, you know, at, at the end of the day,definitely treat these with a grain of salt. Some models perform better simplybecause they're much bigger. You know, these, a lotof these models were trained on a lot of the same data. They might have, uh, they might have some,they might have some differences in architecture,but at the end of the day, most of them are based onsentence bird or sentence transformersas we'll see in the upcoming tutorial.
And as it come, when it comes to, you know,I talked about this a little bit before in, um, in, inand in previous, uh, sort of AI for developers meetup. And if you ask me what embedding model is the bestfor my application, I would say it's very,very highly dependent on your applicationrequirements, right?First thing is what are you interested in doing?If you are, what, what is your domainof data and what are you interested in doing?Are you interested in classification?Are you interested in retrieval, re-ranking?There are different models for different portionsof your application, right?It's very, very important to understand. The second thing is, do I have enough datato fine tune my own model?Uh, and if I do, why don't I go aheadand take something that's pre-trainedand find tune, uh, fine tune against it, right?I can take, let's say some of the ones on the MA leaderboardand fine tune against them. How much latency are we okay with, right?If we are very, very latency sensitive, um,we'll probably wanna pick a smaller model, uh, like, uh,like GT or bge small, and are we okay with false positives?And if not, we want something a little bit more specialized,a little bit more domain specific. As, as I alluded to earlier,oftentimes you have financial documents or legal documents.
You want something that's tuned to certain industriesor certain verticals sothat you can get the best performance outof not just your vector database,but also out of your embedding model as well, right?A lot of this is very, very important to think aboutas you develop out your application, as you figure outwhat you want to use a lot, you know,if there's one takeaway that I, for, for,for this presentation for this particular session, it isthat there is no one right answer. Um, a lot of it is very, very dependentand, uh, training your model,or if you have data scientists internallytraining your model or picking a model. If you don't, uh, this is really, there are a lotof trade offs involved in this,and it is, uh, at the endof the day an engineering question, right?What model should I pick? Whichone's best for my application?So sort of coming up on the halfway mark, um, uh, I haveprobably about 20 minutes left. Uh, I, I do have a quick sort of, uh, notebookthat I'd like to show everybody here. Probably won't take the full amount of time.
So we have a little bit more time at the very end, uh,where I can, uh, answer any questions that are out there. But, uh, I will actually swap overto a different, different tab. Uh, I will take some questions at the very end as wellswap over to a different tab. Good question, Frank, inThe meantime. Um,how do you apply expert variations to images?Ah, that is a great question.
Expert is typically for text only,but there are a lot of multimodal models out there. So for example, any, any model that is trained on clip,so there are pre-trained vision transformers, uh,on the clip dataset,and you can actually apply,there's also a section in the expert documentationfor applying it, for applying multimodal mals to images. You just have to be very, very careful that you have a modelthat can support bold imagesand text if you plan to use sensor transformersfor images, right?That's the only thing that I would caution. Uh, if I have a little bit of time at the very end,I'll see if I can potentially showhow you might want to do that. Uh, uh,but yeah, yeah, just be cautious that you'll wantto use a a multimodal model.
You're not gonna want to use a pure text embedding model. Also, one other thing that I'll add isyour pure text performance, if you use something like clip,is going to be a little bit worse than if you used, um,if you used a non multi-mobile model. So, all right, so there's going to be sort of three sectionsfor this particular notebook. It's very, very short and sweet. Um, there's a couple of different, uh, there's also a coupleof different, you know, random examples I'll thrown here.
I'm happy to take some examples from audiencesas well if you guys want to try, you know,generating different embeddings. Um, and then I have one last blurb at the very end about,uh, uh, sort of putting everything together, right?I know a lot of folks in this particular sessionprobably use Llama index or lang chain,and I wanna, I wanna talk a little bit ab a little bit aboutsome of that at the very end as well. But before we do that, um, I will also make this, uh,I'll also make this notebook available, uh,online somewhere, or, or,or will send out an email at the end of thiswith this notebook link. But you'll require two librariesfor this particular tutorial, send Transformers and Novus. And one of the thing interesting things about Novus isthat there is an embedded version.
It's a lot, not, not a lot of folks know this, uh,it's called Novus Light,and you can actually just pip install Novusand you can get a vector database that you can runinside a Jupyter Notebook, right?So I'm actually, I should, I have bothof these installed already, so this is gonna be pretty,pretty, it's gonna be pretty quick. Uh, and what I'm gonna do here first is I'm going to createa, a model, a sense transformer model. It's very, very easy, very, very straightforward to use. And that's one of the things that I loveabout Sense Transformers. If I come back here,if I look at this is the Mt a leaderboard,I'm actually using one of the models that's, uh,that's on there, that's considered one of the, uh, oneof the top models right now, as we'll see later on.
This model also has its own limitations, right?But what I'm gonna do first is I'm goingto instantiate the model. It really is that simple. And there's, if you haven't downloaded this model already,because it is a pre-trained model, uh, it will goand actually download the waits for the modelbefore continuing. So in this case, because I already have the modeldownloaded, it says I can, you know, I can call,I can figure out, okay, what is the max sequence length?And in this case, I see that it's five 12,and then to turn this into a bang,it's honestly really simple, right?I can just simply encode it. Um, I can, you know, there's, there's, there's some,there's some things I can do to play around with it.
So for example, uh, I can encode itand ask for, ask youto return a tensor rather than an umpire array. We're not gonna worry too much about that in this instance. Uh, and you'll see that it returns, uh,an embedding on this embedding if I do dot shape,this embedding is of dimensionality 10 24. So open air embeddings, uh, I believe are 7 68. You have other models out therethat can be 1536 in this particular instancefor Ember V one, this is, uh, an embedding modelthat gives you 10 24 dimensional vectors, which, you know,is, is, is also a pretty standard dimensionality.
But one of the things that I wanna do now is I wantto take this embedding, right?And I wanna show you guys, uh, here,I wanna show you guys the power as well as the limitationof dense embeddingsand why it is so important depending on your application,to have your own set of dataand to be able to fine tune these pre-trained embeddingsor even take just stock Bertand, uh, fine tune over your own data. So I'm gonna, I have five sentence,I have five sentences here, right?First is Zills is vector data store. That is amazing. Second is unstructured data can besemantically represented within embeddings. The third is related to singular value decomposition.
The fourth is related to chess. And the fifth is, uh, you know, the fifth is, uh, uh, a, um,uh, I guess a sentence about pragmatism. Doesn't matter if cat is blackor white, so long as it catches mice. And I'm going, what I'm gonna do is I'm goingto compare using this model, I'm gonna compare allof these sentences together. So I'm gonna ask sentence transformersto encode all these sentences with the model,and then I'm going to,and then I'm gonna, uh, using this particular utilityfunction, that sentence transformers provides,I'm gonna compute the co-sign similaritybetween all those senses and the original sense as well.
So let's do this, right? So let's try that right there. Co-sign similarity in this case is the inverse, the oppositeof co-sign distance. So the higher two number, the higher the result is,the more related two sentences are. And we'll see, that actually makes a lot of sense. It makes a lot of semantic sense.
Zillow is a vector data store that is amazing,versus Zillow is an awesome vector database. These two mean pretty much the same thing. Uh, and it makes a lot of sensethat the cosign similarity in this case would be 0. 93,very, very close to one. Again, cosign similarity, uh, is, uh,cosign similarity is a value that you typically get, uh,that determines how similar tothat determines how similarto two sentences are to each other.
And then I can also, I can also see, I can also see,you know, it continually decreases. Unstructured data can be semantically representedwith embeddings, and Zillow is an awesome vector databaseis not that related, but they're still, okay. So in this case, I get a score of 0. 57 accordingto this particular model. Uh, and then it continues to go down and down.
It continues to go downbecause these sentences, these piecesof longer form text aren't related tomy initial betting, right?Very, very simple. Five minutes, uh, five minutes left. Yeah, okay. Alright. So pretty simple.
And I think there's a lot of, uh, there's a lot of, there,there's a lot of informationthat's encoded into these Ben embeddings already. I think we can see that. But I also want to go through, um,two other examples, I think that show the, continueto show the strength and also thelimitations of these embeddings. And the first I'm gonna do is, um, uh,let's see if I can remember these. Uh, I like green and ham.
Let's do that one. And then I,um, enjoy consuming, um,m and x that are green. And if I do this, it actually says that, okay,this is very, very close to one. Again, one is the maximum value, a similarity of one meansthat they're identical, um, and is very, very close. Meaning that this, the model believesthat these two are very, very similar.
I can actually do this as well, right?So I can actually do that,and you'll see that it gives me a cosign similarity of one,but I also wanna show limitations of this. So I'm gonna do this cosign similarity,let's eat. And then,so the first sentence is, let's eat Chris. And the second sentence is, let's eat Chris. And these two, because of the punctuation that's in here,they actually mean very, very different things.
So the first one is, um, you know, you, you could thinkof it as me talking to someone named Chrisand saying, Hey, let's go, let's go grab lunch,or let's go grab dinner, let's go grab something to eat. And the second, the second sentence,because I've removed that comma,it means something very, very different. It means, you know, it, it's probably not somethingthat you would say on a regular basis,but if we look at the, if we look at the outputof the model, sure it's not that high. It's less than 0. 9, which in this case is pretty good,but it's still a pretty high value, even though these twosenses are very, very semantically unrelated, right?If I say, let's eat comma crisp versus let's eat crisp,these two things mean very, very different things.
So this is, you know, why did I gothrough both these examples?Is to show you the power of these embeddings,but also to, to help you understand that there are a lotof nuances here too, that if you want somethingthat really fits well with your application,you're gonna wanna fine tune your own model potentiallywith data like this, right?And you're gonna want to, you know, understandthat this is something that is really, really importantto do if you want to extract a maximum value outof your embeddings, right?So, um, just something, just something to keep in mindas you work with these dancetechs embeddings moving forward. Uh, and then I want a quick exampleon how to fine tune your model. It's pretty easy. Um, sense Transformers provides this, uh,input example sort of, um, uh, this, this, this class. And what you can do is you can just say,you can give it a set of training examples.
You can stream these examples from Discas well if you have a very, very large database, uh,data set of these examples. And then what you can do is you can simply the training, um,and I'm only using two examples herebecause there's a lot of, you know,because if I had more, I, first of all,I don't have a GPU, uh, on this laptop. And then second of all, if I had a lot,it would take a long, long time. But you'll see that after training, um, you know, every, uh,the, the shape is the same,but, uh, the results here are goingto be a little bit different, right?So, uh, if I do, if I rerun this, you'll seethat the weights have indeed updated a little bit. Um, this is still gonna be the same,but these results, the co-signsimilarity is now a little bit different.
It shows you that these are actually working. The last thing that I wanna go over,I know I only have a minute left,is inserting these into a vector database. And, uh, it's really pretty, you know,really wanna show you guys, hey, you can use Novus,you can use Mils Light to insert these very, very easilyand be able to conduct really,really large scale similarity search across all of them. In this case, what I'm doing is I'm simply saying,I am going to import the default server from Nobus,and I'm gonna start it right?Very, very simple, uh, very easy way to get an instanceof your mobile vector database up and running. Once that's done, I can actually create,uh, a connection to it.
I can use it as if it were a standard database,I'm gonna create a collection,and then I'm going to insert all of those embeddingsthat I computed earlier into the Vector database,into this collection called, uh,called default, right?And then you'll see here,I've successfully inserted all of these. It's got a, it's got a particular timestamp. Once I'm done with all of it, I can stop the serverand clean up the data as well, right?Uh, I should, I should havereversed these two calls, but it's okay. Uh, yeah, that was a bit of a whirlwind with, with,with this particular notebook. Um, happy to take any questions that you guys have now.
Uh, happy to take any questions that you guys have now,and, um, you know, please post paste them in the chat thereand, uh, we'll go from there. All right, thank you, Frank. Um, are you able to see the q and a?Uh, I can read some off for you. I can gimme one second. I can see it.
So, so the first question is from Keen. I hope I didn't butcher your name there, I apologize. Question is symmetric versus asymmetric. I see it talked about for dynamic queries,retrieving stored documents. How do I do it in reverse, for example,which library subject headings are mostappropriate for this article?That is a great question.
So I, there's a,there's some pretty interesting recent work that's doneon using a large language model to create promptsor queries for a particular document. So using,because a large language model essentially encodes a lotof information, uh, it gives you, it's, sorry,a blurry cha bag of the web is one of the things I've heard. It heard larger language models called you can actually useit to the knowledge stored in LLMto generate prompts or generate queries. And it actually is really, really good at that. So if I have a document, uh, I could chunk up that documentand I could ask my LLM to generate potential queries.
You know, I could say you're an expert retriever,generate me potential queriesor prompts that I can use to search for this document, um,or generate potential queriesor prompts that, um, that, uh,that a human would write sothat the document would be returned,something like that, right?You wanna play around with your prompts,but it is actually a great way for you to be ableto generate data in a quote unquote, unsupervisedor without labeled data to basically generate data using LLMlabeled data, using l excuse me. Second question is from Frank Greco,our embeddings cross speaking language. In other words, can I use embeddings from Englishto generate Italian completions?Uh, the answer, it depends on the model. So you, so for some embedding models,unless most embedding models, unless they,unless they have seen, uh, enough Italianand English text, they're not going to be ableto just understand Italian right out of the box. But I think even if you've seen, if you've seen English,if a model has seen both English and Italian textand understands that there's a lot of relationships between,uh, both of these languages that it can,what it can do is it will actually, you know,if you have English text in a particular domain,let's say legal documents,and then in your training datas that you don't havethat same type of data, uh, as you create embeddings relatedto legal documents that are written in Italian,it will actually be reasonably performant.
So again, if you don't have Italian in your training data,you shouldn't use a pure, you know,you shouldn't expect any type of results from a pure, uh,English, English embedding model simplybecause the tokenizes have never seen those Italian words. But even if you have different domains of data, as longas there's a way for the model to be able to map bothof these languages together, you can actuallyget pretty reasonable results. Great question. By the way, is there a limit on the sentencelength that sentence that expert can encode?Uh, I think the answer is no. I'll have to, I'll have to double check,but just keep in mind thatbecause transformers are quadratic, uh,at least dense attention is quadratic relativeto input length there, you probably don't want to, you know,here in this case, uh,I'm actually gonna try it live here, so let's see if it works.
Uh, the max sequence sequence length is five 12. I can actually, uh, make it 10 24,whoops, 10 24, that was, so I can make it 10 24and I can, I can, uh, redo all these embeddings basically. And you, you'll, you'll see that yes, you can do it,but as you know, the token length increases, you get, firstof all, diminishing returns when it comes to the qualityof your embeddings, or you may even get worse embeddings. And secondly, unless you have a lot of memory, orunless you have a lot of, unless you have a lot of CPor GP memory, you're gonna run into those limits, uh,as you, as you have more and more tokens. Again, because, uh,because transformers, they're not necessarily inefficient,but they're just quadratic relative to input length.
How is punctuation tokenized to be integrated with lettersor words and embeddings?Uh, it depends on the tokenizeand, uh, for most tokenizes punctuation is actually itsown separate token. Uh, so I wouldn't be able to answer that for youhere, Brian, unfortunately. Again, it depends on the model,depends on the tokens that I used. Uh, but typically, uh, I think, you know, I think, um,you can, for most tokens you can think of something likefour to six characters is one token. Another question from keen follow up onasymmetric embeddings.
I don't want to generatequeries or prompts from the article. I want to match the article against preexistingsubject heading phrases. Would an LM be necessary for this?Okay, I see what you're saying. Uh, and so my understanding,and you know, again, correct me if I'm wrong, isthat you have, you have articles and then you have,or you have one article and then you have a setof preexisting subject headings or phrases. Uh, wouldn't lm be necessary if you have, ifwhat you're doing is just tryingto match these two together, you can use a standardembedding model, right?Um, embedding models it, the,the great thing about sentence transformers isthat it doesn't matter the orderor the, the, the order of the input, so to speak,because what you're doing at the endof the day is you're training these,you're training these two sets of documents, uh,or these two corporate two corpuses of documents together.
And each, for every long form text,you actually get a single embedding out of it. So an LLM would not be necessary. You don't have to do anything specialized in that case. You can just match it regularlyand you'll be able to get the most relevant, uh,subject headings, um, or titles given a particular document. Alright, I think that is all for questions,unless we have more Christie.
Um,I see some question, like a question is, um, are thereare, what kind of applications, um, are there for embeddingsbesides semantic search?Besides semantic similarity?Ah, yeah, that is a great question. So if you, if you actually go to the mt a leader board, um,you'll see that there's quite a few, you know, there's,there's a variety of different models,four different applications. There's, you know, clustering, um, summarization,re-ranking, retrieval, Q and a, so on. And so there's a lot of different thingsthat you can use embeddings for beyondmatching prompts to documents, right?And there what you want to do,what your application calls for, uh, you want to,you wanna understand that really welland pick an embedding model fromthat particular leader board. Um, given that,Um, and then there was,there was a question about multilingual embeddings.
Like if you speak non-English, uh,will these embeddings work for you?Yeah, so multilingual embeddings, they're trained across,uh, the, first of all, the tokenize can understand a varietyof different types of characters. And, uh, these multilingual embeddings, you know, we, we,we sort of touched upon them earlier,but they, even if you have a domainof data in one, one language and you try to, um,and you try to sort of extend that to another language,even if that domain in the second language hasn't been seenin the original training data,it'll still perform reasonably well. So yeah, so multilingual embeddings,there are a lot out there sense,but provides quite a few as well. I didn't dive too much intothat here today, in this session, in this session. Uh, but there definitely are a lot of those out thereand they're very, very powerful as well.
Okay. And what if you, uh, wantto embed images instead of text?Yeah, yeah, great question. Um, you would want to use, if you wantto, it depends on what you wanna do. So if all you want to do is embed images, uh,I would actually take a look at PyTorch image models, TIMM,so you can just pip install TIMM,and that'll give you a lotof pre-train image embedding models. If you are looking to do some multimodal search, there's,you have, you're a little bit more limited,but any model that's trained, uh, with,on the clip dataset on, uh, on Leon, L-A-I-O-N,I'm not sure if I'm pronouncing that right, uh,will do really, really well for your particular application.
All right, and I think we have time for one last questionin the q and a window. Yeah, I see it. Yeah. How can we identify the mostrelevant paragraph in a document in responseto a specific questionor search query, particularly when we are converting theentire document into a single embedding?This is the beauty of sentence transformers,and this is the beauty of text embeddings, is you don't haveto, um, you can, if you,when you convert an entire document into embedding, you loseinternal, you lose sort of internal context. So you can't pull out a very specific paragraph usingthat long form, using that compressed embedding.
But what you can't do instead is if you're interested inparagraphs, you can simply chunk your textinto paragraphs, split into paragraphs,and embed each of those paragraphs. You don't have to embed the entire document, right?That's the beauty of the text embedding is that, uh,they're not fixed length,uh, relative to the number of tokens. And you can also match short form text with long form text. All right, I think that's,that's it all we have time for today. So thank you Frank.
And um, we posted a linkto the, um, discord. Hope to see you get all online there. And thank you for joining. Thanks everyone, and thanks for hosting Christy.
Meet the Speaker
Join the session for live Q&A with the speaker
Frank Liu
Director of Operations & ML Architect at Zilliz
Frank Liu is the Director of Operations & ML Architect at Zilliz, where he serves as a maintainer for the Towhee open-source project. Prior to Zilliz, Frank co-founded Orion Innovations, an ML-powered indoor positioning startup based in Shanghai and worked as an ML engineer at Yahoo in San Francisco. In his free time, Frank enjoys playing chess, swimming, and powerlifting. Frank holds MS and BS degrees in Electrical Engineering from Stanford University.