Join the Webinar
Loading...
What will you learn?
Retrieval augmented generation (RAG) is the most popular style of large language model application to emerge from 2023. The most basic style of RAG works by vectorizing your data and injecting it into a vector database like Milvus for retrieval to augment the text output generated by an LLM. This is just the beginning.
One of the ways that we can extend RAG, and extend AI, is through multilingual use cases. Typical RAG is done in English using embedding models that are trained in English. In this talk, we’ll explore how RAG could work in languages other than English. We’ll explore French, Chinese, and Polish.
Topics covered:
- How RAG works
- How to embed text
- How multiple languages can interact in the embedding model and LLM
I'm pleased to introduce today's session, introductionto Multilingual Rag and our guest speaker Gin Tang. Gin has a background as a software engineer working inauto ML at Amazon. He studied computer science, statisticsand neuroscience with research papers, papers publishedto conferences including IEEE, big Data. He enjoys drinking bubble tea, can confirm, uh,spending time with his family and being near water. Uh, welcome you Eugene.
Uh, thank you, Emily. Yes, I love bubble tea. Um, okay. Hello everyone. Um, my name's Yu Eugene.
Uh, today we'll be talking about Multilingual Rag. So, um, uh,currently a senior developer advocate at Zillow. And this slide is basically for youto find different ways to contact me. Uh, the best way to contact me is through LinkedIn,which is available through the QR code there on your,on the right hand side of your screen. Um, as Emily was saying, my background is in software,working on machine learning.
I did a bunch of research in machine learning published inpapers, um,and I've been working in NLP since 2021. And, um, I've been building a lot of rag appsand AI agents over the last, uh, year or so. And today we're gonna talk about how you can builda multilingual rag, which is, um,rag on different languages. So, uh, the contents of our talk today, we'll startwith a rag review, just a review of what RAG is. Uh, this will be a short commentary on differentrag uh, topics.
And then we'll go into LMS and embedding models. So we will cover a short, uh, history of LMSand compare them to what embedding models are. Uh, this is something that came up a lot last year, uh,in my talks, uh,people were asking a lot about the differencesbetween LMS and embedding models. So, um,we're gonna talk a little bit about these differences. Then we're gonna talk about vector databases.
Um, vector databases are criticalto building rag applications. And then we're gonna go into the demo. So the primary reason that you would use RAG is basicallybecause you wanna inject your data. Uh, you want to inject your data into an LLM, right?So when you're working with an LLM, um,you're typically working with some sortof generalized model, and it doesn't have access to the datathat you want to have access to,or that, uh, you know, your private data is generally not,uh, gonna be used to train an LLM. And so, uh, in order to get an Allen to work on your data,you need to use rag, uh, which allows youto do this kind of factual recall.
Um, and it's also good for cost optimizationbecause you can kind of pull your context inand you can play around with your context windowsand how many tokens you'reusing and all these kinds of things. And this is really important for production. So this is a general, uh, picture of a rag, uh, modelof a rag setup, right?So we're gonna start at the bottom left hand, uh, cornerwhere it's your data. And basically in a rag setup, you take your dataand you put it into an embeddings model. You run it through an embeddings model.
The model will generate vectors,and those vectors will go into a vector database like vis. And then at query time, when you actually use, uh, your, um,application, your query goes to an LLM. The LLM decides what needsto be searched in the vector database,and then it sends that search once againthrough an embeddings model, which will turn that, uh, the,the text of that search into a vector. And that will go to a vector database like vis,and it will search viss for some vectorsand then return, uh, those vectors to the LLM along withthe metadata, which includes the text. And that is how the LLM is able to answer your question.
So this is just a basic overview of rag,that's really all we're gonna cover there. Now, let's go into some differencesbetween LMS and embedding models. So in order to understand the motivation for RAGand how it works, we should have some history,some background about LLMs. So, LLM started out, uh, LMS are a type of neural networkand neural networks started out back in the seventies,basically, uh, the idea of which isthat you're basically feeding in some numbers, doing a bunchof different calculations and getting some sortof predictions, some sort of regression, some sortof classification, some sort of output aboutthat input, right?Um, so you, this is a picture of a basic deep neural net,and essentially you take an input,which is gonna be a series of numbers. You pass it through, you get some calculations going on,and then you pass out to a hidden, uh, certain output layer.
And the output layer will give you some sortof prediction about your input. Now, as we evolved, uh, I guess as knownas we worked on neural networks, as neural networks evolved,um, we began to see that there were certain typesof neural networks that were really goodfor certain types of applications. So when it comes to text, when it comes to language,one type of neural network that we sawthat was very helpful was a recurrent neural network. And the picture here is just, uh, one layeror neuron of a layer of a recurrent neural network. And you can see that when this is unfolded,what's going on is that you're taking, um,inputs over time and reusing them.
And so a recurrent neural network is able to keep trackof tokens over time. And because it has this ability to keep trackof this context, it is able to, um, I mean,because it has this ability to keep trackof tokens over time, it's able to, uh, have some contextand context window, uh, to look at text. But RNN still had a problem,it could only keep context so long up to a certain length. And so in order to solve this problem,we introduced transformer models. So transformer models take an inputand essentially that input goes into some encoder,and the encoder creates a hidden stay, a set of matrices,a set of vectors, whatever.
And that hidden state then gets combined with,usually additional input is going to be, uh,an attention matrix. And so you take two matricesand you put them into a decoder,and the decoder will put out some output. So this was really helpful for usto basically solve this problem of, uh, conof short context windows. And this allowed us to have much,much longer context windows. Now you'll still hear this as something that kindof is talked about in LMS is context windows, right?So some, some models have a context window of 8,000.
Some models have a context window of 4,000. Some models have a context window of a million. And context windows are basically something that gets, uh,worked on by the way that the,that the model is, uh, structured. And this kind of leads us to GBT, which is, um, you know,the state-of-the-art, one of these state-of-the-art models,and GBT uses a decoder only architecture. And essentially what that does is it takes your sentence,your words, and it calculates a tokenand positional embedding,and then it outputs a predicted next token, right?So what GPT is doing is it's giving you a prediction forwhat the next token is most likely to be.
So it is generating text,but that text is still a prediction, okay?And so basically in this example,what we're looking at here is if we take the chicken walkedand we send it to um, GPT, it will produce across the roadand then an end of sentence or a period. And what this is essentially saying isthat if we give GPT the words the chicken walked, it's goingto predict that the next most likely word is a cross. And then with the chicken walked across,the next most likely word is the,and again, with the chicken walked across the,the next most likely word is road. And that would be the end of the sentence. And this is because it is predicting fromthe data that it has.
So GBT is trained on a large corpusof data that's found online. And typically when you look online, you're gonna see a lotof things about the chicken walking across the roadas opposed to walking across anythingelse or walking anything else. So that's an lm. So what about embedding models?Embedding models are the ways that we generate vectors,and you can really generate vectors from anything. You can generate vectors from any sort of, uh,unstructured data or structured data.
So you can generate from images, from videos, from texts,from audio, from uh, DNA examples, you know, allof these kind of different things can be turnedinto an embedding. But the secret is that instead of doing the last layerof doing the prediction, right?Earlier when we talked about neural networks,we talked about how neural networks willhave a prediction at the end. Embeddings are the, um, representation,the inner representationof some sort of data from a neural network. So as you pass data into a neural network,each layer is gonna give some sort of calculationand it's going to, um, give you aquantifiable representation of that data. And that data is gonna have a semantic meaning to that.
So the last layer, the second to last layerof a neural network is gonna output the,what the neural network has learned fromthat data without giving you some sort of prediction. And that's a vector, and that'swhat you put into vector databasesto do semantic similarity search. So just to recap this section, LLMs are large models,that's why they're called large language models. Um, they generate text via predictions. They have some sort of reasoning capability,or at least they seem to have somesort of reasoning capability.
And the general architecture is based onthe transformer architecture. Embedding models are smaller, they don't needto be based on transformers,and they do not generate any text,nor give you any predictions. Essentially, you are taking the output of an inner layerof a, uh, neural network as you're embedding. Okay, so let's talk about vector databases. Vector databases are meantto find semantically similar data.
So if I give you these three sentences, apple made profitsof 97 billion in 2023. I like to eat apple pie for profit in 2023. And apple's bottom line increased by record numbers in 2023. And I asked you to find which one of these sentences is, uh,or which two of these sentences are most like the others,um, or most alike. Uh, what would you say? So this isan interactive presentation.
If you think it's sentences oneand two put, please put that in the chat. If you think it's sentences oneand three, please put that in the chat. If you think it's sentencestwo and three, put that in the chat. Right? So prior to, um, priorto vector databases, the way thatsearch across text was done was through keyword search. And so keyword search, what you would do is you would lookfor specific keywords.
And if I were to look for the examples of AppleProfit and 2023,I would get back the first two sentences. But actually, um,oh, I see none of you guys have answered this question. Uh, actually the firstand third sentences are the ones that are the most similar. Oh, wow. Okay.
Thank you. Yeah, someone has answered the question. That's great. There's another interactive portion coming up on the nextslide, so I hope you guys are ready. So actually, the firstand third sentences are the most semantically similar.
And so if you were to do, uh, searchfor the most similar sentences,you would want the first and third ones back. But if you were to do keyword searchas we would have done prior to vector databases,you would've gotten the first two. So this is an example of something ofhow vector databases help. Now, it's not just, uh, text data that you can use this on. You can also use this on image data.
So here I've got four pictures of one of my, uh,favorite artists and oneof the most popular artists in the world. And the question that I have for you guys is, which oneof these is the least, like the others?So this one is picture one, this is Pic Taylor Swift one,this is Taylor Swift two, this is Taylor Swift three. This is Taylor Swift four. Um, I would love for you guysto just take a guess which one do you think isthe least, like the others?Great. Okay, everybody says Taylor Swift.
Two is the least, like the others. So I don't know if you guys haveseen this presentation before. Maybe next time we'll split 'em up. Maybe next time we'll mix it up. But it is Taylor Swift two,Taylor Swift two is the fake Taylor Swift.
So this is just an example of how you as a humancan do this kind of similarity comparisonacross the different, uh, pictures of people,but that's not actually something that is inbuilt into,um, machines. And so in order for machines to be able to do that,they need to be able to compare things mathematically. And that's kind of how vectors work. Vector embeddings are long strings of numbers that allow youto mathematically compare, um, things that are not that,that we don't see as numbers. So this is an example of how that works.
And this example of semanticsimilarity, we've got four words. We've got queen, woman, man, and king. And the idea behind this slide,if you don't take anything away from this slide,the only thing you need to remember from this slide is, um,math on words or mathon things that aren't originally numbers. And the other thing that is incredibly importantthat I hope you remember to take away from the slide isthat you see these words here on the first dimension. You see queen and queen and woman and kingand man have the same value along the first dimension.
But this doesn't actually tell us whatthat first dimension means. It only tells usthat these words have the samerelationship on that dimension. So it doesn't justbecause, you know, the, the first dimension is the samefor queen and woman and kingand man, it doesn't mean the first dimension means genderor sex or anything like that. It just means that those words meanthe same thing on that dimension. So if we take this mathematical example,and keep in mind as well, this is a toy example,you'll never see two dimensional vectors in production,and you'll never see people useManhattan distance in production.
We will talk later about some distances that you will see,but you'll almost neversee Manhattan distance in production. So what happens here,if we subtract the word woman from queen,so we have 0. 3 comm, 0. 9 minus 0. 3, comm 0.
4,and we get zero comm 0. 5. And if we add 0. 5 comm 0. 2, which is, man, we get 0.
5,comm 0. 7, which is king. So the idea behind this slide is basically math on words. Um, oh, we've got a question. What role does vector's dimensionalities haveand how similar the data will be?Um, vector dimensionality is basically, uh, uh,I think you can think of it as a levelof granularity in comparing, uh, similar data.
But really, and we'll see this later on,what matters is not, uh, what matters is, I mean,it does matter how many dimensions you have,but by the time you get to hundreds of thousandsof dimensions, um, it's pretty similar. You start kind of like losing, uh,what's it, what's that word called?It's like you get, um, diminished returnsas you increase dimensionality, right?Um, what's actually important is the datathat your model had trained on. Okay? So let's take a look at some of the similarity metricsthat people do use, right?So we've just said that a Manhattan distance isalmost never used. So there are three types of distancesthat are used often in dense factors. The first is Euclidean distance, this is L two.
And basically what you're doing here is you're capturing thedistance in space, the magnitude of the distance in space. So for those of you who have taken geometry,which I assume should be all of you,you'll remember this thing called, um, Pythagorean theorem,where you're essentially calculating a hypotenuse,you're calculating distances in the space. That's basically the same ideabehind Euclidean distance, right?You got, if you draw this triangle here, you know,if you're thinking of vectors as points,and you're gonna draw this triangle here, then this is thedistance between them as the ous. Okay? So that's one, one method. Next, we have inner product.
So inner product is a little bit less intuitive, um,but it's actually very, very nice. It's a very, very, um, simple, straightforward,and I say a very pretty way to get distance. So inner product is measuring the projectionof one vector onto another. So in Euclidean distance, we would thinkof the vectors as points in space. In inner product, we think of the vectorsas point with the line.
And so when you project, you're getting a measureof both the magnitude as well as the angle,the orientation difference in, um, two vectors. So that's inner product. Inner product measures theprojection, and you can see from the equationthat it is super simple, super computationally inexpensive. So next we have co-sign similarity. So co-sign similarity is inner product divided bythe product of the magnitudes of the vectors.
So it's a normalized version of enter product. And you would never use thisunless you are using modelsthat are specifically trained on cosign similarity. Why? Because it's more computationally expensiveand it abstracts out the actual magnitude of the vectors. But cosign similarity measures the anglebetween two vectors, right?So we've just seen magnitude,magnitude plus angle, and just the angle. So from this you can kind of just think, you know,what is actually more measurable, what has more degreesof freedom, degrees of granularity,and you will probably naturally come to the conclusionthat inner product has the most degrees of granularity.
And inner product or L two are pretty much mygo-to similarity metrics. Now, cosign is very popular and used a lot online. Um, and the reason for that isbecause cosign was one of the first metricsthat people started measuring, uh,when NLP started becoming popular in the late 20 teens. Um, but as many people tend to do,many people would use this measure blindlybecause of the, uh, popularity of the measure. And I would encourage you to make surethat if you're gonna use a metric, make surethat it's the right metric for your embedding model.
So just a review of metrics, Euclidean measures,spatial distance, cosign measures, orientational distance,and inner product measures both. And when you have normalized vectors, your inner productsand co-sign are the same. Okay? And now we've got indexes. So indexes are a way to access data. And this is really importantbecause if you think about vectors, there are hundredsor thousands of dimensions, that's a lotof calculations to do.
You want an efficient and effective way to do that. So some ways to do that include inverted file index,which is basically just K and n,or sorry, K means it's a clustering algorithm, right?So what you're doing here is just finding clusters and OIDs,and that's basically it. You're creating a OID diagram. And so, um, to create the index, you do okay means,and then when you use the index, you basically searchand look for the closest CINs,and then you retrieve all the vectors from those OIDsand look for the closest vectors within those OIDs. Next we have hierarchical navigable small worlds, HNSW, um,and HNSW is a graph algorithm.
And essentially what you do is you build a graph at buildtime, and then at search time you go inand you search through pieces of the graph. So HSW uses this thing called exploratory factor, uh,which is basically a cutofffor a uniform random, uh, variable. So if I were to use a uniform random variable of 0. 9 of 0. 9,then what would ha what would happen is every single, um,at insertion time, at build time, every single vert, uh,not vertex, uh, vector would get, uh,insert into layer zero,and then vectors above 0.
9 would get layered into layer one. Vectors above 0. 99 would also be put into layer two,and so on and so on. At search time, you start at your highest layerand you work your way down. So because this is a graph, uh, you,the search is very fast, you have to do less computationsto have the distance in between the existing vectors.
And it's also quite accuratebecause you're basically able to, uh, store all of these,um, uh, uh, vector values quite easily. And now we're gonna talk about,the next two we're gonna talk about are quantization. So quantization you can think ofas basically a bucketing uh, algorithm, right?So if you were to bucket some, uh, if you wereto bucket some data, if you were to take the real numbersand you return 'em into integers,that's an example of bucketing. That is basically how you can think of quantization. So instead of thinking of 0.
0 4, 3, 2, you would have zero. And instead of thinking of point, instead of thinking like,you know, 7. 1, you'd have like seven, okay?So that's just what quantization is. And product quantization is scaler quantizationacross two dimensions. So not just across, uh, let's say the, uh,values within a vector,but also the values within a block of vectors.
And basically, uh, product quantization is ableto compress much further,but it is, um, not quite as accurate. Okay? So let's just overview the indexes real quick. So we have IVF, which is an intuitive index. K means it's takes a medium amount of memory,you're just holding the OIDs and it's pretty performant. And then you have HSW, which is graph based.
It takes a lot more memory to do and it,but it's very performant. And then you have flat, which is, uh, a basic, you know,just, I mean, it's just bru force through all the vectors,a hundred percent recall, which means it is highly,you know, performing in terms of accuracy,but quite slow, no effect on memory. SQ is scaler quantization. So this is your bucketizing across one dimension, um,low on memory, not as accurate pq, once again,scaler, quantization, productized, um, even less memory,even less accuracy. Okay? So with that kind of, you know, overview done,let's go and get into the demo.
Um, before I get into the demo, uh,I'm gonna cover what we're gonna build today. We're gonna build, we're gonna look at three different code,uh, examples. And you'll notice that all the code examples are extremelysimilar and the main differences is the data,uh, collection for them. And what we've done basically is we've taken visand lang chain and OpenAIand put together some rag applicationsthat will work on three different languages. So there's one in French, one in Chinese, and one in Polish.
Um, so here is the QR code that you can scan. It is also available in the chat. And, uh, I will take some questionsbefore we get into the demo, if there's any questions, uh,that are pressing. So I see, oh,I see someone asking about AI agentsand the cost intensiveness,probably the LLM probably the most expensive part. Okay? If there's no questions, um,regarding the presentation on, oh, there is, okay.
What's the best approach with dealing with documentsthat contain multiple languages themselves?That is a really good question. Um, I would say that what you can actually do,and I didn't do this in the examplebecause this is a rather complex, um, way to do this,is you could actually create an agentthat will perform your embedding for you,and it will decide what embeddings to,what embedding models to use. As long as all the embedding models is the samedimensionality, it will decide which embedding model to use,uh, based on the language, um, that you are, um, lookingto embed in the document. Uh, okay, in the embedding senseor retrieval, the VUS collection side, um,as long as all your embeddings are the same size,the same dimensionality, you can compare them,they're, they're comparable. If they're different, then you can't compare them.
Okay, great. Um, I think we can moveon to the demo. Okay, let's make this smaller. I'm gonna have to need that, I'm gonna needthat in a second anyway. Um, okay, so let's start with the basic right example.
I don't have a right basically rag example, do I? Okay. Uh, so we're gonna startwith looking at the Chinese example. And like I said, the most difficult part of thisprocess of building these rag applicationsis actually collecting the correct data. So what we're doing is we're building these RAG applicationsacross a bunch of citieswhere we're gonna ask questions about the cities. So we need to be able to scrape the city data fromWikipedia, uh, in order to give that data to the, um,LLM via rag to answer questions on.
So we're doing Atlanta, Beijing, Berlin,Boston, Cairo, Chicago, Copenhagen,Houston, Karachi, Lisbon, London,Moscow, Munich, Paris, San Francisco, Seattle,Shanghai, Tokyo, and Toronto. Is anybody coming from any of these cities?I saw someone was coming from Seattle earlier. Um, I'm also in Seattle, so go Seattle. Um, if you're coming from any of these cities,like, let me know, I'm curious. Um, so the first thing that we had to dofor the Chinese version, and we also have to do thisfor the French version, and the um, uh, uh, uh,the Polish version, although the French version was actuallyquite easy, you'll see, um,these are the English names of the cities.
What we had to do is we had to convert theseinto the correct language. I had originally tried to just scrape them from theWikipedia, uh, API,and um, if you do en wikipedia. org, that's the English one. And so when I used the English studies, I was ableto scrape them all, but then when I put Zh do Wikipediaand I tried to scrape these cities,I just got straight, like, no data. So, like I was saying earlier,when someone's asking about the multilingual stuff, likethe, the most important thing is your dataand your data collection.
And so, um,after translating all of these cities into Chinese,I then went and scraped them from Wikipedia like this,and I store them into aexample here that has all the text filesand the Chinese cities. Okay? And then I built rag on it. Um, oh, so this, this block of text is actually wrong, okay?So for this example, um, if you would like to follow along,you can install Pine Elvis, lane chain sentence,transformers, tick token, uh,I don't think you actually need this onefor this example and OpenAI. And then you can, uh, get, um, oh,actually the standalone is in the, this, uh,file is in the GitHub repository. So you can just use Z Shellor bash whatever, um, whatever shell you're using.
And just start the file and, uh, you start. Um, so I'll like, you know, you should run this, uh,because it takes a while to installlane chain's a very big library. Sand transformers is decently big. Uh, OpenAI is is all right. So the first step that we're gonna take here is, um,we're just gonna load our environment variables.
And in this case we're just gonna load our OpenAI API key. And all we're doing is doing load Mand getting the API key. Next we're going to go and get our LLM. And so for this example, we're just gonna use, uh,the OpenAI LLM from lane chain, I believe this is 3. 5 turbo.
And this is just the, uh, the base, uh, OpenAI, um, LLM. And now we're gonna get the hugging face embeddings. Um, so hugging face is a model hub that has a lotof embeddings, and we're also gonna import mil this,which is the vector store that we're using. And the first thing we're gonna do iswe're gonna get our embeddings. So in this example,I'm using this specific embedding model towns woo uh,slash peg.
And the reason I'm using this model isbecause if you go onto hugging faceand you go look at MTEB, Mt e uh, leaderboard, you'll seethat this is the one of the leading modelsfor the Chinese embeddings. Um, and there are some other models that are better,but they are just, uh, a lot biggerand not better enough to, uh, justify using in terms of,uh, computational expense. And then, um, we're gonna import some things from lanechain to make this possible. So we're gonna use the character text splitter. So this is how we split, uh, data,the chunks of data, right?So if we go in and we look at our data, we'll seethat the data looks like thisand it doesn't make sense to embed this entire thing.
And so what we want is we wanna be able to, um, we wantto be able to, uh,split this up into reasonably sized chunks. And so that's where the character text splitter comes in. And then we've also got this lane chain,this lane chain document thing. And that's basically just a way to create entries into, uh,vector databases for blank chainOS is just, you know, operating system. We use it for basic shell functions.
So here what I'm doing is I'm just listing out all thesefiles so we can see them. So we can see that we have all these Wikipedia files,they're all dot TFCs. And then I'm gonna create this like emptylist of file texts. And this list will eventually be populated by documents. And this is how we're gonna insert into mobi.
So once we have this list, we'll go through all the filesand we'll open up each file, we'll read each file,and then we're gonna create a character text splitter. And we're gonna create this with some sortof chunk size and overlap. And this is really up for you to play around with,but I find that 5 12 64 is a decent default. This is about in English, this is about one paragraph. Uh, and this is about one sentence, one short sentence.
Um, and then once we have the character text splitter,we're gonna split all the filetexts using the text splitter. And then we're gonna goand we're gonna loop through all those split textsand we're gonna create documents out of it. So we're gonna create this thing called page context,which is gonna include the text. So this is the metadata that gets stored with your vector,and this is incredibly importantfor actually answering questions. And then we're gonna also add the doctitle and the chunk numbers.
This just lets us know like which document it was inand where in the document it was. Uh, and you'll see that sometimes like there's pausesthat will have, uh, there's a character text Splitter has a,um, set of tokens that it will split onand it will let you complete sentences,which means sometimes you'll have larger chunks. And then when you're put, and then once you have all thesechunks and they're all vectorized,you put them into a vector database like Vis. And so you'll see that we have that, that, uh, listof documents here called file texts. And we'll use embeddings our embeddings function,and then we'll connect to Novus this way.
Um, I have Novis running on the local host on port 19, 5 30. That's what we did here with this standalone start. And then we'll give it a collection name. And so in this case, I've called this collection Chinesecities, 'cause these are the cities, they're,maybe it makes sense to call 'em like cities Chinese,but, uh, you know, the choice has already been made. And then once we have that vector databaseand we've connected to itand we've inserted it, we'll use it as a retriever.
So retriever is just a object in lane chainthat does some sort of, um, that allows youto retrieve data from the Vector database. Now the next part is where we come up with a prompt. And so prompts are the ways that we interact with lms. And in this case, we're gonna create a chat prompt template,and we're gonna tell it you are an assistantfor question answering tasks. Use the following pieces of retrievecontext to answer the question.
If you don't know the answer, just say you don't know. Use three sentences maximum and keep the answer concise. And then here, this is just like a basic rag prompt. And then here we say answer in Chinese,and then we give it the question in the context,and we tell it to generate the answer. So the question in the context are both past inas if you're using an S string in Python.
And then we create a prompt template,and then we create the link chain chain. So in this chain, the first thing we're gonna do is we'regonna give it context from the retriever. So notice this is a function,and we're gonna take the question,and we're gonna treat the question as a function,as a runnable pass through function. What that means is this is a functionthat just passes the text through. That's all it is.
And then once we get the contextand the question, we're gonna put them into the prompt,which is how we fill in these values here,and then we're gonna pass that to the LLMand the LMS output gets past the string output parser. So in this example, I ask,what landmarks should I visit in Tokyo?And it tells mesomething in Chinese. I actually, um, can't read Chinese,but it looks like you should go to Tokyo's. Yeah, I can't read this. Okay.
Um, and then if I actually, so this question right hereis basically what should I visit in Tokyo?Uh, but I use Google Translate to translate it into Chinese. Oh, actually we should, we can totally just take thisand Google translate and see what it says. Um,translate. So this says you should go toFuji Hako, Izu National Park and Tokyo Skytree. And this probably says something very similar,but you'll see that actually if you ask questions in Chineseversus English, you'll get different responsesand you'll see this in the Polish and French ones as well.
And the reason that this happens isbecause the embedding model is embeddingsomething slightly different. And just because it's, uh,the question is similar in the languages, the actual text,the actual tokens are different. And so the embeddings model will produce different tokens. So I believe the answer for the Chinese one issays Tokyo Tower Imperial Palace and San Soji temple. So it doesn't even mi mention MountFuji, which is kind of interesting.
Okay, so that basically covers the basics ofhow you should be building these kinds of applications. And now let's take a look at the French one. So you'll see in the oops, uh, yes, okay. All these you'll see in the French one, the only one we hadto change was Beijing to picking,this is the French name for Beijing. So that's the only change.
And then here we used Fr Wikipedia instead of Zh or ENand everything else here is the same. And you'll see that we scraped all this textand you can see that, hey, you know, all this text is in,uh oh, Moscow is probably different as well. Well, you can see that all this taxes here in, um,in French, okay?And now when we build the rag application, uh,this is also wrong, okay?When we build the rag application, all the,all these steps are the exact same. So you can literally take this templateand apply it to whatever data you want. The only thing you need to ensure is that you, uh,you're using the right data.
You can use this template and add whatever you want. The main things you need to make surethat you change are these embeddings. So you'll notice I didn't use the, um,the same embeddings model as last time. I'm just using the default hugging face embeddings,the default hugging face embeddings do decently well inFrench, actually, the best embeddingsfor French is the open AI embeddings. Uh, I found that out, uh, later.
Um, and then, uh, change the directory that you're givingas your, uh, file directory. And you'll see, you know, this, just make surethat we're getting all the files here,and then we're opening them up. Same thing, chunk size,chunk overlaps still something you can play around with. Make sure that you're using the rightfile director here as well. And make sure you change your collection names sothat you get the right, um, the right data back.
And of course, in your prompt, you should tell itwhat you wanted to answer. And so instead of answering Chinese,we now want it to answer in French. And this is all the same. And so in this time, I've asked,tell me a historical fact about, oops, about Karachi,and it will respond something like this. Let's see.
What's interesting about Karachi. Karachi was first mentioned in Theo's historyof plants in third century bc It was occupiedby the British in the early 19th centuryand became the capital of sin in 1839. In 1876, the future founder of Pakistan, Mohamed Ali Jina,was born and buried in Karachi. Well, I don't think he was bornand buried in the same year, but there you go. That's an example of LM hallucination.
Uh, now if I ask the same question in French,it'll give me a slightly different response. And this is once again due to the fact that, um,the embeddings model here is actually gonna embed yourdifferent responses differently. And this response was, say, will say, Karachi is a citythat was founded by the British in the early 19th centuryand became the capital of send. It was an important economic centerand experience rapid growth. Uh, notably thanks to its port Since the 1980s,the city has been the scene of ethnicand religious conflicts.
And in 2012, it was the siteof the deadliest industrial fire in history. And the last oneis the Polish example. So in Polish, once again, you'll find that you'll haveto go convert a lot of these city names. Um, so there's actually a few waysyou can convert these city names. There is a Python package called Translate,and that Python package will directly translate someof these city names for you.
Uh, the other way you can do it is you can go to,so in this case it's pl you can go to the Polish, Wikipediaand just put in the English name,and you'll get the, uh, you'll, you'll get redirectedto the city name most of the time. And in this example, we do almost theexact same thing once again. Um, all we do is change up our embeddings again. So here we found another embeddings model. And so once again, use the mt e leaderboard,NTEB leaderboard, you can find the best embedding models,and you can pick your embedding models to, uh, fromthat leaderboard and go from there.
And so this is the embedding modelthat is the best for Polish. And then everything else is justyou changing the name Polish,and you'll see from the files that we're making surethat we're getting all of the right cities as well. Um, and then we will just do the exact same thing here. We'll just change the name of the collectionand tell it to answer in Polish and do the exact same thing. And we'll ask which sports teams are in Chicago, uh,and I believe this is Chicago Bears, Chicago Bulls,the Chicago Blackhawks, the White Sox, and the Cubs.
And I think this should give the same answer this time,despite the fact that we asked in Polish. Um, but it's gonna be a slightly different phrasing,and I don't know, polish, so I dunno what the phrasing is. It's gonna be different order. Is it a different order?No, same order Cubs, white socks, black hawks,bulls, and bears. So yeah, there you go.
This is the basics of how you build, um,multilingual rag applications. You take your rag applications,you build it in the exact same way. The main difference is that you want to be able to, uh,ensure that you're using the right embeddings model, uh,which you can find on the hugging face leaderboardsto look for the best embedding models. You'll use a different embedding models for each language,or you can goand you can look for embedding modelsthat do well on multiple languages. So the hugging board, the the hugging board,the hugging face leaderboard, uh, hasmultiple languages options available.
And in those multiple options, you can, uh, look, pickand choose, uh, the different embedding modelsthat do decently well on all the languagesand choose those, uh, models. So that's it. How do we know that this is the datathat you input using RAG and not just the normal GBT answer?Um, because,uh, this is a really good question, by the way. Uh, the way that you would ensurethat you're using the right answer is through this, uh,the templates, um, not so I guess likeGBT is not very well known for following your templates, um,but the instruct tuned ones are. So what I would suggest is if you don't believe me, um,I mean, I can just kind of like try to run it now.
We'll see how it works, but you should, uh,just run it yourselfand see like, hey, um, if I don't give it anyof these texts, what will it say?So, so I don't have the exact example pulled up here,but I can go look for it. Um, I do have an example of, of the model saying like,I don't have this in my contact somewhere. Um, but let me answer any questionsbefore I decide to goand try to poke around in my workspace for these models. Um, the other question is, didn't we use RAGto overcome hallucination?No, this is a, uh, this is, um, a misinterpretation of rag. So RAG is not used to overcome hallucination.
RAG is used to, uh, minimize hallucination. Hallucinations by LMS are always a, uh,risk of using lms because LMS are predictive machinesand all they do is predict the next token. And, um, there's no way to have zero hallucination. There's just no way. You can have 99.
9, 99. 99%,uh, of that's, you know, it's coming from your data,but there's actually just no way to overcome hallucination. And if you think about it, that kind of makes sense. 'cause you know, you as a person,you hallucinating all the time. Like, you know, you have every time, um,every time we, uh,remember our memories we're actually just creating newmemories and, uh, calling it like, oh, this iswhat happened in the past,but it's actually happening in your brain right now.
Does the LM haveto be trained in the language identical to the embeddings?And if yes, what about less common languages?Um, so LLMs are typically trained across huge swathsof data, huge swaths of internet data. Um, and they will have languages in, they'll have accessto all sorts of languages usually. Now, some LLMs are specializedto perform better in some languages rather than others. For example, um,the Quen Q-W-E-N-L-L-M isspecialized for English and Chinese. The mixed draw model for Misra, the French company is, uh,specialized in French and English.
Uh, I believe it can also do the other like, uh,romance languages as well. Um, Italian Spanish,I think it does those, uh, decently well. Um, and the, this is a really good question iswhat about less common languages?Uh, this is something that peopleare working on right now, right?So, uh, the amount of data on the internet,the internet is in 55% English, despite the factthat there's less than 55% English speakers in the world,55% of the data on the internet is in English. And so less common languages, you know, such as Arabicor Gaelic or, um, you know,some other like less common language are not languagesthat these LMS have, uh, had exposure to. And so they don't have the same ability to, uh, answerand process those questions.
So this is a really, really important example, a really,really important problem thata lot of people are working on. Um, and I suspect that as we kind of move forward, uh,with technology, that this is something that's gonna become,um, uh, more prevalentand more people are going to start creating data sets, uh,specifically on languages that are less common. Last chance to get your questions inbefore we wrap up today's session, you can use the qand a tool at the bottom, or you can drop 'em into the chat. So you, Eugene, what was the hardest part of this project?Collecting the data, also getting the, uh, getting the,um, getting, getting the embeddings model to work. So, uh, I actually had to go throughand test a few different embedding modelsbecause some of the embedding models are really bigand I'm running on limited, uh, Ram,and actually this is probably a big problem for a lotof people is like, access to the right hardware is like,if you don't have, if you don't have like, you know,like 128 gigabytes of ram, you can't run most LLMs locally.
Um, there are some tools out there that are lookingto help people do that, like llamas do cpp, um, uh,LM Studio, like there's a bunch of tools out therethat are looking to help people do this kind of stuff. But, uh, in general, um, the hardest part is, is,is just hardware limitations and then collecting data. Uh, I, I fumbled around with the Chinese data for so long,I didn't even know why it wasn't working,and then I was like, oh, oh, okay, I, I haveto actually like, get the Chinese wordsof the sentences in here and I can't just like, use Englishand have it automatically convert. Thanks Rob. Um, concerning the language agent, uh,what is your opinion in the most efficient approachto actually detect languageand let the agent do its job inchoosing the embedding model?And what's the best practice in storing them in vis eachlanguage in a separate collection question?Yes, so great question.
This is a very, uh, complex task. So, um, oh gosh,there are machine learning models that are able to detectlanguage, um, and they are, you have to, basically,you have to run the model to detect the language, which isonce again, additional cost overhead. But unless you know what language it isand you can tag it with the language,this is probably the most effective wayto do it, get the language. And then, um, in terms of storand vis, you can only store vectors at the same lengthin the same collection. So I would suggest that you look for embedding modelsthat can embed multiple languagesor ensure that all the embedding modelsthat you're working on, if you wanna be ableto compare these different languages anyway,ensure all the embedding modelsthat you're working on have the same length.
Otherwise, yes, you should store it in separate collections,and then when you store it in separate collections, you needto make sure that when you are working with this, uh,in production, that you are using the same embedding modelsto, uh, query those collect,to embed the data, to query those collections. So this is a very complex task that you're doing. Um, and, uh, I imagine that you're gonna have,we're gonna run into, uh, a lot of bugs. Oh, great question. What are the bestvector databases today?Well, I work at Zillow, so, um, I'm gonna tell youthat Novus is the best vector database.
This is not really, I, I, you know,my opinion is if you want to ask this question,you should ask it to someone who works on LLMsor someone who works on, um, you know,something other than vector databases,because guess what I'm gonna tell you, it's Vissand I'm gonna tell you it'sViss not, I'm gonna tell you why. Now, okay? I'm gonna tell you because Viss is very flexible. You can use multiple different types of vector similaritiesand you can use multiple different types of indexes,and it's super, super scalable. So if you go and you look at other vector databases,they don't let you use, um, many types of indexesor many types of similarity metrics. In fact, this is one thing that I've been,I've been railing about for a while,co-sign similarity is super popular, totally useless.
There was even a paper that came out earlier this yearthat was like, cosign similarity,doesn't even measure similarity. And I was like, ha, I'm right. I've been talking about this for a while. It's an expensive metric and it's not very useful. Um, so you can use different distance metricsand different indexes in Milds,and that's what makes it really great.
Emily, I can see your lips moving, but, uh,I can, yeah, I was on mute. Uh, we have just a few minutes left, so if there are anylast minute questions, we'll give you justa, we'll give you just a second. So type 'em in quickly. Um, oh, there's a requestto share the paper on the cosign similarity note. Um, let's see if I can find it.
If, if you Eugene can't find it on the call, we can, uh,send it in, uh, the follow up emailwith the recording, uh, later. Oh, so, oh, I already got it. Yep, I found it. I found it very fast. I searched for it many times.
Um, because guess what?I do not like those. High similarity. Yes, uh, I man of strong opinions. Um, any other questions from the audience today?Thank you, all of us, all of youwho have joined us for the session. Uh, hopefully you enjoyed it, learned a coupleof extra languages, languages along the way.
Uh, just gonna stall for time for just a minute. Um, but I'm not seeing anything else in the chat,so we'll let everybody out, um, a few minutes early. We can get to your, uh, next meeting or head to lunchor wherever you may be in the world. Um, you, Eugene, thank you so much for this session. It's been really fun.
And uh,hopefully we'll catch you all on, uh, a future webinar. Thanks Emily. Thanks everyone for coming.
Meet the Speaker
Join the session for live Q&A with the speaker
Yujian Tang
Developer Advocate at Zilliz
Yujian Tang is a Developer Advocate at Zilliz. He has a background as a software engineer working on AutoML at Amazon. Yujian studied Computer Science, Statistics, and Neuroscience with research papers published to conferences including IEEE Big Data. He enjoys drinking bubble tea, spending time with family, and being near water.