Build your RAG application using Milvus and Mixtral

You’re in!

Webinar

Building a RAG App with Mixtral: A Step-by-Step Guide

Transcript

All right. Welcome everybody. Uh, my name is Chrisand I work here at, uh, Zilliz. And today I am your host for this session with Yujian. Yujian is one of our, uh,extraordinary developer, um, advocates.

And so, Yujian, I'll let you take it away. Before you do, just remind everybodythis session is being recorded, so you can take a listento it later on, and we'll send you the links. Please put your questions in the chat or the q and aand, um, or have your little AI agents,just take notes, whatever you like. All right, take it away. All right.

Thanks for that intro, Chris. Um, hello everybody. Um, today we're gonna be talking abouthow you can do RAG without Open ai. And in particular, we're talking about how to do RAGwith VIS mixed role Octo, AI and Lang Chain. And, uh, this is a little bit about me.

My name is Yujian I'm a senior developer advocate here at Zilliz. That QR code there on your right hand side will, uh,take you to my LinkedIn page where you can follow me keep up with all of the AI things that I post. Um, and a little bit about my background. I've in AI ML for a while. I've been in software since 2013.

Um, I published my first ML papers back in20 17, 20 18, and I worked on auto ML at Amazonbefore I got into startups and doing developer advocacy. I found that, uh, you know, I really enjoy writing papers,uh, making videos and, uh, giving talks like this. Um, so let's get into what we're gonna talk about today. So first I'm just gonna do a brief overview ofwhat we're gonna talk about and what we're gonna build. And then I'm gonna talk about the different piecesof the tech stack that we're gonna cover.

So we're gonna talk about Milvus ,which is an open source vector database. It's the only open source distributed vector database. And then we're gonna talk about mixed roll, which is oneof the, uh, more recent models from Mistral. And then we're gonna talk about OctoAIwhich is our LLM provider that we're using to access, uh, Mixtral. And then we're gonna talk about LangChain, whichunless you've been living under a rock,you probably know about LangChain.

And then we're gonna go into a demo. I'm gonna do a code demo at the end. Um, so, you know, be prepared, have your VS code ready,and, uh, just just know that we're gonna walk you some code. You don't have to do all the code, uh, as I do it,but, uh, it'll probably be helpful for you to, to at least,you know, uh, have something upso you can, uh, follow along. Okay, so let's do the overview first.

The tech stack that we're gonna be covering today is we'regonna be covering vis, and then we're gonna use Mix Draw viaOcto ai, and we're gonna use Lane Chain. So Vis is, is our Vector database, right?So our Vector database is the database that lets you, uh,interact with analyze, use, basically unstructured datathat you weren't able to use before. And essentially what we're doing is we're quantifyingqualitative data using Vector embeddings. And then for the LLM piece of, you know, uh, the RAG app,the LLM is kind of your, um, core piece. You know, it's one of the shining pieces of Rag.

Uh, we're gonna be using Mixed Draw, which was developedby Mytral ai, and we're gonna be using that through Okta ai. And Okta AI is actually based here inSeattle where I'm based. Um, and I would love to know where you guys are basedas well, if you would like to drop something in the chat,just telling me where you're from. Um, and then to orchestrate everything,we're gonna be using Lane Chain. So Lane Chain is a framework for generative AIthat's really, really built on how you can orchestrateand put together, uh, a bunch of pieces to build Rag.

Oh, I see. There's someone from sf. sf, okay. West Coast, very much West Coast. This is, uh, generative AI is somethingthat is very popular here on theWest Coast, so that makes sense.

Oh, someone's in France. Hey. Oh, awesome. Okay. Toronto, Cincinnati.

Oh, wow, that's surprising. India, New York. Okay, I see we have a very,very diverse audience here. Washington, DC um, more, more of the Bay. This is great.

Um, San Jose. Yes. Okay. Connecticut, Mexico. Wow.

Okay. Super cool. What a diverse audience. Um, more Paris people. You guys are gonna have to talk to the Mistral people.

Okay, you guys who are in Paris,you guys gotta talk to the Mistral people for me. Um, okay, so what is rag, right?Let's, let's, let's take a look atwhat this architecture looks like. So this is a very, very simplified architecture. This isn't obviously like, this isn't, you know,the everything that's gonna be built,but essentially the ideabehind RAG is you're gonna take your dataand you're gonna embed it. And so I've, I've missed the embedding piece,but you're gonna embed it if Via, have some sortof embedding model, and then you're gonna put that into VISas your Vector database, right?So your data goes into a Vector database, like vis,and then, um, VIS is gonna sit thereand it's gonna be your, let's say, source of truth.

And your LLM in this caseprovided by Okta AI is going to talk to your,um, vector database. And then, uh, when you have a question, uh, at query time,you're gonna ask that question to the LLM. And the LLM is gonna say, okay, like, here's the question. Now I need to know, um, you know,how do I get the answer to this?And so what it will do is it will execute somesort of semantic search. And the way it executes semantic search is it givesthe vector database, in this case, viswhat it needs to search for.

And VIS will do a search for thatand return the top x, y, Z results, top five,top three, top 10, whatever. And then the LLM will take that as the context,and it will answer that question,and it will give you a response. And you see that I've got the lane chain logo here. And so lane chain is what we're gonna be using to executeor orchestrate all of this kind of, you know, backand forth between you, your data, the Vector database, um,and the LLM. Oh, I also see that we have someone from Slovenia.

Wow. Very cool. Um, so let's take a look into the,these pieces of the tech stack. So first thing we're gonna look at is Novus. Don't be afraid.

I know this is a verycomplicated looking architecture diagram,but I'm gonna walk youthrough all the pieces you need to know. You don't need to know the entire diagram, uh,to understand what's going on. So the first thing to know is kind of like, you know,what are the pieces here?What is this? What is, what is being doneunderneath the surface, right?And so what's being done underneath the surface isthere are these three different separationsof concern when it comes to doing any kind of search. And in this case, we're doing vector search. And so these separations of concern are querying,which is basically, how do I find my data?Data, which is, how do I get my data in?And you can also think ofquerying is how do I get my data out?And then indexing, which is, how do I build the map ofhow I'm gonna find my data?And indexing is something that, you know,like it is in every kind of database,but it's particularly important for vector searchbecause, uh, vector search is very,very computationally expensive, right?So we're gonna cover kind ofwhat vector search looks like in a moment.

But the idea behind vector search isthat you're comparing these long lists of numbersand you're trying to find the ones that are the closest. Um, so these nodes are separate in Novisbecause they need to be scaled separately. You will never be in a situation, well, maybe I won't,I won't say never, but you'll, it is very unlikelythat you'll be in a situationwhere you're gonna be doing like millions of queriesand also building millions of indexes at the same time. And on top of this, um, each of these,as your data comes in, indexes are built on segmentsof data as they come in. So why is that, right? So if you think about it,an index is basically a map of, of, of your data.

And so if you put a bunch of data in at once,and then you build a map on it, that's great. But what happens if you need to grow that data, if you needto shrink that data, or if you need to start replacing someof that data, right?So we have a dynamic way to build indexes using novis. And so that's kind of, uh,that's the idea behind these segments. And it's also much more scalable. If you think about, let's say hundreds of gigabytes of data,it's a much, much more, uh, it's a much lower big o toparallel search a few, um, segments in parallel as opposedto searching the entire thing linearly.

And then the last thing that's kind of interesting that, uh,you might wanna know about this is basically the way thedata gets put in is modeled as a pub subsystem. So data comes inand the message store area basically publishes this dataand the query node and the data node will subscribeto the message store to keep trackof the data that's coming in. And the data node will basically be like, okay, you know,now enough data has come in, there's a segment, we needto flush it, tell the coordinatorto call the index node to build a segment. So that's the basic idea behind novis. And now let's take a look at what someof this data looks like so we have a betterunderstanding of what's going on.

So this is what an entry into Novus, uh, or will look like. So this is, I actually, I took this from Zillow,but um, essentially you need these top two entries,the ID and the embedding. The idea is the way that we find your data,like the exact data point that you want. That's how you can do deletes and up certs. And then the embedding is the waythat you can compare data, right?So earlier I said what vector databases,like Novus do is they allow you to do this kindof quantitative comparison of qualitative data.

And so that's what the embeddings for,and that's why what you see there is justthis huge list of numbers. And the rest of this is what we call metadata. So metadata is data that you wanna store alongwith your idea or your embedding. And this data helps you either define what, um, you know,what kind of data you stored. Perhaps it is the sentence, right?In this case we've got paragraph, perhaps it's a sentencethat's being embedded or the, you know,the chunk of text being embedded.

Uh, we've also got a date,maybe you wanna know when it was published. Uh, we've also got the publication who published it. Um, so these are pieces of, um,of the vector data that you would store into Mil vs. And remember the most important pieces here,there's two pieces that are most important for youto know are the ID and the embedding,and everything else is metadata. That is optional.

Okay?So now that we've talked about this ideaof quantifying qualitative data,let's take a look at how that actually works. So this is a very, this is very much a toy example, okay?Um, let's, let's preface this, that like,you're never gonna see people usingtwo dimensional vectors, right?So this case, these have two numbers represent a wordthat means that word is represented in two dimensions. In this case, you're never gonna have twodimensional data in production. Um, and, uh, the main thing that you are going to do,oh, and you'll also never do Manhattan distance, which iswhat I'm about to show you in production. Um, but the main thing that I want youto take away from the slide is that vector embeddingsallow you to do math on wordsor math on things that were not originally numbers.

And before we get into the math,there's one other thing I have to point out here. This is really, really criticalthat you understand this, okay?So you'll see that queen and woman and kingand man have the same value along that first dimension. All this tells us is that these wordshave the same relationship along that dimensionfrom the vector embeddings. It does not tell us what that dimension means, okay?It does not mean that a value of 0. 3 corresponds to,you know, uh, some sort of like genderor sex, kind of like characteristic,or that a value of 0.

5 corresponds to genderor sex characteristic. It just means that queenand woman mean the same thing,have the same meaning along this first axis. Okay? So let's look at the math here. So basically what we're showing here is this ideathat queen minus woman plus man is equal to king. So we'll take queen, which is 0.

3, common 0. 9,and woman, which is 0. 3, common 0. 4. And when we subtract them, we get zero common 0.

And if you would like to venture a guess at whatthat means in the chat, I will tell you what I'm,I think that means later on. But if you would like to venture a guess onwhat you think maybe, uh, what word might correspondto zero comm 0. 5 in the chat, I would love to know. And if you take that and you add the word man, which is 0.

5,comm, 0. 2, you will get 0. 5 comm 0. 7, which happensto correspond to king, right?So the idea of this slide, basically,the only thing you need to take away from this slide isvector embeddings, let you do math on thingsthat were not originally numbers. So vis lets us compare unstructured data at scale, right?This is what I was talking about earlier,all the different nodes,all the different ways you do search, that kind of stuff.

Okay? So nobody has commented anything about what they thinkthat this zero common 0. 5 value means. So I will tell you what I think this means,and this doesn't necessarily mean what it is, but, uh,because we clearly can't see it marked here,but this would probably mean something like,whatever the difference between queenand woman is in your hands, in, in your head. So probably something like royalty. Okay?So what about mixed draw? What is, oh, cool.

Someone said royal in the chat. Great job. Um, okay, what about mixed draw? What is mixed draw?So mixed draw is lay big model. It is a big, big model from, uh, mytral. So Mytral AI is based in Paris.

Um, so those of you who are from Paris,you know, go say hi to them. Andal is this kind of mixture of experts model. And so it's not just one LLM, it's essentially like, let's,let's pretend like we have multiple lms. In this case we have eight and there's only two of themthat are gonna be activated at a time. And each of these LMS is basically a Misra seven B model.

And mixed draw has this interesting, unique, uh,or, you know, uh, unique at the time, no longer unique,but, uh, an interesting way to interact with languagewhere it offers multiple languages,that it understands English, French, Italian,German, and Spanish. And if you think about it, it kind of makes sense, right?All of these languages have very similar roots. They're all from very similar areas. It makes sense that it's easy to get a mo, it's not easy,but let's say it makes sense that amodel can understand these things. Um, okay, now what about Otto ai?Otto AI is our serving tool that we're gonna use.

So Okta AI provides a few different things,and the way that I think about it is basically generative AIas API endpoints. So most of us are probably familiar withhow APIs work at this point, right?They've been around for a long time. Uh, you know, in the, in the beginning in like the,you know, when they first started coming out,we had like API gateways and all this stuff,and, uh, people were like, oh, what's an API?It's a black box, right?It's, you can think of it the same way, right?It's a black box. You send arequest, it sends you a response. And what it does in this case is it gives you the abilityto use generative ai.

So, OK two AI has a few options you can use,you can use media gen stable diffusion, or control net. You can use text gen, right?Mixed roll, which we're gonna use Gemma Smog Llama. Uh, and it also lets you docompute with bring your own model. And so what we're really gonna look at is we're gonna lookat text generation via mixed roll. And so there are three parameters that you can tunein this text generationand these three that you probably, I mean,there's more than three parameters really,but the ones that you really wanna think about here are thetemperature, the top P and the max tokens.

Um, and so temperature is basically, you know,temperature and top P kind of do the same thing ofhow creative do you want this output to be?Temperature is more likehow creative you want the output to be. Top P is how varied do you want the output to be, right?So LLMs are, you can think of LLMsas statistical models, right?And these kinds of, these, these two parameters tunethe variation of the statistics of the statistical modelbehind the LLM. And there's max tokens, which just says like, Hey,I only wanna, you know, I only want you to do 200 tokens,500 tokens, 1000 tokens, whatever, right?We saw earlier, um, mixed trial has the ability to workwith 32,000 tokens. We're not gonna work with 32,000 tokens here. That's generally, uh, you know,more expensive if you have to do a lot of tokens.

And so, um, we get the ability to kind of adjusthow many tokens we need. And also this kind of restricts the LLM onhow it should produce a response. Okay? So the last piece of this puzzle is lane chain. Um, lane chain is the most popular, uh,LLM orchestration framework. And essentially the ideabehind lane chain is you're gonna chain abunch of functions together.

And one of these functions,or some of these functions can be LLMs. You can treat LLMs as if they are a function that is oneof the core ideas behind Lane chain. And Lane Chain has popular plugins with pretty muchall the popular tools you can think of vis Okta, ai,OpenAI, vis. And, uh, the idea behind this is that you're gonna focus onthese chaining all of these results together,which is why we call link chain in orchestration framework. Okay? So now the next part of this is just the code demo.

We're gonna spin up the code demoand we're gonna walk through the code. So if you want,Hey, you, Eugene, before you do, uh,there was just one question I wanted to get, uh,answered from, uh, Nikola. Uh, is there a Unhosted alternative to Octo ai?Is there a Unhosted alternative to Octo ai?Yes, you can host these models yourself. Uh, and actually, um, I'm working on something that, uh,kind of plays around with this using Bento ml. Um, the other thing you can kind of dowith this is you can go to hugging face,or you can go to, I don't know if Torch Hub has it.

I think Torch Hub has some as well, but you can go to Hugand Face and you can download A LLM there. And depending on how much, um, how much,um, memory your GPUor CPU has, you can host an LLM directly fromyour GPU or your CPU, uh, so,or you know, locally basically. Um, and so that is, uh,an ho a unhosted alternative to Okta ai. Um, okay. Uh, so if you have enjoyed this presentation, I urgethat you go and scan this Q code with your phoneand go to the vis GitHub and give us a star.

We'll be very happy about that. Uh, I see there's a couple other questions here,or one other question here that I'll answer, which is,how do you join the hackathon?My son was doing it in college. Uh, so you can join the hackathon by goingto our events page,and you'll see the hackathon posted there,and you can just, uh, you, you haveto be in Seattle for these hackathons. Um, but you can just go and click join or, or RS two P. Cool.

So, um, I'm goingto pull up my code. Aha. Great. So, uh, I'm gonna zoom in just a little bit here, uh,and we're going to also make this smaller, okay?Um, so this is a code sample of how you dowhat I was just talking about, which is building some sortof rag app on top of Melva, mixed stroll,Octo, AI, and Lane Chain. And so I've tied this MMO rag, which just has a nicer ringto it than MMOL.

Um, but you know, it's like kind of,if you think about like MMO RPGs,I was trying to make a joke here. Um, so the first thing we need to do is we actually need to,uh, run Docker composed. So I have mils on Docker Compose,and, um, we will have a link for you in the chatfor how you can do that. Um, but waste, basically,once you have the Docker post file, you should just be ableto run Docker, compose up dash D,and it will start your, um, uh,uh, uh, what are they called?Containers. And now these are, I have these running locally,and now I have a way to basically work with VISand have this kind of persistent storage, uh, outside of,um, outside of my, my notebook here.

And so the first thing we're gonna do, um,so this is a live example. Live examples are always a bit nervewracking, but let's hope this goes well. Uh, so here are all the pip uh, packages that you'll need. If you're gonna be trying to run along with me. Um, you'll need to have vis, uh,you actually don't need vis, this is for vis light,but vis light can be run directly in your notebook,and you can also do that.

Uh, and then you'll need lane train, of course, uh,you'll need sentence transformersto do the embeddings and tick token. So this is for chunking, uh,we'll talk about chunking when we get to it,but chunking is basically the waythat you split up your data so that you can, you know,access it in a reasonable manner. And then you'll need the Octo ai,SDK if you're using Octo ai, if you are doing what I, uh,was talking about earlier with either hugging faceor if you're familiar with Bento, you can use eitherof those to run your, um, to run your LLM locally. So here in this first section,all we're gonna do is we're gonna get, uh,some imports from lane chain. So, um,the first thing we're gonna do is basically getthe LLM chain, right?This is lane chain's, like, you know, core, uh,uh, uh, a piece of framework.

Basically the core thing that lane chain kindof runs on is like chaining, LLM and LM outputs together. And then we'll get the prompt template, right?Prompts are a very important way to work with your LLM. Um, and basically it's how you tell the LM what to do. And then we'll gonna get Octo AI endpoint. So we can use mixed draw, um, from Octo ai.

And then here, what I'm doing is I'm just loading myenvironment variables. You can load your environment variableshowever you want, I just prefer this method. And you can load your environment variables in this methodand, uh, get, um, the, uh,Okta AI API token. Uh, and then what we're gonna do is we're gonna makea token, uh, template. And this is just an example template.

So this is just something that we're gonna useto basically ensure that, you know, mixed draw is working. And so basically all we're gonna say here isbelow is an instruction that describes a task,write a response that appropriately completes the request,instruction, question, response, blah, blah, blah. Okay? And then we're gonna turn that into a prompt templateso that lane chain can understand this. And then, um,what we're gonna do is we're gonna get our LLM,so from the lane chain Octo AI endpoint,what we're gonna do is we're gonna just say, Hey, here'swhere we're gonna ping,and here are some of the model, uh, keywords that we need. So in this case, we're using mixed draws, uh, seven, uh,so mis draws seven eight by seven B,which is just their mixed draw model.

And we're gonna use the instruct version, which meansthat it takes instructions is fine tunedto take instructions, not a general chat model. And we're gonna use it in FP 16. So FP 16 is 16 bits. And classically, if you think about, um, if you're familiarwith data types in, uh, in, um, in programming languages,you're probably familiar with floats and doubles. Those are 64 bits,and those are typically what you would see, uh, vectorsand, um, uh, models be using.

But, you know, as we've kind of grown these models,these huge sizes, we've run into this issue of having modelsthat are too big, and we've come up with waysto compress them without losing much of the performance. And one of those compression methods is to dropthat down from double, which is FP 64 to single,which is FP 32 or half, which is FP 16. And then we'll just say, Hey, let's,let's only use 128 tokens. Let's, you know, let's not run anything crazy. We're not gonna do, you know, a very expensive,uh, test run here.

Um, and then we'll give it some sort of temperatureand some sort of, you know, ability to create. And then we will give it some sortof prompt that tells it what it is. You are a helpful assistant. Keep your responses limitedto one short paragraph if possible,and then we'll just give it a question. This is just a test, right? There's no rag being done here.

This is just using the LLM who was Leonardo da Vinci, right?And then we'll say like, here's your prompt, here's the LM,and then we're gonna invoke that, um, uh, uh,LM chain basically,and it should, oh boy, I didn't define prompt, did I?Uh, oh,cool. Okay. So that works. Uh, so I didn't run that cell above,so make sure you run all the cells. Um, so this tells usthat Leonardo da Vinci was an Italian polymath who's oftenregarded as one of the greatest painters in history,et cetera, et cetera, et cetera.

Um, Mona, Lisa, blah, blah. So basically this is like, yeah,Leonardo da Vinci is that guy. Um, oh, please provide a link to the code. Oh, yeah, yeah, yeah. So I can, I can get that, uh,GitHub, ok, two AIvis, okay.

And where is the chat?Um, there you go. That is the link to the code. Okay, so now we're gonna start working with the rag portion. And so for rag, we're gonna need to do embeddingsand we're gonna need a vector database. And so we're gonna use the Okta AI embedding endpoint fromLane chain and Vis as a vector store from Lane Chain.

And basically this partof the embeddings endpoint is pretty simple. We just give it the embeddings endpoint. And this creates, and lane chain createsan embeddings function out of it. And for the next part, what we're gonna do is we're gonna doa little bit of, um, uh, uh, document manipulation. So in the repo, uh, that I sent,this is the repo will just give you,or sorry, the link that I sent is just the notebook.

But if you go outside of the notebook,you'll go to the repo. And in the repo at large, you'll seethat there is a data folder,and we're gonna work with that data folder. So here all we're doing is importing the character textsplitter, which is one waythat you can split your data into chunks,and then the document, which is the waythat lane chain deals with documentsand os which is, you know, your operating system. So the first thing we'll do here is we'll just get a list ofall of the pieces of our, allof the data that we're gonna be working with. And I'll just print this out for youso you know what we're seeing here.

Excuse me. Okay,so we have a bunch of text documents from, uh, Chicago,Washington, DC Cambridge, Massachusetts, Houston, Seattle,Toronto, San Francisco, and Boston. And basically these are the citiesthat we're gonna be working with. Um, these are scraped from Wikipedia,just straight outta Wikipedia. Uh, and what we're gonna do is we're gonna get these filesand we're gonna put them into a text trunker.

So I wrote this, uh, function. Um, I, I was, I was hoping to find a very easy wayto do this with Link Chain, but I couldn't find it. So if you find an easier way to do thiswith Link Chain, I'm very interested in knowing. So please send that over to me if you find it. Uh, so basically all we're gonna do is we're gonna loopthrough all of these files.

We're gonna open the file, we're gonna read the files. So now we have the file texts a string,and then we're gonna create a text splitter. So Lang Chain has the ability to split on texts,and so we're gonna create a text splitterthat's gonna use tick tokens and coder,and we're gonna just say, let's say a chunk size of five 12with an overlap of 64. So chunk size and chunk overlap are definitely thingsthat you should consider when you're working with your data. This is how you get your data into reasonable,a reasonable format to work with, right?And what chunk size and overlap mean in this case isbasically how many characters are wegoing to be dealing with.

So if you think about it, 512characters, it's about a hundred words. It's maybe a paragraphor so, maybe like a paragraph to a paragraph and a half. 64 is like about a sentence, right?So, um, basicallywhat we're saying here is we're gonna chunk up our datainto these paragraphs, and between these paragraphs,we're gonna make sure they overlap by about a sentence sothat when, you know, when we get these documents, we'll seewhere the is,and we'll be able to kind of, you know, without being ableto see exactly what the chunks would look like,we'll be able to programmatically guaranteethat we'll have some sort of context overlap in between eachof our chunks, right?So context is very important for, um, you know,for, for your data, basically. So now that we have this text splitter,what we're gonna do is we're gonna take the textsand we're gonna run this text splitter on the stringof the entire data that we just read in. And then what we're gonna do is we're gonna go,and I'm gonna enumerate through that list of texts.

So I'm gonna turn that, I'm gonna get both the, the, the,the position of the text as well, the text itself. And basically I'm gonna append that as a document, which ishow Lang chain understands these entries. And I'm gonna append that as a document, um,and I'm going to say that here's the, here's the dataand, you know, here's the title of the file. Like, you know, what, where is this data from?So you'll see that this is actually done pretty naively,and that Washington DC will actually give us Washington duh, but everything else will work. So this is kind of interesting.

Uh, and then we'll say, here's a chunk number. And so hopefully, did I run this one? I did. Okay, I, so this should run some texts,chunks, and it'll tell us, yeah. Hey, so some of the chunks are, are longer than specified,but that's, that's based on like, you know, uh,the tick token character encoder to understand that, hey,you know, sometimes we want complete sentences,so maybe sometimes we need a longer, uh, longer, um,chunk than specified. Okay? So now what we want to do is we wanna put these chunksof data into vis, so I'm gonna run this,'cause this will usually take, you know,at least 20 seconds or so.

Um, and basicallywhat we're doing here is we're saying we're going to createa collection in vis collections are kinda like tables. And we're gonna create this from this document set, right?So that's why these are called documents. We're gonna create this doc from this document set,and the document set that we're gonna use isthis list of file texts. And then we're gonna use this embeddings functionbecause vectors, databases, vector stores,they store embeddings. And so we need to give it an embeddings function sothat this function can turn allof this text, all of the page content.

Basically, you'll notice that here,there's this thing called page content, all the page contentinto embeds, and that's the function that does it. So that was the Okta ai embeddings function that we didwith, and by the way, that is GTE large, uh, that's the,uh, models being used. And then we just need some connection arguments. And so I have vis running on, uh, Docker composed locally,and so the host is just local host. This can also be 1 27 0.

0 0. 0 0. 1,and then the port automatically, we use 19,530. And then the name of the collection is just gonna be cities. Um, okay.

And now let's just, we're gonna take a look at what oneof these file texts look like. So basically document page context equalsChicago, et cetera, et cetera, et cetera. So Chicago is the first onethat's read in this is what it looks like. Uh, it looks like someone sent something in the chat to do,um, the tech splitters. Uh, okay, cool.

I'll have to try that out. Thanks, max. Okay. And now what we need to do is we need to use, be ableto use this vector store. We need, we now, we have a storage where we have,we have a place where all these vectors are being stored.

Now we have to be able to use them. So what we do is we turn this into somethingcalled a retriever. And the retriever basically just gives you a wayto query the vector store, right?Retrieve data from the vector store,and then we'll create another prompt, another,another template, and another prompt, right?So the template this time is we'll say,answer the question based only on the followingcontext, right?And this context is gonna be the contextthat we grab from the vector database. And then we're also gonna give it a question,and we're gonna say, and we're gonnaturn it into a prompt template. Okay? So now we're gonna create the lane chain chain.

So this is the part of lane chain where it really shines. And, you know, we're creating a chain out of lane chain. And so the things that we're gonna get here are a runablepass through, which basically lets you just passthrough some text as the function,and then the string output parser,which parses the output, uh, as a strain. And then we'll create a chain where we give it the context. And the context is gonna be fromthe retriever that we made earlier.

And the question is basically gonnabe, you know, done through the run. We'll pass through, it's basically justgonna be given directly. And then we'll say, here's the prompt. Remember the prompt is this prompt right here. And then we'll pass that prompt to the LLM, right?This is the pipe, uh, just like CLIs, right?And then we'll pipe that to the string output parser.

So we can outspoken parse the output of the LLM as a string. And then we'll ask something. Let's ask,how big is the city of Seattle?Uh, so big is kind of a, you'll see this in, in in a second. So this is interesting. Big is kind of a weird termbecause it could mean, uh, the population sizeor it could mean the, um, the area of the city.

And you'll see here, and you know, Boston, the answerfor Boston was the population size,but the area of the city is, uh, I, sorry, the answerfor Bo Boston was the area of the city,but the answer for Seattle here is the population size. And maybe it does give the answer to the city as well. Oh, yes. Uh, no, no, no, no. It just gives the population size.

So maybe we can also ask like,how many square miles is the city of Seattle?And it'll give us like the, oh, oh, well,I guess it doesn't, uh, it doesn't provide the information. So, you know, that's very interesting. I guess Wikipedia doesn't have that,or if it does, we were unable to convey thatto, uh, mix draw. Okay. So one of the other thingsthat we can do is we can do something much more interesting.

Let's actually leverage one of the interestingor unique, you know, points about mixed draw,which is it's multilingual reality, right?It's able to do, it's able to work in multiple languages. And since Misra is based out of Paris, we are goingto ask you to do Frenchand your helpful assistant who responds in Frenchand not English, and answer the question in French basedonly on the following context. And you'll see basically what I did was I copiedand pasted the original portion,and then I just attached the word French in frontof everything so that we can get the French ones. And the French chain will basically use the French promptand the French LLM. And then we'll ask, how big is the city of Seattle?And the French LLM will give the French chain,will call the French LLM, and they will ask the questionand get it to respond in French.

And now the response in French is basically the sameas the response in English. Now, I don't actually read French,but I'm pretty sure this says the cityof Seattle has a population of 750,000 inhabitants in 2022. And the metropolitan region of Seattle composes 400and, uh, four, sorry, 4. 02 million inhabitants. Um, I don't know what this is saying, honestly,something, something growth at21.

1% a year, something like that,can sentence transformer be used as an embedding tool?Yes, it can. And in fact, uh, I would saythat hugging faces sentence transformers, um, library,which we did import earlier, is my favorite embedding toolbecause it's free and you can host itlocally and it's easy to use. Um, so the answer is yes,and you can also use a bunch of different models on it. Uh, it's just like, since we're doing everything hosted,I just figured, uh,I would pass this on to the hosted version. Um, cool.

No open questions, which makes it the 15th biggest. Ah, thank you, Damien. Yes, the 15th biggest, ah, yes,which makes it the 15th biggest, uh,metropolitan region in the United States. And then something about growth rates. And so that's it.

That is how you can build a, um, a, uh,rag app without using OpenAI. Um, and if you would like,we can also ask the French chain some other questionsand get some other responses back in French. So, uh, I'll kind of open the floor for peopleto put some different questions down,and we can try to ask it some different questionsand see if we get something interesting. You can translate French to English. Uh, I can't translate French to English,but, uh, maybe other people can,I've noticed chacha PT can translateto a few speaking languages.

Is that recommended or is there a specific translation mode?Um, I, I mean, so this is a, this is a good question,but, uh, or this is a good like point. So I've actually, actually played aroundwith this a little bit more, and you can actually askquestions in, in multiple languagesand get back responses in multiple languages. And it's not just French, German, Italian, Spanishand English like advertised. You can actually play around with like Hungarian, Hungarian,uh, or, uh, Turkish or other languages like that as well. Um, but the, I I think likethere's no real,I have no real insight into why you can do that.

Um, and the only response that I really have for this is,Hey, LLMs are are magic. You know, uh, sufficiently advanced technology isindistinguishable from magic. How does it work if tokenizationis different for different languages?Yeah, see, I wanna know the, I wanna know that too. That's a great, that's a great question. And, uh, unfortunately, I just don't know.

Well, you can ask the Misra people. I could, I could ask the Misra people. Um, so I know that Misra actually has a, uh, a, um,uh, uh, one of their experts, they're,they're in their mixtureof experts is basically this translation kind of thing. Is there like a evaluation of, uh, the correctnessof the translations? Is there something like that?There isn't, uh, at least not widely known. Um, you basically just have tofind someone who understands French.

So I was building this with ri and Ri speaks French,and so he was like, yes, this is, you know what it is. And so I was like, cool. Does LM treat latest indexwith a higher weight?Um, does the LM treat the latest index with a higher weight?I'm not sure I understand the question. Can you expand a little bit about what, um,higher weight means,or sorry, what the index, latest index might mean. Take the output of the Frenchand ask the model to translate to English.

Oh, that's a great idea. Oh,That's a good idea. That's brilliant. Okay, so let's do chain invoketranslate to English. And so we're going, like, it's, uh, okay, wait.

Um, so wait, so let's, let's go backand let's just, uh, let's go backand reinstate one of the old ones. Let's just get it to this. So we'll do question now. We'll, we'll get rid of this, and we'll just say like,English chain prompt equals prompt. So prompt will be, uh, let's see what that says up there.

Where'd it go? Where'd it go?Oh, it,uh, below is an instruction that describes a task,write a response that completely, okay, uh,let's say translatetranslates the entry into EnglishEntry entry. Um, and then what we can do is questionor entry equals, oh, oops, let's copy and paste this. Ah, maybe if I justdo what's, what's f what's, what's FFR one I, okay,let's just, let's just do entry equals FR one,then fr one is not defined. Are you serious?Did I just type this wrong? Oh, fr underscore one. Oh, yes.

Okay. Okay. So let's see what it says. Um, let's see if it translates the entry into English,or if I try to, ah, thank youfor the information about Seattle's population. To summarize, Seattle had a population approximately.

Okay, so yes, this, uh,this basically translates it into English. And yes, between 2010and 2020, Seattle experienced a noble growthof 21. 1% ranking itamong the fastest growing cities in the country. These statistics highlight sales increasinglyimportant in the national landscape. Yeah, that's right.

So if you live in the us, you know,you should, you should come to come to Seattleand come to all my AI events. Okay. Uh, how are the embedding stored in Vector DB addingto the value of the response we were seeing in the lm?Can you show with the sample basicallywhat difference would the response look like if we didn't?Well, uh, uh, as we saw earlier, if the vector,if the embedding is not in the Vector db,then it cannot answer the question. So that is, uh, the, this is the answerto your question about, uh,having Vector db having theembeddings in the Vector database. Does that answer your question, uh, on Jill?Awesome.

I mean, I guess, you know, we have to rememberthat, uh, you know, yes, these models are large,but it doesn't have like every single answer,at least not, not now. I, I actually think like with, uh, you know,some examples it's easier to show than others,but in this case, like one of the things isbecause this is an instruct model, um, it istrained fine tuned specificallyto follow the given instructions, which means that,you know, even if the model has the answer,it won't answer it if the context is not givenfrom the batch database. Um, so I'm actually pretty sure that like, theoretically,like Misra should know how, you know, it should know likehow big the city of Seattle is. Um, but because we don't have the answers stored in theVector database, and it is an instruction tuned model,and we do not provide that context, it does, it says that,Hey, I don't have that information,so I can't answer your question. If I can't find it in my data, is there a wayto allow the LLM to find it for me?So I assume this is kind of a follow up to Anjo's question.

Um, and like, I, I think I was just talking about this. So, um, the answer is, uh,yes, yes, you can. And there are two ways that you can do that. One is you can just rely on the LMS training data, which mayor may not be true and may or may not be up to dateor two is you can build an agent that will goand scrape the web for you. And conveniently, I recently built an agentand publish the blog with a tutorial on it.

Um, and so we can drop that link in the chat for you. Um, and, uh, which Vector DB can cater to financial data?Uh, well, I think the thing with financial data isthat you care a lot about securityand privacy and,And latency probably, And, and probably latency. And so what you probably need is somethingthat you're gonna deploy yourself on premises,you don't wanna send any data around. So you should use VIS and you should deploy it on Dockerand you should put it into the same, uh, VPCas everything else you're doing. But even, even financial data,there's like specific use cases underneath it that you know,you also have to think about, right?Yes, yes.

Uh, so it could be like, you know, likeevaluating whether or not someone is qualifiedfor a mortgage or it could be like, you know,predicting the stock market. And these are two use cases that are kind of differentwith different sensitivities. Um, but, uh, either way, you know, you can,you wanna probably wanna deploy something on-Prem,and so Novis is open source permissible. You can do that. Or if you're trying to find anomalies,then you definitely wanna make sure that you have, um,you know, very strong consistency, right?That's gonna be more important.

'cause you're gonna wanna make sure youlook at all the data. Yes, yes. And, um, oh, that gives me a perfect segueto talk about the consistency levels in viss. So mils is a distributed system,which means it has instances and replicas. Um, and Viss has four levels of consistency that you can do.

And so with financial data,you probably want the strongest levelof consistency called strong consistency, which guaranteesthat everything is read after write. Uh, can you give more weights to some chunksor documents in the Vector DB for prioritized retrieval?You can. So what you can do is,um, you can do this manually. You can give, basically give it like a, a, a metadatathat tells it, you know, like, Hey, here's the weight. And then what you can do is you can re-rank thingsby the weight that they get.

You can retrieve extra information and re-rank by weight,or you can just simply filter out some of the weights. You can say, Hey, this is not something I find veryimportant, so I don't need to in this particular, uh, query. Um, so that's, that's one way you can do this,Is that some chunks per document that he,that he was asking, or just overall chunks or documents,It says chunks slash documents. So I'm gonna guess both. I was just wondering that if that new group, uh,group search can also kind of help too.

'cause I think you can, you might be ableto do something, uh, with that as well. Group buy, yes. Um, uh, I don't know if there's a featureto see how many chunks are relevant via group buy,but group buy will let you basically just search documentsand not chunks, uh, which is really nice if you wantto get unique documents back. Uh, and maybe you can kind of do something similarwhere you use group buyand also see like how many chunks are getting thrown out,which tells you like how many of the, how muchof the document is relevant to your question. That was an easy thing to translate,try the bat broke along the handle translateto French and then back to English.

Uh, I think we'll take this on offline. Let me answer the other questions first. I don't want my intern to see the CFO data. How do I control that in a RAG system?You know, that's a really good question,and that is actually a particular question that is,this question is particularly relevant to enterprises. And, um, you know, I gotta sayNovis has this thing called role-based access control.

And that is probably how you would do that. You would say, Hey, interns don't get to see the CFO data,but the CFO can see the intern data. That's probably something that you would want to, uh, dothrough vis Can you factor intime in the queries?Um, I think so, uh,I'm not entirely sure Ianswer the, I understand the question. Um, perhaps, um, if you want say like, Hey, I only wantto get queries with contextafter a certain time, then that's when you would, like,remember I showed the data, like Ishowed like the publication date. Like you could say like, only show dataafter this publication date.

You can use hybrid search, uh,or metadata filtering, whatever, uh, to do that. So I think the answer is yes,if I understand the question correctly. He, he, he is got a subsequent,Martin has a little more info. Oh yes, if I have rag I want newerdata to be more relevant. So yes, basically you can filter on the,the, the newer data.

You can also give them, you know,dynamic weights based on how new the data is. You can say like, Hey, like I want to know, like, you know,uh, it has to be published within the last 30 days,within the last 10 days, within the last 10 hours, whatever. And so yes, you can, you can do that. It just requires a little, little bit more tuning. And, uh, the question in, in news,can latest news be trans treated with more relevance?The answer is the exact same to the one I just gave.

Yes. It just requires a little bit more tuning. You have, uh, the incremental index build. Yes. Can VUS do incremental index build without needto rebuild index every time we add new,say daily portion of data?So this is exactly what I was saying.

Um, VUS doesn't need to rebuild the index. You just build index on the new data,and then you can search and the new dataand the old data will be searched. And if you want to delete dataor uper data, uh, it deletes it out of the old segmentsand you can get it into the new segmentsand it can be searched at the same time. Um, so the answer is yes,and that's actually how it's done automatically. How can I do a bulk insert into VISor copy a DB from another install?Um, so Viss just recently added Parquet, so you canexport your data into a Parquet fileand you can insert it into Mil vs.

And in this way you actually just skip the log,the write ahead log the, the pop subsystem,and it just publishes the data directly into viss, um,and creates the indexes on that. Um, a as you, as you answer your data,Also, there's a VUS backup, right?You could do it that way too. Yes. There's also a way to do a VUS backup. What is the best way to structureand store chunks for efficient vector retrieval?Wow, this is a really good question,and the answer is nobody knows.

Um, so this is something that you're gonna have to test,and it's gonna be based on the waythat your data looks like. So for example, um, if your data is time sensitive,you're gonna wanna include publication dates or how longor, you know, uh, howlong goes published and things like that. Uh, and if your data is very longand it's in the document form,you're gonna want these chunks of data that are,you know, paragraph sized. If your data is in conversational form, you're gonna be ableto want to split up your data, uh,and just have, you know,different sentences in there, right?So, um, there are many ways to structure thisand to store the chunks and it,it really just all depends on what your data looks likeand what you need to do with it. You have a bunch of questions in the chat too.

Oh, VUS has good ACLS for financial data too,but you probably need encryption and a local install. Yep. Uh, and precision. Um, um, okay. Wait, uh, did you mention a link to a chat bot?Uh, can you share?Uh, I think I mentioned the link to this repoand it is in the chat and I will share it again.

Um, okay. Uh,And here is the repo that this is in. Um, what is the max batch size moment supportsduring batch insertion vectors?Is it scale based on system network requirements?So yes, it's, it's gonna be based on your system networkrequirements, and it's gonna be dependent onhow large your CPU is, how you know, uh, how fast it is,which CPUs you're using, uh, how many instances you have,uh, all of these different things. Um, but as far as I know, like, yeah,you can insert gigabyte to the time if you want. How large can Mils grow?Um, we, the mils is basically meantto be very, very big.

Um, and it has more than50 projects with over a billion vectors in production. And, um, I mean, it can growas large as you want, basically, you know, as longas you have the hardware and the DevOps to support it,you can, it's pretty much indefinite. How would I put this on a website for people to usewho can't work with this code-based interface?Uh,Actually there's a, um, we have a, a a a user or,or a company that actually built a kind of pointand clicky way of building a chatbot. They're called Mind Studio. Uh, so I would take a look at theirs.

So they built it so that like, you know,non-developers can build chatbots it's a pretty,they have a pretty good ui,so check out how they're doing it. Yep. So there's that. Uh, how large can Viss grow? Yep. I just, uh, just addressed this one.

Um, can MIL store tabular data?So BU stores data basically in JSON format. Um, so if you put your tab tab,if you take your tabular data, I mean,I think tabular data can be represented in JSON format. So my answer to you is yes. You just have to transform it into the right format. Wow.

That was a lot of questions. You Eugene?Yes, yes, it was. Oh,So, um, before, oh, here we go. One more. Okay.

We'll take one more question before we go. How to convert the information from a relational DBto a Vector DB by splitting it to an embedding partand it's metadata part automatically. Uh, so you actually have to write the pipeline for that. Uh, at ooh, uh, no, we have Zillow's pipelines, right?Zillow's pipelines can probably do something like this, uh,where as long as you pull the data outta your Vector db,you can just run it through Zillow's pipelinesand that will go well, that goes into Zillow's cloud. But, uh, essentially you can do thatand put it into a Vector db, um,and it will do it automatically for you.

And there's, I think, six models right now,uh, that support it. SoYeah, there's a lot of modelsAnd it's free still. So try it while it's still free. Cool. I think we got all the questions.

Um, so if anybody has any more questions, uh,hit Eugen up in, uh, the Discord channel,or his favorite is LinkedIn. He's always there, so you can always like pop some questionsin there, always on LinkedIn. And, um, yeah, come join us at one of the meetups. Come join, uh, Yugenor one of his hackathons if you're in Seattle. Uh, definitely go and see him.

He's also known as the guy with the hot pink pants,so he wears the pink sweatpantsso you can find him really easily. And if you have, uh, any other, um, ideas for contentthat you want us to create, just, uh, let us knowand, uh, we're happy to be able to build that sothat you guys can all start building really cool solutions. Awesome. Recording will be sent once, uh, Ms. Saatchi who is on, uh, the call with us, uh,does her final edits, and she'll send a link to everyone.

And, um, we'll make sure that you get also the, the linkto the repo as well, so you can play around with it. And if you have any other, uh, ideas for that notebook, uh,just let, uh, Yugen know. Cool. All right folks, have a great day.

Building a RAG App with Mixtral: A Step-by-Step Guide

AI Assistant