Events
Advanced RAG Optimization To Make it Production-ready

Webinar

Advanced RAG Optimization To Make it Production-ready

Zilliz Webinar | Zoom

Join the Webinar

About this webinar

We explore effective strategies for optimizing your RAG setup to make it production-ready. We will cover practical techniques such as data pre-processing, query expansion & reformulation, adaptive chunk sizing, cross-encoder reranking, Colbertv2 rerankers, ensemble retrieval etc. to enhance the accuracy of information retrieval in RAG systems. We will also dive into evaluating RAG performance using relevant tools and metrics, providing you with a comprehensive understanding of how to optimize and assess your RAG pipeline.

Topics covered

How to enhance the accuracy of information retrieval in RAG systems
Advanced techniques to optimize your RAG system for production, including query transformation, Colbert rerankers etc.
Evaluating RAG performance using relevant tools and metrics

View presentation slides

Transcript

Well, today I'm pleased to introduce today's session,advanced Rag Optimization to Buildand Ship Production Ready Rag. And our guest speaker, Ravin, uh, uh,uh, um, yeah, is a, is a dataand product enthusiast with 15 yearsof experience in data analytics and product. He's currently co-founder at crux. ai, makersof the open source tool, rag builder@ragbuilder. io.

He's worn multiple hats in his career analyst,technology architect, data engineer, product manager,and was previously leading cross-functional teams at CultFit and at Meta, um, or, uh, you may know it as Facebook. So, welcome Arvin and, uh, take it away. Really looking forward to the, the, uh, presentation. Thank you so much, Stefan, for the intro. Hey, everyone, super excited to be here, to be talkingto you about rag optimization.

Um, how many people do we have in, uh, uh, uh, Stefan?Just, uh, curious. So it looks like we, so we currently got 24 attendees. Awesome, awesome, awesome. I think, uh, it's, you know, Thursday morningor probably evening or probably late night in some places. Uh, so really appreciate all of you folks dialing in.

Uh, I think this is a topic that's, uh, super closeto my heart and, uh, you know, really excited to, uh,you know, go through some of the advanced drag optimization,uh, techniques with you all. Um, so before we get started, um, um,I think Stefan already introduced me. Um, but, uh, really quickly, I'm Arvin,I'm co-founder at Crux ai. Um, we are building in the rag optimization space. Uh, we are building an open source.

Uh, so super excited to talk a little bit about thatas well, uh, towards the end of the talk. Um, but if any of you wanna connect over LinkedIn, you know,there's a QR code, um, on the screen, um, you know,feel free to reach out. Um, happy to chat, right. Um, now, so today we are gonna talk about, uh,advanced rag optimization. Um, before we begin though, uh, I'd love to hear from someof you, uh, what's been the biggest challengewhen you have been trying to build rag?And assuming a lot of you have tried building some sortof a rag, either A POC or a prototype, right?And maybe some of you have also tried to create, like,production grade applications based on rag, right?So I'm really curious to see what are some of the challengesthat you all have faced?Data scale, accurate retrieval,chunking efficiently.

Yeah. Yeah. Rack pipeline flow. How do you design the, the rack pipeline flow, right?Working with Fed approved ls. Okay, interesting.

Adding a layer of reasoning. Wow, that's, that's super cool. Super interesting passing the correct contextual data. Yeah. Yeah.

So I think as a lot of you already know,a RAG has a lot of moving parts, right?Um, there are probably a few of you in therewho are probably new to this topic of rag. So I'll just take 30 secondsto give a quick primer about what a RAG is. So, a RAG is a very interesting technique. It stands for retrieval augmented generation. It's a technique that allows youto bring in external knowledge sourcesor data sources that you can connect to L LMS in a waywherein when a user query comes in, you don't have torely on its training data, right?You can fetch context from the data sourcethat you have connected dynamically,fetch the relevant context, um, and put it in the promptand answer the user question, right?So, for example, assume you're a business,you wanna create a AI chat bot on top of your website,which can answer customer queries about yourproducts and your services.

Right? Now, you can't take, uh, off the shelf LLM,it's not gonna be aware of, you know,your proprietary data, right?Your products and services, right?So it's not gonna be able to answerany customer queries, right?So that's where RAG comes into play. All you do is, you know, take this, uh, knowledge basethat you have, you know, maybe a bunch of PDF documentsor maybe a bunch of textual documents aboutyour products and services. And you basically have this engineering pattern where,you know, you, you chunk those, uh, you know, documents,you kind of embed them using some sortof an embedding model, store it in a Vector database,and then when a user query comes in, uh, you kindof dynamically retrieve the right pieces of context, uh,which are relevant to this query. Uh, put it in the prompt and then send it to the LLM. And now the LLM would magically answer this questionthat the user has, right?So it's a super interesting technique.

It opens up a lot of interesting use cases. Um, I think it looks like a lot of you are already workingwith rag, uh, would love to hear more about, you know, someof the use cases that you are working with, right?Um, but, uh, you know, overall, uh,RAG has a lot of moving parts, right?As you can see, uh, there's, you know, data ingestion,there's retrieval, there's, you know, re-ranking,there's generation and so on and so forth, right?And so there's a lotof different failure points within the rag, right?So, uh, the more, uh, componentsor more moving partsthat you have in any engineering system, the more likelihoodthat it's not gonna work,the more failure points you have, right?So typically in Iraq, uh, given the number of moving parts,there are a lot of different failure points. So for example, you know, the, the startingwith the user question itself, the questionmay not be clear enough, right?Or it may be ambiguous, right?And so that may cause a failure,or maybe, you know, the data that you,you've ingested has not been storedin the most optimal format. I think some of you called out like ineffectiveor, uh, you know, inefficient chunking, right?That can affect, uh, the accuracy of the rag, right?Uh, similarly, the embedding model plays such an importantrole in the accuracy of the rag, right?You, the embedding model that you are using may not have,uh, you know, sufficient semantic expression ability. Uh, maybe you are working in a very specific domainand, uh, as a consequence of that,maybe your ag is not performing well, right?Similarly, in the Vector database, there are a bunchof different indexing techniques.

Um, I think this is, you know, a Zell eventsor assume all of you're familiar with vis, um,and, you know, it supports a lotof different indexing techniques. And so what, what's the rightindexing technique to use, right?For your use case and for, for the kind of datathat you are working with, right?Um, similarly on the retrieval side, uh, you know,you might end up with low precision or low recall. Um, and then finally towards the generation, uh,even if you manage to get the right context, you know,you might not get the answer in the right format,or you might have incorrect specificity, uh,or you might have an incorrect answer, right?So overall, there's a lot of differentfailure points of rank. Um, and, uh, a a naive rag gets you onlyso far, right?Uh, when I say a naive rag, it's basically, uh, you know,the very bad bones version of a rag wherein, uh, you know,uh, you just have, uh, uh, this Bering, uh, vector membraneof all the data that you have stored. Uh, and then you're just doing a simple semantic search,simple, uh,semantic similarity search on top of that data, right?So that's the overall backdropof why we are having this discussion, right?We wanna look at what can you do tomake your rag production ready, make it reliableand robust enough where you can put in frontof your customers, right?So that's a very broad topic.

I'm gonna spend maybe about a minute just giving you a broadstrokes overview of what are the different techniques, um,you know, at a high level. Um, and what we are gonna do is take a subset of someof those techniques, go into some bit of detail,look at some of those techniques in detail,and then, um, you know, if you have, uh, any questionsaround any specific technique, happy to, uh,you know, take that offline. I'm happy to spend time with you, right?Um, but overall, these techniques, uh,set across four different buckets. So there's a bunch of techniques which come underpre retrieval, right?Before, uh, you even look at retrieval, uh, you know,overall in the data ion phase, uh, what are someof the things that you can do, uh, to improve, uh,rag accuracy and performance?Uh, you can improve information density. You can, you know, optimize chunking.

You can, uh, you know, do a bunchof different transformations on your data. Uh, so that retrieval, uh, when you do retrieval,it's more effective right now, retrieval itself, uh, uh,within retrieval itself,there are a bunch of different techniques. You know, you can do query routing, you know,recursive retrieval, fusion retrieval, and,and so on and so forth. Uh, a lotof you have also probably heard about Graph rag, uh, right?So you combine graph Act, pro, speca search, um,and so on and so forth, right?Um, and then finally in the post retrieval, there's, uh,you know, after you retrieve, what are someof the techniques that you can, um, uh, you know, applyto improve the relevanceand information density of the retrieved results evenfurther, um, so that you get the best juice, uh,or you get the best, uh, you know,answer correctness out of the rank, right?So there's obviously re-ranking. I think a lot of you might have heard about re-ranking.

We'll look at some of the techniques there. Um, and then, uh, you know,there's a couple other things like contextual compression,correcter, brag, et cetera, right?And then finally, in the generation, um, there's a bunchof things, uh, that as well that you can optimize, you know,top K chunks that, that are used in the prompt. There's prompt optimization, of course. Uh, a lot of, you're familiar with D spy probably, um,and, uh, you know, this like self ragand, uh, a bunch of other things there, right?So let's look at a coupleof techniques within each of these. Uh, and then what I wanna spend time with you on is, uh,talk about, um, how do you navigate this space, right?There's just so many techniquesand so many moving parts of rag.

How do you navigate this space given adata set, given a use case?What's, what's the right approachto build a really good production grade, reliable,robust ride, uh, that you can ship to production rightnow within pre retrieval optimization?There's one very basic fundamental techniquethat I don't see a lot of teams using, right?So I wanna talk really quickly about this. Uh, and basically, uh, when you talk about rag,typically you are working with a lotof unstructured data, right?And unstructured data is, uh,is generally in a form where it's not optimized fora rag setup, where it's not optimizedfor search and retrieval, right?And the, it's basically, you know, thinkof it like paragraphs and paragraphs of information. The factual density is very low, right?You might have one relevant fact,but it's spread across, you know,three different paragraphs, right?So, one very basic problem, uh,that arises when you take unstructured data in its originalform, uh, is that you end up with a lotof chunks in the LLM context,even if you have a very efficient retrieval approach. And then that just increases the probabilityof an incorrect response, right?So the more chunks you have, the higher probabilityof incorrect response,and more importantly, you end up with a higher token usage,and your cost is really high, right?So one very basic thing to do here isto improve the information density, where you kindof get rid of irrelevant information and noise. You kind of do information deduplication, right?So, one interesting use case, uh, sorry,one interesting case study, uh, that I came across, um, uh,related to this specific approach is, uh, thatof a financial institution that wanted to create like acustomer facing chat bot on topof their products and services, right?So this company had a, a lot of HTML pagesthat they were using as the, as the data source, right?And so, one very basic thing there to do isto programmatically remove all the, you know,C-S-S-S-T-M-L tags, right?So the first box that you see here, that's the raw STML,the, the natural and obvious thing to do is to get ridof the STML, you know, tags.

But then even with, with that,the information density was low. So what these guys did is they used GPD four, uh, todo one level of data passing, where, where they sent this,you know, uh, uh, raw form of data through g PT fourto improve the information density. So they use kind of a prompt saying, Hey, uh, you are, uh,you know, helping build a knowledge base, uh, you know, uh,for each paragraph, you know, take thatand, you know, condense it to the most factualand relevant information, right?So, uh, that's what this company ended up doing. And as you can see, the, the, uh,while the accuracy went up, what was really interesting is,um, the cost reduction, right?So they saw a four x reduction in token cost, um,because of this approach. Uh, so really interesting, uh, one caveat here isthat since you're using an LLM, uh, to do this information,pre-processing, uh, there's a riskof information loss, right?Because again, like the LLM may hallucinate, uh,so it's recommended that you do this with a really, uh,powerful LLM, like a G PT four grade kindof, uh, LLM, right?Um, so that's number one.

Another technique, uh, that'srelevant in the pre retrieval optimization is querytransformation, right?We, humans are inherently very bad at askingquestions, right?Like, we typically, uh, you know,suck at asking questions in the right way, right?And so often you'll have ambiguous queriesor, you know, poorly worded queries. Um, and a lot of times, you know, like your rag is facingcustomers who are probably not very technical, right?Like, you, you probably end customers, end users are goingto be, you know, like a layman person, right?And so for them, uh, it's often gonna be like ambiguousor poorly worded queries,or maybe like really complex queries, right?So how do you navigate that?Um, what you again do is, uh, basically do some kindof query transformation where you take that overall, uh,you know, uh, contextor slight bit of history, uh, of the conversation alongwith, uh, the query that you have,and then you kind of use another LLMto revert it into a more relevant and more effective query,and then use that for retrieval, right?So, hence pre retrieval optimization, right?So in this particular example, um, as you can see,like a customer was having a conversationwith this chat bot, and they were like talking aboutinterest rates for the CDs, uh,and then suddenly the conversation got to, uh,you know, credit card, right?Like, which credit card is good for travel?Um, and now the customer's asking,tell me more about the interest rate for that, right?And now, uh, this query alone is gonna be verybad at performing retrieval, right?Because, uh, you know, there's interest rate in,in the query, but we don't knowwhat product is the customer talking about, right?And so it's very important to doa kind of pre-processing here. So, so you would use some kind of a prompt here saying, Hey,you're examining a conversation between a customer, uh,and a chat bot and, uh, you know, uh, use, uh, uh, you know,construct a search query that will be usedto retrieve the relevant documentation. Um, and so what it would do is take this historyof the conversation and convert it to a better query, right?Um, so that's, uh, that's query transformation. Um, but in case of complex queries, you know, it makes senseto kind of break those down into subqueriesand then, you know, individually perform retrieval on thosesubqueries and then combined results.

Um, uh, let's, uh, on that note, let's talk about, um,ensemble retrieval of fusion retriever, right?So in this technique, what you basically do is, um, a lotof times one specific retrieval approach is notgonna be sufficient to answer, you know,whatever question the user has, right?In, in the most effective way possible. And so what you wanna do is leverage the powerof multiple techniques. So as an example in this diagram, uh,we have a vector index, right?Um, uh, which is like dense retrieval method. So I, when use a query comes in,we are gonna do a semantic similarity search on that,get the top key results, right?But maybe that alone is gonna be insufficient, right?And maybe the user had, you know, certain, uh, you know,abbreviations or acronyms in the query,or maybe they had some really important keywords, uh,which are notsemantically represented in our vector database. And so what do you wanna do is you wanna use, uh,you wanna combine the vector searchwith a keyword search, right?Where you basically use some kindof a sparse retrieval method, right?Like BM 25 is the most popular one.

Uh, but basically what you do is you,you use the same query in parallel. You send it to, you know, this other retrieval,which is going to do BM 25 based keyword search. Uh, and then maybe you end up with aslightly different subset of results here. And then what you do is youperform, you combine the results. Right? Now, how do you combine the results?You, uh, you can use a technique called retrie, uh,reciprocal run fusion.

Um, it's a very popular algorithm, nothing new. It's been around for a while, uh, very popular in the, uh,information retrieval world. Um, but essentially, uh, you calculate like a weighted scorewhere, um, you combine the, the results from both, uh,retrievers, um, and then send it to the LL, right?Um, so there's a case study here, which was, uh,well, not a case study exactly,but, um, like an experiment that was done by Microsoft. Um, and, uh, basically as you can see here, uh,there's a comparisonof keyword search versus vector search versus hybrid search. Uh, and, uh, the hybrid search clearlyperformed better, right?I mean, it's kind of obvious, right?Like we are combining the, uh, techniquesof two different methods, right?Um, and what's interesting in this case is they also hadthis fourth option where they took the hybrid retrieval,but also performed, um, a reran on top of that.

Um, and, uh, that was even better, right?So almost like a, uh, you know,20 percentage points improvement in the overallsearch results, right?Um, so that's on, uh, fusion retrieval. Um, now speaking of re-ranking, um, there'sone really interesting technique,and probably a lot of you have already, uh,heard about this technique,but, um, cross encoder re-ranking is another technique wherewhat you want to do is, um,do a first pass retrieval, right?Where, um, you use the buy and coda approach. Um, buyand coda approach is basically where you areembedding the document, uh, you know, separately, right?Like you're doing it as an offline process, you arecompressing that one single vector, right?So that's, uh, that's done as a,as a pre-processing step like we discussed earlier. And when your query comes in, you kindof do the same thing like you, uh, you know, pass itthrough the model, and now you have one single vectorrepresentation of that query as well. And then you basically just do a cosign similarity search.

Um, and, uh, you know, you, you basically end upwith a bunch of documents, which are very similarto the query that has come in, right?However, the, the problem with this approach is thatsince we are taking like a almost a, a page longor a paragraph long, uh, piece of information,and, uh, embedding thatand transforming that to one single vector representation,there's, uh, you know, possibly there's some lossof semantic meaning, right?And so when we are doing the similarity search,it's probably not gonna be as effective, right?And so the flip side to this is where you use a crossand code approach, right?So with a cross and code approach,you basically send the query as well as the documenttogether to the module, right?Where it would basically just, uh, you know,perform like a classification, um, uh,and identify if these two documents are similar or not. Uh, and that method allows you to preserve a lotof the, uh, the, the good as aspects of transformer, right?Like, uh, it has very semantically rich informationto figure out with very high levelof accuracy if this document is relatedto this other document, which is a query in our case, right?So assume document B is the query right?Now, the problem with the cross encoder is thatit's very expensive to do, right?And, uh,and the reason is, like I said, you know, you haveto pass the document and the query togetherto the transformer, right?Um, and so it's like a, uh, you know,early interaction model. And so that's super expensive to do. So when you're working with like a billion vectors,or sorry, a billion documents, uh,this is just not gonna be scalable, right?Uh, you know, when a user query comes in,who's gonna wait like half an hourfor the answer to come up, right?So what do you do, right?So the one approach is to do a two pass methodwhere you use the buyand code approach to come up with say, a hundredcandidate documents, which are potentially the most similarto the query that we are trying to answer. And then you use the cross encoder on topof these hundred documents, right?So you're only using a very small subsetof the overall documents to perform, uh, you know, uh,uh, re-ranking using the cross and code approach.

And now you have a much better approach,much better likelihood of finding the right documents,which, uh, right context, which can answer this questionthat we are trying to answer, right?So here's a, a block diagram that shows it, uh, you know,you have millions of documents here. Uh, you're performing, uh, you know, semanticor keyword search here, you fetch, you know,a hundred results, uh, like, uh, as a, as a first pass. And then you basically use the queryand these a hundred documentsand perform, um, across encode a re-ranking. And now, you know, you have this re-rank resultor fetch again, maybe you take a subset,maybe you take the top five or top 10and then pass it to the LLM,and now, uh, you know, your results are gonna look much,much better right?Now, that's great. Um, but like I said, um,it's a very expensive thing to do, right?Like the, the crossand coder reran is quite expensive, right?So a third approach here is to use, uh,a Cobra approach, right?A cobra approach is, uh, uh, a very interesting, um,what you do in this case is, uh, like a middle groundbetween the buy and coder and the cross encoder, right?So in the buy and coder, there's no interactionbetween the query and the document.

Uh, with a cross encoder,there's like very heavy interactionbetween the query under document,like we call it early interaction. Uh, with a Colbert approach, you kindof do a late interaction, uh, uh, approach, right?Uh, what does that exactly mean?What it means is, um, in the Colbert approach, youtry to preserve the token level embeddings, right?So what you do is you don't take the documentand compress it to one single vector. Rather, what you do is you take, uh, you know, token level,you create token level embeddings,and you kind of retain it at the token level. And when a query comes in, uh, you again, you know,embed it at a token level. And then what you do is for every token in the query,you kind of, uh, you know, figure outwhat are the semantically similar tokens, uh,across this entire corpus of, you know, tokensthat you have, uh,and you perform like a maximum operation, uh, which is, uh,you know, very efficient to do, very easy to, uh,and, uh, uh, yeah, Stefan has just pinged, uh, a bunch ofdocuments, uh, about, uh, using Cold Pal, um, you know, uh,which is, uh, Colbert on top of, you know, visual data, um,uh, using malware.

Um, thanks for that, Stefan. Um, but, uh, justto quickly show you an example of this, right?So let's assume this is your query. Uh, you know, each of these are tokens. Like, uh, let's assume there are three tokens in your queryand, you know, you have these documents. Um, and so what you're doing is you're trying to figure outfor every token in the query, uh, what's like the maxim,what, uh, is a maxim, uh, out of all these tokens.

And so, uh, you know, you seethat it's nine seven in this case,eight four in this case, and eight five this case. So you take a maxim look, basically aggregate that, right?Um, so what this helps with isthat you are still retaining a lot of that semantic richnessthat you have in your document, right?Like the, by keeping it at a token level, you're preservingthat semantically rich information. But at the same time, this is a lot more, uh,efficient operation because you're doing just a maximumoperation, which is very easy to do, very efficient to do. And so, uh, it's a, it's a great technique, uh,and, uh, is showing some great results, right?The, the flip side though, uh, actually, I'll just cometo the flip side really quickly. Here's, uh, here's a real example of of the Coldwell modelwhere, uh, you know, assume this is, uh, you know,your question, uh,and this is the, the, the document that you have, um,as you can see, like the, the transformerto transformer like match will be the highest, uh,because they're the same word.

Uh, but then you have, you know, like cartoon matcheswith animated, the when matches with, you know, Augustand 1986 and the, uh, come out, uh, you know, uh, matcheswith released, right?So, uh,it's basically doing like token level matching, right?So to speak, right? Um,and so that makes it like highly relevantand highly accurate, uh, in the context of a rag right?Now, the flip side of this is that it takes a lotof search space, right?Uh, because you're storing it at a token level. Um, you know, storage can be really, uh, a big challenge. Um, and so that's where Colbert VI two came into play. And so, uh, Colbert VI two uses a central based approachwhere, uh, what you do is, uh, you essentially, uh,uh, you know, store like OIDsbecause you have the semantically rich information,which is a token level. You kind of can do clustering on top of those token level,uh, you know, embeddings that you have,and create some OIDs, uh,and then basically do like a first pass match against thoseOIDs to figure out like what, uh, you know,tokens are the most closest to this.

And then from the tokens you get to the documents, uh, uh,which results in, in, uh, you know, in a much betterand efficient approach, right?So, um, that's cool,but I think, uh, we talked about a lot of different, um,approaches and techniques, right?Um, so let me pause hereand let me ask you folks really quickly,how are you folks figuring outwhat is the right technique to use?For example, how do you figure outwhat is the right chunking approach to use?What is the right chunk sizefor the data set that you are working with?What technique are you using?Or, uh, you know, how are you figuring out, uh, what, uh,value to use, right?What chunk size to use, what chunking strategy to use?What kind of retrieval to use?Can you guys like type in the, the chat and, uh, share?How are you figuring out the optimal setupfor your atrial and error?Yeah, not there yet. Okay, got it. So trial and error is, uh, you know, the most commonand way, I mean, uh, you know, there,there's really no other way, right?Like, how do you even know what chunking strategyor what chunk size is gonna be, uh, better for you?Uh, uh, I mean, at a high level, yeah. Someone is, uh, mentioning herethat you chunk based on relevance,I create separate documents based on departmentsor topic of the information. Um, cool, cool.

Yeah. But, uh, like, there's no way to generalize, right?Like, hey, if you're working with financial documents,you use chunks, chunking approach, X, y, z,it's just not possible to dothat kind of generalization, right?Because every data set is uniqueand there's just no one size fits all drag, right?Um, so creating a optimized drag is actually really,really hard, right?And it just takes, uh, a lot of trialand error, which takes a lot of time and effort. And so, uh, you know, uh, an example would be, uh,let's say, you know, you're working with, uh, uh,you're working on a specific rag use case where you haveto pick from five different chunking methods,five different chunk sizes, five different numbering models,and so on and so forth, right?Does anyone know how many ragconfigurations this would produce?Like in terms of the combinations? It's a math question. Let's see, how many of you can guess five race to seven. That's right.

That's right. Yeah. So this would create like five race to seven,which is like 78,000 different rack configurations, right?And if it took you just five minutes to try each one out,it would still take you like 2, 7, 1 days of nonstop trialand error, right?And so clearly it's kind of impracticalto figure out the optimal rag setup for your data setand your use case using a manual trialand error approach, right?This is why we ended up building Rag Builder. Rag Builder is a toolkit whichperforms hyper parameter optimization on the differentmoving parts of the rack, right?So for example, it's goingto figure out what's the right chunk size for your data set. It's gonna try out a bunch of different chunking sizes,a bunch of different chunking strategies, uh,and a bunch of different numberingmodels, and so on and so forth.

And with every trial, it's gonna figure out what's working,what's not working, and then kindof converge on the most optimal set ofparameter values across each of these parameters, right?Uh, ally, it comes with like a bunchof predefined drag templates. Uh, it also comes with, uh, you know, uh, the ability to do,uh, synthetic test data generation, uh, using ragas. Uh, I don't know if all of you are familiar with Ragas,but ragas is another open source library, which is, uh,specifically focused on rag evaluation. And so there's a, uh, synthetic test data generation, um,but, uh, uh, yeah, we support knowledge baseof graph rank as well. Um, and, uh, working on a couple of, uh,interesting roadmap items as well, like the abilityto bring your own component and a,and a cloud hosted option, et cetera, right?Uh, it's completely open source.

Uh, so do check it out. Uh, maybe I can do a real quick demo so that, uh,you folks can get a feel of what it looks like. Um, awesome. Stefan is gonna cover Ranga in our upcoming webinar. Fantastic.

Um, cool. So installation is very straightforward. If you go to rag builder. io, uh,there's this one line command that you can copy paste, uh,and, uh, you're good to go. It basically installs a couple of brew librariesand does PIP install Rag Builder.

Uh, once you've installed Rag Build, you just goto your terminal and type in Rag Builder,and it spins up a ucon fast, API appthat runs locally, right?So it's gonna bring up the browser in just a bit. Um, and, uh, like I said, all of it is running locally. If you look at the IP address here, it's 1 27 0. 2,2. 1, uh, it's running locally.

So if you're concerned about data leaving your systemor your platform, uh, this is a pretty neat solution, right,where you can, unless you're using a third party AI servicelike OpenAI or Cohere, uh, data will remain in your system. Right? Now, let's say we are building a rag, right?So I click on new project and, uh, I give it a description. So let's say I'm building a documentationchat bot for a second. Let's assume via land chain,and we have a ton of documentation, uh,and we wanna make it easier for our developersto access this via an AI chat bot. And so let's create a documentation chat bot.

Right now, I've already scraped long chain documentation,so, uh, there's, you know, hundreds of markdown filesthat I've, uh, you know, put in this directory. Uh, but this field is pretty versatile. You can point it to a, uh, you know, a URL as well. Uh, you can, uh, you know, point it to a directory,obviously, which can have like multiple files, um,or you can point it to a specific file as well. Right? Now, if your data set is large,it'll automatically scan itand tell you that, hey, your dataset is, uh, you know, kindof large, uh, I've set the threshold here really low.

So, which is why even for a, you know, kindof like a one M file, it's saying the dataset is large. Uh, but if your dataset is large, you wanna startwith sampling, right?So that you don't burn a lot of tokens, uh,and you get to iterate faster, right?Uh, so that's a pretty nifty thing. Um, like I said, we come with a bunchof predefined templates. Um, and, uh, so if I just show you this, um,you know, uh, with these templates, you can go from zeroto one really quickly, right?So if you want to create like a graph dragor like a hybrid drag, uh, you know,it's very easy to kind of do that. Um, there's a bunch of templates out here, uh, right,but where this tool really shines is the second optionwhere you can create like a custom rack configurationfrom the ground up, right?So let me show that to you real quick.

So when we go next here,you can tailor the individual components at a moregranular level in this screen. So, for example, if you wanted to consider chunk sizesbetween 503,000, you can just use the sliderand it's gonna figure out what is that optimal chunk sizebetween this range, right?It's going to try out allof these different chunking approachesand figure out what's the optimal one. If you wanted to considermultiple ING models, you can do that. Alternatively, if you have a very specific embed modelthat you're working with, you can do that. You know, we support, uh, you know, hugging face, uh, uh,if your model is hosted on a cloud and can do that.

Um, yeah, let's, uh, let's select, uh, vis, right?Um, and, uh, uh, yeah, we support like a bunchof different vector databases, uh,but obviously ves is pretty good. Um, and, uh, it supports like a bunch of retrievals. So we spoke about a bunch of different, uh, uh, you know,retrieval optimization approaches, right?Uh, so, uh, for example,we spoke about the multi query retriever, right?Like where it'll break down the query into multiple, uh,you know, versions of itand, uh, you know, query, uh, multiple, uh,retrieve multiple, uh,documents based on multiple versions of the query. And then do a, uh, retrieve, uh, sorry,reciprocal rank fusion on that. So you can do all of those things like you have BM 25retriever vector similarity search.

It's gonna combine all of those or try individuallyand figure out what is that optimal retriever set up. Uh, obviously your top K parameter. And then within reran also, like we just discussed,there are so many different techniques. Uh, there's, you know, your crossand coder reran, obviously like, uh, you know, cohereand, uh, you know, BG and, and so onand so forth, Gina, and so on and so forth. But then there's also Colbert reran, the, the onethat we just talked about, um,and, uh, you know, a bunch of other things, right?And then finally your LLM as well.

So, uh, we can, uh, you know, select the LMof your choice if you want to use GRor if you wanna use ulama for a modelthat you're using locally, like a LAMA 3. 1or 3. 2, you can do that as well, right?Like I said, we do be optimization. So it's not going to do like a grid searchor like a brute force search and all of these parameters. Uh, it's gonna be intelligent where it's gonna, you know,start with our random set of values and,but figure out with each run what is,what is working, what is not working.

So how does it do that?It does that by using a rag evaluation on the fly, right?Basically, um, you, uh, provide a data set,um, which is, think of it like a golden dataset. Uh, I dunno if you, all of you are familiarwith golden dataset,but golden dataset is just a term to say, what's the setof question, answer pass that you can use to evaluate a ragand figure out, uh, the accuracy of a rag, right?So golden data set typically covers, you know,all different scenarios. Uh, it's basically comprehensive enough. Uh, so if you have that kind of a data set,you just bring in that data set, uh, uh,and point it to, you know, uh,the file which has that dataset. If you don't have that dataset, which is in majorityof cases, you don't have that lying around, uh, like I said,we can generate that synthetically for you, uh,using your source dataset, right?So in this case, like I had the long chain documentationas the source dataset, we can generate synthetic test datato kind of simulate what questions might users scan,what is the ground truth for those questions, right?So we can generate that synthetically.

Meet the Speaker

Join the session for live Q&A with the speaker

Aravind Parameswaran
Co-Founder at Krux AI
Aravind Parameswaran is a data & product enthusiast, with 15+ years of experience in data, analytics & product. He’s currently co-founder at Krux AI, makers of the open-source tool RAGBuilder (ragbuilder.io). He has worn multiple hats in his career - analyst, technology architect, data engineer, product manager, and was previously leading cross-functional teams at Cult Fit, and at Meta/ Facebook.

Advanced RAG Optimization To Make it Production-ready

About this webinar

Topics covered

Meet the Speaker

AI Assistant