Events
RAG Evaluation with Ragas

Training

RAG Evaluation with Ragas

Zilliz Webinar | Zoom

Join the Webinar

About the session

Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.

Topics Covered

Foundation Model Evaluation vs RAG Evaluation
Do you need human-labeled ground truths?
Human Evaluation vs LLM-as-a-judge Evaluations
Overall RAG vs RAG component Evaluations
Example of different Retrieval methods with Evaluation
Example of different Generation methods with Evaluation

View presentation slides

Transcript

So today I am pleased to introduce today's sessionrag evaluation with Ragusand our guest speaker Christie Bergman. Christie is a passionate developer advocate at Zillows. She previously worked in distributed computing at any scale,and as a specialist, AIand ML solution architect at AWSChristie studied applied math, is the self-taught coder,and has published papers including one with the A CM races. She enjoys hiking and birdwatching. Welcome Christie, and get started with our session today.

Thank you Saachi. Uh, good morning everybody. I'm Kristi Bergman. I am a developer advocate at Zillows, uh, the maintainersof Open Source Vector database vis,which I'll be using in today's demo. And this is my LinkedIn.

Please feel free to follow me or connect with me. So just a few words about, uh, are us, uh,Zillow is the maintainers of Open source Vector, Daves Viss,and you can think of vector databasesas either open source or closed source. Uh, so, but um, our company does have a foot on both sides. So we have the open source viss, which was donatedto the Linux Foundation. It has a very generous Apache Chew license, which meansthat you can build on top of open source visand even charge money for your products.

Uh, uh, yeah, so that's a Linux Foundation, um,managed project, which means we can't changethat licensing even if we wanted to. And then ZI is, is the managed, uh, mils,which runs on A-W-S-G-C-P or Azure. And we do have free, free tier versions of Zillows as well. So you could get started on either side, uh, for free. Yeah.

Uh, people have probably heard of these three pillarsof generative ai. So the computation, those are the CPUs, GPUs, TPUs,and even GR lpu. And then the models are super important. We have, um, just move this little thing outta the way. Okay.

I can't move that. Um, so, uh, the models we have, um,for example, the, the GPTs of open ai, uh, meta LAMA threes,um, and I'll be demoing that today. Um, and then we have, um,we think the most important thing is the embeddings, uh,the, the unstructured data. Sorry. Okay.

So we think there's a lotof opportunities in unstructured dataand, um, yeah, I'm gonna be jumping into that today. And also keep your eyes outfor the unstructured data summit, uh, in Q4 of this year. Alright, so my top today,I'm gonna be just doing a quick introduction to RAGand then talk about a couple challenges with RAGand then do a demo of rag evaluation methods. I really wish I could move this thing. F**k.

Oh, there it goes. Finally. Um, all right. So, uh, up until just a few weeks ago, I was ableto go into Geminiand, um, ask, for example,list three politicians born in New York,and one of the one list would be Hillary Clinton. But then if you go to Wikipedia, you can see that,uh, she was born in Chicago.

So that's an example of a hallucination. And then as of a couple weeks ago, I noticed, uh,that it changed, and I can't get anything about electionsor political figures anymore. So why do these LLMs hallucinate?Uh, they are, because they are trained on sequences of wordsor, or tokens to predict, um, next tokens or mass tokens. And so, for example, if you wereto predict the chicken walked across,it's probably the road, butbecause the, um, these large language models are trained on,uh, so much data including, uh, you know, possibly Reddit,uh, for example, um,and you can imagine that there's all kinds of, uh,unusual things, uh,being inserted into those, the training data. And then where do the vectors come from?So the vectors come from these deep learning neuralnetworks, uh, not the very last layer, which is, um, usuallya, um, uh, for like, maybe just nine different classesto map the inputs into, uh, numbers.

Uh, but the next to last layer, and,and the reason for that is, uh,so researchers have learned that this is where allof those trained weights are stored. So you can think of this, this whole neural networkas like a big function, FFX,and the Xs are the instructor data inputs,and the f is the transformation through all these trained,uh, weights into that last embedding layer,which represents, uh, the FFX equals Y. The Y's are vectors,and these Y's represent the knowledge of this model. You take those vectors, you store them, uh,maybe in a vector database,and then you can, uh, retrieve them. Um, skip that for now.

So, which is, you can use them for rag. So for example, you know, this is the rag pattern. You take your data, uh,you embed it using an embedding model. Uh, you store those into a vector database,and typically that part is done offline first. Then you would deploy your, your, your chat app, maybefor example, with, uh, the knowledge base already filled in.

And the user would come along real timeand maybe ask a question. That question needs to be, uh,encoded using the same embedding model. And then because a vector,because vector calculations are super fast,the near approximate nearest neighbors can be calculatedvery fast on the fly in real time. Uh, what's retrieved from the vectordatabase, we call it the context. We stuff it along with a question into the prompt.

And then this prompt goes to another model,the generative model,and we just call them LLMs, um, for the,like, the chat GPTs. And then hopefully you get out a more reliable answer. All right. So let's talk about afew challenges with the bag. Um, so number one is choosing your embedding model.

And a commonplace to get started is this huggingface M tab leaderboard. Uh, you'll see, you can sort it by retrieval taskand see, um, the top rank models. Um, and yeah, choose one. Um, I, I don't want to talk too much about this, I guess. Um, one of the, um, uh, uh,a surprising, a surprising, um,developments in the this year is that people usedto think that the longer the length of the embedding vector,the more accurate, uh,but with open AI's announcement in February of this year,they showed that using a technique called, um, matrika,embeddings or MRL, uh, matrika representation, learningduring the training time,that you can actually take a cer an embedding modeland use either five, 12 or 1536, uh, dimensions,and actually get exactly the same accuracy from the smaller,smaller dimensional vector.

Uh, which means this is importantbecause if you've got, uh, like a third of the lengthof a vector, um,and still getting the same accuracy,that means a smaller memory, uh, amount of datafor the amount of data that you need to store. It also means fast, uh, lower latencyor faster response times. Another challenge is, uh, choosing your index. So we have, if you go to our docs, you'll see a whole bunchof different indexes, uh,to choose depending on your use case. Uh, so yeah, so flat meansyou're gonna do an exhaustive search.

This is good if you've got very small data. Um, it's a brute force search. Uh, the approximate part of aand n approximate nearest neighbors isbecause normally for very large data,the search is stochastic. Um, meaning that you use a data structure,uh, for the index. Uh, I think I have a few slides for this.

So, for example, this IVF flat uses clusteringas its data structure so that, um, then you specifyhow many clusters you wanna search, uh,and the HNSW indexing type uses, um,hierarchical navi, small worldsor higher hierarchical graphswith the sparse graph at the top. So that means, like, as you enter the top layer,because there aren't that many, uh, nodes to search there,this, the nearest neighbor search is super fast,then you move to the next layer, which is a little bit, um,more dense, um, and, and continue on this way. So, uh, the overall search is,this is a very efficient search algorithm, uh,because you're jumping from layer to layer, um,and, uh, it's actually a big o of log in. So it is actually oneof the most efficient search algorithms for large data,but there are different reasons why you might choosedifferent, um, indexing strategies. And then finally, uh, the one I'm goingto delve into in my demo today is chunking.

And you can think of chunking these days in a rag pipe,in the rag pipeline as something like the data featurizationused to be part of the traditional data science pipeline,uh, where the features like red, green, blue needsto be mapped to 1, 2, 3, for example. And the way that you do that featurizationof turning data into numbers, uh,can have a difference in the outcomeof the accuracy of your model. Today's paradigm in RAG is this chunking step,especially for text data. So, um, it's kind of a, still a little bit of an art. So I'll, I'll jump into this a little bit more.

Um, the main, the first, the first entry point is you needto think about what kind of data you have. So depending on what kind of data you have,whether it's conversation chats,or whether it's documents, uh,which I'm gonna give a demo today, or whether it's like a qand a pair, uh, lecture data, uh,there are different techniques that, uh,are the most efficient way to handle that kind of data. Uh, I'll talk a little bit aboutthree chunking methods today. So the first, uh, chunking method, uh, for example, isdepending on the structure of your data, for example,if you've got H TM L data, which I have, uh,the H TML page probably has headers in it, which leadto natural structure. Uh, there is a tri there just, it is kind of trickybecause a lot of webpage, including ours, our Melva docs,uh, don't have straightforward H one H twos, uh, which I'll,I'll show in my demo, I made my made it, uh, not,and maybe this technique doesn't work for everybody,but high level, the ideaof a webpage is you've probably got a title,you've probably got headers.

And then in each, within each Heather, you, you've got text. And if you just take the naive textand only embed chunk those, uh, it, it, you lose some,some, um, you, you lose some context. Whereas if you were to just put those, um,the headers into the text,and usually they're very small,so you're not losing much in terms of the text,the text chunk size, uh, you, you actually,it's like a little hack where you're,you're getting a cheat, uh, cheat short wayof getting more context into each chunk. Another, another technique is called, um, small to big,where you have the, the, the small chunks, uh,of text, and then you also map them to maybe the parentdocuments of those chunks. So maybe you've got five 12 as your chunk,and then you map it to the parent document, uh, sothat you could re, so the idea isthat you do vector retrieval at the small level,and then you do document retrieval at the big level,and you, you send this parent documentto the LLM for generation.

Okay, I'm gonna prob, um, skip over here to,um, uh, okay. Just, and this is another, uh, fund slideto think about when, uh, you're trying to optimize yourrag pipeline is we had another speaker from Link chain, um,at one of our unstructured data events. And he spoke about these needle in a haystack experiments,uh, that show that, um, the,especially from this archive article, there was a researchthat showed that just adding moredocuments into the context windowof the LLM isn't always better. So the, the y axis on this chart from the paper is accuracy,and the x axis is the number of documents you add, um,as well as the position of those documents. And in general, the more documents you add, you, you,you do risk losing accuracy.

Um, but then there's a, a curious curve here,which led people to do more experiments,which are called needleand haystack, where they showed thatwhere all the green stuff is, is the most accurate, uh,retrieval that you,you get the most accurate retrieval at thebeginning and at the end. And then, uh, Lance Martin, who cameand spoke to us in San Francisco said, actually,there's even a little bit more refinement about that, thatthe, the highest accuracy is really at the end. And then finally, there's the LA models themselvesto think about in your, in your rag uh, pipeline. Up in, uh, we have the, I I really liketo follow this guy Maxine Labon on Twitter. What he does is he takes this, um, thingthat's called the LM CS leaderboard for foundation models,and he makes a, a nice visualizationwhere green are the open source models,and red are the closed source models.

And people were pretty excited tracking this for a whilebecause it seemed like the gap between the accuracyof the two models was decreasing. And then we finally crossed over that gap this year, uh,with, uh, the, uh, a lama, uh,Lama three, actually, no, sorry, the,the crossover was clawed. Um, right. And just a few words about evaluation. So we call, we've got the encoder modelsand the decoded models, which are different types of models.

Uh, for example, I might be using, uh, an,an encoder model from hooking face,and I might be choosing this GPT-4 oh from from open ai. Um, let's see. So the, so the thing to keep in mind is thatparticular foundation models that are on the kind oflot, uh, tweeted about, um, uh, leaderboards,they have certain, there, there's, there's performance that,you know, are, is closely tracked by researchers. Uh, but, but a model or either an embedding modelor a generative model that appear high on aresearch leaderboard might not be the modelsthat perform the best on your particular, uh, systembecause your particular data is different. And it might be that, uh, much lower ranked modelactually performs just as well as a higher ranked model,which means you might be able to get bywith smaller embeddingsor, uh, a more a local model, for example, uh, to save moneyand, um, latency in your system.

So this is why, um, evaluations are really importantto do on your own data in your own rag system. And it is a tricky topic. Um, I think, uh, some of the thingsto keep in mind is that these LLM evaluation,uh, models are based on this concept that came out in 2023. LLM is a judge, uh, from uc, Berkeley Skylab,with Jan Stoica. Uh, so this really changedbecause previous to this paper, uh, it was thoughtthat evaluation needed to be donewith human, uh, annotations.

And this was really expensive. So it used to be that people thought, oh, you need thousandsof data points and you need thousands of human labeled, uh,yeah, human labeled data points. But with the paper, it showed that actually GPT-4is just as good as a human. And, and this is because there is stochasticitystochasticity, even among humans, not all humans will agreewith each other in terms of which answer A or B is better. And, um, in fact, there's like, humans tend to agreewith each other about 80% of the time.

Uh, and if you apply a LM as a judge, um,or as a critic in these evaluation methods, they also agreewith each other, um, and with humans 80% of the time. So, so yeah, you can replace a human with an LLMis, is the concept. And this has really changed evaluationand made it possible to do evaluations in anautomated way, which I'm gonna show. Um, and, um, and, and then also the data points. You don't need as many, you don't need thousandsof human labeled data points anymore.

This means you can get by with like maybe 20, uh,human labeled data points,which makes evaluation just a lot more approachable. So there are a few caveats, um, about the LM as a judge. It, we know that if you ask it to score different typesof things, like whether, is the answer correct, is it,is it, um, is the style in the way that you want,or is it complete or comprehensive?Uh, LGP uh, LMS are very good at the first two categories,but they're not so good at the completeness type of score. So you should not ask that type of question for an LLMas a judge evaluation. And the other thing that we know about olms as a judge isthat, let's say you ask it to give a score, uh,between a range of, uh, like, uh, 10, one to 10, um,which is actually most pe people who when, um, uh,reports know that that's not actually a good scoring system,but let's say that was the scoring system, um,then humans tend to answer midpoint answers, um,where on the left you can see the charts, um, whereasuhis tend to answer on the extremes.

So either zeros or highest min or max scores. So, so that's also, um, so taken into accountwith the vargas, which I'm gonna use,it's an open source package. Uh, we, they score between, um, zeroand one with a digital scale. Um, yeah, so this is the exploding gradients,GitHub, open source ragus, which I'm going to use. And what they've done is they've taken RAG as a whole systemand, uh, evaluate rag the whole systemas well as component wise.

Component wise. You can think of that rag diagram asmaybe a parallelogram of components. There's the, there's the user's query, then there'swhat is gonna be retrieved from the nearest neighbor search,uh, of chunks that answer that question. Uh, and, and then you can measure like,did the retrieve chunks answer cover, cover the main pointsof that question or not?Then you've got the LLM generative model,which generates the output. Um, and you can measure what was that LM uh, faithfulto the context, for example.

And then, uh, you've got the ground truth answer. So between all of these data po, all of these like, uh, uh,components of a rag, you can get different metrics out,Right?My demo, so you get everybody,maybe I should make this aLittle. Okay. All right. So I, I have this demo on I'll,I can share the GitHub link later.

Um, it's on our, it's, if you search, um, VIS bootcamp, um,they're out there on GitHub. Um, yeah, so here's my right diagram. Uh, bring in the data, pass it through an embedding model,save the vectors, come along, ask a question search. Um, there are sabra between, uh, the, the embeddingof your question into vectorand your knowledge base of vectors,what's retrieved from the knowledge base of vectors. You stuff it into the prompt along with the questionand generate an answer.

Uh, so the data I'm gonna be usingfor my evaluation is gonna be, um, our own, uh, documents,uh, viss. io docs. Uh, so they're public webpages of documentation, technical,technical docs, and I just, I load themand I download them locally, uh, just so I don't haveto download them every single time. And then, um, I clean them upand I see a total of 22 documents,and I can see that, um, yeah, so I can see the, the websitesfor, for each one of them. I'm also wanna give a lot of kudos for the demo.

I'm going to give, uh, see, can I change myzoom's kind of in my way?Okay, here, um, oops,nope, that's this one. Um,I'll give out this link as well. Uh, this is a medium article that I wrote. Um, and then I wanna give, I wanted to give a shout outto Greg Oun. Um, his original article here is, uh, five Levelsof Tech Splitting, which I'm gonna show in thechunky, I'm gonna show three of them.

Um, so there's, there's the, this,depending on your document type, uh, HTML,there's the small to big,and there's semantic chunking iswhat I'm gonna show today in my demo. Alright, so let me just jump down here. Um, I'm gonna skip, like, uh,how I initialized my LLM for pipelines, uh,using the line chain. Skip all this. I have some question canned questions.

Okay, so here's the first chunking method, small to big. And the idea of small to big is that,uh, I showed it in the slides. You've got the small ch chunks, which you're going to,to recursively split. And recursive splitting means you've just got it a fixedlength and you have a fixed overlap,and you just kind of dumbly go through all your textsand split it into five, 12 links with a 10% overlap. Uh, then you've got all of those small chunks belongto a parent document.

So we're gonna keep track of that parent document. Um, and so yeah, this is how you do it. You in link chain, for example, you just needto define these two, um, the, the splitterand the retriever, um,and split the, the,the child splitter into the fixed links. And then, um, you can see the r as expected around five 12with a few variations. And then you have your parent document,and I, that should be something bigger.

So I chose 1586 as the parent chunk size. And again, if you, if you look at the parent chunk sizes,they are pretty close to 1586. Um, and then what you need is two vector, two stores. Um, I'm gonna use Melva as my vector storefor the child chunks. And then I'm gonna use, uh, just a builtin link chain, uh,document store for the parent chunks.

And what you do is you define a retriever with,with the child and the parent and,and add those documents, which was my HTML, uh,webpage from our vis technical docs. And now you can ask it a question. Uh, so for example, I ask it, um, yeah,I ask it some question, uh, I pass it in to the retriever,uh, enumerate the, the result outputs. And I can see, um, here the question was what's the,uh, distance metric?And I can see that it did retrieve chunks relative to the,that question that was asked. Um, and same thing with parent documents.

I can see that it retrieves very long parent documents. Um, and so what you do is stuff them into a, um, contextto send to the LLM. And here I'm taking advantage of Lance Martin's, uh, wordsof wisdom that I reverse the context so that the most, um,relevant context that the, um, top K of the top K are last,and then I get back an answer. Okay, so I looks like I have 15 more minutes left of demo. So I'll, um, jump into the next techniqueand which is the semantic chunking.

And this is a really, this is really unique, uh,pretty interesting. Um, and all thanks go toGreg for coming up with this. So, so basically the concept isthat you take your, the, the, um,the fixed length chunks that we have,you calculate a cosign distance between adjacent chunks,and then you plot those distances. You look for outliers of adjacent distances,and that kind of tells you intuitively where you should makea chunk chunk cuts. Um, so, so for example, he has these nice, uh, graphswhere he showed, okay, this would be my chunk, 1, 2, 3,et cetera, based on the, a more intuitive,um, outlier distances.

So, so that's what this is doing. So again, we'll go into, um, I'll,I'm not gonna code these myself from scratch. I actually do have a notebookwhere I tried scratch coding all this stuff from scratch. Um, but I, uh, yeah, I'm going to use this demo. I'm going to use, um, line chain sequential semantic,chunker, uh, library.

Uh, you just initialize it, uh,with a particular embedding model. Uh, I kind of skipped over that in my code earlier. I'm using the hugging face open source embedding model. And, um, yeah,and then at this point, you just, um, you just call it,because all it's, it's using statistics to calculatewhere the chunking, uh, should go. And I see that my 22 docs were split into 87, uh,semantic docs in this case.

And I can look at my chunk linksand I do see something was not able to chunk. It might be a webpage that's only like codeor something, uh, that actually I already looked at it. It's, it was a webpage that was only code, uh,that was not able to split very well. Uh, but most of the other, uh, mostof the other webpages were ableto split into fairly large chunks. Um, yeah, you can try it with percentilesor, um, as standard deviations as your trunk points.

Um, and again, you follow the same steps. I can use VUS as my vector storeand I can test it with a question, um, ask the question,get back my, uh, resultsand loop through what was retrieved. Um, here I asked about IBF flat again, I, um,I just combined those, put them into a, a contextthat I can send to the LLM. Um, I, I tried this HTML chunking,and this is where I have to say that, uh, our,our particular docs are very, have very tricky structure,uh, for the H one and H two. You can see, like I, you, you kind of haveto look at the way your documents are, are coded up.

And mine were, uh, yeah, I'm not, I'm, I kindof ended up scratching thisbecause I ended up building a whole bunch of custom code. Uh, they're not nice and neat. H one, H two H threes, they're like, uh, weird parse thingsthat I had to try to figure out. Um, but yeah, I, I, I did a whole bunch of kind of scratchwhere to, uh, to parse out our H TM L docs. Um, and I can split 22 initial docs into 63 H TML chunks.

Uh, so you can see like H one, H two, H three out of them. Um, and you can see my h TM L chunk linksof what I got out. Um, alright,so those are my three chunking methods that I tried. And so let me go back to, uh,I still got still only half an hour,so I've still got half an hour left of this webinar. All right.

So, um, high level, what we're doing with ourrag pipeline is I'm playing around with chunking methods. So I, I, I did the fixed length chunking, I did the smallto big and I did semantic chunking. Uh, some other levers you, you have to pull, uh,for optimization is,besides the tweaking strategy, you canswap out the embedding modelsor you could swap out the generative AImodels in, in my demo. I actually, um, I'm going to, uh, let's see,do I have like a nice review?Okay, so in my demo, I,I swapped out two different embedding modelsand open AI model and a hugging face model. I scrolled up there.

So I'm using, uh, from hugging face. I'm using this one called B-A-I-B-G-E large. Um, you could also use, uh,and then I, the way I did my experimentsand is I just ran through this notebook twice. Um, I ran it once with the hugging face modelsand then another time with, uh, this particular, um,text embedding three small that I showed, uh,with the smaller dimension, five 12,since it gets the same accuracy as the 1536. Um, so those are my two iterations of the embedding model.

And then I also iterated on the LLMsand for the lms, um, I actually tried out six different, um,LLMs, uh, I tried out, um,I tried out, um, mixed drill. I tried, uh, LAMA three from meta, um,and GPT-3 0. 5. So let me just jump to my code where I try those out. And I'm gonna add a table of contents to make this easier.

'cause I realize myself trying to scroll around. It's kind of hard. Um, so I'll add, I'll add a tableof contents before it and check this in again. Alright, so when I, when I do the lms, uh, you needto create your system prompt. Uh, and here's where I iterated with LAMA three, uh, alama.

So you just follow their instructions on their websiteto down, download a lama pip, install a lama, um,pull a LAMA three model. And then, um, if I, if I run this here, I can see that,um, uh, exact that it's running the long three latest,uh, that it's GGUF, um, format, which means it's,it's formatted for CPU. Um, and I can also see that I've got the highest levelof quantization, um, which is kind of interesting to see. So, and that's, that makes sensebecause I have the poor person's laptop. I have, um, an M two, uh, so it's an older, um,older Apple laptop and I've only got 16 gigabytes of memory.

So I needed the small, I needed thisto be the smallest possible. Um, and so yeah, I can send it to LAMA threeand I get, um, an answer, um,to vault index cosign is what they said. Um, I can also send it to a, a bigger llamabecause that llama that I ran locallywith a llama was pretty small. So I can also, um, um, yeah,I can also send it to, uh,here I'm using the 8 billion model so that now it's bigger. Um, and I'm using any scale just'cause I heard any scale endpoints were good.

So tried that out. I can see, oh yeah, this is to point out. Also look at this latency. So my lama took three seconds. Um, and you know, you can use an endpoint,it takes a little bit less, that's why a lotof people use endpoints,even though using a an open source model locallyis cheaper sometimes, sometimes there's a latency concern.

So there's all these different trade offs in yourrag pipeline to think about. I also tried it with Okta ai, um, just to see if it was, um,how, what the difference was. Um, I actually don't see a clear answer right there. I see clear answers l chew here. Um, also try gr and this is amazing.

0 3 7 seconds. It's super fast. Um, I do see the, a clear answer there. Uh, I also try out clawed. I'm not gonna demo that right herebecause it was, it's very expensive, um, to try out,I can try out mixture.

Um, again, I think I used, um, any scale endpointsto try out mixed roll. Um, and the mixed roll I tried out was theeight seven B instruct. And I do see an answer is L twoand the latency is not too bad. And then let's try open AI of course. And GPT-3 0.

5 turbo is what I'm using. 'cause I don't need the super fancy model, uh,just for a simple question. And um, yeah, I, I do see two answers. L chew an ip. Alright, so now we're ready for evaluation.

So I'm gonna use, uh, I've got links to the paper, the code,the docs, um, and a blog. And so what you, all you need to do is PIP install Vegas. And they do use hugging face data sets behind the scenes. Uh, one of the, uh,people ask me like, what are the tricky parts?Uh, just keep in mind that they do use data sets. So, um, I created this, um, uh,and also you do need those, those four components.

So I, I have four questions here. I have the ground truth answer for each question,and then I have my context that, that were retrievedfor each of these different chunking methods. Then I have, um, the, the context that we retrieved usinga different, um, embedding model. And then I should have six different columns of answersfor the six different models. So this is the type of stuff you need for the evaluation.

And in a production you should have about 20 questions. I only was just for, to be quick,I only have four questions here. So I just put those into a CSV. Um, and then, uh, you do needto convert them into a hugging face dataset,which I have some code here to do, assemble them, um,in terms of what's the ground truth, what's,what were your contacts, uh, what are your answers?Then, uh, you need to know what your evaluating,because there are different metrics you recall fromthis high level chart. Depending on what you're evaluating the context of the LLM,uh, you're gonna have different, you know,all these different metrics are possible.

So you import all those metricsand then you have to define,depending on if you ans if you're,if you're evaluating the contextor the answers, which metrics to use. And then the other tricky part, I think here,I mean it's not tricky once you know it,but, um, that anything that's supportedthrough link chain is also supported in, uh, ragus. And so, um, I, to save money, uh,because this can get expensive,I replace the default OpenAI modelwith my local alama, uh, model. And then I'm using open source BA AI as my embedding model. So you need to set those wrappers, uh,to be able to change the model.

And what I found is I offhand, I mean my, my initial, uh,eyeballing, at least from my, my regpipeline is I'm very happy using, uh, LAMA three,even the super quantized version on local llama number onebecause I'm saving all that money. Uh, six different metrics to calculate per question. And then you saw I had six lms, uh,two embedding models and three chunking methods. So yeah, six plus, uh, three, nine, i, you know,12 different, uh, permutations, six different metrics each,uh, 72 LLM calls that gets expensive very fast. And I'm only using four questions.

So imagine if you are using the full 20 questions. So, uh, for me, I saw by eyeballing the,the evaluations that came out of this, I, I'm very happyand I would stick with, uh, a lama'cause for my evaluation,I'm not too worried about real time performance. And yeah, so, uh, this is what I, I run it, I run itthrough, um, well, I guess it calculatesfor the all the different chunking methods. For example, um, I output my,uh, a final metric. So for me, what I chose is rather than outputting, uh,three different metrics for contact, to measure the context,I just combine them down here in my code into,because I have precision recall,I just combine them into a traditional F1 score.

So here's my F1 calculation, um,and I return an average retrieval F1 as my final score,which is what I'm gonna reportas my scoring method for the context. And so here you can see the outputs, uh, for the different,um, chunking methods. Uh, and just to make it easier, I just, um,output them here in a little easier format to read. But meantime, ragus does give you all of this, um,output in terms of for your questions, for each different,um, context type that you, that you, um, had. So here's my recursive context.

I get the different scores we call precision in mycalculated F1, uh, for each question. And then for HTML and for the parent context, big to smalland for the semantic context. Uh, and then I even tried summarizing that semantic context'cause it was so big. Um, so yeah, I tried all these different permutationsand in the end you can see, um,that it was the parent context, uh, the raw parent contextthat had the highest accuracy followed, uh, by, uh,a, a very straightforward recursive context. And, uh, when I did this over the chunking strategies,over the embedding modelsand over the LLMs, the punchline here is thatI get the most bang for my buck, uh,by changing the chunking strategy, uh, in my case, 84,which is a really huge difference.

Uh, I got a 20% improvement by changing the embedding modeland I got a 6% improvement by changing the L lms. And yeah, so that's kind of the, the, um,punchline and I think saachi,I'll, I'll take it back to you. Yeah, thank you so much for sharing your demo. Christie. Uh, you have quite a few questions in the chatand the q and a Okay.

Um, about what you shared. So first, what is the OPT optimal chunking for rag models?Yeah, so I think what we saw is that the chunky,I don't think there's gonna be a straightforward answer. It, it's gonna, number one,depend on the shape of your data. So are you, do you have, uh, text data like I hador do you have, you know, some, some other kindof like conversation data or lecture data?Um, there there's gonna be different strategies. So if it's conversation add memory, that's,that's the biggest bang for your buck.

Um, for mine, uh, you can see that it was this, um,sim it was this, um, small to big chunkingor the parent chunking that worked best for me. But I think it's really gonna depend on your data. Um, and that's why this hard taskof evaluations is really importantbecause you won't know thatunless you are ableto evaluate your different shrinking strategies. Yeah. Thank you.

And thena follow up question they had was,what about costing in rag?Where, where's the optimal area?I think you kind of answered this,but anything to add there?Yeah, the costs. So you do need to think about, um,what kind of model, um, you know, the open source models. I, I'm very bullish, uh, just from my own, um,what I've seen very bullish on the open sourcemodels right now. I, I'm very much like the MELA three as a critic. Um, but you do need to try out different modelsas your critic to see which,which ones are evaluating better.

And there is a trade off between, um, uh,serving up, uh, the maybe a local model, which tendsto have a little bit worse, a little bit slower, um,performance versus, uh, an, uh, commercial endpointwhere they have like a huge cluster behind, uh,what they're serving up. But then you're paying for that cluster, uh,through API calls. So, yeah, um, I guess I would say if you are gonna do this,um, definitely have subscribe. Um, if you're doing it individually for demos,I would subscribe, uh,that 20 bucks a month is gonna quickly, uh, be worth it. Um, if you're, uh, a big business, thenof course you're gonna, that 20 bucks a month is you'regonna, um, you're gonna, uh, hit some,some ceilings there,but at least for an individual, I think it's worth it.

Uh, the 20 bucks a month if you're, if you needthat commercial endpointor small business, the 20 bucks a month. Okay. Okay. Thank you. Um, and then someone in the qand a tool asked on the distribution of scores chartwhere the LLM favors high and low answers.

Does this mean using LLMs as classifiers is not ideal?And what should, what would be some methods in orderto use LLMs as classifyinga response consistently then?So these are just, these are known, and,and this is, um, this is what has come out of,out of the research papers. Uh, so for example, um, um, this paperand it's a pretty well accepted technique. Now, this LLM is a judge, so, um, you want,but, but the thing is, is if you are, so it comes withso many benefits using the concept of LLMas a judge versus going backto the hand curated thousands of answers. Um, so I think the benefits are really strong. It's just that you do need to keep in mindthat there are known caveats with using LLM as a judgeor using LLM as the critic instead of a human critic.

Uh, so the fact, the fact that AI tendsto prefer extremes versus midpoints, like humans prefer,uh, just keep that in mind. And so Ragus has done thatand they've, they've created, uh, metricsand scoring systems that avoid these known pitfalls. But I would say don't throw out the babywith the bath bathwater. Like, don't just say like, Hey, I'm gonna skip Ellumas a judge, as a technique because,and go back to hand labeled data points. Um, yeah.

Okay. Thank you. And then, uh, someone's asking if Ragus is a VIS product,uh, no it is not. But Christie, do you wanna like, talk a little bit about,um, like how you can access ragus,But Sure. Yeah.

SoPI think I have a link in my notebook. Oops. So then this is the,this is their, uh, technical docs. Uh, so I could, I could also just like import,like suck this in and, uh, since they're public web pagesand, um, create a little chat bot, a rag chat bot for youto ask rag questions, uh,like I did for the Mill's questions. But, uh, yeah, the code is over here,so it's called Exploding Gradients.

Ragus. So no, it is not vis, it's not the same. Okay, thanks for sharing. Um, and then we'll have the linkas well in the, um, in an email. Um, and also the codethat she shared is actually in our VIS bootcamp.

Um, I'll provide a link to that soon, uh,while I ask the next question, which is, is ragus justfor Python deployments?Uh, let's, let's see. Actually I'm not sure about that answer. Um, I would have to look at the word patients,sell myself and Okay. Say so and get started. Um, it looks like it is Python.

Yes, that's my right now. Um, I, I'd have to look into that closely,but offhand it looks like it's Python. Okay, great. Thanks for clarifying that. Um, and then our next question is,rather than doing all the chunking, is it probably a betterROI to label and pre-process all your dataand put them into proper chunks beforeand instead worry about other parts of the pipeline?So my takeaway,and I, this is also what I've heard from other others,um, other engineers,and I think there's even a white paper that got,that was published by a consortiumof AI companies from the AI Alliancethat there was agreement that the biggest bangfor your buck comes from the retrieval step.

So you saw changing out six different LLMs,the answers really didn't vary that much. Um, however, if you change though the chunkingand the embedding models you, there is, there was a bigger,uh, there's a bigger, um, improvement. So it is more bang for your buck actuallyto focus on the retrieval step than the generation step Nowwithin the retrieval step, um, in my particular case,if I go to the end here, um, I did see,um, I did see, uh, like I was,this is outside of, of typical, um, what,yeah, so they did a re they did a whole bunchof experiments over lots and lots of data,and typically what they saw in the,in the white paper was likearound 30% improvement from chunkingand then something really small from the LLM. In my case, I saw a huge improvementby changing my chunking. So I, I do think, yeah, definitely, I i, it,it is worth looking at the retrieval step.

Um, and, and it's,and you saw it's super easy just to change the LLM. You just, uh, change like typically one line, uh, the nameof the model and then you can call,call a different call endpoints with different models and,but it takes more work changing yourretrieval in your trunking. Okay, thank you. And then you mentioned you can swapembedding models Mm-Hmm. Um, just to confirm, you can only use one embedding modelper mil this collection.

Like for example, you cannot mixembedding models in one collection. Just it's, uh, clarification on that thatYes, yes, that is, um, I feel like that is a pain pointthat, uh, when you,and it's part of just the theory of, of vector retrieval isthat if you're going to create a knowledge baseand then you want super fast, uh,nearest naval calculations within a vector space, uh,based on unknown questions that are gonna come in, uh,they do all need to be in the same vector space in order forthat distance metric to make sense. So you're, you're basically calculating either co-signdistance metric or L two or, you know, productor some kind of distance metric of between vectors. So they do all have to be in the same vector space. So they do all have to becreated with the same embedding model.

Okay. Well this is kind of related to that,but it, how is it possiblethat you can mix embedding models all with different sizes?And does this mean that all the internal neural networks areconstructed the same?So what they've done that MRL matrika representationlearning, which has to be done during training timeof the embedding model itself to allow, uh,what they call the MRL embeddings, and I think they named itbecause of those Russian matrika dolls where you have, uh,dolls within dolls within dolls. So the idea is that the vectors themselves have sub vectorswithin the larger vectors and more sub vectors. And by doing, uh, this, this kind of being aware of those,uh, layers at the time of training, the embedding modelsthat you can choose different sized vector outputs. And in that case, in this particular, um, case,I forget the name of my embedding model from opening AIFebruary, 2024 announcement, uh, the twodimensions possible were 1536or five 12 with the same accuracy.

So it's up to you. I mean, if, if you, for me, I'd,I'd rather just save the space if I'm gonnaget the same accuracy. Was it the, I'm not sure. Hopefully I answered that question. Yeah, let us know if you haveany follow-ups in the comments.

Um, someone is wondering, I don't know,maybe you might know this since you've met someof the co-founders of ragus, what the asin ragus is after rag?Good question. I have no idea. I, I, I'm, I know rag, I don't know why they call Ragus. Maybe like rags. Maybe rags was like, maybe they tried to get the,the company name rags, uh, for, or, you know, name it rags,and that was like too generic.

So they just, I don't know. I'm just guessing maybePotentially. Yeah,that's what I think too. Um, last question here. How is Ragus calculating these scores?For example, the context relevance,and is the l is an LLM evaluatingor something like the cosign similarityused by vector search?Yes, that's one question.

Um, feel free to answer thatand then I'll ask their follow up. Okay. Yeah, sure. So, um, they are, uh,they are using an embedding modeland they are calculating distancesand they are using LLM as a judge. So there the two, the two main components, um,when you're using Ragus is this choiceof your embedding model and your critic model.

So you do need to choose those. Um, and, um, in terms ofwhat they're doing, I would referto, there's a, there's a paper, um,and, uh, actually I tried to say it in my blog, uh,also see I have a part one of this blog. Um, so I,uh, there, yeah, actually I, I think I'll, I would pointto both of my blogs, um, as follow-ups, saachi,maybe we can send these out. Um, yeah, so my first blog is about how Ragus works and,and um, and kind of like their main concept is,I think this, somebody highlighted it that, um,that when you statistically sample, uh, factual answers,they should be more similarto each other than hallucinated answers. And this is kind of their, what they're going off of.

And then they, they do use LM as a judge to calculate allof these different metrics that you can see here. So, um, so yeah, taking segmentsof the question, taking little factoids out of the questionand then seeing if they're covered, for example,in the factoids of the context, um, we call,um, precision and we a difference. So precision is more like the coverageand recall is like how many of, of the points out, um,were also mentioned. Um, and then that's why typically statistician calculatethat F1 score out of precision and recall. Um, yeah,and also you need the ground truth answer for the recall,um, versus precision.

Sorry, I probably said that the very reverse way. Yeah, well, their, their next question was like,is it possible to produce anyof these scores without ground truth specified?So, um, right, so,Right. Yeah. So itis possible. I have some, seen some people using raguswithout a ground truth.

Um, it just means you'll be missing, uh,these three metrics, for example. You can't get the recalland you won't be able to get whether the answer was corrector not, or similar in style. Okay. Okay. Great.

So we have one minute left if anyone has anylast minute questions. Um, I think I see one here. Is this process pre-production evaluationor production evaluation slash monitoring?Yeah, I would, I would do this pre-production as partof your development cycle before you deploy an app,because you probably want to iterateand figure out for your company's data what are, you know,the levers that are gonna get that you wanna, um, uh, your,your best levers for your company. And the another, there's even a new layer,another layer on top of all of this that, uh,the agentic rag where now you're adding,besides the LLM evaluation, now, now you're adding an LLMthat could decide, uh, for example, maybe, maybethe agent rag decides which chunking strategy,which embedding model and which LLM to use. So, um, I think it's early, still early days for agent rag,but it's getting a lot of buzz of like yeah, kindof the future of having,even having LLMs make even more decisions for you.

Um, yeah. Okay. Thank you so much, Christie. Uh, just to end, I see some speculations onwhat Ragus might mean. Okay.

One is assessment and another, uh, said that RAG is,or Ragga is a, or Ragga is a Indian musical melody style. So they may have gotten, um, the idea from thereand saw that RAG was part of that. Okay. Um, part of the name there. Um, but yeah, thank you so much.

Um, we will send the recording, the slidesand some of Christie's blogs explaining more about thistopic, um, after the webinar has ended. Um, thank you all for joining today,and we'll see you next time. All right. Thank you. Bye.

Bye.

Meet the Speaker

Join the session for live Q&A with the speaker

Christy Bergman
Developer Advocate
Christy Bergman is a passionate Developer Advocate at Zilliz. She previously worked in distributed computing at Anyscale and as a Specialist AI/ML Solutions Architect at AWS. Christy studied applied math, is a self-taught coder, and has published papers, including one with ACM Recsys. She enjoys hiking and bird watching.

RAG Evaluation with Ragas

About the session

Topics Covered

Meet the Speaker

AI Assistant