RAG Evals: Statistical Analysis of Retrieval Strategies
Today I'm pleased to introduce today's session, rag Evals,statistical Analysis of Retrieval Strategies, and our guest speakers, Jason Lo. Jason Loki. Jason is co-founder and c e o of Rise ai. Jason is joining today by his colleague Sally Ann Decia and ML SolutionsEngineer at Arise. Welcome, Jason and Sally Ann.
Hey, uh,Thank, thank, thank you so much for, for having us, and, uh,we'll do a quick intro. So, so I'm a co-founder here at Rise, uh,technical founder, and, and you know, rise is, is kind of a AI ML observability,and we've been, to be honest, uh,I would say measuring the performance of ranking and retrieval systems for,for a number of years now and,and this last year with the growth of L L M and RAG has been an area of,of a lot of, um, uh, a lot of work for us. A lot of deployments for us. And what we're sharing with you today is, is really, I, I would say, um,notes from the trenches, what, what it's like in,in building a lot of these systems, the metrics we've been using, um, and,and really again, sharing with the community some,some things that we've learned. So great to, great to be here.
Um,and I'll let Sally Ann introduce herself. Yeah. Awesome. Uh,so I'm Sally Ann and I'm a ML solutions engineer at Arise. My background is in deep learning, so it's been really exciting, uh,the last year to really see how all that's been applied and transformed into theworld that we now know as l l m.
So, really excited to, um,get into this. Um, Jason, should we get started?Let's do it. Hop in. All right. Awesome.
So we do have a really informative and exciting session for you all onbenchmarking and analyzing your retrieval. Um,this is super important because we're seeing a lot of different approaches outthere for like, how to retrieve your context, different chunking strategies,but what we really don't see is a lot of standards and guidelines that tell youwhat you should do or help you evaluate these systems that you've built. And so that's something that arise, we've been thinking a lot about, and we've,as Jason said, these are notes from the trenches,some of the learnings we've found and things we've put together. Um,so we're gonna really be sharing with you all some of our guidelines forevaluating our rag system or your rack systems. Um,so I first want to introduce you all to the five pillars of observability.
And what these five pillars really allow us to do is get visibility into how ourl l m applications are performing and then ultimately how to improve them. That's the goal at the end of the day, is having these super high performing,reliable models in production. Um,and so these pillars are gonna be crucial for doing just that. And for today's conversation, uh,we're really gonna be focusing on this piece of evaluation where our goal is toevaluate the LMS output by using a separate, um, evaluation l l m. And so let's start off by reviewing some of our building blocks here.
So before we get into the specifics of like, what's an eval,how do we do evaluation,I think it's really important that we understand the building blocks that reallymake up a successful evaluation. So starting off with what our evals,well, we've built an eval library and it allows you to, uh,generate evals for either chains or individual spans. And what we're talking about in terms of retrieval is there's two things we wantto evaluate. So the first one here is going to be, um, on our chain,and that's gonna be our QA eval. And this is really our overall performance of our rack system.
The second piece here is going to be, um, our retrieval. So is our retrieval actually correct? And this is super important because,you know, RAG works on retrieving contests based off of similarity metrics. We're definitely gonna wanna make sure that whatever we're retrieving iscorrect. So if you're using mbus, um,this is how you would evaluate your retrieval. And we're really gonna focus on discussing the building blocks for evaluatingthese and how you can evaluate, uh, leverage these to evaluate, you know,your entire system.
And, and I would say the, the reason we call 'em building blocks is we, we,we deeply believe like N D C G M R R traditional retrieval metricsare really ideal for, for evaluating retrieval systems. But to build up to those,you need some amount of kind of labels or evaluations or something you're,you're building those off of. So when we're talking about building blocks,these are gonna be building blocks towards your traditional search and retrievalmetrics, which I, I,I listen to a lot of these podcasts I don't hear people talking about. And, and,and it's like that, that is how you really wanna look at these systems,but these are gonna be like the, the core, uh,bricks that we use in that foundation. Yeah, I said that's a great way to describe it.
Um, so again,kind of focusing on this retrieval section here. Um,and there's some fundamentals that we have put together at our rise forevaluating your LMSs in production. Um, and, you know, we,we've applied it to our open source library and our evals. Um,so when you're thinking about evals, we suggest using a golden dataset. Um,in our framework, we have provided some golden sets,but we've also seen a ton of our customers just building their own dataset.
Uh,because it's really important that you trust these, at the end of the day,you wanna be able to trust the evals you're using. So building a golden dataset,which is really just, um, a really good example of your data, good queries,good retrieval, good answers. Um, and the benefit is, again,as Jason was just measure, uh,mentioning that you can use this golden dataset to apply, you know,the metrics that we're familiar with, like precision recall, F one. Um,and this really does allow us to understand more specifically how well our modelis performing just versus using like, you know,an averaging of scores or something like that. The second thing Oh, yeah,Yeah, I would say, I would, my little note on this is, is, um, on,on the left side is, is, is a lot of people, if,if you're kind of listening to the evals, um, ecosystem right now,a lot of people have grabbed onto a, a, an average of a number,like correctness or something like that.
And, and those of us who've,who've lived in the trenches realize like averaging numbers don't really tellyou much. You can have really skewed data that, you know, skews your number and,and, and you don't know if you're measuring, you know,an outcome or you're measuring the skew, like the, the,the weird skews in your data. So the traditional we, where we've ended up is,is for our task-based evals, we use precision recall F one a lot. Um,I would say if that's one thing you take away from the building block section,you know, use a binary decision here to say is something relevant or not,not a single number. Um,and then also really build data sets or trusted data sets so you can make sureyour evals are correct or you feel good about your numbers before using them.
Yeah,Exactly. And then you just, you touched on actually the second one here,which is the task-based. And I think that's also important, you know,it's taking the stand on like, you know, binary,but also the fact that they should be task-based. Um,so there's a lot of emails out there that are model-based. Um, and this, um,in our approach,we're really suggesting using a template and a task to define our performance.
And this is really, um, important because at the end of the day,we're just kind of adapting these foundation models for specific tasks,and then we're putting these into production. So we're evaluating them. We really should be task specific. Um, and that's why, you know,we believe that the task-based approach is the best approach. I, I think I, you know, I saw you do a bit of a, a writeup on the open AI eval.
Mm-hmm. Like, it's kind of evals is this broad word. Can you describe for a second maybe the confusion, the ecosystem between the,the open AI evals, model-based evals,and maybe how they're used for hugging face leaderboards and stuff?Yeah, absolutely. Definitely check out the piece. It gives a much broader rundown of all the different evals, but basically the,the evals that you see with open AI are,are judging the model on how well it's able to do against, you know,a specific task-based, uh, data set.
So it's giving you more of the general idea of like,how well is this l l m able to, uh,generalize how well is this eval able to, um, answer questions,things like that. It's,it's more not looking at your specific tasks that you are doing, um,and more just looking at the general capabilities of your, of the model. And it's, it's meant to be more of like a starting point for development, right?Like, there's a million models out there. It's impossible to keep track when everybody comes out with a new model that'ssaying like, Hey, we have this state-of-the-art model. And so those evals allow you to compare model to model,but it doesn't really give you true insight to when you apply this to a specifictask, how else is this gonna perform in production? And that's what we're, uh,proposing here.
Cool. Great. Yeah. And then the last thing here, um,is just you want your evals to be able to run in different environments and asfast as possible. I think this is kind of a given here.
You know,a lot of times we're experimenting with different frameworks. You know,I have lane chain and LAMA Index stated here,but we really want it to be easy to compare and make these metrics usefulregardless of what framework we're working in or what environment. And so I think now that we have the fundamentals, you're probably asking like,how does this all fit into when I'm actually running something into production?Well, here's an example for you all. And we have our l m spanner chain, um,and there's some eval library. You know, we have Phoenix,which is ours in parentheses there.
Um,and what we're really doing is having the data that's flowing through your l malso be throwing through, um, the eval library to create an eval. And, you know,maybe here you're switching out the template that you're using or your evalthat, or the lmm that you're using to do the evaluation. But, um,this is just like the general setup and the end result is going to be the evalthat was produced by that library. Um,and we'll get into the different components of these libraries next as well aswhat those results actually look like. And I would add that fe you know, Phoenix is open source.
Uh,we're not opinionated. If you want to use your own evals or you have evals that you think you are good. I think all we're trying to do is,is tell you a little bit from the trenches of how to use these to, to get some,you know, N D C G M R R type retrieval metrics from your system. Absolutely. I kind of see our evals as kind of laying the groundwork.
It's like a summary almost of everything that we've experienced from what we'veexperienced, we've put it together into this package. And like you said, we,we encourage people to try 'em out,but also build your own and let us know what you find. Cool. Okay. Cool.
So how do evals work? So in this example here,we're looking to run a retrieval eval on our retrieval span. So if you remember a few slides back, we had, you know, our overall chain,which had our qa,but then we had a retrieval chain that was going to be specific for retrieval. And this is the eval that we would use for that. So we're taking our inputs,the user query, the document, and we're running our eval library. Um,in this case, again, we're using Phoenix, um, to get our evals.
So this template here,I think this is really important to focus on because this is what we're using toinstruct the l l m that we're using to evaluate our system on like how toactually evaluate what we're doing. And so we're asking, you know,g p t four or 3. 5 turbo to determine like if this chunk contains theinformation that is relevant to answering our question. And so we're just trying to understand if our retrieval is relevant orirrelevant. So that's where the binary label comes into play.
Like yes or no,is it relevant? And, and one of the reasons, you know, we're doing this,as we've been talking about, is to really benchmark our system. Um,and the output of this, uh, as Jason alluded to,is where we can start to do some analysis on the actual quality of ourretrieval. Um,and the retrieval performance and understanding our retrieval is reallyimportant, um, because it's really difficult to do. And so, you know,before having some of these newer eval frameworks,it was sometimes impossible to know if your retrieval was good or not. You may be digging through, uh,your responses and looking at your retrieve context and kind of just guessing.
Well, with this,you get to know with a lot more detail and certainty whether or not yourretrieval is performing well. Um, and the thing, yeah,the technical thing I would add here is like you are using embeddings to findsimilar documents, which is a rough course search approach. It's semantic search. It's, it's finding things that are close in that,that latent space, but, but very, very rough in in terms of like, it's, um,how it grabs document. I mean, it's like all dimensions are the same and,and basically it's just looking at distance.
Um, yeah. What we're doing here is a much more subtle check. You're,you're using G P T four to,to look and see if that that embedding distance thing really return somethingthat would answer the question. So, so you might be relevant. We've seen examples where there's a bunch of words platform arise, you know,like tons of words mm-hmm.
That are semantically similar to the chunk returned. But does, does the chunk have a chance to answer the question? Is,is the core thing that G P T four is answering?Exactly, and I think this is kind of ties to the fact that like, you know,similar is not always mo more relevant, you know,and I think this gives you real insight and to see like, okay,we know it was the most similar, but was it actually, you know, relevant, uh,to us all. So I wanna review this retrieve,eval in just a little bit more detail, breaking it down. Um,so we see here that we have our query, um,then we have some chunks that were retrieved from our vector database. This might be Zills or Novus, um, as an example.
And then we also have that template that we saw on our last slide. And so what we're doing is we're applying this template to every single one ofthese chunks, um,relative to that query to evaluate whether or not the chunk is relevant. And,you know, when you're in production,you might not do this for every single chunk on every single, uh, run. You might opt to do some sampling just of a percentage or something. But the goal is really to get a view into how well, uh, things are doing.
Leveraging, you know,leveraging another L L M is a really great way to automate this. Um,and I'll touch on this later on as well,but this is not to say that this is a replacement for doing human labeling. You know, we, we see the best teams out there doing a combination of both,but the idea is try to get as much data as you can around this. So automating with an l lmm really empowers you to do this efficiently. Yeah.
And maje, um, I asked the question like, oh yeah, know eval,how do you account for elements mistakes that does the evaluation?I think it's kind of hitting your, your same topic that you just hit there,which is, I really do see the top teams right now using a mixture of both. Like you, you, you, you know, there's human labels,you have human labels that say, is this relevant or not?That can't scale, but,but are really something you believe. Like, you, you,you feel really good about the results. L m e valves scales much, much bigger,much broader, can be proxies for production. When you can't get a label labeler on it, it's,it's actually great to get a good gut check.
I mean,there's teams that are just starting. If you're a hacker and,and you just want to get a quick gut check of your system, these,these evals are actually a pretty good proxy. So we don't,we don't think they're perfect. Um, but, but G PT four, but we do have,um, reference, uh, in mm-hmm. Test data that gives you an idea of how good they are.
Um,and then you build your own, you know, intuitions and data sets to,to believe it yourself. Yep, exactly. So like, if, if you take away from like what an eval is, again,it's like automating proxy metrics. That's what we're, we're really doing here. Um, and it, it's definitely not, like you said,a replacement for the human labeling.
Yeah. Cool. And then for our QA eval, so that's, you know,the second eval that we'll do for rag systems, um,we again have our querying context,but this time we're also gonna include our generated response because, you know,this is ultimately what we're going to evaluate with this eval. And so we're asking the l l m whether or not the question is answered correctlybased on the retrieved context. So you can see here, again,we're able to see like does was the chunk, uh, having the,all of the information to create a correct response.
And so these are some of the rise, uh,LM evals and golden data sets that are available within our framework. So again,today we're focused on the retrieval and the QA eval. Um,but there's so many others that you can, um,use to benchmark your various tasks and we definitely encourage you to checkthem out. As Jason said,you can use them as just a guideline for creating your own,definitely want to see the community and what we're all able to come up with. Um, so exciting stuff.
I think now we're gonna get into maybe some more of the fun stuff here. Uh,Jason's gonna walk you through a live example and then we'll break that down abit more after. So Jason, you wanna take it away?Yeah, I'll, I'll share here. So, um, so, so fir first off, the,there was a question, uh, from the team around, um, uh,performance and what, so, so for these reach, for these, um, for,for the eval metrics, we have a bunch more rolling out. Um, uh,what we do is we do, uh, we do have test data sets, which, which Sally Ann,like,there's a whole benchmark phase where you decide and whether this thing is goodenough for you to use as a proxy metric for performance.
Again,it doesn't replace the hand labelers who might go out and,and help you label in production to understand or, or, or even building, but,but it's a good, but there's some performance metrics here. Precision F one recall, uh, for different, um, different models, model types. Um, this one, you know, G P D four and three Turbo don't,don't have a big difference for, for some of these others. Like q and a i,I see big differences. So, um, we help get gut check template plus model.
Is it good enough for the task? Um, but I encourage you to build your own, um,uh, confidence. Um,Hey Jason, if we could just pause right there on that, that, uh, table. Somebody asked about, you know,evaluating the L L M when you're doing retrieval. This is an example here where you can see the comparison of the different LMSand how they're performing. I just want to to point that out.
Yeah. And this is like a pre-production benchmark. So you, you,you kind of get a feeling for your evals, how well they're doing. You save off your own golden data sets and feel good about your templates andyour models. Um, and then it's, it runs in production.
You have some proxy performance metric, and,and you can also do hand labeling on top of that for,for keeping your gut check, you know, ke keeping your gut check of your,your metric, right? But again, this is a proxy metric that scale,that scales out more than hand labeling. Um, as you think about it, arri,you know, Phoenix open source, pretty awesome solution,which we're showing you with evals. It also runs as a couple lines of code with LAMA index and link chain. So you literally drop this, this thing in here for retrieval. There's a little link, you click the link and it runs local.
There it is. Doesn't send your data anywhere completely local execution where you can seeyour queries and, and see your retrieval. So we're gonna talk a bit about retrieval, but, um, I can look at like this,this re-ranking, um, you know, the number of l m calls in here. I have a re-ranking, um, output here. I can drop and see the templates that are used by, by LAMA Index and Ling chain.
Um, so there's, there's, there's a,a way in which you can collect the data and then all this data that getscollected here. So, so I can, you know,I can easily in here compare like one retrieval approach, you know, to maybe,um, simpler approaches. So you can see there's, there's probably maybe six,six L l m calls on, on this one. So, um, or maybe the,the very simplest approach here where there is no re-ranking, um,and it's just a, a simple retrieval. You could see the chunks retrieved performance metrics.
So it's a really good debugging tool. Open source,one line of integration with Longman and link chain. What's really nice about it is you can bring all this data back into a dataframe. Um, so we'll, we'll release this notebook after,but this is an example on the rise docs where all the metrics that you seesaliann running with the queries and N R R and M R R and D C G,um, and these evals, uh, you can just run yourself and,and build your own confidence that this works for your system. Um,you can also just point it at your docs.
So if you have your docs and your questions, this same script will, will run. Um, we'll, we'll provide it post, um, post, uh, call here. Awesome. Share my screen again. All right.
Yeah. Anything else you wanted to chat about? Uh,in terms of that again, before we,Yeah, I mean, there's another question dropped in,and we're gonna go through a little bit of this, but in,in your experience mm-hmm. What are some ways to improve retrieval if the eval results turn out not sogreat, um, uh, would you focus on better documented processing,better similar metrics? And I think the answer is,is like we've seen a lot of different failure scenarios. So I would first off,say the goal of Phoenix is to actually get you to maybe examples of wherethose problems are. Examples of problems I've seen,I've seen people with tons of duplicate chunks where allthe chunks in the context window are like the same mm-hmm.
The same thing. 'cause they have French and English, but they're all, like, they're all,you know, d different webpages for different countries,but they're all English language and all the chunks were overlapping. So there,there's some simple things like that. There's,there can get as complex as what Ian's gonna go through, which is like,your K is too big, you're returning 10 and it's killing performance. And you really, you, the model you're using really needs a,a small number of contacts.
You can get, um, examples where, um,your chunk sizes are too small, which is another one Sanne will,will run through here. Um,I've seen examples of just like returning garbage,like, like there's some just messiness in your docs and like,it's like you tons of characters and that's retrieving a chunks,which you don't have English. So,so there's lots of things around cleanliness of data of, um,of, of setting up the retrieval system with the right parameterization,which is kind of what the script helps you do. Uh, that, and,and then just giving you tools to debug. 'cause there's a million failure modes,I think.
Yep,Yep. Absolutely. And I think like that question that we just got in the chat is one of the mostcommon questions we get asked. And like, again,that's the part of the motivation behind this all is actually having informationto guide that. 'cause there are so many different pieces of that that could be causing forperformance.
And so, um, as Jason mentioned,we're gonna get into specifics and identifying these that you can strategicallydecide what, what to do about this. Cool. Um, so, you know, now we've discussed some in those building blogs we discussed,you know, how we're building these evals. You've even seen, you know,Jason go into the Phoenix tool and see it. Um,but I think actually discussing how to benchmark these is,is a super important piece, um,because this is kind of where complexities start to come up.
So we're returning to our retrieval eval. So we have our queries here. Um, we have two queries, zero and one,and then we also have our various chunks that were returned. Um,and so we've already discussed like going through our eval and we're gonna runon each of these chunks. But what we're really doing here is generating these labels using our evaltemplate.
So if you remember in our eval template, we were asking,is the retrieved context irrelevant or relevant?And so that's what these zeros and ones indicate. So one is a relevant piece of context, zero is an irrelevant piece of context. And so for our first query, we returned three relevant, uh, pieces of context,but, uh, not so good. In our second query here,we only have one piece of relevant context. So what we can do with these labels,and this is kind of the power that Jason was talking about in the beginning,is we can start to say like, okay, well what's my precision app for,for example, or what's my M r R for these?And so once you have your retrieval set up and you know,you have your queries and you know you're confident in your evals,now we're like really starting to apply this traditional search and retrievalmetrics, uh, which just makes it a lot easier.
And,and it makes you more confident in what you're actually evaluating in terms ofyour performance. You know, we talked about just having this average score. Well if you have, you know, skewness in your data, um, or anything like that,you can't really understand, uh, the overall performance. This really allows you to do that. And probably also a way that you're more comfortable with.
'cause these methods have been around for, for quite a while. Um,and we do have some really great example notebooks for this, um, using Phoenix. Um, and so we'll definitely be sure to share that with you,but definitely encourage you to check it out 'cause it does make life a loteasier. Um,in terms of our second retrieval, uh, method, or I'm sorry, eval method,our q and a, um, we're going to, um,take a look at, um, the query, the chunk and the answer. So this is like an additional level of analysis that should be done.
So like the first part is like, was our retrieval good? Um,but now we're trying to say like, well, was our answer correct or incorrect? Um,and um, another kind of piece of this is the fact that like,did the l l m know the full answer based on the retrieve context?So sometimes maybe it has partial the answer,but the full answer is not included in the context. And this eval will also help you, uh, understand that. Um,and there's a lot to be learned, uh, I think on this QA portion, um,because there's a lot that can affect the qa and we don't really know exactlywhat defines good retrieval. So definitely expecting more to come up on this. But one thing that we are kind of certain about is that there is some kind ofcorrelation between good retrieval, um, and good qa, uh, results.
Uh, anything to add to this Jason, before we move forward?Yeah, I mean there's a big, big question in the, in the, the webinar chat,which kind of hits on exactly what you hit on, which is, um,is is is N D C G M R R,the the things you should be looking at given the way an l m works. Um,and, and what we've seen, uh, again, build your own intuition, use our scripts,you do do whatever you think you should do. Um, but, uh,we do see correlations between good retrieval results and the good qand a evals here. So again, we're, we're kind of overall,did you get the right question?Like those two when when you have good retrieval, you do good,you have good q and a. I would say there's so much blocking and tackling that the N D C G andMRR is help really helpful with mm-hmm.
Which is like the example I gave where I just had chunks with characters likethat. The way the find, the way we found it is just, you know,MRI is kind of zero for a bunch of these and we're like, well, what,what's going on? And, and, and you can, you know, it's good for getting your,your blocking and tackling, you know, as you get to really,really clean data and maybe decent retrieval,like you're probably working on other things like overall looking at maybe theseoverall q and a evals to understand if I'm, I'm doing the, you know,the next set of tweaks are, are working. But I definitely think it's useful. I think it's very useful to help capture, you know,your core setup problems and, and your retrieval system. But it's not the only thing.
And I think that's the point here. Yeah, for sure. And I think even the specific problem,the lost in the middle problem, um, that they're discussing there,I do think the evals have power in helping identify that. 'cause you're really able to see, you know, which of the chunks are, um,really providing good answers or which of them are really relevant. And if you have your relevant ones, you know,in the middle and then you notice with your q and a that we're getting a badresponse, that kind of tells you right there that we, like,we are experiencing that loss in the middle.
So, uh,there's definitely value of in the evals for that problem as well. Cool. So we've been talking a lot about the evals. I I kind of wanna switch to a, a parallel, um,kind of a way of evaluating models and that is, um,some scripts we have that help you sweep through retrieval setups. And I think this is super cool because, you know,you're able to point to your docs and sweep through the different options.
So,you know, that question we got before of like, what do I do?Or what's the best setup or that question that we get all the time,this is really helpful for that. So you can just set your, you know,your different chunk sizes that you wanna integrate across. You can set up different ks and then, you know, you get, um, the results. And you know, I mentioned it a few times,but this really is one of the most common questions that we get,is how to really set up this retrieval, which is the best option. Um,and so what we're really allowing teams to do is, you know,sweep through these setups and then set,then set up your system with some amount of rigor because you have, uh,these results to power the decisions that you're making in terms of setup.
Um,and, you know, this might be a little bit more, uh, relevant for, you know,development use cases, but, um, nonetheless, I I think it's,it's super cool and we're gonna discuss some of the results in, in just a bit. Yeah, and I, I would say, uh, we,we've kind of just started release some of these scripts this week, uh,still early. So, so feedback, feedback is welcome. Um, this,this version of scripts with, with, uh, LAMA index, we have a, a lane chain, uh,version coming. Uh, and, and yeah, the, um, I think it's been,it's been pretty helpful.
The, the one one caveat is, you know,in general evals are, are generally slow. I would,I would say actually the slowest thing is actually the,the querying and question and answer systems themselves. So running through mm-hmm. Hundreds and hundreds of questions, um,QA wise to, to do these different things does take, you know, it might, this,this sweep here might take over 24 hours to run. Um, yeah,it's not super expensive.
I think relatively speaking, again,depends on your budgets, uh, but, but the, the, the slowness is,is natural in a lot of these systems, um, right nowFor sure. And, you know, we are experimenting with that big,that piece that is slow. So as we adjust, even in just the implementations,right, like, you know, re-rank takes much longer than like, you know,an original implication of just right, you know, pulling in. So that all will definitely have a factor as well. And, and to be honest, the, the thing that I found is slow right now.
So,so our library,so that the Phoenix library for evals is designed to like fill the pipe as fmuch as possible, uh, you know, forks a bunch of calls to open ai, the,the link chains and LAMA indexes kind of do one after the other calls. So those tend to be like the gating calls right now. Those,those approaches like do are, are fairly slow. You,they wait for one to return before they kick off their next query, typically. For sure.
For sure. So let's, or at least maybeAt least they don't fill the pipes. Yep. Yeah. Not, not in the same way at least.
Yeah. Um, awesome. So let's discuss maybe in a little more detail about, you know,these individual, you know, sweeps in what we're, we're looking at here. So starting with like a chunk size sweep. So, um,it's probably what you would imagine here we're varying the number or the sizeof our chunks.
So we have 500 and a thousand token sizes here,and then we are able to, you know, evaluate these to see, you know,our relevance or correctness or just really see the overall performance of usingthis chunk size. And you can see here we're just using K as a standard four. So in this example, we're not, uh, varying k we're just varying our,our chunk size. Um, for, uh, k it's,it's very similar, but now, you know, just reverse. Now we're varying k we have four and six as our K values comparing those.
Um,and we're keeping our chunk size at 500. So keeping that standard, and again,you know,we can sweep through this to get overall performance by looking at the differentK values. Um, K values are like what is also called, um,your context windows. It's the number of chunks, uh,that you're going to return. So definitely an important one.
Um, and then,you know, we can also look at, uh, the retrieval transformation. So that's what we were talking about earlier with, you know,original or re in or hide. Um,so there's different transformations that you can try and compare to. So here we're taking a look at original and, and again, able to,um, compare and we can swap these out, um,for different parameters as well. And it just actually gives you some results that we can analyze to reallydetermine the best strategy.
So this sweeping method is one of those ways, uh,to have asked the question about like, how,how do we know a way to improve our retrieval? Well,this might be one of the ways in which you can do that as well as the evals, um,that we're talking about, uh, prior. Um,so let's talk about the, those results that you actually get and what you can,can make at them. So this is, um, a set of results from, um,a chunking retrieval eval that we did for our chat bot that we built on AriseDoc. So this is a chat bot that's powered, uh, with our arise documentation. We're testing on some questions that, uh, we usually get from our customer base.
Um,and I really just want to kind of preface all of this with saying that you mightnot see the same insights in your, your use case, right?So we're using different docs, insights are going to vary there. So I don't want you to come away from this thinking that, you know, well,Sally and Jason said that like, this is what exists. Well,this is just for our documentation, so definitely try it out with your own. Uh,but the idea is you are going to get insights. Uh,and so if we're looking at here, you know,we have four different checking methods and we also are looking at two differentmethods and three different K.
So there's a lot going on. We have all of the results that,or all the evals that we were just talking about in these results here. Um,and if you look at, um, all of the re-rank ones, um,so focusing on this kind of upper left hand corner here,this is chunk says a hundred with the two different, um, ranking metrics. And I think what's, um, no, sorry, uh, retrieval methods. Um,I think something that's worth noticing here is that we're definitely getting abit better, uh, performance in terms of retrieval.
So again, this is retrieval,not question answering. We're seeing, uh, better, uh,performance there with the re-rank. So, um,that is one insight we can see kind of right away. And then you can kind of see that, you know, perhaps as you know, we get, uh,larger in our chunk sizes where the more k you know, our performance goes, um,down a little bit. Yeah.
And,And the, the expectation is as your K goes bigger, your,your retrieval performance does,does drop and it doesn't quite drop as drastically as as if you're, you know,the slope here is is not as drastic as you would,you would actually estimate if, if it was not working better. But, um, sure. We're, we're keeping going on the, the, the other ones. Cool. Yeah.
Um, okay. Did you want to discuss these or the other, did you mean the other results?I think the,The next, maybe the next otherResults? Yeah, the next slide. Yeah. So this one here, I wanna point out too,the metric that we are using is precision at K,but of course we can also look at M R R. And so one thing to note about MRI is it's slightly more affected by the orderof the retrieve context.
So that's something definitely to keep in mind. It is a little bit relative, but, uh, just something that I thinks worth noting. Um, and we're seeing the same or similar patterns that we're seeing from ourprevious slide, uh, with precision at K. So you can see here, again,re ranks just slightly better. It's hard to see on these maps 'cause they're not next to each other,but we're still seeing that.
And then similarly, you know, we're seeing maybe,uh, worse performance, um,maybe as we increase our K or um,we add larger trunk sizes to it. Um,I do think that these are the two metrics that we really recommend using forthis retrieval use case. Um, we do, um,definitely see places where we have better retrieval and then does have betterqa. Um,but there's also one other component that we might wanna consider. But before I do that, I did see, oh, it's just the chat.
I thought maybe we had some questions coming in. Yeah. Um, and, andI would say again, some,some of these are really good at like M R R and clusters,which is what we kind of do in Phoenix or M R R overall, like on,on an item by item BA basis can be really helpful for troubleshooting individualproblems. And then it's correlated to the next thing we're gonna talk about,which is, uh, overall Q and a performance. Um,Absolutely, yeah, this is just a,a small piece of what Phoenix really has to offer in terms of troubleshooting.
Um,but another component that I'm sure everyone here has probably thought of andwhen they're building L one systems is the latency. Um, and so this is a,a really big, uh, thing to consider here because, you know,as you're choosing a really, you know, maybe complex approach,there are some trade-offs that you're going to expect between, you know,your performance and, you know, latency. So we really have this job of trying to balance these. Um, so we mentioned that,you know, re-rank is more computationally offensive, um,so it's no really surprise here,it's a little bit more involved that there's higher latency involved with that. So that might be something that we need to balance, like, uh, between, you know,using re-rank to get better retrieval.
But, you know,if latency is super important, we might wanna opt for a different method. Um,but yeah. Jason, did you have something to add?Yeah, I mean my, my addition here, so this is coherence re-rank, uh,calls within LAMA index, um, put in here. And I think for anyone who is putting anything out that interacts witha human, you almost wouldn't be able to to use the ones on the right, like,like the, the, the thing, the four or six kss and small chunks, you know,maybe 500 chunk size. I, I, I mean,I would probably go with the smaller window like the perfor you're gonna see inthe, in the q and a performance that, you know, the, the simple method and,and smaller windows is probably the, um, you know,the stronger thing to go with.
So,so I think it's important to not just be caught with the,the fancy new retrieval algorithm that people are putting out there retrievalapproach that someone's put out. Like it's how well is it doing and how long does it add and delayFor sure. And I think like, you know, before you make that change,like you might see this really fancy new method and you might be quick to belike, let me put that in my system. But, you know,this allows you to actually see what impact that's going to have in terms of,you know, latency in the end performance. So see, is it even gonna help me, uh,improve my performance? So this is definitely useful for that.
So up until,or these last few slides have been looking at this from a retrieval perspective,but you know, we talked about before there's that second component,the q and a performance, how well are we doing overall?And so that's what we really have summarized here for you. Now,there's a lot of different visualizations here. We're,we're looking at a lot of different, um, components here. So we have chunk size,you know, methods, our K values,and then we're looking at the percentage of incorrect, uh,from our QA evals. And so here, um,there's kind of this sweet spot in the middle around like 300 or 500.
It kind of appears that you can see it here with this pretty low, um, incorrect,uh, percentage here. Um, and it's, it's a little bit better than, you know,the a hundred size or the a thousand size chunks. Um,and something that you know,you're able to do here is to start asking questions when you see these results. So one question I had, you know,when I was looking at these is I looked at this,this graph here where we're at chunk says a thousand, you know,we're using our original and we have K of 10,which is a pretty big context window. And so this makes me start thinking like,you know, I have a lot of incorrect uh, responses here.
Perhaps I'm overflowing the window,maybe this is the lost in the middle problem where relevant context is gettinglost, um, and forgotten by the L l M. So these results allow you to ask those questions, dig into that,and then really, um,provide you with information overall to balance these different components and,and really optimize your system. I think one of the more interesting ones that patterns and, you know, we're,we're seeing if it's consistent across like we're running this both on our docsand, and a bunch of customers, um, docs, uh, customers who are using it. And on k going from four to six to 10 is pretty consistent,independent of retrieval method gets worse. Mm-hmm.
So you're like, I, you know,part of me wonder, you know, I,I want to dig in deeper and confirm that result is real. Um, it, it,it's, it, it does seem to drive towards keeping the windows simple,keeping the retrieved documents of high quality, um,and that the more garbage you put into the window, the the,the worse your results are gonna be. Um,we'll see how well this holds as we test on more docs and, um, you know,get, get more experienceFor sure. I'm, I'm sure you feel the same way, Jason,but I imagine all of this is gonna change probably 10 more times in the comingmonths and years as this, uh, uh, this really just space continues to evolve. So it's,it's exciting time that we get to try out all of this and really find the bestmethods that work.
Alright, cool. Cool. So we've done a lot,we've covered a lot of information in the last little bit here. We've understood what evals are, we've gotten to the building blocks,we now know how to look at the results and what kind of results we can expectfrom running evals. Um, and I think an important thing, uh,just to touch on here is some,or the challenge really that we're seeing in production when it comes to some ofthese rag um, systems.
And so that really is the fact that a lot of our customers and a lot ofapplications and organizations out there have a lot of documents. And so this could yield, you know, millions of chunks in your vector database,but then at the same time,we need to limit the number of chunks to something reasonable so that we havereasonable, uh, retrieval. And then, you know,we also need to get the right pieces of context from our knowledge base. So it's,it's a really interesting and complex problem that we have on our hands. And,and as we talk about, you know, AI memory and using LMS on, you know,customer data or your own data, um, it,it's really going to be this problem that we're dealing with and we're gonna bedealing with it for a while.
It's likely that this is really the future of these l m ecosystems. And so what we're really talking about here is like,how do I manage how many chunks,how well is my retrieval actually doing with those chunks? And like,how do I really get the data in, in the best way possible to our prompt? Um,and that's, that's the problem we're we have at hand. Um,anything to add to, to this before we, we talk about where we're headed?Yeah, I mean, the, my only point,and we're gonna talk about like some of the things people are doing to,to solve this, but, but the fun,like fundamentally as you think about the next five years,like LLMs, you know, the, the context window's gonna get bigger,but it's not gonna be infinite. Mm-hmm. And you're gonna have documents, you,you're gonna have data you want to give it.
And,and that data's big and there's a lot of it typically. And so this,this problem of like how to feed the l l m good data into the context window andthe right data, like,like I think it's bigger than just this rag moment right now. Mm-hmm. It's a,how to make LMS work with your data is a fundamental problem over the next Xamount of years. And, and thinking about this is something we,we think about a lot.
Um, and, and, and, you know,one of those things that we're seeing right now, um,simple ways kind of like the, the hammer method to, to, to help,this is kind of the next thing we'll, we'll, we'll throw out. Yep. Um, and so what,what Jason's talking about here is what we're really finding is like acombination of semantic search with like structured filtered capabilities, um,is, is really what we're seeing might be a potential solutions to this. So the idea here is we're adding in this metadata. So here we've, you know,added these location metadata metadatas, we have us, we have France,we might have other countries, uh, mixed in there.
And so what this really allows you to do is filter for a specific set of dataand then do your semantic search. You know,if you think about any of like the traditional query methods like SQL forexample, people like that. 'cause it's really efficient at being able to filter down your data and get justthe data that, um, you need. And I, I think we believe that, you know,as vector databases continue to evolve, uh,we're gonna see these filters being integrated more and more, and it might be,uh,the potential fix for how we can manage these millions and millionsof chunks and make sure that we're getting the best possible, uh,pieces of context for our user query. Yeah.
I, I think this is the point, as long as, you know, the,the search method being like a distance approach looking at embeddings in latentspace, um, it, it, it is very coarse search. Like,like you're essentially a, a distance sphere and latent space. And,and so that with that course method, the more chunks you get, the the, the,the more likely you're just not gonna get the right thing back. So,so I think as, as, as much as that search method remains simple,fast and, and distance based and, and, and it could get more complex, but,um, right now it is what it is. The, the, one of the nice solutions, easy,simple, everyone gets solutions is, and, and it's,it's one you probably have is, is breaking up.
You're,you're not just stuffing all your data in, but breaking it up by,by some easy groups, by product category, by region, by,by something that it, you know, you, you know, or by, you know,e-commerce sites are doing it by like the,the filters someone might have as part of the search. So, um,recommended think through, especially as you're doing bicker systemsFor sure. And like,kinda going back to that question we got early on about like, you know,how do you, um, improve, uh, your retrieval? This again might be another,another method you might try out here. Cool. Awesome.
Well, um, we do have, or that is, uh,the end of our presentation. We'll, we'll do questions, um, for now on,but these are some, um, links for you all. You can go to some colabs, um,as well as our GitHub for rag. Uh, also check out our Phoenix tool, uh,leave our, um, GitHub a star if you feel inclined to do so. But yeah.
Uh,what questions do you have for us? Yeah,They, they a bunch during the, we kind ofOf all thanks for that. Great, uh, great session. Um,what's the difference between original and re-rank?Uh, yeah, so, so original is, is just, uh, embed your query and,and get semantically similar, you know, um, stuff. So, so it's kinda like the,the, the very traditional, um, rag approach that most people are, are,are using. And then re-rank is using COHEs re-rank to re-rank the,the context retrieved, um, and, and using Lum index,plus COHEs kind of version of re-rank.
Gotcha. And we have another question in terms of the results. What would you do next? Let's throw those back up on them. Yeah. Um, uh, in terms of the result, I mean, so,so just walking through this one,like if I was gonna choose my parameterization for my system right now,I would choose 300 or 500, uh, in terms of chunk size,maybe, maybe even three in this case.
And probably K equals, you know, four,four or six, um, u use it on the small side. So I think,I think what I would do is choose my parameterization. So this would help me in my original setup. Um,I probably would use Phoenix to be able to catch the rest of my problems. So as I'm getting customers sending in more data or I have test sets I wanna runagainst, um, I'd use Phoenix to then really map out like,where is M R R failing? Okay, what's this group look like? Okay,is it garbage chunks or something? Um, so,so I would use this for my original prioritization and then I would use,you know, um, tried and true troubleshooting, building good test datasets,running it, using those tools to, to narrow down your,your general problems that, um,you're gonna probably hit constantly and consistently and,and once you even get rid of 'em, you'll have more as you get to production.
Um,so, uh, that's, that's probably what, what I would do. Um, and so, sorry,I didn't see chunk overlap. Yeah, chunk over. Good. Good question.
Great, great,great question. So chunk overlaps another parameter. Um,we'll be adding it to the script. It's not in the script now,but it's another common one that people use to improve performance. So trying to get, so, so trying to get some, some overlap in the,in the chunks that,that are in there versus having them completely desp disparate.
Um, good,good catch. It's not, uh, not a parameter in, in the current script. And I don't know about you guys, but um,definitely on the Vis and Zillow's team,we get a ton of questions about chunking strategy. Mm-hmm. Um,you guys have a lot of experience, right? You've built this amazing tool set.
Um, where do you feel like most people go wrong and what do you sort of mostcommonly,and then how would you recommend people sort of avoid those like first mistakes?Yeah. I, I,I think the core area of of wrong is just like not having measurements and goingafter the new shiny cool, cool thing. Um, that, that kind of, my, my,my, my push for everyone is put, put some measurements in place that,that help you understand if what, like this strategy versus this strategy. And,and then the other thing that, that I would put into,and I think people go wrong, is like, um, the data's kind of important. So putting together either some test questions or, you know, some,some test sets in the beginning, and, and sometimes you don't have 'em,you can generate, we, we did a, a, a,a session with llama index earlier this week where they have some great toolsfor generating questions from chunks, but synthetic data can help.
It doesn't replace like the, the data that you, you have, you know,Jerry's note was like,the test questions can be garbage if your chunks are garbage, so be wary. Um,so I, so I think the big one is data, you know, come up with data,come up with tests, um, and then,and then I think like the results seem to show that like,don't go too big on chunk size, don't go too small, you know, there's there's,there's some,some amounts right now that are kind of nice given the context window andstrengths of these models. I also think considering what kind of, uh,queries and like what your system is supposed to be designed to do can behelpful in determining, you know, your chunk size. Uh,so just another helpful hint there. So you mentioned, um,that you were testing on your own product documentation.
Does testing on something that you have domain expertise on help you interpretthe results of these evals? Like that kind of human checkpoint,or how does that factor in? Yeah, I can,I can give my answer and then I'll let Jason kind of elaborate on it a littlebit. But I think, you know, when you're looking at your own l l m and you,you know what data it has and you know what the task at hand is,it does make it a little bit easier to evaluate the results that you're gettingback. Um, you know, for our documentation, for example,like we know that we don't have any pricing docs in our, um,in our publicly available documentation. So we're looking at evals and if we have a question on, you know, pricing,we can kind of know that you know it,it's not gonna have the context to answer that question. Right.
So then maybe it's not a retrieval problem if that's what we're seeing in ourevals. Maybe it's more like from our knowledge base or, or prompt engineering,I'm sorry. Um,and making sure that it's not answering questions that it doesn't know. So I definitely think, maybe not domain expertise,but like understanding what your knowledge base is, maybe, um,maybe the documents that it has available to it and, um, what you're actually,you know, expecting this l m to, to do is, is really helpful in, um,um, evaluating these results. Jason, you have anything to add for that one?I think it was good.
I think it was a good, you know, good, good capture of,of the question and yeah. Thanks. Perfect. And we have another question in the chat. Uh,there are also rag techniques to apply l l m in the retrieval phase.
Mm-hmm. Uh,for example,we can retrieve more than enough chunks and use L Lmm to filter out theirrelevant ones. This way, since all chunks are made to be relevant, um,N D C G D C G M R R becomes less useful, I assume. Yeah, so, so great question. So by the way, we did use like a,a fancier technique, like hide.
It's an example in here where we generate a more complex, you know, a, a,a better centered question. Um, but, but I think all those are, are I,I think some of 'em actually have, have a lot of promise. I think given the results here,the one we just mentioned actually kind of getting stuff out of the window,that that doesn't, you know, your context window, that's,that's not relevant prior to the, the, um,generation step, I think actually has, has a lot of promise. Um, again,I come back to like, let's, let's look at measurement, let's look at results. I would also come back to like, you know, even with those fancier techniques,I still think there's a lot of, um, a lot of things that could go wrong.
And so I, I, I don't think all these metrics you get rid of,I think you're gonna want them as ways of finding problems in your system. Um,but, but I think they're, they're good points. That was a good point. Um, and another question from an attendee. What are some other ways to re-rank without using cohere or, uh,basic cross encoder?Yeah, I mean, I, I mean, without,I think you're gonna do without aco.
Um, I,I think it's gonna be like you're either using, you're probably using, um,you know, l l m or, or internal like, like lmm if you're on the, on the,the fancier side. And then if you're, if you can do local models, there's, um,I I would say it's some approaches maybe to look at like hugging face type typethings. But yeah, I think, I think the cross encoder, um,and an L L M are probably the, you know, those,those approaches are probably gonna be the, um, more, uh, the things I see, the,the more, more of, um, yeah, that, that,I mean the, the challenges with those that we've seen right now is just the,the timing it takes. So like, um, yeah, there, you're,you're for those of you or those of us dealing with real time systems, um,that,that is a challenge in these like double calls back or are taking a little bitof time. So I think that's the,the one that everyone's kind of fighting with right now, that to do the re-rank,which you probably could get value out of or a lot of value out of if you dropstuff out of the, the window, um,doing it fast is you probably do need those local models in some way.
Great. Thank you. Um, I have a question. Uh, just from personal curiosity. Anything you guys wanna share, um, that people should keep an eye out for from,uh, the rise in the Phoenix team?Um, oh, uh, I, I think, yeah, so, so I do think what, you know,one of the key points with these evaluation, um, sets and stuff is, is,is we're trying to like, do more of these.
We're,we're gonna add a session to the end of the evals set here on,on really synthetic GE data generation. We're feeling like that's a key part of like, starting with,with your evals and getting some data together. It's not a replacement for your,um, uh, for your, your, your current data set. So, so expect another session on,on that. And then, yeah, and then Phoenix has a lot, a lot coming.
So, uh,there's some great tools in there for visualizing your retrieval and embeddingspace. We're tying it a little bit better to the spans and traces stuff coming. Um, expect a lot, you know, there's, there's that team's, um, uh,has a lot in the works that we haven't rolled out. Cool. I think that's it on questions for the moment.
We'll give everyone just a last minute. Um, but anything, um, you guys,in thinking about what we've talked about today,anything last minute shares that you want to, um, any takeaways?I don't know, feel free. I, you know, love join our community. We talk about this stuff all the time. Arise, um, communities pretty large,and this, this is on our mind.
Feel free to drop notes and follow up there too,if you have any questions. Um, we're, you know, we're,we're all technical folks loading the, the space and, you know,trying to add measurement and data, you know, a little bit of science to it. Cool. Um, Sian, would you mind sharing the QR code one more time?Perfect. This is your last, last chance to, uh,to grab those codes.
Make sure, uh, you connect with the Arise team,join their community. They've got a lot of really cool stuff going on. Um,we will have the recording up probably later today for all of you who areasking. Uh, thank you so much for joining the session. Sally and Jason,this has been really wonderful.
We really appreciate you stopping by and, uh,we look forward to seeing everything that you guys build and create. Thank you for having us. Appreciate it. Yeah, thanks for having us. Bye.