Events
Evaluating Retrieval-Augmented Generation

Webinar

Evaluating Retrieval-Augmented Generation

Zilliz Webinar | Zoom

Join the Webinar

What will you learn?

Retrieval-Augmented Generation (RAG) has emerged as a cornerstone technology powering the latest wave of Generative AI applications, from sophisticated question-answering systems to advanced semantic search engines. As RAG's popularity has grown, we've witnessed a proliferation of methods promising to enhance the traditional RAG pipeline. These innovations include query rewriting, intelligent routing, and result reranking—but how do we measure their real impact on application performance?

Join us for an informational webinar where we'll explore robust evaluation frameworks, including LLM-as-a-Judge methodologies, industry-standard benchmarking datasets, and innovative synthetic data generation techniques. By the end of this session, you'll master practical approaches to evaluate and optimize RAG systems, equipped with the knowledge to implement these tools effectively in your own applications.

Topics covered:

LLM-as-a-Judge
MT-Bench
LM Eval Harness
Synthetic data generation

View presentation slides

Transcript

Today I'm pleased to introduce today's session,evaluating Retrieval Augmented Generation,and our guest speaker Stephan Webb. Stefan is a developer advocate at Zillows,where he advocates for open source vector database vis. Prior to this, he spent three years in the industryas an applied ML researcher at Twitterand meta collaborating with product teamsto tackle their most complex challenges. Stephan holds a PhD from the University of Oxfordand has published papers at leading machine learningconferences such as nips, ICLR and ICML. He is passionate about generative AIand is eager to leverage his deep, um, technical expertiseto contribute to the open source community.

Welcome Stefan, and get started with your session. Yeah, thanks so much, ACHI, for, for the introduction. And so, um, I wanted to say, uh, thanks to the attendeesfor, for making it here. I know there's like so many sort of competingwebinars in person events in generative ai. So, uh, thanks for choosing this one.

Um, so firstly, before we get started,I've got my contact details on this slide here. And so, so, so my, my role as a developer advocate is to,to bridge this gap between the developersand the, uh, the users of melvic, um,our open source vector database, which means that, um,I really love connecting with, um, uh, with, um,uh, anyone essentially. Um, so if, if, you know, you want to sort of have a chat,you've got questions about, uh, vector databases, rag,um, you, you know, we'd love to hear about, um, not onlythat, but also what you're building. Um, I, I, I assume like a lot of you, uh,are in startups building some really cool new applications. So, um, so, uh, feel freeto send me an email at that address.

Follow me on LinkedIn, um, send me a LinkedIn message. That's all really, really welcome. But let's get started. So the, um, so the topic for today, um,I originally called it evaluating Brag,but then I thought, okay, maybe sort of, um, a more sort of,uh, like a title that is a bit more sortof implies like the, the consequence of why we're,or like the, the motivation behind evaluating RAG is, uh,building a principled rag pipeline. So building all of the components in your RAG pipeline sothat you know that the, the,the specific design choices you make are actually increasingthe performance of, of, of your application.

And so here's a, here's a brief outline. So I'll start off, I'll give a brief introduction to,to the problem, why it's important. I'll then talk about some like, uh,give like an introduction to some e evaluation, uh, methodsfor evaluating, um, uh, large language modelsand RAG in particular. Then we will talk about some of the, the challengesand the limitations, especially as it relates to evaluatingRAG using large language models as opposed to using some,some, uh, like predefined, uh, benchmark dataset. And then we will, uh,briefly discuss some open source evaluation frameworks.

Uh, so, um, with regards to the, the, the fourth part,I was, I originally had like sort of a,a larger scope in mind for this webinar, but thenafter I got some requests to really sort of, uh,cover like the basicsbecause it's a, a new topic to many people, um, I decidedto break it up into, um, at at least two webinars. So, so this one will really focus on the, like,the motivation and like the basic ideasbehind evaluating LLMs and rag. And then in a second webinar, we'll really dive deep intoevaluation frameworks like, uh, ragus and,and s Hey, so let's sort of,let's introduce the problemand discuss why, why it's important. Sorry, my, my cat's just making a noise over here. So, um, uh, I, a motivating example I like to useis, uh, semantic search.

And so I've got a picture here of, uh, like a, a,a search result from perplexity ai, this, this, this new,um, generative AI based, uh, search engine. And so this is, I mean, uh, maybe we could look in,maybe we could sort of, um, people could write in the chat,have you used a product like perplexity. ai before?And if so, has it, what,what have your impressions been relative to say,like a more traditional, um, Google search?So, um, this is one of the, I would say like oneof the killer applications of generative ai. And, um, behind the scenesit's working it on a sort of like, yeah. So we've got, uh, Ken show here says, love itfor the higher quality and accurate search resultsthan, than GPT.

So definitely can relate to that. Um, so behind the, behind the scenes, this has some sort ofa rag, uh, pipeline. Um, so that's the technology it's built on. Um, and so, um, uh, forthat, you know, so for that reason, we need to be able to,so, uh, when we're building a system like this, um,a question sort of arises like, how are we goingto optimize it?How are we gonna, like, what changes are we gonna maketo the, the architecture behind the scenes sothat we're actually serving people, uh,better results over time?So I think that's why it's such an important question. So firstly, semantic search like a killer applicationof generative ai, uh,but only if we can continuously improve itand be able to like to, to measure those improvementsand know that we actually are moving in the right direction.

Um, so why, so, um, why,why do I think this is a killer application of Gene W ai?Well, because semantic search is a really effective waythat you can search over what's called unstructured data. And so unstructured data that includes, uh, documents,images, audio, um, uh, geolocation data,all these types of, of, of data that don't have a readily,readily available feature representationlike say, um, tabular data. So, um, yeah,and in 20 25, 1 estimate, uh, reportsthat 90% of newly generated data will be unstructured data. So this, this semantic search,like perplexity on AI is goingto become even more important over time. Um, so we're gonna be talking about how to evaluate RAGand how to improve rag systems,but firstly, just like a quick refresher.

So here's how like a very, like a basic rag system works. Um, so we start off with a knowledge base of, um, documents,images, et cetera, that we want to, to be ableto search over and to, to pass to, uh, relevant, uh,chunks to our chat bot. So, uh, beforehand or offline, we take that knowledge baseand then we run it through an embedding deep neural network. And that turns, um, sort of like chunks of the text, um,whole images into these, these vectors known as embeddings,which are just like sequences of, of real numbers. And those numbers in some sense capture the, the meaningor the semantics of, of the input.

So once we've converted these unstructured, um, uh,data items into these, these more, uh,machine learning friendly embeddings,we put them in a vector database like Novus,which is especially designed to, uh, be able to,to store large amounts of these embeddingsand be able to perform like a efficient search. So you may have, you may have like a hundred million,you could even have, I imagine it like perplexity ai scale,um, either like hundreds of billionsor trillions of, of embeddings. Um, when, when you get to that web scaleand you wanna be able to do a, a search and,and find like the, um, the,the nearest matches to your query. So, um, uh, that's the first step. You, you, you take the data that you wanna search over,you embed it, you put it in the vector database,and then, um, so, uh, the user comes along with a query.

And so the user asks a question to the chat bot. And how it works is, rather than just sort of the,the large language model answering the question directly,what it does is it takes the, the query puts itthrough a deep neural network to get a query embedding,and then looks up the vector database to find the,the closest matchesor the closest items to the querythat we've stored in a, from our knowledge base. And then, uh, it retrieves thoseand it adds them as context to the, the promptthat is fed into the LLM um, chat bot. So this is a way that we can, we can add external knowledgeto, um, a, a chat bot. And I think it's really usefulbecause, uh, knowledge is changing, um,so constantly much more quickly than you would wantto retrain your LLM.

So you can keep your LLM fixedand, uh, just like, just vary your, your,your vector database and sort of adaptto like a change in knowledge. Or alternatively, you could make your vector databasecapture your, your own proprietary data, uh,and then you don't have to train the LLM,you just put your proprietary data into the, uh,vector database and you can search over it. Okay. So that's a, that's a quick refresher on ragand what I, what I showed there was a very basic rag, uh,so rag these days has a number of different components. It's become like a modular system of, of stepsto perform different types of functionsand put together in, um, you know,like an infinite number of ways.

Um, so here are some of the main sort of design choicesthat, that you have to make when you're designing, um,a more a advanced performant rag system. So firstly, the, the question is input into the,um, the, the system. And so in, in the previous example,okay, we've got a question here. So, um, with answering questions, feel free to like, um,put a, a, a question in the chat at any time. Um, I'll, I'll tryand like stop as we're going and answer them.

So the question from Austin ishow are we comparing vectors producedby embedding different input types?For example, if we embed a websiteand a user query, these are very different entities. What approaches are thereto make these items more comparable?And so that, that's an excellent question. And so the, uh, the answer to that is we need, so if,if we want to be able to compare, um, like a, um,uh, yeah, so, so, so just suppose that we have imagesand text documents in, in our, um, knowledge base,um, or suppose that our knowledge base is imagesand the query is text. So to be able to compare those inputs from, uh, different,what we call modalities, we need a modelthat's especially trained to produce embeddingsthat align to the same space. And so, um, this is called, uh, multimodal,um, embedding models.

And, um, uh, it, you know, it's,it is like a topic I could give a, um, I I've sortof given whole talks on in the past. But, uh, in a nutshellhow it works is we have a training data set of imagetext pairs that match,and then we do some training so that the, uh, the,the corresponding embeddings for those image text pairsthat match, um, uh, are, are close in the embedding space. So, um, yeah, so the answer to the question is,we need special embedding models that are especially trainedto work on multiple modalities, like image and text. Okay. So, um, we are talking now aboutwhat are all the different sort of design choices thatyou potentially have to make when you'redesigning a rare system.

And so the first one which comes along is what sortof processing should you do to the query?And so if you look at the red box here, query translation,there's all these sort of different techniques to translateor transform your query to, to make the, um,retrieval more efficient. So let's talk, take a look at this onehere, pseudo documents. Um, so it goes question and inputs into like a, a brain. It's got, um, hide HYDE ridden underneath. So, um, this particular technique,how it works is we take our queryand then we ask a large language model to, to synthesizelike a hypothetical, um, document chunkthat would answer that query.

And then instead of searching the VEC database with, um,the, uh, the query we search with the embeddingof the hypothetical document,and that's sort of shown thatthat can often produce much better resultsbecause that hypothetical document is much closer. It's much more sort of similar to the, um, the, the,the document chunks that you've embedded, um, in the, um,the, um, uh, offline stage. So, uh, and there's other typesof query translations you can do. And then after you've done that,the next decision might be, um, routing. So where are we gonna look up the data?It doesn't always have to be a vector database,or it could be, could be multiple vector databases.

We have to make a decision, how am I gonna like,route the query to a specific data source, whether that's,um, whether that's like a graph database,a relational database, um, a vector store,or one of a number of vector stores. And we can, we can make that choice depending onwhat the query is, um,in an intelligent way if we design our system like that. Um, so then, um, in the next step, query construction,how will we actually, uh, construct the query?So if, if, if we're searching from a relational database,how are we gonna go from our text query to an SQL query, um,to search that relational database?Um, with regards to the indexing step,we've got many different ways that we can,we can break up our knowledge base and embed itand, uh, index it in, in the vector store. So take the example of our documents. Do we want to break them up into paragraphs?Do we want to break them up into fixed length, uh, chunks?Do we want to use some sort of s semantic informationto decide how to make those, those breaks?Um, that's just like one sort of very basicdecision you have to make with, with the indexing.

Then after that, so you,you've retrieved your relevant documents from your graphdatabase, your relational database, your vector database,then we can, we can process the results to tryand, um, I improve the, um, sort of like the,um, the, the relevancy. So often people apply, um, a re-ranking procedurewhere we take our documentsand we put them through some sort of like a, a modelthat can, that, that can, uh, rank them in termsof relevancy and maybe like filter out the onesthat are less relevant, or if we've,we've got results from multiple sources, um, we'll,we'll need some sort of a method to like combine sort of,um, uh, combine the, um, the rankings from,from, from different sources. Um, and then in the generation stage, do we just wanna like,generate the answer and that's that,or do we want to have some sort of like an active retrievalwhere the, um, the, the,the system kinda like reflects on generation quality?And that feeds back into some question rewriting, uh,maybe some, some re retrieval of documents. So these are sort of like the, the main sortof design choices that, um, you have to sortof consider when you're making like a, um,like a more modern rag system. And so, you know, the question is like, which,how, how should you do it?I've got this, I've taken this table here from a surveypaper on rag.

And so this is a summary of, um, some different rag methods,and this is like very incomplete. So you can see how like, how many different, uh, techniquesand research papers there are on all sortsof different design choices you can make your rag pipeline. And then on the right hand side, um, I got some very, um,uh, concerned, um, uh, anxious, uh,designers who are thinking, okay, how do I sort of cutthrough this, um, um, this sort of, um, uh,complexity to, to actually build a system that is perfect. Uh, so, um, uh, the answer is that, of course,like we shouldn't be just designing these systems at random. We shouldn't just add a componentbecause we've heard that, uh, on someone else's benchmark,it, it gets high performance, we need to do,do it in an informed way,and that informed way is experimentation.

So we need to run, uh, well designedor like, um, like, uh, rational experiments to, to,to make these design choices. And so how that works is we firstly haveto consider the alternatives,and ideally, we'll only be changingone factor, um, at a time. So keeping everything else held constant will varyone factor of the design. So for example, maybe, um,we'll say, okay, let's compare like the very naive,so this very basic rag system to onewhere everything else is the same,but we've added in this hypothetical document method. So we've, we've kept held everything else the same,and we just very one thing.

So once we've got our alternatives,there may be a single alternative, there may be multiplethat we're considering at, at a, at a time, we then haveto be able to measure the effect of, of that, that change. So we need to have some sort of like measure of like, um,like, you know, goodnessor performance to be able to say, wasthat a beneficial changeor was it beneficial in some respectsand perhaps not others?Um, and then, um,after that, then we can essentially just, so if, if we've,if we've figured out how to do those two steps effectively,we can, we can sort of, uh, rinseand repeat this process, um, sortof making additional design choicesand guiding the, the design path, um, in a principled way. Um, so, you know, I think that's pretty simple. The, the, the challenging thing is how,how do we measure the effectof making a change to our rag pipeline?What does it mean? Like, what, what are we measuring?How should we measure it? Um, does, does it even make sense?And so I think that's, um, so that, that'swhat we're gonna talk a bit more aboutin the rest of this webinar. And I think it's really, so evaluating, uh, ragbut also just evaluating foundation models in general,I think is one of the most important topics to,to grasp in applied generative AIbecause, um, you can't, you know, you, uh,you just can't design like a, um,a state state-of-the-art system without a principledmethodology like this.

Okay. So, um, just like a, a few notes on the scope ofwhat we're gonna talk about for the rest of the webinar. So evaluating brag is a very large topic. So there's some things that, um, I'd like to cover today. And then there's other things that I'd like to lead for,for future webinars.

Um, so I think primarily the, the main focusof this webinar will be the, the fundamentalsof this technique called LLM as a judge,where you actually use a judge, large language modelto evaluate the, the performance of, um,a large language model, whetherthat's in a rag system or just by itself. And so we'll be looking at evaluating the output of,of large language models rag, uh,primarily in an offline setting,although these ideas also carry acrossto online evaluations. Um, and so we will be talking about, um,one framework called, um, LM Evaluation Harnessthat's really easy to use. And with a few command lines, you'll be able to startevaluating your, your pipelinesor your, your life language models on, on these, um,uh, common benchmarks. Um, we won't be covering some,some other really common frameworks such as one called, uh,ragus, uh, because it's a bit of a larger topic,and I wanted to really keep the focus on the introductoryideas, um, in, in this webinar.

So a follow-up webinar will primarily focus on, on,on ragus, as well as some of the, um, uh, sortof like alternatives to LM as a judge. So we won't be having like a, a super deep discussionof LM as a judge today. There's other methods that have, um, built upon the, the,the research from the LM as a judge paper that sort of tryand address some of the shortcomings of, of that method. So again, we'll, we'll leave that for a future webinar. We won't be talking about evaluating agents specifically,although many of these techniques will be applicableto agents or agent workflows.

Uh, we won't be talking about evaluating multimodal modelsor in an online setting. But again, um, many of these ideas are,are relevant to, to those settings. Um, so when we say performance,performance can mean several things. And so I think I would say like the, the, um, sortof two primary ones are performance in termsof generation output versus performancein terms of, um, you know, like, um, uh,latency throughput, these more sort of like, um, uh,time-based metrics. And so we are not gonna be talking about latency at all.

It's a completely different topic. Uh, we'll be really focusing onhow do you evaluate the quality of, um, the answers. Um, so interesting topic i I came across recently wasthis idea of adversarial attacks. So, uh, researchers have found that you canconstruct input outputs in a certain way, suchthat they fool the, um, either the, um,the LLM which you're using as a judge modelor the, uh, the benchmark, uh, metrics. So, and in that way you can sort ofconstruct like an adversarial attack on the, um, evaluationof your model, which I think could be particularly relevantif you're doing it in an online setting.

And you're, you're evaluating these, these real time, um,uh, input outputs from, from users. So again, like a very interesting topicand something that, uh,potentially we'll discuss in a future webinar. Okay, so let's talk about some of the, um, the sort of, uh,fundamentals or basic ideas. So I think, um, uh, the big question iswhat do we actually mean by performance or, or,or, you know, uh, uh, goodness of,of the model or, or the system. And I think to answer that, we have to make a numberof, um, distinctions.

So let me just go through some distinctions here. There's a distinction between performanceof the system on a task versus the performanceof the, the large language model inand of itself, uh, independent of, of a task. And so what I mean there is, um, so we can, um,to, so we can talk about the performanceof a system on a specific taskwhere we have like these predefined, um,input output pairs plus like a ground truth response. And then we can say, okay, we know what the ground truth is,has, has the answer, um, like, like agreedwith the ground truth in, in some way. But then, uh, very separately there's,we can evaluate the model in and of itself.

And so what we can do is we can say, let's take the inputand the output and see whether the output alignswith the input with respectto some aspect of human preferences. And so these are things like, so, uh, we,we have a preference or, um, perhaps it's just common sense,the output should be relevant to the input. And so that's something that is independent ofwhat the actual answer should be. Um, it's about this consistency between input outputor it might be, um, groundedness, does the output onlyreference facts that were contained in the contextand not just like, uh, make up stuff. Um, so again, this is like independent of the ground truth.

It's something that is a quality of the, the, the modelof the system in and of itself. So there's two very sort of distinct typesof evaluation there. Um, so again, like, I think, um, just to reiterate, so the,the task evaluation that's sort of like a very sort of, um,specific, um, uh, task relevant metric. Um, whereas when you're evaluating these models inand of themselves is typically related to do,does the model align to human preferencesin ways that we would expect?Um, so, and then like, um,I think this is pretty much related, uh, we can compare, um,uh, are, are we, so where we doing this evaluation,are we comparing the answer to a known ground truthor are we comparing the outputto, to the input or the context?Um, with respect to rag, there's different parts of the,the pipeline that we can evaluate. We can evaluate the retrieval part, so that's evaluatingthe what's returned from the call to the vector databaseor the relational database or the graph database.

And then there's evaluating the actual outputof the LM itself. So there's two different parts of the pipelinethat we can evaluate. Um, and then there's, um, again, sort of like a,um, not like completely mutually exclusive dimension. Do we have the ground truth, um,or do we not have the ground truth?Um, so if we, if we, um, so if, uh,if we don't have the ground truth, um, human evaluation,we can see that as the gold standard. But of course, this doesn't scale well,we can't sort of just fill in all of our labelswith human evaluation.

Um, it's just too expensive. So one of the, like the key takeaways that I wanted to, to,um, to, um, uh, oneof the key takeaways I wantedto take away from this presentation isthat perhaps surprisingly strong large language modelslike chat DPT fouror, um, you know, the, the latest version of Claude,they can actually evaluateor judge LM outputs when there isn't a ground truth. And I think this is kind of like surprisingbecause it seems like it, it is sort of like a, you know,like cheating or a hack. Um, why should a model be ableto like e evaluate itself in a sense?Um, but there's empirical evidence that the, um, the,um, the answers from these LLM judge judges like actuallyagree with each other, um, as muchor more so than, than human judges. Um, and I think this is particularly relevant when you areusing a strong LLM that's already really well alignedwith human preferences as opposed to like a, um,like a much smaller, um, I don't know, sevenor 8 billion parameter model.

So in terms of the, um, the, the task based evaluation,we can break those down into several different types. There's ones that are knowledge based, so, you know,does it give the correct answer to a question?There are ones that are instruction following based,so does does the model followin simple instructions in the desired way,and then there are conversational ones. Um, so for example, one type of that is, is the model ableto, um, answer correctly like basic sort of, uh,reading comprehension questions from a dialogue. Um, and so I've got some of the most sort of common, uh,benchmarks typically used in, in papersor leaderboards up here. And so if you're interested in getting a sense of like,what actually is in, in these, uh, benchmarks, like what,what are the questions and answers, et cetera, um,a good way is to actually find a version ofthat database on the hugging face, uh, website.

So I'm just gonna go over here. Um, so for example, so tell us swag a knowledge-based,um, uh, task benchmark. So we can go to, we can find it on the hugging face, um,data sets, uh, page. And so there's most likely gonna be many different variants,people adapting the database to their own purposes, uh, uh,dataset to their own purposes. Um, so this one here, we see thatwe are given an activity label,and then we are given, um, like a primary contextand then, uh, two different contexts, A and B,and then this, this endingsand the model has to, um, we're sort of evaluatinghow the model, uh, completes the, um, the sentence.

And so we have like a label this ground truth ofof which one, um, it is that, that we we're completing. Okay? So, um,or another example, if we look at this, if we'll look at,yeah, we'll look at this, um, uh, coqa or coca,however you pronounce it. So in this one we have, um, so we have,um, so, um, uh, it, it, it lists the sourcethat's not really so relevant, uh, but it has, um, a story. And so the model has to take this story as context,and then for a number of questions, refer to the specificspans of text in the story that answered that question. So that, um, this particular example here,it's talking about the, um, the, the Vatican library,and then the first question is, when was the Vatican, um,or sorry, the, when, when was the VT formally opened?Um, and so it has this ground truth of, um,where the answer starts in the, in that text string,where it ends, um, and what what it is.

And so we've got this ground truth of like,what should the answer be, um, as, as,as you, um, a as it sort of, uh, refersto like a subset of, of the story. So, so this is sort of like testing, uh, comprehension,um, in a sense. So, um, yeah, so,so these are just some like very commontask-based evaluations. One of the things about these task-based evaluations isthey, they often don't sort of capture human preferences,but they, there's some aspects of human preference alignmentthat they don't capture. And so that's why we would want to, um, consider separatelysome complimentary evaluation metricsthat are more introspective.

And here are some examples of those. So there are ones that evaluate the, the generation of,of your rag system or your, your lm. And so one aspect of that, as I mentionedbefore, is this concept of faith, um, concept of, uh,faithfulness or, or groundedness. So is the output is,is it factually consistentwith the information given in the context?Does it make additional claimsthat can't be deduced from the context?Um, if so, that shows that it's not alignedwith this human preference, that answers should be relevantto, uh, questions. Um, okay,uh, answer relevancy.

So similarly, like is the response actually relevant?So, so, uh, not is it grounded in the context,but is the answer actually relevant to the question?Does it just like go off on a, a tangent?We can also construct some similar metricsfor the retrieval based part of our rag pipeline. So again, um, is, so the,are the retrieved chunks from our vector databaseor, um, our graph database, et cetera,are they actually relevant to the, the question being asked?Um, uh, separately it, so, um, if we have a ground truth,uh, we, we can, we can use some retrieval based, uh,search metrics to, to score the, um, the, um,search results from our, our vector database, uh, relatedto the, the, the known ground truth. Um, but for these first three, this, these are the oneswhere we can actually use a separate large language modelas a judge to, to, to, um,to score the input outputsor to, to, to score these metrics. And so free. Um, how it works is you have a separate judge,LLM, and you construct a prompt that, um,I, I guess sort of like gives instructions on, onhow to calculate the metrics.

So the prompt might be, um, give a score from zero to one,how, uh, is this context, um, a, um,does it contain any facts that are not present in, um,uh, context, um, uh, b. So, um, so this is where we can,we don't have a ground truth,but we can use a large language model to,to calculate these, these introspective type metricsand do so, so that we can capture like a different aspect ofhuman preference alignment that are not captured by these,uh, more like task-based, um, evaluations. Okay. So just very briefly, some challenges and limitations. So, um, uh, when I first sort of, uh, saw this, I thought,you know, this can't possibly work.

How can a large language modeljudge another large language model?Um, it just seems like a very sortof like circuitous, um, argument. And so, um, but you know,then I saw the empirical results in the LM as a judge paperand thought, okay, this seems like it works,but of course, like there are some caveatsand there are some particular limitations that the,the researchers originally found. And so I'll mention a few of these brieflyand then talk about, um, like a simple way that,that you can, um, you can, uh, ameliorate some of them. So the first is what's called position bias. So the, the result of our LLM judge can sometimes,or perhaps often be different depending onthe exact order of the, um,the information that, that you present.

So, um, in this particular, uh, question, the,the judge is being asked to say, which answer do I prefer?Do I prefer assistant A or do I prefer assistant B?Is that a better answer to the question about, uh,business adequate norms in Japan?And, um, so, um,uh, GPT-4 gives a different answer depending on whether youput, uh, the assistant a information in the context firstor the assistant b information in the context first. And this is obviously, um, shouldn't happenbecause it should be irrelevant, uh,that the question is the same. It's just like the order of theinformation in, in, in the context. So I guess there's sort of like, in some sense,like human preferences over judgesof human preferences in large language models. There's another one called, uh, veracity bias.

Um, we can just take like a, a contextand duplicate some of the information. And then for, for GPT-3 0. 5and Claude, that changes the, um, the response. But, but not for GPT-4 in this case. Um, there are some questions where if you asked the, um, if,uh, if you ask the, the LLM judge to answer the question,it can answer the question correctly, for example,like a mathematics question.

But if you ask it to judge two answers,or, um, if you ask it to judge an answer from another model,it will be sort of like misled by the, um,like incorrect reasoning in in that answer. So this, this doesn't seem right, like if, if the model can,if, sorry, if the judge can answer the question,it should be able to judge other models answers. And then another one. So very briefly, there's, um,another failure mode where if we include like a chainof thought type, uh, reasoning. So, um, those, those, the, um, exact results of the judgecan be influenced, um, in, in a way to, to,to give the incorrect answerand not sort of, um, thinkers like independently in a sense.

Um, so, uh, they suggest some, some methods to tryand like improve the, the quality of the judge, um,and, uh, try and like fix some of these biases. Um, I, so, uh, since thenor so, uh, one thing that they sort of alludeto in the paper isthat it might be worth looking at fine tuned judge models. So, uh, judge models that have been fine tuned specificallyto be judges and to avoid these biasesand other, uh, mistakes that a judge could make. And so, so since that, that paper was published, um, uh,companies researchers have been releasing, um, uh,judge models, uh, both closed source and open source. And I've listed some, some recent open source ones here,so you can check those out.

And I think sort of the, um, the lesson is like, you,you should always be using, um, an LLM judgethat has been fine tuned for, for that purpose,or maybe a separate one for specific purposes,like maybe one is fine tuned for groundednessand one's fine tuned for for relevance and, and so on. Um, so I just wanna finish on, on one sort of key message. Um, so I think like a really important part of evaluationis the data quality of your, um, evaluation. And so I think, um, so in, um, how that relatesto task-based evaluation is, does the, um,does the specific data in that benchmark, how closely doesthat match the task that you are tryingto do in your application?And if not you, you may need to construct your own, uh,custom data set to capture your, um, uh, some,some aspects of the, of performanceof the specific task that you are doing. And then secondly is, um, so data qualityfor, um, for, uh, for, for, for,for training these, uh, judge models.

So can we construct, like the right data set sothat we can fine tune our judge modelsand they align with human preferences onhow a judge model should operate?So it looks like we're, we're out of time,so we're gonna move to questions. Um, so, um, I was gonna talk about very briefly, some,some open source frameworks,but I'm gonna leave that for a future webinar,and that will be the primary focus. So we'll talk about LM evaluation harness, uh, ragus,the most widely used, uh, frameworkfor evaluating RAG specifically, um,talk about some alternatives brieflyas well compare them to, to ragus. And we'll be using the VUS open source vector databaseas the vector database for the, um, like the, the,the RA system that will be demonstrating this on. So in this, this future webinar, I'll actually builda simple rag system and show you from like, from end to endhow you can, uh, make a design choice, evaluate that,and then move your, your system in a direction where you've,you've made like a verifiable improvement.

Okay. So, um, for the sake of time,I think we should move on to, um, to questions. So yeah, so you have a few questions in the chat. Oh, in the q and a tool, so I'll just go aheadand read the first one out here. Um, in more traditional ml, we would use a grid searchto find the optimal setof hyper parameters to train a model.

With so many different configuration settings for rag, suchas model query, construction query translation,are there any frameworks that can do grid search?Um, yeah, so great question. So I think, um, uh,I think, yeah, so, um, uh, not that I'm aware of. Um, I think that you, you could sortof use like existing frameworksthat are not specifically designedfor hyper preparative search for rag systems,but can be like adapted to, um, to, to rag systems. So I could definitely see how you could sortof combine some pre-existing libraries, um, whetherthat's grid search or Ian optimization, uh,and then you would just sort of plug in the, um, like the,the rag specific metrics and it would search that space. Um, yeah, maybe one thing to mention is, uh, it,it is sort of like quite expensiveto evaluate some of these metrics.

Um, so that might restrictexactly like the sort of breadthof the space that you can search. Um, so my advice would be consider each of these partsof the pipeline independently. So just kind of like make the assumptionthat they don't interact with each other, um,whether that's true or not. And then instead of doing like a, you know, like a sortof traditional, uh, grid search over learning rate, um,consider like a very small number of alternativesto get like a, like a larger sense of like,what is important, what is not important. Okay, thank you.

And then the other question you receivedis, when I started building a rag pipeline,can I use existing open source code basebases as a starting point?And can you tell us about a good,good code basis to start with?Um, they already have, maybe they already have scriptsfor pre-processing data and evaluating models. Um, yeah, sure. So, um, I, I would, so what I would say here was use,um, use a framework, um, like an open source library,like lane chain that is sort of a bit more high, so sortof like higher, higher level than if you were trying to sortof build the components individually yourself. So you can work from a higher level of abstraction. And so they will have example code which,um, uh, sort of shows you how to build, um, sort of like,uh, uh, different types of,of rag pipelines with the library.

And then separately, uh, ragus has good integration with,uh, L chain, and so they have examples that are specific to,um, using ragus, uh, with, with, with lang chain. So, um, I would start there. So, so you use a library like L chain,use the resources on their websiteas like your starting point for ARA system,and then use the resources on the ragus websiteas your starting point for, um, evaluation. Thank you. And then for a software engineer without MLor data science experience, what are some ways that we,that would be useful to a team tryingto build a bag pipeline?Um, yeah, sure.

So I think, so, um,maybe like a key point here is to, to build a,a good pipeline. You don't really need ML knowledge,but what you do need is an experimentation mindsetor like a research mindset. And so that's a mindset which says we need to measurethe effect of each of the changes, uh, we make. So we have to do it in a principled way. We don't necessarily have to like understand all of the, um,like mechanics of what's going on behind these steps.

Um, but provided that we know that we understand likewhat the choices we're making actually are, um,they're very explicit and we can measure them,then that's what you need. So, um, I think yeah,focus less on the machine learning sideand perhaps like more on the, um, experimentation, uh, like,uh, research mindset side. And I, I think like many, many software engineers, um,I guess you have like, uh, hopefully like, um,some good experience from AB testingof if you're developing like, um, uh, like a, um,a social media app or some other online app. So, uh, uh, very similar principle. Okay.

Um,and then I put some resources in the chat about ragus wewere interested in, in getting started with ragus,and then a little bit about ragus and VIS integration. Um, and you also have one more questionto avoid this. Would doing the evaluation as a new conversationor prompt remove the bias of order of prompt?Oh, sorry. Um, I didn't quite catch the first part. Could, could you, um, could you read that again?They said to avoid this, uh, would doing the evaluationas a new conversationor prompt remove the bias of order, uh,of prompt, and this is in regardto position bias, one of your slides.

Mm-Hmm. Okay. So yeah, so to avoid position bias, um,doing the evaluation as a new conversationor prompt, uh, remove the bite. Well, so, um, uh, yeah, so,so this position bias, it's notso much about like previous chat history,but the context that you have to includefor like the first sort of step of, um, I mean,this is like a single step, um, like sort of chat dialogueto calculate the metric. Um, so to remove the, um, one sort of method to removethe bias, the position biasis we could actually just try both options.

We could say, okay, I'm gonna ask the judge model giving,uh, context a first,and then separately I'm gonna ask the same model,putting context B first. And then if those two agree, then we can say, okay, um,there's no position bias in this case,we have confidence in the judgment. Um, otherwise there's diff there's different sortof strategies you could do if, um, they,they actually give, uh, separate answers. Um, so you might wanna like, uh, delegate to, um,uh, a separate model. Uh, but I think sort of the most effective method isusing these, these judges that are fine tuned,especially to be judges.

Cool. Thank you Stephan. Uh, we have time for one more question. If anyone has a last minute question, uh,burning questions about what he talked about,please put it into the q and a tool or the chat,otherwise, um, I'll just give it one minute. Um, if you're typing, um,and that would be pretty much the end of,of Stephen's presentation.

Oh, okay. So there is one question,maybe I missed that information. Is there a way to measure context recalland context relevancy?Um, yeah, sure. So, um, so, um, uh,it depends whether you have, um, the,um, the, the ground truth. So I guess like for recall, like yes, um,you wanna have the ground truth for, for relevancy.

You can either have the, the ground truth, um,maybe you've had like human annotators do it. Um, but, but if you don't, like, that's where you use the LMas a judge to score relevancy. And so you you would say, um, oh, hang on. Sorry, I'm just, um, reading the question again. Is there a way to measure context recalland context, uh, relevancy.

Oh, okay. So, so, um, I,I think this question is asking about the relevancy ofthe information retrieved from the database, not the, uh,relevancy of, of the output from the lm. And, um, yeah, again, so this is somethingthat you can use the LM as a judge to score this relevancy. So you could construct a prompt that says on a score from,uh, one to 10, how, how relevant is,um, the context, these retrieved chunks from the VICdatabase to the, to the, um, the, um,question, uh, being asked. So someone asked,what does the company Zillow primarily do?And I just gave a brief answer here.

We, um, we are open source, uh, vector database. So we have viss as our open source vector database product. We also have Zillow cloud, uh, which is our enterprise, um,um, a vector database. Um, and it is, um, built mostly for scalability. If you have moreand more data, uh, to input into your vector database, um,Zillow's cloud is, uh, helps with that.

So if you have any, uh, further questions, uh, feel freeto reach out to us, um, about what we can dofor your specific use case. Yeah, absolutely. Yeah. Um, thanks Achi. And one more thing to add is we've donated VISto the Linux Foundation for, for AI and data science.

Um, although we we're still the, the main contributors. So fully open source has a commercial use license,but if, if you don't want to handle the hosting yourself,that's where our commercial offerings cloud comes in. It's built on the same technology as as Mel. Okay, we are at time,but thank you so much everyone for joining today. Uh, the slides and the recording both will be shared to you,um, in an email shortly after the webinarand we'll have the recording up on YouTubefor everyone to view as well.

So hope you enjoyedand we'll see you at your the next webinar. Thank you all. Bye. Yeah, thanks everyone for attending. Hope to see you next time.

Okay, bye.

Meet the Speaker

Join the session for live Q&A with the speaker

Stefan Webb
Developer Advocate, Zilliz
Stefan Webb is a Developer Advocate at Zilliz, where he advocates for the open-source vector database, Milvus. Prior to this, he spent three years in industry as an Applied ML Researcher at Twitter and Meta, collaborating with product teams to tackle their most complex challenges. Stefan holds a PhD from the University of Oxford and has published papers at prestigious machine learning conferences such as NeurIPS, ICLR, and ICML. He is passionate about generative AI and is eager to leverage his deep technical expertise to contribute to the open-source community.

Evaluating Retrieval-Augmented Generation

What will you learn?

Topics covered:

Meet the Speaker

AI Assistant