You’re in!
Webinar
Evaluation Builds Better Retrieval Augmented Generation Applications
Today I'm pleased to introduce today's session evaluation Builds BetterRetrieval, augmented Generation Applications, and our guest speaker Josh Rainey. Josh is a core contributor to Open source, true Lens,and the founding developer relations data scientists at Tru,where he's responsible for education initiatives and nurturing a thrivingcommunity of practitioners. Excited about AI quality. Josh has delivered tech talks and workshops to more than a thousand developersat events, including the Global AI Conference 2023,New York City Dev Day, 2023 LLMs,and the Regenerative AI Revolution in 2023,AI Developer Meetups and the AI Quality Workshop,both in live format and on demand through Udemy. Prior to Tru Josh delivered end-to-end data and machine learning and solutionsto clients, including the Department of State and the Walter Re,national Military, uh, medical Center.
Welcome, Josh Rainey. Awesome. Thanks you, Eugene. Thanks for having me. Really excited to be here.
Yes. Sweet. Um, yeah, so as you Eugene said,today I'm gonna be talking about how evals build better rags. Um,so I thought it'd be kind of fun to start off with. I wanted, I wanted to, uh,do a little bit of research, make sure I was really up to speed on everything,uh, that Zillows it's up to.
So I, uh, asked my assistant cloud, I said,who founded Zillows?Oh, uh, are,Oh, am I on mute?Yeah. No, no. But, uh, can you share your screen?Oh, yeah, yeah, yeah. Um, my bad. Thanks.
Cool. You can see it now? Yes. Good deal. All right. Uh,so I asked my assistant, Claude, and I said, who founded Zillows? Um,and I got, uh, not really a response.
Um, so it didn't know who founded Zillows,um, but it did say that Zillows is a blockchain company, uh, Eugene,can you fact check this one for me?Yeah. So, um, Zillows is not a blockchain company. We are a Vector database company. So yeah,I have no idea how they even came up with this actually. Yeah, they just, they kind of like try LMS are really good at, uh,coming up with an answer that seems plausible.
Um,and you'll see this a little bit later too. Uh,so after I ask who founded Zillow, I also wanted to maybe ask, uh,what does Zillow's do? Um, so it's a, give me some bullets, um,when it was founded, uh, when it, where it's headquartered, uh, you know,how much it's raised. Uh, what are some like key product offerings customers,uh, Zillow. So you, Eugene, I'll ask you to fact check this one again too. Yeah.
Uh, let's see. So,I guess develop solutions for AI infrastructure and machine learning operationsisn't technically wrong. That is not how we officially position,but we do develop a solution for AI infrastructure. Uh,we were founded in 2017 and we are headquartered in Silicon Valley,so That's right. Uh, the main project is viss.
Um, and Viss is an open source. It's open source vector database. We think that database systems are slightly different than similarity searchengines. Um, we've raised more than 43 million now, uh,but I think five y at least is correct. Um,I think Hillhouse is also correct.
Uh, let's see what else is on here. Uh,we don't have an ML ops hub. We don't have an ML hub, uh, for ML ops. Um,Zillows managed viss for fully managed viss is a very interesting way ofdescribing Zillow's cloud, but that's kind of what it is. Pretty wordy, I would say.
Yeah. Uh, so this is like somewhat right, somewhat wrong. Okay. Um, so now, so if you're remembering the first one, it,it said maybe with some more context about to Elizabeth,but know who the foundries are. Uh, so let's like, try that again.
Uh,so ask the same question. Um, and here it gives me, uh, five names. Um,I'm gonna let you, Eugene, tell me if these are actually the founders. Yeah. So our founders, uh, our founder is, uh, Charles Schiff.
And none of these people, none of these names. I actually just looked this up like right before. It was like,none of these people even work at our company. Like,none of these names exist in our company directory,so I have no idea how it came up with these names. Yeah, so, so, uh, large language models are really good at, like,coming up with plausible sounding answers.
Uh,so it knows Zilla is a Chinese company,so it came up with some Chinese sounding names, um, but, uh,and maybe some like descriptions of people that commonly found softwarecompanies like this. Um, but as Yugen said, totally hallucination. Uh, so that's, give me, and my hot take may be slightly less hot. Uh,now that we've seen this example is that we should consider large languagemodels to be hallucinatory until we prove otherwise. Um,and why is that? Right? Uh, so research has optimized models for generalization.
Um, and at the same time, it's pen penalized them for memorization. Uh, so,uh,this leaves us in a really tricky spot where we're asking these models to comeup with facts, uh, but also be, uh, pretty creative. Um,and that's like that really tiny overlap, uh,that can be really murky to find that kind of golden zone. Um,so the solution for this is to focus these models on general tasks. Uh,so these general tasks are gonna be things like summarization,tech generating embeddings, doing inference and planning,and then we can leave them memorization to something else.
Uh,so that's something else is gonna be a knowledge source. Uh,these knowledge sources can be vector databases, uh, like Zills and vis,um, and they can also be tools, uh, such as APIs to a search engine,um, maybe a lookup with, uh, proprietary, um,database like Yelp, um, or Yahoo Finance. Um, so how do we kind of string this all together? Uh, this is, uh,kind of like a typical architecture for a rag. Um,so retrieve augmented generation is, uh, kind of the general approach for, uh,fitting our vector databases, uh,and our knowledge source into our QA applications. Um,so this basically works by, uh,starting with the user question we'll generate and, and a, a question embedding.
Um,and then we'll take that embedding over to our vector database and search for,uh, chunks that are nearest, uh, to that, uh, query embedding. And then we will,uh, retrieve those chunks, uh, return 'em to our LLM and then use that, uh,LLM or completion engine, uh, to, uh, return our final response. Uh, but there's a lot that can go wrong. Uh, so sometimes, uh, this, uh,retrieval can retrieve irrelevant context. Uh, but in other cases,sometimes we get, uh, not enough context.
Then all of these can lead to hallucination. Uh,so here's an example where we built a rag on our, uh, website and company docs. Um, so we asked it, uh, who is Chaac, uh, which is one of our founders. Um, so these first few sentences are correct. Um,so it knows Chaac is a computer scientist, got his PhD from CMU, et cetera,et cetera.
But then it also rose in this detail about how he's a member of the Bank ofEngland's AI public private forum. Um, and this is because, uh,this information is about another member of our, uh, team named Shameik. Um,so Shameik and Shak were kind of located similarly in the vector space, uh,probably because they were the way they were tokenized. Um, so we,we retrieved both of these, um,and then fed them to our LM and it combined them into a plausible foundinganswer. Um, Josh, I wanna ask a question here.
Yeah. Um,what is the data that you guys gave the, the, the,the Vector database to build this, um, this IO thing?This QA thing?Yeah. We, uh, like indexed our company website and then the TRU Lens stocks. Wow. Okay.
Cool. Mm-Hmm. Um, so this is the like, core problem. Uh, we built Tru Lens for,so Tru Lens is an open source project for tracking and evaluating LLMexperiments. Um, and how this generally works is, uh,once you've built your application, uh,you can connect it to True Lens to start logging the records.
Um,and we'll log the inputs, the outputs, um, all of the intermediate steps, uh,sometimes called, uh, spans that you might wanna evaluate and inspect later. Um, we'll add these, uh, these things called feedback functions, uh,to evaluate the quality of our application each time it runs. And then,uh, once we've done so we can explore the records,understand the evaluation results, and iterate and select our best, uh,application version, uh, that we wanna roll with. Um, so where does this fit in the lmm ops stack? Um,so we sit in this observable observability layer. Um,uh, so true lens is gonna focus on this, uh,kind of desk testing and debugging piece, uh, in the development, uh, cycle.
And then when you need to monitor in production and scale, um,you can turn to our, uh, more managed platform with Tru. I'm gonna ask another question here that I think might result in a hot take. Yeah. Uh, can you go back to that slide?Uh, yeah. This one? Yeah.
Yeah. Okay. So you have feature store cross out here. Mm-Hmm. Why?Oh, uh, so this is kind of, uh, if you think about the, uh,transition from like old ML lops to LM ops, this is like the,the changeover.
Um,Oh, okay. So the, the prediction, I guess, or the,or the pattern I is you're like, you're seeing this pattern in the industry,or is this kind of like a prediction that,that you have about the way that M ML ops or LM ops is evolving?Uh, yeah. This is kind of the, the pattern we've seen so far in the industry. So like, if we think about, uh,like we don't really do a lot of like engineer engineering anymore, uh,with like these LMM apps and then, and the compute layer, uh,most of the time we're not even training our own models. We're just like using pre-trained models or even calling APIs.
Wow. Okay. Very interesting. I hadn't seen that. That's very cool.
Nice. Yeah. Awesome. Yeah, it's,it's nice to think about kind of where we come and like how it compares to likethe old stack. So kind of useful there.
Um,cool. So how do we think about testing, uh, these rags for hallucinations? Um,so we like to use this thing, uh, we call the hallucination triad, or rag triad. Um, and this is gonna kind of follow the, the path of our application, um,each time it runs, um, from the query, uh,to the retrieve context to the final LLM response. So the first thing we wanna check is context relevance. Um,so here we wanna know, is this context, uh, relevant to the query?Are we retrieving the right chunks, uh,that we're gonna use to answer our question?And then this is second thing I want to check here is, um,is there a response fully supported and like backed by evidence from the contextthat, that we're retrieving? Or is it kind of hallucinating on top of that, uh,evidence? And then the last thing, once I've done those two, I also wanna know,is this final answer, uh, actually helpful for answering the question?So in some cases, um, I might ask, uh,a question, get it, get the right context chunks, um,and form them to form them into a grounded response.
But they still might not answer the question at the end of the day. Um,so kind of checking all three allows us to, uh, kind of, uh,verifiably check for hallucination. So what are some,uh, kind of common failure modes that we can identify here?So the first one I wanna show is, uh, retrieval failure. Uh,so here we have a rag built on some Wikipedia data, um,and ask what's the best national park near near Honolulu? Um,and it's not able to the ans to answer the question given the context. Um,uh, this response is actually like, pretty good because it,it tells us that it doesn't know, um, but we still wanna know like,what is the root cause of that, uh, like unable, inability to answer.
Um, and that's just because we didn't retrieve, uh, the right information. So here you can see a screenshot from true lens at the bottom, uh,where with each of the three chunks we retrieved, um,an evaluation score on how relevant from zero to one, uh, these chunks were. And then also some reasoning on why it got a low or a high score. Um,so if you look at that third one, we see, um, this got a score of two out of 10,um, because, uh,the statement provided information about the population of Honolulu, um,but it didn't mention any national parks. Uh, so the second, um, uh, failure mode, uh,we often see is a lack of groundedness.
Uh, so this occurs when, uh,the LM is often hallucinating on top of the evidence we retrieve. Um,so here you can see an example. Uh,this is a summarization application, uh, where we're feeding it, uh,this input, uh, from a dialogue, uh,between a hot hotel help desk. And then, uh,you see that we have, uh, kind of three claims that's, that are being made,um, in the summarization. Um, so the first one is that the,they called room service an hour ago.
The second one is they're not happy about weight. And then the third one is,um, that they have no other option. Um, so we're able to, uh, verify, uh,with evidence, uh,from the retrieved context that the first two are supported. Uh,but the second one, there's not like a clear direct, uh,statement in the context that says they're not happy about the weight. Um,so we can kind of verify, uh,that that's like an unsupported statement and, and knock out,knock our ground to disc score down a peg.
Um, I also have a question about this. Yeah. While it's not explicitly written, uh,that they're not happy about the weight, um,isn't this like, would you like, as a person make that assumption?Yeah, I think that's, I think that's like probably true. I think this is maybe like a bit debatable. Mm-Hmm.
Um,Is there, like, um, oh,I guess what I'm asking here is really like the measure of groundedness. Is that based entirely on the actual text provided,like the explicit text provided? Uh, versus like,does it take into measure something like, um, something like,uh, uh, uh, like a, I don't even know how you describe this,uh, oh, extrapolation from context that, you know, like people doY Yeah. So it kind of a mixture of both. Um, so,uh, this like evaluation can be done, uh,with kind of like two classes of models. Um,so it could be done with a, a large suggish model, um, in which case,uh, we're doing some kind of careful prompting, uh, to split, um,our like statement into different claims and then search for evidence for eachone.
Um, and because that's ALLM,it could like have the freedom to kind of understand, uh,that like, not being not happy about the weight is kind of implied. Yeah. And you see that a little bit, but the third one, right? It says like,they have another option and there's this kind of like, left ha well,we have no choice. So, um,and then kind of the other option for like this evaluation is, uh,like a natural language inference model. Um, uh, so this is,uh, like kind of the medium language model, uh, evaluation style.
Um, and we follow kind of a similar strategy here, break breaking into claims. Um, and that one probably has a little less, uh,like kind of wiggle room on the, um, like extrapolation. Cool. Sweet. Uh, so the third, the third failure mode I wanna, uh,call out here is, uh, answering the wrong question, right? Uh,so here I ask, uh, which year was the Hawaii State song written? Um,we get a good retrieval.
So, uh, you can see, uh, we have,uh, the correct answer, um, located in the retrieve. We have the,the name of the song, we have the ear, um, but then, uh,our response is not the year. It's just like the name of the Hawaiian State song. So our context relevance was good around this, uh, was as,uh, good. Um, but our answer relevance was not sufficient.
Uh, so how do we think about, uh, choosing the right evaluations, um,for our application? Um,so the way I like to think about this is kind of on two axes. So,so the first one is, uh, how meaningful is the evaluations? How much is gonna,it'll allow us to get a good view on the performance of the application. Um, so maybe the, the most meaningful, uh,evaluations we can come up with, uh, our ground truth evals. So this is when,um, we've done a human evaluation and we understand, um,this is exactly how we would grade a given response. Um, but the problem is these, these don't scale.
So, uh, it's really expensive,really time consuming to come up with these human evals. Then additionally,they're not gonna be dynamic. Um, so as your, uh,the queries into your application change, uh, they're not gonna reflect that. It's just gonna be kind of this like static ground truth eval. And this is like a lot of the problems you see with grading models, um,like on the hugging face leaderboard, for example, right? Like that's, uh,using this like static dataset and it's not reflective of your domain.
Um, so, uh, on the kind of other side,we have traditional NLP evals. So these are like blue and rouge. Um,and the problem is with these is while they're really scalable,they're cheap to run. Um,they rely on a lot of like synta syn similarity, um,and they don't really understand the, like, se semantic, um, uh,comparison. So like they often fail to recognize that, uh,two statements are similar, um, with when they use different words.
Uh, and that brings us to, uh,kind of our medium language model and large language model evals. Um,so for like simple more defined, uh, commonly used tasks, uh,medium language models can be really useful. Uh,so we talked about ground disk before. Uh, we can also do things like, um,identifying the language of text, um,and using that to match across the prompt and the response. Um,we can do things like checking for sentiment.
Um,these are all really useful things to do with medium language model models. And,uh, UCI have them kind of marked a little bit more scalable than the LLMs justbecause they're cheaper to run. Um, and then lastly,the LM evals, these are gonna be the most flexible, um, uh,they're like very meaningful because we can tailor them to exactly how we wantto evaluate our application. Um,so oftentimes it's kind of like useful to, uh,mix a bunch of these evaluations when you set up your application. Um,in the early prototyping phase,maybe you wanna start with some ground truth evals, uh, but then as you, uh,kind of get your past your first versions,you wanna add in some of these other evaluations.
Oh, okay. Um, I, I also wanna pause here. The audience has a question. Um, I,the audience has asked how is relevance score calculated?Yeah. So, uh, relevance scored is, uh,kind of through some careful prompting to lms.
Uh, so basically we'll just,uh, kind of insert the,the prompt and the response into like an estring and then, uh,say how relevant is the response to this, uh, question. And then we can also do some few, few shot examples and say, kind of,you should give like a lower like, score of zero to three for, um,responses that like, are not relevant at all,um,at maybe four to six is as they are like relevant to part of the question. And then we should kind of reverse reserve those like perfect scores to answersthat fully and completely, uh, answer and are relevant to the question. SoYou can tune the, the relevance based on what you need. Yeah, yeah, for sure.
Yeah. So if there's like, uh, if your,if the like style of like responses that you're getting in your application aretough to grade, kind of, uh, with like a straightforward prompting, uh,it's like, can be useful to like, do more of a, like a few shot prompting style,uh, tailored to the like,style of responses that you want to grade for your app. Cool. Wow. Nice.
Mm-Hmm. Great. Yeah, that was a good question. Oh, uh, okay. Uh,Any other,Yes.
So, uh, Revant has asked a follow-up question to this, which is,what determines the actual relevance score and what metric says text A is morerelevant than text B? Do you need a prior dataset?I think you partially answered this when you were saying that you can kind oflike tune it and give it some examples, right? Is there, is there anything, uh,is there like another like metric that you would call it? Um, in, in terms of,of the, the measurement?Um, yeah, so the metric is just, I mean, at the end of the day,it's like the LLM filling the next most likely token, uh,being some score from zero to 10. Um, and then,uh,we can kind of like push the LM or like influence the lmm to understandlike what, uh, score is,what relevance through that kind of few shot prompting. Um,but you don't need a, like, prior data set. So this is like, uh,like dynamic to kind of that run. Very cool.
Sweet. Um, so we've talked a bit about the,like different evaluations you wanna run, um,and what the rag setups looks like. Um,and there's a huge configuration space when you set up your rags. Um,so when you go about creating your vector database, um, there's a bunch, uh,of things you need to decide, like, uh, what embedding models do I want to use?Um, how do I select the data? Uh, we've seen a ton of research that the,like both quality and diversity of, uh, data are even more impactful on the,like, uh, your application than even like the model size and things like that,that are a lot more talked about. Oh, yeah.
Um, and then like, distance metric index type, um,these are like kind of additional things you need to choose that can be like,pretty impactful downstream. And then when we get to the retrieval step, um,we will also have like a bunch of, uh,really impactful decisions to make. Um, so the first one is, uh,how many chunks, uh, do we want to retrieve in our retrieval? Uh,what retrieval method do we want to use? So do we want to use like a naive rag,uh, where we just retrieved the, like top three or top K most, uh,relevant chunks,or do we wanna do some kind of like sentence renewing or like auto merging? Um,the like, law index guys over there are coming up with like new,like fancier retrieval methods seemingly every day. Uh,so how do we want to think about incorporating some of those options?Yeah. Um, and then we can also do more like dynamic retrieval.
So, uh,using things like re rankers. So we do our first retrieval, and then once we've,they've like set AK number of chunks, then we can, uh,do a re-ranking to decide which ones are the most relevant,and then we can even filter out the least relevant, uh, chunks. Oh, I also wanna, uh, comment something else here. Is this,how many chunks top K thing is actually something that we've just added, uh,a different version of this, which, um, we're calling range search. Okay.
And what it essentially does is instead of you telling it how many top K youwant, you tell it what is the distance and that you are able, that you were,that you were willing to kind of like accept as like a, you know,acceptable distance and then it retrieves everything within that distance. Oh, that's super useful. Yeah, that'd be really, really great,especially for some of the kind of problems we saw earlier, right?Is where we see like, uh, some good chunks,but then also because our top K is too high,we can like get some like irrelevant chunks. So yeah. Super.
Nice. Very cool. Um, and then lastly, when we get to the completion step,there's like another huge set of things we need to decide. So what model are we gonna use? Uh, how big is the model? Um,are we gonna use some API, are we gonna like have some local model, um,uh, then like model parameters, you know, what, what's our temperature?What's our frequent frequency penalty? What's our logic bias? Um,uh, do we want to use, uh, can we kind of like nail down our problem to be,to use some kind of like function calling, uh, like you can do with open ai? Um,there's like a, a bunch of things to decide here. Um,so how do you think about across all of these decisions, how do we decide, uh,what is the configuration we wanna use for application?Which one's the most performant, uh,which one is gonna give our users the highest quality responses?Uh, so that's where evaluation comes in.
So, um,as we talked about before, uh, feedback function is, uh,the abstraction we use in true lens, uh, for evaluating LLMs. Um, so these include, uh, things, uh,in this includes like the rag triad, like, like we talked about before. So answer, relevance, context, relevance, groundedness, um,but also a whole bunch of other things, uh, that can be, uh,useful for evaluating your app. Um,so this could be things like summarization quality, prompt sentiment,um, is often pretty useful to, uh,look at embedding distances as a feedback function, uh,detecting PII, et cetera. Um, and we can run these feedback functions, uh, with any models.
So we can use,uh, kind of on the left side, we can use like open source models, uh,maybe like me Mytral or Meta metas lama. Uh,we can use, uh, any model from hugging face. Uh, we have a nice, uh,pretty cool kind of integration with, uh, light LLM, uh,which is kind of the train emoji that you're probably not super familiar with. Um, and what they've done is, uh, kind of built, uh, this like common,uh, standard connector for a hundred plus lms, um, that always res returns,the response in kind of the open AI style. So it's always gonna be in the same place every time,and it's kind of a nice way, uh, to connect to a whole bunch of LLMs.
Um,and then, yeah. Oh, uh, quick question before you move on from this slide. Uh,and you can finish doing this slide first,I actually wanted to just get your opinion on any of these models and whichones,like which open source ones do you like and which proprietary ones you likeE Yeah, so I think, um,I found the like Azure Bedrock Open AI to be like, uh,as on the Enterprisey side, uh,very kind of like consistent and kind of same for same thing for open ai,I generally find, uh, like GPD three and a half or GPD four to be better at,uh, providing evals than like the other models. Interesting. Okay.
Um, and I've done a a little bit of, uh,benchmarking marking that maybe I can show later, uh, in our docs. Um,and I can like send out a link later too. Yeah, that'd be awesome. Mm-Hmm. Um, yeah, so, and then I,as I kind of just alluded to on the la on the more like enterprisey side, uh,you can run feedback functions with Bedrock, with Azure open ai, uh,models from cohere, et cetera.
And these can be run, uh,kind of, we're pretty like agnostic to the LM app app stack. So we have pretty tight integrations with L Chain and lax. And through those,uh, we can easily wrap like any LM map, uh, for evaluation. Um, so from here, uh, I'll jump into a notebook a second. Seems like I went slightly too far.
Uh,so here's the QR code to the notebook I'm gonna show from here. Um,so feel free to open it up, um, and pull it up. And I'll give you maybe like a minute to pull up this, uh,notebook and then we can kind of jump into the demo and walk through. Um,and this is probably a good chance to take any questions too at this point. Okay.
So then I will take a couple questions that seem like they're for me. Um,let's see. So Frank has, um, I'll do answer live. Frank has asked,isn't distance related to the top K discussion dependent on the embeddingalgorithm? If so, what would it, that would have to be parameterized per LLM. Right.
Okay. So, um, there's a lot to unpack here. First of all, embeddings are not generated from algorithms. Embeddings are generated from neural networks,and the distance is definitely related to the,the network that you generate it from. And it's actually,as Josh alluded to earlier,most related to the quality of the data and the contents of the data that that,um, that neural network was trained on.
And, uh, so,so the answer is yes, it is dependent on the embedding. Um,and does that have to be parameterized per LLM? That does not because, um,most, I, I mean, I'm not entirely sure what you mean by parameterize,but it does not have to be, you don't have to use a different, um,LLM just because you're embedding is, uh, you,you can use different LLMs with the same embeddings as long as you're using thesame embeddings model to generate the embeddings and the embeddings model andthe lmm are not necessarily the same. They can be, but they do not have to be. Um, I hope that was clear. Uh, this, there's a lot to unpack in this question,so if you have follow up questions, feel free to submit them.
Cool. I see one other question about Viss, just to ask if it has, uh, built in,uh, re-rank functionality?Oh, yes, I see this. Um, it does not,Um, but I guess like adding on,you can like easily use re-rank with vis by like using odx. So mondex is like a nice, uh, like framework. Uh,so that's kind of like where in the app the re-ranking would be.
Yeah, yeah, yeah, yeah. Um, Novus is very much focused on, uh,being a Vector database system. Our founder is from a database system background and is very,very focused on building a very high performance system. Uh, so we,we try not to touch the outside portions too much. Um, but yeah,slam Index is a great choice for that.
Uh,do Vector databases index every realtime edition of a document embedding andcalculate the top k?I guess I'm a bit confused as to how a query can offer a realtime list of thetop K when it has to do several similarity calculations. This is,this is also a very deep question, and I could explain this to you,like with Amil vs. Like, uh,with Amil vs like di like diagram pulled up. But, um, I think that,that this is a much longer discussion. And the, the,the short answer here is basically, um, not all vector databases do this,but you can do this in Viss because Viss has a real time stream, um,and Novus ISS distributed system, uh, uh, database.
And so it does it by using timestamps. And so it timestamps your data as it comes in,and then you set a consistency level,and the consistency level will de and will dictate, uh, from which,uh, timestamp you're able to get, uh, your data back from, basically. Awesome. And then, oh, there's one from you, Josh. It looks like the tooling and overall setup for RAG with LLMs is pretty robustat this point.
Is there anything else close to this for image generation at True Era orelsewhere?Yeah, so I actually started to do some,like explorations on this last week is like using multimodal models, uh,like lava, um, to do an evaluation. So basically, uh,you're kind of like prompting it,but the image and then also like the evaluation criteria is your text. Um,and then using the combination of those,we can kind of do a similar like scoring, um,as we did with like the straight text. Um, so I,I don't wanna promise any dates,but that's definitely something we're thinking about and working on. We're getting a, we're getting a sneak preview.
Awesome. Yeah. All right. I think that's all the questions for now. We can, great.
Let'sHit the, hit the notebook. Yeah. What is your IDE?Uh, this is Cursor, uh, so it's like a vs code extension. Um,and it's like got this like, nice, uh, like AI assistant,so there's like an AI thing you can ask chat with a code base,So Oh, cool. Okay.
Yeah, it looks like vs code,but then I saw like the icon look different, so I was like, huh. Yeah. Yeah, I'm a fan. I've liked it so far. I've been using it for like,I don't know, like a month maybe.
Um, yeah, would recommend. There's my shout out cursor. Uh, oh. And Rock asked the same question you're on. Uh, sweet.
Okay. So getting into the, into the example. Um, so,uh, here I've already installed, uh, vis, uh, so we can kind of skip over, uh,that section. Um,so the libraries I'm gonna use in this example are gonna be, um,LAMA Index. Um, I'm gonna pull some embeddings from, uh, link chain embeddings.
I'm gonna use, uh, tenacity for, to get some exponential back off, um,and retry so it can deal with the like, rate limits, uh, with open ai. And then I'll, uh, pull in some things from trulance to do our evaluation. Oh. Um, cool. Okay.
So the first thing I want to do here is we need to load the documents. Um,so I'm gonna pull in some, uh, C City Wikipedia pages, um,uh, so cities like La Houston, Chicago, um,and then I'll just load each of those using the Wikipedia reader from LA Index. Um, and then while I have the cities in mind, I'll just, uh,kind of set up a set of like, test prompts about these cities, uh,and we'll use these to help measure how our application is doing. Um, and then I can build my first version of the Rag. So here I'm gonna use, uh,the first thing I need to do is set up my middle of Vector store.
Um, so I'll,I'll set, uh, different, um, kind of index types. Uh,I want to choose the dimension that matches the dimension of the embedding modelI'm gonna use. Um, I can set a search parameter, um, and I'm gonna,uh, use this vector store and set it into the storage context for my long indexapp. And then the second thing I need to do after that is set my cervix context. So this is gonna be my model and my embeddings, uh, that I want to use.
Um, and then I can just, uh, create my index, uh, using those two contexts. Um,and then create my query engine, set my top K, um,and then put it in this, uh, like retry, uh,kind of, um, uh, tag, um,from the Tenacity Library. So this is gonna just, um,if I like fail and hitch some like open AI, HDTP or something like that,um, I can just like wait X amount of minutes, um,and kind of increase that as I go, um, to try and avoid those rate limit errors. And then once I do that, I can run it through each of the test prompts that I,we set up earlier. Um, but so I've created,created this prototype rag, but I don't really know how it's doing.
So sure,I can like look at, look at each response and manually say like,is this the right answer? Uh, maybe I have to go to Wikipedia and see like, uh,is this actually right or is this like, um, kind of hallucination? Um,but it's really hard to know kind of off the bat, uh,what is the performance for this, uh, application. Uh,so from here I wanna set up my evaluation with Trulance. Uh,so for Trulance, uh,we're gonna set up a lot of these evaluations using, uh,OpenAI GBT three and a half. Um, and I've found that GPT three and a half is,uh, like plenty good, especially when you give it this like,careful prompting that we talked about earlier. Um,there's like not really a big, uh,difference in the evaluation quality between three and a half and four with thatprompting.
Interesting. Um, and that is a bit of a hot take, I think. I think like with less careful prompting, there is a bigger difference. So you may see other opinions elsewhere. Um,That is a hot take.
Yes, yes. I see a lot of people saying that GPT four is a lot better. So this is,you know, this is, I use GP 3. 5 a lot as well, so this is, this is cool. This is interesting.
Yeah, it's cheap, it's fast. Yeah. Um, yeah, so the first feedback function I wanna set here is, uh, groundedness. Um, so I'm not gonna walk through this whole thing,but I'll call out a couple things. Um, so I'll set my grounded as provider,so that's the, our model that we're gonna use.
Um,and then also it's like worth pointing out that I'm pointing to the context. Um,so each time your application runs, uh, we're gonna serialize the,the record or the, like, JSON, um,and that's kind of the observability component of trulance. And then here we're gonna point, uh,to where in that JSON is the context located. And then this gives us a ton of flexibility to point to different parts of ourapp. So maybe if you're like building an agent app,you might wanna point to like the tool decision or maybe you wanna point to thetool input, um, for rags, maybe in addition to the context, uh,you may want to point to maybe like a summarization step or like otherintermediate steps.
So you have a lot,a lot of flexibility on what to evaluate here. Um,and then other evaluations we're gonna do answer, relevance, uh,context relevance. So here we're pointing to the same context. Um,and then also notably, I'm all these, uh,evaluations I'm doing with chain of thought reasons. Um,so I'm pointing that out because that those two things that are pretty nice forus.
So the first thing it's gonna do is just improve the quality of the evaluation. Um, so there's been a lot of research to show that LMS are a lot better atreasoning, uh, when they're prompted to think step by step. So that gives us one benefit. And then the second benefit is, um,that we actually as humans get a reason on why a particular, um,record got a low score. And this helps us to understand and like do debugging and figure out, uh,weapon wrong.
The last feedback function I wanna show here is, uh, embedding distance. So this can be like pretty useful when comparing, uh,the different embedding models and trying to figure out, uh, which we want,which one we want to use. Um, in addition to the like downstream performance,we can even see just, uh, like these distances here, uh, for each record. Um, cool. Okay.
So now I've set up my evaluations. Uh,the next thing I wanna do is set up my configuration space. I think I actually took those out. Um,so here we're gonna evaluate two different embedding models,uh, mini LLM and ada. Uh,we'll set two top kss and two chunk sizes and then, uh,we'll just kind of iterate through them all.
Um, try each application,uh, run 10 records through each app and then see which one performs the best. Kind of a fun experiment. And then, uh,the last thing I wanna call out here is that, uh, when we set up, uh,Tru lens here,we're just gonna wrap our true our query engine with this thing called Tru Lama. Uh, so this is our like LAMA index integration, um,that does the observability and like in instruments your LAMA index application. Um,and we'll just give it our list of feedback functions that we wanna evaluate oneach time it runs.
Um, so as you can imagine,this takes,the whole thing is gonna take a bit of time because we're doing like eightdifferent application versions. So I've done kind of pre-run this for you here. Uh, so we can just switch over and show you what this looks like in the trulanceleaderboard. Cool. Okay.
So here I've, uh,already run 10 records through each of the eight different, um,application merchants. And you can see we have, uh, the latency, the cost, uh,total tokens, then our different evaluation metrics for each one. Um,if we hover, hover over this like information button, we can see uh,what are the model parameters. Uh, we set, this kind of gives us a bit of a,like an experiment tracking, uh, view. And these first four are gonna be using, uh,mini lm and they're gonna kind of increase in top K and chunksize as we go down.
Uh,so you can see we're like starting off performing pretty poorly, um,especially on context relevance and groundedness. Um,and then as we kind of give our LLM more context,whether that's by increasing the, the chunk size or increasing the top K,uh, we start to get better, uh, scores on both of those metrics. So I'll kind of scroll down. So you see we increase the chunk size from 200 to 500,and then we'll take it back down to 200 and increase the top K and try that out. And then in app four we'll get better ground in this score.
Hmm. ThisOne has a much better ground in this score. Yeah, yeah. Big improvement. Yeah.
When we get it like enough data,it's actually like does. Okay. Wow. Um,uh, but it is,I I guess like notably like much more expensive obviously 'cause we're likefeeding way more tokens. Well, it's only 2 cents,But doubleIt is, it is double.
That is true. That is, yes. You don't get double the groundness for double the cost. So that is interesting to, to keep in mind, you know?Yeah, yeah. So I,so as we get into kind of like apps five through eight,we switch our embedding model.
Um, so here we're gonna start using ada, um,and immediately even with like top pay equals one and chunk sizes 200, we get,uh, even better ground in this much better context relevance. Wow. Um,and I think it's like also pretty interesting to like point out and like,this is also kind of obvious, but uh, we get a much lower,uh, if you think about like the ADA vector space, um,the context chunks we're retrieving when we're using that metric, uh,for the retrieval is gonna be like much lower on average, uh,than when we're using some other model ob that's like obvious,but also like notable to point out. Um,so yeah, so we really improve on a lot of these metrics as we, uh,with ada and especially as we, uh, increase the context relevance. But then for like app seven and app eight, we actually get too much context.
So here we're, uh, kind of feeding, starting to feed it irrelevant chunks. Um,Oh, very cool. So our kind of like sweet spot is gonna be this like app five and six, uh, prep,maybe app six, where we're getting at the right chunks. Uh,they're pretty accurate. Um, so maybe let's, uh,dig into it and take a look at some of these records.
Um,so here you can see each of the, uh, 10 questions we asked our application. Um, and then you can also see the different, uh, scores we got. So let's maybe look at a failure mode and then we can look at a success. Um,so here, uh,we asked what are some famous festivals in la Uh,so we got a response that includes some festivals in la but uh,if you look at the context we're retrieving, there's like, not,these festivals are not in here. So this is like a pretty classic hallucination where even though the answer islike maybe right, it was from the pre-training and not the R rag.
Um,so that's like, that's a not what we're not what we're looking for. And we see that, uh,like shown in the context or chain of thought reasoning too is that RLM is ableto tell us, um, that we didn't have the right information in the context. Uh, so if you look at, uh, alternatively maybe one that went, uh, really well,uh, so here we ask, uh, what pro sports team teams are in la um,we get a good answer. And if we look at the context relevance,you see we have all these teams listed in the context. Um,and then we also have this like nice grounded answer where we have the statementsentence and then we have supporting evidence from the context.
So, um,for the most part we're like pretty happy with this, uh, like app six version. There's probably some tuning to, uh, we can do even farther to like improve,uh, those last few records. Um, but in general, uh,we've made a lot of strides just by like trying a few different things. Um,the last thing I want to call out this on this page,and then we will go into like more the q and a is we have,uh, kind of this, uh,like instrumentation view on what are all of the different components thathappen your, in your application. So here you can see, uh,where we do the retrieval.
And then, uh,from there you can see where that, uh, retrieve context is passed to the LLM. Um, so this can also help you do how you're debugging to understand, uh,what are the different components to your application and why,where might something be going wrong. Cool. Sweet. Uh, so we think we can go to q and a.
I'll just flash up the QR code for true lens. Uh,so we're fully open source. Um, you can find it on GitHub, uh,as the QR there. Uh, we love stars. It really gets the people going.
Um,and yeah, let's get into q and a. Yeah, we also love stars on viss. So, um,Star on boat. Yes, star button. Um, for use cases with complex schema for your metadata.
Would you recommend using a traditional database in conjunction with viss onlyfor vector search? If so, how would that work? Uh,I would say only if you have really complex stuff. If you have like multiple,like if you have something that would regularly cause you to do multiple joins,then you probably wanna do that. Otherwise, vis stores metadata,so you can store things like the publication date, the author, the title, the,the where in the Texas from all this stuff in Vis,and then you can just use a filter. Um, and BU's filter is really well designed. It's a bitmask, so it's only linearly adding, I mean, it's, it's,I think it's adding, uh, constant time to your, um, to your search.
So, um,yeah, so I, I wouldn't do it unless you have a really complex metadata. Any thoughts on vector similarity distances such as Euclidean distance,co-sign similarity or dot product similarity? Um,so Euclidean distance is, uh, x squared plus y squared. So there's about,uh, five calculations going on there. Um, Euclidean distance is really good for,uh, oh five, that's actually three. What am I talking about anyway? Uh,occluding distance is really good for, um, non normalized vectors.
Cosign similarity is, um, so we'll, well lemme go to dot product first. Dot product is A times B the matrix or vector A times vector B, right? Uh,the inner product there. And so there's, um, basically, you know, uh, uh,some n number of, uh, calculations going on there. Uh,this is really good for when you want magnitude and orientation. And then co-sign similarity is, um, you know,the normalized dot product essentially.
And so this is really good if you already have normalized vectors and you allare only looking for orientation. Josh, did you wanna comment on this?Um, yeah, so I think in,especially in like the, like very simple example we showed here with the,like a few Wikipedia, Wikipedia, um, pages in our Vector database,it like didn't, doesn't make a difference, uh,which distant did distance metric you use, but as you get to scale,uh, you definitely want to like keep that stuff in mind for sure. Yeah. Uh, are there any other language APIs? I think this one's for you, Josh. Yeah, so I, I think maybe there's like a couple answers to this question.
Uh,so you could use, uh, as you Jim said earlier,you can use a different language model than, uh, the one,the model that's being used for your any embedding model. So they can be the same, they can be different. Um, Trulin can wrap, uh,any application that uses any model. Um,so like a really popular use case for us is, um,this new model came out, you want to see, does it,how does it perform compared to the model I'm using now?Or maybe when I'm prototyping, how do I choose between g PT four, gbd,three and a half, uh, should I use, uh, something open source? Um,and then, uh, also, um, we,as we talked about, we use language, uh, APIs for evaluation. Um,so this can be like a huge variety of, uh,language models that we've built in.
So you can run feedback functions with, uh,any model from any of these providers I'm showing on the screen, um,kind of out of the box. And then, uh,we're like adding more of these all the time. So I think like replicate is,is the next one that I'm working on. Cool. Um,how well does GPT 3.
5 work when evaluating function? Calling using True Lens,Eh, yeah, so I think, um,and just in terms of the evaluation, um, I think it, it works pretty well. So maybe I can like quickly pull up our docs. Um, so we have these, uh, like smoke tests, um,that I've run, um, and you think you can think about, uh,so I actually use True Lens two run these. Oh,Cool. Yeah,I see these, yeah.
Yeah. Oh,so GPT 3. 5 is scoring better than GPT four on some of these? Mm-Hmm. Yeah. That's insane.
Mm-Hmm. Wow. Okay. Very interesting. Yeah.
Um,and so these are like on relevance and then, uh,we're kind of working on like expanding these smoke tests across like all therest of our feedback functions. Um, but yeah,GBD three and a half gets the akay from me on evaluation. Um,and then I think the next question, uh,so to understand we use mini l, m, and ADA as our embedding models within viss. Uh, slight, slight correction there. Um,actually no, that's, yeah, that's good.
Um,and then use GPT three and a half to do our grounded distance relevant scores. Uh, yeah, that's right. And then we also use G PT three and a half for like the completion engine. Okay. I, I wanna comment here briefly.
Um,the embedding mo the embedding models don't live in Mils,but the embeddings live in Mils. Just wanted to clarify that. Yeah. Cool, cool. Um,if we are getting some questions that do not have an answer or source documents,it would be counted as negative in the true lymph metrics.
Is there a way to account for that? Um,he, yeah, so I think, um, that's,I think it's like a good thing, right?That you're able to capture that in the like true lens metrics, right?You wanna be able to know where are the gaps in my Vector database and where amI missing data. Um,I think like one thing that you can do is set up, uh,like an additional feedback function that maybe says like, um,is did the application give an answer when it doesn't know,or is it saying like,correctly saying that it doesn't know when it doesn't have the context. Um,so that's like also something that's useful to evaluate on. Um, is there a pricing associated with, uh, Tru?So I think like a lot of companies in the space, uh, uh,when you get to scale, uh,you're gonna wanna hook up to our managed platform and we can help you with thatfor sure, and I'm happy to chat about that if you wanna, uh, contact me after. Um, but the, everything I showed today is fully open source, um,so you're able to use that, uh, just download it from High pi,check it out on GitHub, um, and contribute star, et cetera.
Last, do you have a recommendation on size of the Groundness eval set?Um, this is kind of like a, uh,this is like up to you in your domain. So like,if there's a really big variety in the source of questions,uh, you expect your users to ask to your application,and if there's a big variety in the like sorts of context that are your,like running your rag on, then you probably do want a larger eval set. Um,but if it's like a tighter domain than like a smaller one, it's probably fine. Cool. It's not, maybe not a very satisfactory either, butA lot of times the,a lot of people ask me questions and then the answer is always just,it depends on your data.
Okay. Uh,I think this is good timing. It's, uh, 57, so we're, um,pretty much right at the end of the hour. Uh,thanks a lot for giving this presentation, Josh, this is really cool. Um,I also,I mean the GPT 3.
5 being better than g PT four i email is a quite a hot take. Uh, this is not something that's, uh, that I've seen, so this is very cool. Um,thank you very much, Josh. Thank you very much everybody. Uh, we'll see.
Yeah,Thanks for having me. Yeah, thanks everyone for coming. This was a ton of fun,uh, really great questions. Um, so yeah, look forward to talking to you later. See you guys.