Webinar
Effective RAG: Generate and Evaluate High-Quality Content for Your LLMs
Join the Webinar
Loading...
About this Session
Join Atindriyo Sanyal of Galileo and Yujian Tang of Zilliz for a deep dive into RAG and LLM management. This webinar will equip you with actionable insights and methods to enhance your LLM pipelines and output quality.
This promises to be an informative session for any data scientists and machine learning engineers looking for frameworks and tools to optimize RAG and LLM performance.
You’ll learn:
- When should you do fine-tuning for RAG?
- When should you use a vector database for RAG?
- The architecture of Zilliz's vector database and its role in RAG
- Mechanisms for fine-tuning language models to generate precise outputs
- A metric-driven framework for evaluating RAG context sparsity and relevance
- Techniques to ‘fix’ content sparsity and improve LLM outputs based on quantitative evaluation
Today I'm pleased to introduce, uh, today's session,effective Rag at Scale, um, the role of Vector and Beddings,and our guest speaker at Attend Atria. Sonya,welcome Otten. Uh,so I'm going to kick off this presentation and Otten and I will,AAN will give some commentary along the way and, uh,towards maybe halfway, two thirds in. Um,I'm gonna hand it over to AAN to talk about, uh, evaluation for, um,uh, retrieval augmented generation. So,first I'll start off with a little bit about me.
My name is Chen Tang. I am the developer advocate, uh, here at Zillows. I have a QR code there on the side that if you would like,you can scan the QR code. That will take you to my LinkedIn,and we can connect and you can ask any questions you would like there. My background is mostly in machine learning and software engineering.
Um, and for the past few months,basically all I've been doing is building rag apps. Uh, aan,would you like to introduce yourself?Yes. Uh, thank you for having me here at the webinar today,and thanks everyone for joining. Uh, my name is Athan, uh,and I'm one of the founders of, uh, a machine learning company called Galileo. And we're essentially in the business of evaluating models better and providingbetter frameworks for evaluating particularly unstructured data relatedmodels and use cases.
And r a g is certainly a very big part of that, um,that, uh, entire use case. Um, a a bit about me. I spent about about 12 odd years now in specifically justbuilding machine learning systems, mostly for, uh,slightly larger companies, uh, like Apple and Uber. Uh,spent the earlier part of the last decade for many years at Apple, uh,doing like old school N L P mostly. A lot of the ML tech we built went into Siri and eventually a lot of otherapple, uh, of apple's AI ecosystems.
Um,funny enough, I was part of, um, uh, this, the,the first effort we had towards building APIs for N L P, uh,which was at Apple, we called it Siri Kit, which was the first Siri, a p i. And it's funny to think that was started in 2014,almost 10 years ago at this point. And there's so many parallels I can draw towards new age l l m workflows,like, like r a g workflows. Um,albeit the issue was back then that models weren't as powerful as they aretoday. Um, uh, I also spent many years at Uber,was the architect of the first feature store, which was built at Uber's, uh,Michelangelo team,and did a lot of work around basically figuring out data quality for allof Uber's machine learning.
And a lot of those learnings, uh,went into founding Galileo with the principle that evaluating models iscertainly a, a big, uh, challenge and a big,there's a lack of tooling around evaluating models better. Um, yeah,thank you for having me and, uh, hope you find this, uh, webinar. Interesting. Yes. And for those of you who haven't, uh, scanned the QR code there,that is ONS LinkedIn if you would like to go find him on LinkedIn.
Um,and so today we're gonna cover essentially these five topics,or really four topics in a demo. It's gonna be,we're gonna talk about why would you use retrieve augmented generation. Then we're gonna talk about how you can build your rag apps. And then we're gonna talk about the role of vector embeddings. We're gonna kind of look at what they are, how you get them, what they do,and then evaluating your rag outputs using vector embeddings.
Uh,and then at the end, we're gonna go into a demo. Ton's gonna show us, um,how you can use Galileo to evaluate, uh, your rag outputs. Um, so we'll get started with why use rag. Um,so the real thing that we look to tackle with retrieval augmentedgeneration is that large language models don't have access to your data. But not only that, a lot of them have the tendency to hallucinate.
And here I've just picked a couple, uh, images to kind of demonstrate this,right? So, um, a lawyer used chat G B T in court, uh,sometime within the last few months and cited fake cases and got in trouble forthat. And, you know, you really don't wanna be that guy. You really don't wanna be that lawyer. Um, and in addition, you know,there was a lot of buzz in academia. There was a lot of buzz in schools around people using Chad G B T and makingup like fake articles to cite and to talk about.
Um,and I believe that, uh, you also have something, uh, cool to say about this, uh,hallucination stuff as well on,Yeah, I mean,I've been deep into hallucinations and figuring them out for modern medals forthe last many months here at Galileo. Um, and, uh, first off,of course, you know,r a g is probably the first anti antidote to hallucination,simply because LLMs are frozen in time. And how do you solve that problem?How do you bring them to modern day reality is through, uh, R a G, um,uh, but there's at a, at a higher level, there's, uh,many dimensions to hallucinations. And, uh,r a g is sort of designed to solve this one key aspect where every L l mresponse that you, that you generate needs to be grounded in some,uh, very specific context. And that's the context that, uh,that vector DBS and r a g workflows provide.
Uh,but it gets interesting when, um, uh,you talk about like the other dimensions of hallucinations. 'cause there's,there's, uh, intrinsic hallucinations, which are, which are more closed domain,but there's, uh, other extrinsic hallucinations, which are, you know,like factual errors, which these kind of models make. And, um,there's a lot of very interesting techniques, which I'll share later on how to,uh, how to detect those. But, uh, uh, the, the embeddings and, you know,the role of a vector DB is extremely important in,in fetching the right context. And that's probably the, you know,the zero to 70% mitigation strategy for hallucinations.
Uh, yeah. Yes. And that's why we promote using VIS to solve this problem. Um,so let's dive into why LLMs hallucinate,and then we'll kind of get into how we're gonna solve this issue. So I'm gonna take you guys all the way back to the maybe 1970s when,uh, neural networks were coming out, right?So we had these perceptron things come out in the late fifties, early sixties,where essentially what they were doing was learning some sort of data.
And what we found was, hey, you know,if you put a bunch of these perceptrons or neurons together, uh,you can actually get something, uh, that models a more complex,um, more complex relationship. You have more complex pattern in your data. And so when we started doing this,we did this with initially just one of these neurons, right?It just takes an input, and then there's some sort of bias. It does some sort of math, and then it gives you an output. As we moved and we started putting these together,we've come up with this neural network structure where we've started arrangingthese, uh, perceptrons and these neurons and layers.
And, uh,this image here is a very basic neural network that shows youa three layer neural deep neural network. And the reason why it's called deep is because we have a hidden layer in here,and this hidden layer is gonna come into play, uh,as an important piece of working with vector embeddings later on. We'll cover that, uh, in a later section, but, uh,do remember that the hidden layer is an,IM an important piece of this vector embedding, um, workflow. So from here,um, we moved into recurrent neural networks for natural,for natural image processing. Anyway, so what we actually found is that, um,certain neural network architectures are better for certain kindsof data.
So the basic neural network architecture that I showed before, um,that is good for, you know, basic kinds of data. Maybe we have a bunch of numbers,then we're trying to find some sort of classification. There's two kinds of neural networks that became popular for working with datathat wasn't naturally easy to kind of insert. And those were,um, convolutional neural networks and recurrent neural networks. So for natural language processing,we found that recurrent neural networks were a great, um,architecture because what they allow you to do is take sequence,uh, context into consideration.
So you'll see here this diagram, don't,don't worry too much about the diagram and understanding it. Um,what you should pay attention to this di in this diagram is this, this, uh,HT minus one,XT minus one HT xt HT plus one XT plus one,uh, thing, right?So these three little blocks here essentially are showing you that when we do arecurrent neural network where we feed back the output of some of the neuronsback into themselves,we get the ability to keep track of thesetokens over time, this context, these sequences over time, right? So,for example, if we had like a sentence such as the cat in the hat,we could see that, you know, maybe we're at the word, uh,cat and we'll see the XT minus one is the XT is CAT and XT plusone is the Cat N. And then, you know, as we move along, we'll have different,um, words at xt,but everything else around it remains in the same concept. One of the problems with recurrent neural networks is that over time,they lose context and they lose context for many reasons. But perhaps the most well-known reason is the vanishing gradient.
Um, which the vanishing gradient problem is that as you do moreof these operations, um,your change or your gradients kind of become zero,and then you, you no longer have that information. So the way we got about this was to build something called a transformerarchitecture. And what I have on the screen here is like a super simplified example of what atransformer architecture looks like. Uh, but at the base,transformer architectures take an encoder, um,and they use the encoder to take your input and transform it into a hiddenstate. And the hidden state is really just a bunch of vectors or, uh, you know,like a matrix and the hidden state encodes thecontext of your sentence or your, you know, your, uh, sequence over time,as well as the current positional, uh, embedding of where your token is.
Then we feed this set of vectors as matrix into a decoder. And the decoder may or may not apply something called self attention,which I have marked here as additional input,but you've probably heard the term self deten self attention before. Uh,and so the decoder may or may not apply that most of the times, uh,in most architectures right now, it does apply self attention. So you take the hidden state, which is a matrix, you take self attention,which is also a matrix, and then you feed them into the decoder,and the decoder gives you, Hey, here's what the next token should be, or,you know, something like that. And actually, that's exactly what G B T does,right? So G B T is a decoder only architecture.
So when you feed G B T words,it actually turns this into tokens and positional embeddings,which gives the decoder the hidden state or the, uh, you know,the information that it needs to put out the next token. And so in this example, you can see that when we give G P T, you know,the chicken walked, it should produce something like across the road,because this is something that it may see, uh, commonly, uh,probably this is what we expect to see the most commonly, uh, in its data,right? So what it's really doing is it's predicting this next token. So when you see the chicken walk, it's predicting across, and it's saying that,Hey, across is the most probabilistically,most likely token that's gonna come up next. So that's what we're gonna put out. So G B T,so the reason chat G B T hallucinates is because it's set up to predict a seriesof words or tokens.
Um,now let's go into how you can build your rag app. Uh,so essentially RAG is injecting your custom data on top of an L L M and usingsimilarity search to find the right data. And this works really well for, uh,one of the types of hallucinations that Auten mentioned earlier. Um,Auten, did you want to expand a little bit about, um,the other types of hallucinations as well?Yeah, no, for sure. So, um, yeah, it's a good, uh, sort of, uh,premise to what we are getting into here.
Um, uh,because one of the key purposes of R A G is to bake in context, uh,into the l m responses. Uh, but, uh, just super high level,um,we've done very interesting experiments internally on just measuring andquantifying, uh, hallucinations in modern state-of-the-art LMSs. And we consider anything that's G P D 3. 5 and beyond as something that's stateof the art. Uh,a lot of our evaluation philosophy and metrics don't really, uh,cater to some of the older G P T architectures because, uh,the modern ones simply don't make those simple mistakes anymore.
Oh, uh, yeah,it's very interesting. Some of the, uh, revelations that we've had from our, uh,hallucination experiments, uh, one of them being, um,in context of R A G, there's, uh,there's this idea of how much your context or your documents are grounded,or how much is the output of the L L M grounded in the context that you'veprovided, which is of course, obvious. But then the next level of challenge in these kind of workflows becomes, uh,how do you specifically point to the areas, uh, where, uh,which the model used to make that assertion or that, uh, at that answer,uh, because a lot of the errors are a little more abstract, uh,it's more reasoning based, and it's hard to pinpoint specific tokens. Uh, so that's where, you know,the next level of challenge is figuring out how do you, you know,bake in more explainability into the whys, uh,whether you know the answer is grounded in the context or not,how do you take it a step beyond? Um, so that's in context of R a g. Uh,there's more open domain use cases where you, uh,typically would answer something that's, you know, that,that you're essentially trained on, uh, without using the context.
And,and in those cases,think the hallucinations typically tend to be more around logical orreasoning based errors or factual mistakes. Um, and, um,yeah, so those are essentially the two or three dimensions of hallucination. And I'll talk a little bit more about in my slides about, um, you know, the,the key inputs and outputs to an L L M system,particularly an R a G system. And then we'll go deeper into, uh,what specific kinds of errors occur in the input and the context and the output,uh, that leads to hallucinations. But, uh, uh,I'll let you continue here.
Okay, cool. Yeah, I mean, I'm excited to,to hear about that in the later sessions. So, um,let's start off with what a basic rag architecture kind of looks like. Um,so essentially what you're doing when you're doing,when you're building a rag architecture is you're thinking like, okay,how do I inject my data on top of this, this l l m, right? Uh,and so actually what happens is, well,you can actually drop this first L l M thing,but this is most of the times I see people use an L l M to, uh,make their query more s like queryable and searchable, uh,semantically in the vector database. But typically what happens is,you as a user, you got a query, you come in, you say, Hey, I want to know like,um, what animal ha is in a hat? And it goes to the L L M and the L l M says, oh,find me, you know, uh, animal hat.
And then it goes to VIS and VIS says, oh, the cat in the hat. And then it sends it back to the L L M, and the l l M takes that and says, oh,okay, well, this is the semantically,the most similar response that we have in the vector database. We're gonna make that human readable for you. And we're gonna tell you the animal that has, that's in the hat is the cat,right? So that's the basic rag architecture of just saying, query,break my query down into something that makes sense. Ask my vector database like viss, um, you know,what's the most similar thing you have? And then we get it back and we say,okay, now let's match that back to our original query and answer the question.
So that's the basic one. And then if we move on to, oh,I didn't have that in here. Uh, so the basic, uh, tech stack for rag, actually,you know what,I'm gonna pause here because there's also a couple things that I wanted to addhere. Um, you can make this architecture, uh,better actually by inserting, uh, a caching layer. Um, of course,this is under the assumption that you're gonna have a lot of use for your RAGapp.
And this is actually something that we saw internally when we built, uh,o s s chat, was that a lot of people were asking very similar questions. So OSS Chat, I don't believe is in this slide. Yes,O sss chat is not in the slide deck, but, uh,OSS chat. io is a site you can go to to ask about open sourcesoftware. And we built this, and we found that, hey, you know,a lot of people are asking very similar questions.
Maybe we can cash the response,the questions and the responses, and not only will the user get a faster,better experience,but we're also gonna save money by not calling open AI's L L M A P Iendpoint all the time. Um, so that was something that we found that, you know,that sits right here. Basically, before the query goes to l m Orvis or anything,we're gonna send it to the cache and see if we have an answer. So the basic rag tech stack looks like this, right?You have chat CT or any other L L M, right? Uh, llama Falcon,whatever. And then you have a Vector database, and you have, uh,some sort of, uh, framework, some sort of prompt as code framework.
And,um, the way that you can think of this is like a computer. So in your computer, you have a processor, a C P U,something that allows you to something that allows the computer toperform a bunch of operations, and then you also have, uh,rom your hard drive, right? And that is allow,is what allows the computer to store its memory, and that's where viss is. And then you have something that allows you to interact between all of thesedifferent components, as well as allows you to interact with the computer. And that's where haystack or Lang chain or, you know,some other prompt as code orchestration framework comes in. And the reason why, uh,the reason why this architecture looks this way is because one of thecore driving factors of being able to do retrieval, augmented generation,is being able to do this similarity search, this semantic similarity search,right? This is essentially what, uh,the Vector database provides as a functionality.
So what happens in similarity searches, you come with your data,it can be any sort of unstructured data. It can be images, videos, uh, audio,P D F C S V, whatever, as long as you have the right vector embedding model. And that's, you know, the step from one to two here,as long as you have the right vector embedding model,you can embed any kinds of data into a vector,and then you store it into a vector database,then at query time, when it's time to ask questions,when it's time to do product recommendations, something like that. In the, in,in terms of rag, and it's time to ask questions. You once again,take whatever data it is that you, uh, are, can ask questions about.
Maybe you have an image that you wanna turn into a sentence,maybe you just have a sentence, you wanna ask a question,and you use the same model that you used before to get, uh,the vector embedding. And this is very important that you either use the same model or at least oneswith the same dimensions depending on what you're measuring. Um,because vector similarity can actually,it only makes sense to be done on vectors of the same dimension. So you,this is an important step. You wanna make sure you're using the same model.
Once you have those vectors, then you feed that vector into the vector database,and the vector database performs an approximate nearest neighbor search. And this is essentially the, you know, the similarity search,and we'll cover kind of what that looks like in the next section. What I want you to get out of this, um, this slide is essentially, you know,what is the, what are the steps? What's the workflow that's happening here,right? So once you have the vectors that are your approximate nearest neighbors,you get those results, and then the l l m will once again, you know, uh,turn those results into somebody human readable and send it back to you. So the takeaway from this is really that a RAG app can be built like a computer,use an L L M for your compute. This is your C P U, your G P U,and a vector database for storage.
So that's your hard drive. Um,you know, your ss your, your solid state drive, uh, and your prompt as code,which is the interface, how do things talk to each other, right?So this is the main takeaway, um, L L M Vector database,some sort of prompting. So what are the role of vector embeddings when it comes to rag?This is gonna be a short section. This is gonna be the last section that I cover,and then I'm gonna pass it on to Otten to cover evaluation. Actually, you Yj, uh, yes.
If you can just go back to, uh,the slide before this. I just wanna highlight something interesting for,for our audience here. Yes. Um, so, uh, the, the first part,the left hand side of this diagram is very interesting and very importantbecause it kind of takes me back to, um,some of the more complex, uh, embedding generation techniques that we,I've worked on in the past in my career,particularly for recommendation systems. Um, um, where we,for example, instead of just using, you know,sentence transformers to create embeddings and store them in vector store,there's more complex model architectures where you like the two towerarchitecture where you can, you know,you can essentially take user embeddings and item embeddings,say for a shopping website.
Um, in our case, in my own experience,we built the entire vector search ecosystem for Uber Eats. And there we would create, um, embeddings from user activity on the,on the, uh, on the Uber app. Um, in fact, there's a paper that, uh, uh,workshop paper we wrote, uh, with folks at Stanford, um, that I wrote. Uh,if you just Google feature stores for embeddings, hopefully that should pop up. But that, what that does is it specifically talks about how, you know,there's certain sets of embeddings which are dynamic in nature,and they change over time as you interact with a system.
Yeah. Say in the case of a shopping app, you know, you're, you're,you're using the app and you're generating clicks and you're dynamicallyevolving embeddings. So the, the left hand side of this, uh,this diagram sort of highlights this, um, you know,more complex workflow where you have an embedding management system,which is constantly curating newer embeddings for users, uh,and which eventually gets stored in a vector store. Uh, but there's,uh, there's some very interesting new r a g use cases,which are coming up from my discussions with many of our customers,and even beyond where you want to use these kind of embeddings,which are not necessarily text embeddings,but then you tie that in with a multi, uh,multimodal model to generate texture representations of user activity,which eventually gets stored in a different index in the vector store. And that serves as input to an L L M, uh,that leads to a more complex workflow.
But that can solve a very,some very interesting problems where you're essentially texturizing useractivity. Um, I know a few friends of mine, uh, who are building Zipline,which is Airbnb's feature store, they're exploring these kind of ideas where,uh,you essentially create texture representations of user activities and then use avector store to, to augment, uh, the context of an L L M. And then you can generate very interesting analytics and summaries of, uh,of user activities. And I, I just a very interesting new use case,which is coming up, and I see that the sky is the limit for,for r a g workflows. Uh, and this is just an example of that.
Oh, okay. That's really cool. I hadn't heard about that. So I'm actually gonna just ask you to dig a little bit more into that, uh,right now. Um, so then,so for the transform side for the left side,you're saying that there are ways where people, people are working on,or maybe there already exist methods where, um,you're taking like a live thing, what's going on in the app? Like some sort of,I don't know, is this like some sort of like,features of what's going on and then they're transforming that into the vectorsto store, and then when it comes to query time, they're able todo that again for the same user, or what's going on when it comes to query time?Yeah, no, absolutely.
So essentially, uh,there's established architectures like, uh, the two tower architecture,which many of the audience might know. Um, but, uh,you should certainly read about, uh,the two tower architecture where you essentially take, uh, features from, uh,from say users, from items, from any entity that's, uh,that's in your data warehouse, and then you create, uh,vector representations from those features. And the output,the vector itself is the output of a two tower deep learning model. Uh,but that has proven to be a very efficient way of representing more abstractthings like user activities. And, you know,you can cluster people who are interacting with your system the same way into acluster because those embeddings kind of map up in the same space.
Uh,but leveraging those into a language model is this new avenue of r i gworkflows, which has, which is sort of emerging now. Wow. Okay. That's really cool. I haven't read this paper,so I'm gonna go check it out.
Yeah. Uh, yeah, just Google, uh, feature stores for embeddings and, um,that should come up. It's an old paper, but it was, uh,essentially we wrote it seeing this upcoming wave of,um,just vector databases being a lot more important because of the importance ofembeddings and some of the problems that we were able to solve a few years ago. Uh, but it was interesting how with the advent of, uh, G P T and, uh, LLMs,how this has, this, how it's manifested is very interesting. Yeah.
Yeah. Um, okay. So I'm gonna talk briefly about the role of vector embeddings and how you canget them other than using the twin tower method, uh, the two tower method, uh,and then I'm gonna pass it on to, on to talk about eval. So,traditionally most cases vectors come from some sort of deeplearning model. Essentially, you take your knowledge base, this can be, you know,as I was saying earlier, images, videos, text, audio, whatever you need,and you feed it into a deep learning model.
Um,and in the case of deep learning models like resnet or sentence transformers,uh, like mini lmm, uh,what you do to get the embedding is actually you just cut off the last layer andthen you get say, what are the outputs? And that is the vector embedding. And these outputs are the internal semantic slashnumerical representation of the, this input from the model. And then once you have those vectors,you store it into a vector database such as Zillow or Viss. So before we get into how do you evaluate this kind of stuff,I just want to leave you with this quick,kind of like tutorial summary basic commentary on how semantic similarity workswith vectors. And the basic idea from this slide that I want you to get is that you can domath on words or images or anything other than things that are originallynumbers using vector battings.
Um,I also want to point out that, you know,you'll never see two dimensional vectors in real life,and you'll never see anybody doing Manhattan distance, uh,as their vector metric in real life. Uh,but this is a total example to get across concept. So let's get into it. So what I wanted to show you here is the word queen 0. 3 0.
9minus the word woman, 0. 3, comma 0. 4. In addition,it gives you, you know, this 0. 5, and then if you add the word man,0.
5 0. 2, you arrive at the, uh,the point 0. 5 comma 0. 7, which corresponds to king. And,uh, a couple other things I wanna note here, uh,is that it's important to see that, you know, hey, look,queen and woman have the same value on that first uh, dimension.
And what that tells us is that these words mean the same thing along that firstdimension, but it doesn't tell us what that dimension means. It doesn't tell us anything about what the dimension means. It could be that both these words have five letters,and that's why they have the same value along this dimension. It does not necessarily tell us what that dimension means. It just means they have the same, uh, value along that dimension.
The other thing I wanna point out is that, uh,queen and king and, uh, woman and man, uh,differ by the same values along the X and y axis. And what that means is that these words have the same relationshipsalong that dimension. Whatever the transformation is that applies is the same from queen to king aswoman to man. So that's basically what I want you to,what I wanted to point out, and I just want you to get the idea that, hey,I can do math on things that aren't originally numbers via vectorembeddings. And now on, I'm going turn it over to you,and I'm gonna stop sharing my screen and, uh, you can share your screen.
Sounds good. Alright. Um, so, uh, in this section I will try to cover,um, something where, which is very, yeah, uh,I guess we consider as a company very, very important in, um,as a practice in any machine learning workflow, which is evaluation. And like the whole, um,basis of founding Galileo was that we need better evaluation metrics,uh, and traditional metrics that we use for in accuracy measurement, um,et cetera. They're necessary, but they're not sufficient.
They treat all kinds of errors the same way. And, uh, they,they're unable to distinguish the big errors from the bad. And as well as the fact that a lot of the errors essentially boils down to thedata. Uh,we are basically in the business of creating new sort of metrics for modernsort of ML workflows. And the L L M workflow, the r a g workflow,the Vector DB workflow is, uh, this new workflow which has emerged.
So I'm gonna talk about how do you evaluate, uh, you know,your general health of your R A G workflow,and what specific parts do you want to, you should look at. Um,so keeping it super high level just to, uh, you know,just bake in the understanding of the importance of this and what to,what to truly evaluate. Um,this is a 10,000 foot view of a a an R A G workflow. Uh,here you can see that, uh, uh, on the left-hand side,you have the prompt or the query, of course, on the right hand side, there's,uh, the output of the L L m and the, the gap that, uh,vector dbs are filling is essentially baking in context andbaking in very relevant and similar context to what you need. And of course,you can achieve that at scale using, you know, technology like zills and VUSs.
Um, but there's a lot of, um,questions which prop up in this workflow. And, uh,these are essentially unknowns to the practitioner. And typically,the questions which are asked are on the prompt side, uh,whether your prompt is pertinent to get your desired l l m outcome,um, on the documents or the contact side,which is the data that's fetched from your vector store. Um, you know,are the documents you know relevant?Are the documents fetched from a spar space in the embedding space? Uh,are these documents relevant to the, and most ideal to get the desired outcome?'cause in the end, you're,whether you're an application developer or a model fine tuner,you're trying to get the best output from your model. And there's all these knobs that you can tune to, uh, you know,tune the output of the L l m, uh, uh,so it becomes very necessary to quantify some of these, uh,issues that you're seeing.
Um, on the outward side, of course,there's the big issue of hallucinations, which in the context of, uh,retrieval augmented workflows essentially boils down to, um,how much of the context your model used to answer the query. And, uh,which parts of the document were, was the,was the answer sort of grounded in. Um,so what we've done is we've calibrated and, uh,quantified some of these issues into metrics. And, um,I'm gonna talk about some of these, and I'll also demo a few of these, uh,metrics in a very simplistic r e g workflow. But just talking through them,uh, on the prompting side, there's, there's prompt relevance,which is a quantification of the pance of the prompt,or the query to the output of the L l M.
Uh, on the document side, there's,uh, I talked about sparsity, uh, just diving a little bit deeper into sparsity,uh, why that can be an issue. Um, your, your, uh,vector store essentially stores, uh,the embeddings of the, of the entire set of documents,which you can consider the syllabus or of your l l m. Uh,and of course, they, these vectors are embeddings, which are, um, uh,derived representations of the original documents. But as Ujan had mentioned,the cool thing about embeddings is that you can do the math with it. And, uh,they kind of cluster together very, very, uh,they give you insights into raw data, which you wouldn't have otherwise gotten.
Uh, but a typical operation with embeddings is, of course, you know,dimensionality reduction and clustering, and, uh,they kind of map into this end dimensional space. But the space can be dense,it can be sparse, depending on, you know,the overall set of documents that you put into your vector TB andthe sparseness,and the density of the regions from which you fetch your context and yourdocuments can make a big difference in the output of the L l m. Uh,so sparsity is essentially just a measure of how dense or sparse is a region ofthe embedding that you fetched your data from. Uh, then beyond that,there's the relevance of the document, like,which I was talking about attribution and actionability earlier. Like,it's one thing to, you know, quantify something as good or bad, but, uh,how do you take the next step? And in case of, uh, r a g workflows is,uh, the fact that if you detect an error in your, uh,or if you detect a hallucination or an error in your output, uh, the, the goal,I guess the first step is to, you know, sort of quantify the error as you know,how high level, how grounded was the output in your, uh,in your documents.
But then more specifically, uh, you know,which areas of the documents that the model, you know,consider towards the answer. Um,so doc relevance is a metric that is sort of a measure of that,that quanti quantification of that. And then finally, the main one is the,the, the, uh, output of the l l m, whether it hallucinated or not,which we, uh, talk about through a metric called, uh,we call it groundedness for now, it's, it's a fairly simplistic metric, but, uh,the technique that we use to measure this is Galileo's groundedness. And we use,uh, uh, uh, a very, um,very nuanced chain of thought to be able to reason about whysomething might be grounded in the documents or versus why not. Uh, and,uh, we found that to be a very effective technique.
Um, in fact, uh,just a quick highlight of a lot of the hallucination work that we've done, uh,for the last many months. Uh, essentially have, we've curated, uh,over 15 data sets, uh, uh,across many kinds of different l m tasks and, uh,had experts from the world over, uh, you know,go through thousands and thousands of data points and label them, uh, um,as hallucinated or not. Uh, and these were domain experts and, uh,uh, from our experiments, uh, we've done, um,you know,a determination of groundedness through this chain of thought method that I wastalking about. It showed, um, over an 85%,uh, a u c score,which is an 85% accuracy and match with these human annotatedhallucination experts. Uh,so just gonna touch upon that a little bit in the demo as well,but just wanted to highlight that.
Uh,so tying it back to the, uh, original workflow that you saw, you know,I think the missing piece is essentially, uh,just this algorithmic evaluation and quantification, and it's a loop,right? Like, you, you figure something out, then you go back to your querying,and you are just the query, you're just the documents,and you keep doing that till you fine tune yourself into, into a, uh,the ideal output of the model. So just, uh,adding a few components to the simplistic workflow that we saw forR a g. Alright, um, enough talk. I'd love to get into a demo. And, uh, just keeping a check on time here.
Uh, we're good. It's nine 40, so you should have, uh, totally be good. Awesome. Great. Uh, so in the demo, just to give some high level context,it's, uh, uh, it's a fairly simple demo just to convey the idea,but feel free to use your imagination to apply the same principles to morecomplex r i g workflows.
But in this demo,imagine that you are a data scientist that's building a q and A system,and your goal is to build this question and answering machine,which purely answers, uh, based on the,the documents which are provided as part of your input and not rely on anyextrinsic information. So you have this l l m in the loop,which is this pre-trained model, or some A p i, which, uh,which is backed by a very, very powerful G P T 3. 5 or four level model. Uh, and you're trying to construct this I d G workflow,which tries to restrict the output of the model purely to the documents that youprovide. Um,it could be for various reasons you work for a FinTech company or a healthcarecompany where you're not allowed to divulge general information and,uh, uh, things of that sort.
But I am a data scientist or a prompt engineer who's building this l m app. Uh,I'm gonna use, uh, a vus. So the first thing I'll do is I'll install a bunch of,uh, libraries including py mil here. Uh, and, uh, once I do that,I'm gonna set my environment variables here. Just getting to the main, uh,components here, you can see that, uh, uh, my,one of my goals is to generate, um, you know, embeddings for,for the original data or the documents that I have.
So in this case,I'm using a sentence transformer,which is a simple sort of birth based embedding that, uh,you can very easily generate through one line of hugging face code. Um,and we use, I use that, that transformer model to generate, uh,embeddings for my documents here. Uh,I'm initializing my millware store. And, uh,this particular piece of code is to, essentially, it takes the query,which is the input to the r a g system as the input. Uh,and then it, of course, you know, we use our m similar method that, uh, that,uh, uh, mil offers.
And then we sort of, uh,we take those embeddings, and then we, here in the demo,you can see that we are, um, essentially concatenating that into,uh, uh, simplistic data frame. Uh,so the data frame here contains the input, uh, questions like,you know, explain vector embeddings, what is the best phone to buy?What is machine learning? So it's a q and A system. So this is the original input. Um, there's, uh,the, the textual context, the actual document chunk, uh,the embedding for the same, which was fetched from the vector store, uh,and the embedding of the query,which was generated from the sentence transformer. So I club all of that in into a, a, a, a simple data frame.
Uh,now as, because I'm doing prompting, I'm gonna define my prompt. And here,uh, this is one example of a, a prompt. Uh,typically in my experimentation phase,I'll be trying out many different tweaks and, you know,turns into the prompt mostly based on the, the, uh,evaluation metrics that I'll show you. But here's example of one prompt, right?Like you, uh, you essentially say you tell the model that you're, uh,an expert q and a person,but you only want to answer only based on the context below,and don't directly reference anything that's, uh, outside of the context. And,uh, but also kind of avoid statements like based on the context.
So you're steering the model here through the prompt. Uh,but essentially you're providing this context. Uh, in Galileo, we use, um, uh,context as a key word to detect the presence of R A G and, uh,sort of calibrate certain r a g specific metrics for you. Uh, I'll,I'll just show you that in the second. Uh,but you can also override in Galileo using the Python library,the specifics, uh, metrics that you are interested in.
So say you're only interested in groundedness and context relevance,you provide that, uh, of course you can. Uh, there's, uh,a whole host of other metrics, uh, which should show, I think the, the, the,uh, the, the, what's this, um,the notebook, uh, is not connected, that's why it's not showing. But there's a whole host of metrics that you should see in our scorers library. Um, so in this case,I'm overriding the general metrics calibrations to mainly give me thesetwo. Uh, for the r a g workflow, uh,there's certain metrics which are automatically included.
If you have ground truth, if you have baselines, we,we give you the usual suspects for to you. Uh,uncertainty is another interesting metric, which we have in our system. It basically measures, um,the confusion of the L l M in its own output. And the way we do that is we track entropy in the,the tokens that it's spitting out. And, uh,that gives you a good sense of where the model wasn't sure and where it wascertainly.
Sure. And through our experiments. One,one interesting thing that we've seen with these modern, uh, you know,these machines is that, uh, they're often, uh, when they start speaking,when they start spewing out the initial set of tokens,they're often very high on entropy because they're kind of almost like warmingup. And then as they start speaking,or if they are start spewing more tokens based on the,the attention that they've already sort of accumulated, uh,the entropy sort of gradually goes down. So it's very interesting to see there the internal behaviors of these models,but that's also another score that we, we give you,and that gives you a good sense of the confusion in the output.
Um,all right, uh, once you define the metrics and the pro, uh, the template,you just log into, uh, you use PQ dot login. PQ stands for prompt quality, which is our Python library. So you just install pip, install prompt, quality, import, prompt, quality as pq,and then you can do this, uh, you can log into your, you know,hosted Galileo environment, just one line of code,and then one more line of code to run, you know, your R A G workflow. And here in, in our case, we are specifying the project,the template, which I just curated, uh, the, the data frame,which had the context from the vector store,as well as the set of metrics that I'm interested in,as well as the model that I want to query. So this is basically your experimental line of code, right? Like to,to define your experiment.
Uh, and this is just a single experiment. We,of course, allow you to do what we call sweeps, which allows to take vectors of,or, or, or areas of models, areas of hyper parameters. And we do some interesting sort of optimizations behind the scenes andrun, uh, a batch of runs. Uh, so here, I'll just get into the,once you run this, you'll get this output, it'll do the entire,execute the workflow, and, uh, it'll give you this link to go to,which I'll go in a second. Uh, but you can see we also give you the data,like the object itself, the prompt metrics, python object.
So you can,which has your hallucination number, like how,what was the quantification of hallucination and other stuff like latency andcost. So you can literally build an entire system using you just Python,like you can build, uh, uh, an entire observability system. 'cause you get all these metrics out of the box. Uh,before I go into the demo itself or the, the console, uh,just want to highlight, uh,an interesting extension to this experimental framework that,that we are working on. Um, so typically in experiments,I was talking about using different templates, uh, your different kinds of,um, you know, inputs and tweaking the, the,the prompt to get better output.
But in the r a G workflow, you can also,uh, you know, leverage different embeddings, uh, you know,vector dbs allow you to store different indexes where you can store differentsets of embeddings. Uh, so here, uh,what we want to do is we want to give you a layer of experimentation on top ofthe vector store, where you can essentially fetch multiple embeddings. Uh,you can generate,say you generate one with a sentence transformer and another with a two towermodel, and you wanna see which ones work better. Uh, so we are working on a,a, a new version of our A P I,which allows you to override, um, uh,these transformation methods where you can specify one or even more than one,um, way of generating the embeddings. And then you sort of stitch together,you provide the transform like these functions here.
And then behind the scenes we will do this ab experimentation and give you,um, you know, which one hallucinated more,or which ones led to less hallucination, and so on and so forth. So yeah,just wanted to bake in this idea that through this Python code,you can run these very advanced experiments, uh, using a Vector db,but once you run this, you click this link, which leads you to here. And this is the Galileo ui, of course. And, uh, there's, uh, at the high level,yeah, um, sorry, this, this is the one run you did. Uh,and, uh, if I go back to my data frame,the single run is testing out three queries, right? Um,there's the explanation for vector embeddings.
What is the best phone to buy?Those are the things that I'm testing with. So if I click into this, uh,it'll show me those three queries that I made,and it'll show me the Stitch Together version of the whole thing. Um,so already off the bat,what you want to do is when you start your evaluation process,you look at these top level metrics and, uh,already one thing you see is that there's, uh,the aggregate uncertainty is a bit on the red. It's a bit, uh, um, uh,on the higher side, uh, it's 0. 36 might sound low,but we use standard deviations.
So red means bad, uh,but also I see that the relevant context relevance, which is an r a G metric,which I was interested in, it's kind of on the lower side,there's something off. Ideally I would expect it to be between 0. 9 and one. Uh, and here I can see, uh, at the, the last column,if you see if I, uh, here you can see that there is a,one of the requests was clearly not. There was some grounding issues with that.
Uh, uh, there's a culprit here, which has slightly lower, uh,relevance. So I can click into this. Uh,and I have a host of metrics to evaluate out of the box. The main one that I want to highlight is the groundedness metric, uh,which is using, as I said, like it using chain of thought at, um,to,to figure out whether the model itself can tell whether its answer was groundedor not. Uh,'cause one of the realizations that we've had from our experiments is thesemodels are often very not as good as not makingthe error in the first place as they are good at realizing that they've made anerror.
So there's many ways to, you know,there's tricks that you can do with prompting to make the re model realize itsown mistake. I'll give you, this is a very interesting example. So let's look at the input here. Um, your expert q and a system,oh, I already went through this. The question that I'm asking here is,what is the best phone to buy? And the context information,which was fetched from the fetched from the vector store was, uh,it said the advantages of moto colon mobile net.
Uh,the model is great for a few reasons. Um, some,some random text which was fetched, uh,from the Vector DB in this case where clearly you see that, you know,the model would struggle to answer this kind of a question,what is the best phone to buy?And here you see the output of the l l M was mobile net, which,uh, to the human reader, you see the,it's not really the answer if you think about it, like it's just saying,the advantages of moto is mobile net. Uh,that's not really the best phone to buy. So with groundedness,we automated that reasoning. So here it says zero, which is negative,and if you hover over it,it gives you the reason that the best phone to buy the response was mobile net.
However,there's no information provided in the context that directly supports thisclaim. It only mentions the advantages of and disadvantages of Moto. It does not compare any specific phones. Uh,and the response is not supported by the documents. So this was, uh,our r a g metric groundedness at work here,and we automated that.
And, you know, doing this at scale, imagine how much, uh,more leverage you can get in your evaluations. Um,even beyond that, just want to highlight certain other things. Like, um,you get lmm uncertainty out of the box. In this case, you can see,as I told you before, the, the model started off being a bit uncertain,just like these, they usually do. Uh, but the funny thing was that, uh, the,the answer itself when it said mobile net, it was very confident.
So it made a very over confidently wrong answer,but we got it to, you know, got it to admit its own mistake, if you will. And, uh, were able to figure out that the answer was wrong. Um,so this is, uh, essentially, uh, you know,just evaluating a single workflow. There's of course, a lot more you can do,uh, if you have, say, multiple runs. There's many ways to sort of,yeah, you know, compare, if you had multiple runs,you could hit compare runs and it would load it in, uh,this columnar view where you can, you know,essentially enable these things.
And, you know,you can look at these responses, and it's a very easy way to do AB evaluations. And at the end of it, once you decide on the winning run, in this case,there's just one, Galileo gave this one as the winning. You choose this,and then you write, you, you, you click here, and then, uh,you can essentially get the entire code for that workflow and copyit and paste it into your Python environment,and it'll just run it programmatically. Sosuper sort of Blitz Creek sort of a demo. There's of course a lot more here.
Um, and, uh,but this is how you can do like rapid experimentation and very preciseevaluation of modern LLMs and modern prompt workflows and modern r a gworkflows with Galileo. Um, I'll hand it back to you, uja. Awesome, dude. Yeah, thanks for, uh, going into that. That was really cool to,to kind of hear about.
It seems that we have a lot of questions. Um, this is,this is great. We don't, uh, we, we only have a few minutes. I don't think we have time to address all these questions. So what I'm gonna do is, um,I'm gonna try to address some of these QA questions that are here in the qa.
And then I know that a lot of you guys have sent questions in the chat. Um, what we can do isif you have any of these questions that you want answered,you can simply, we'll, we'll put the video up. It'll go on YouTube, you can ask on YouTube, uh, it'll be on LinkedIn. You can message us, you can make a post and tag us, ask us, whatever. Um, but we won't be able to get to everything.
So I'm gonna shut up and try to get to the questions now. Um,so number one is isn't a focus on hallucination. Misguided isn't the more important issue, the need to include private,confidential information into the N L P application? Um,I think this one's for me, I'm gonna say is the, is the focus misguided?I don't know. Uh, could be. But uh, yes.
The importance of RAG is that it allows you to prove, uh,provide this private confidential information. Um,and there are some nuances here, but you can also use this to, uh,you know, stop the, uh, l l M from making up answers,and that's the hallucination part, right?And so there's a little bit of both here. Um,and then if the, the local knowledge base has all information of interest,what is the advantage of including it in the query and sending it out to afoundation model? Um,typically we want to kind of combine multiple answers, um,multiple things that you get back. And then we also want to make sure that it's human readable. So the point of getting the answers from the vector database and thenusing the lmm to, to mess with it,is so that it actually answers your question,how do you cache when questions can be worded differently? Um,as long as thevector embedding is within a certain distance,and that is up to you as the implementer to decide, uh, you can, you can say,Hey, this is close enough, we're gonna return this answer.
Is N L P using r n N now considered obsolete and replaced by LLMs? Um,I mean, maybe not entirely, but pretty much from what I've seen, yeah. On,Yeah, I mean that's, uh,it's certainly a trend that we are observing in the industry where people aretrying to build more practical systems with these lms, the modern lms. There's no question that these architectures and, uh, these, uh,these newer LLMs are way better to the point where, uh,metrics and papers published even as early as 2021, they no longer apply. Uh,these models are just incapable of making those basic mistakes. But that said,um, uh, the what still, you know,to be figured out is how do you build practical systems with these LMS at scale?Uh, but the buck is certainly moving in that direction.
Uh, yes. Same from here. Um, slide 25. Does the rag box include the l l m? Uh, yes it does. Uh,I'm just gonna mark this.
Yes. Can you discuss more on what it means to have the right embedding algorithm?How can you evaluate that you use the right embedding algorithm?This is gonna be a pretty complex question. I think this is something that you should ask, uh,either on the recording in a commentary or on LinkedIn. Is there a reference architecture for setting up this kind of system,a test system to examine and consider duplicating, uh, otten?Uh, can you quickly repeat that?Is there a reference architecture for setting up this kind of s I think the,this kind of system that, uh, the question is referring to is about, um,like the rag plus the eval stuff. Uh,and so he was asking is there a reference architecture for it and a test ISN forexamining and consider duplicating.
Yeah, these are, um, it's not commoditized yet. It's all work in progress for now. But, um, yeah, if you hit me up offline,I'm happy to share some, some new, you know,white papers and blogs that have been written in the industry,including from us, uh, as a reference. Okay. Uh, yes.
Thank you, uh, Otten. Thank you guys, everyone for coming. We are at time. Um, and I'm a very time conscious person,so I'm gonna cut this off. If you have any questions,please do not hesitate to reach out.
Um, we will be posting this online. You can find us on LinkedIn. Uh, please ask any questions that you may have. Thank you guys all for coming, and I'll see you guys next time. Thank you.
Meet the Speaker
Join the session for live Q&A with the speaker
Atindriyo Sanyal
Co-founder and CTO at Galileo
Prior to Galileo, Atindriyo was an Engineering Leader at Uber AI, responsible for various Machine Learning initiatives at the company. He was one of the architects of the world's first Features store (Michelangelo) and early engineers on Siri at Apple, building their foundational technology and infrastructure that democratized ML at Apple.