Events
Designing Retrieval Augmentation for Generative Pipelines with Haystack

Webinar

Designing Retrieval Augmentation for Generative Pipelines with Haystack

Zilliz Webinar - Zoom

Join the Webinar

About this Session

In this talk, Tuana Çelik from Deepset will share examples of retrieval augmentation in Haystack and how you might design your own pipelines for specific NLP use-cases. Let’s see how you can combine a vector search engine like Milvus with a RAG pipeline designed for your use case, to achieve the best out of powerful LLMs.

View presentation slides

Transcript

Today I am pleased to introduce today's session,designing Retrieval Augmentation for Generative Pipelines with Haystack and ourguest speaker Tiana Chalik. And Tiana is a developer advocate at Deep Set where she focuses on the opensource N l P framework to build l l M applications haystack with a degree incomputer science from the University of Bristol. She started her career as a software engineer,but then to the world of machine learning as a developer advocate and nowdedicates her time to helping the open source n l P community. Tiana is also joined by my colleague U Tang,whose name I say a thousand times a week. Uh, and, uh,he is a developer advocate here at Sills.

Welcome, Tiana and u Eugene. Okay. Um,so today we are gonna be talking about building retrieval augmented pipelines atscale. Um, my name is Chen. I'm joined by Tijuana from,uh, deep set.

And, uh, I will actually let you introduce yourself first. All right. Hello everybody. Um,thank you for the introduction while Emmy re said it all. I'm the developer at one of the developer advocates at Deep Set,and I focus on our open N L P framework, um,to build large language model applications called Haystack.

Um,so today I'm mostly going to be talking about Haystack, and this is all of my,uh, social media information. So if you don't want to follow me, go ahead. Um,and I think we can give it a start. Cool. Uh, the QR code there, um, by the way, links to Tuana is LinkedIn.

If you have your phone, uh, you can scan it and, um,and go follow her or, or connect. Um, I'll give, you know, 10,15 seconds for you guys to scan this,and then I'll go move on to my slide. Okay. So this is me. Uh, my name is Yugen Tang.

I'm the developer advocate at Zillow. Uh,hopefully will be joined by more developer advocates soon. Um,that QR code links to my LinkedIn,so you can scan that and you can follow me or connect and send me messages. Um,these are some of my, uh, links if you wanna find me. Um,my background is in machine learning and, um,I have been building a lot of these retrieval augmented generation pipelinesrecently.

So today's outline is gonna look kinda like this. We're gonna start with, uh, addressing why you want to build a retrieval,augmented generated, uh, pipeline. Then we're gonna talk about how you can use a vector database for your RAG app. And then we're gonna look at how you can build a RAG app pipeline in haystack. And then we'll cover some, uh, frequently asked questions.

These are just some questions that, uh,come up from time to time from our presentations. Um, and then, you know,at the end, you'll have your, uh, chance to ask questions as well. So, Tiana, I'll take this part away. AllRight. So we're going to do a bit of a back and forth,but I'll start with a quick, uh,recap as to why we talk about retrieval augmented generation and to sort ofdepict the issue we try to solve with retrieval augmentation.

I have an example,and I'm sure a lot of you have already seen this UI before. So this is actually a question I asked chat G p T, um,and I asked what time is a talk about retrieval augmentation?And as you can see here, I got a reason why, um,here I was using G P T three, um, and the reply is that it simply doesn't know. Um, and in, um, if you're lucky and you're using a model that can do this,you will get a reply that says, I don't know. But in a worst case scenario,you might also get a reply that simply is a hallucination I'll be trying tosolve this particular problem with retrieval augmented generation. Um,and to give you a bit of a insight as to why I asked that question,here's actually a screenshot of a meetup, myself and you, Jenna,doing this evening.

And you can see that at 6 25, um,there's a talk about retrieval augmentation by myself. Um,and what we're going to try to solve is the following. Well,large language models do not know the answer to everything. And the idea behind retrieval augmentation is that we can help them by givingthem the relevant context. And once that context is retrieved,the reason why we call it retrieve augmentation, this,this is how I imagine the name was come, came up with, um,is that we augment the prompt or the instruction.

We send these large language models with the relevant context. So for my previous question,the relevant context might be this particular events data. What is this event? Event about? What time are each talks, et cetera. So a retrieval augmented pipeline, at the end of the day,what it looks like is it looks like us sending a prompt that lookslike the following. We have an instruction or a prompt,and we've said, given the context,please answer the question if the answer is not contained within the contextbelow, say, I don't know.

Now,this is just an example of retrieval augmented prompt, um,and what I've done here, I've just gone ahead and copy pasted the, um,the events, uh, schedule for this evening. And then I have asked what time is a talk about retrieval augmentation?And when I do this, then I get a reasonable answer. And this is actually a correct answer. The talk about retrieval augmentation is scheduled for 6 25 to6:55 PM So what we're going to be talking about with U G N todayis how you can actually build a pipeline that will do this for you,that will achieve this end result for your interactions with large languagemodels for you. And the way we're going to be filling in that context is going to be theretrieval step.

Um,I'm going to come back to talk about how you build these pipelines withhaystacks, but before that, I'll hand over to you GN to start talking about,uh, how vector databases play into this, uh, pipeline as well. So,ot,Okay. So, um, we're gonna talk a little bit about how RAG works,and then I'm gonna dive into the Vector database details later on. Um,but the basic idea is that you have some sort of knowledge base that is not yetin, um, chat G B T or Llama or, uh,una or any of those big, uh, large language models, big language models,large language models, um,and you would take your knowledge base and you would put it into an embeddingsmodel. And embeddings model is a deep learning model that is trained on the type ofdata that is, uh, the same type of data as your,the data that you were trying to work with.

Then that learning that, uh,that deep learning model,the embeddings model will produce a set of vectors as the output of the secondto last layer. So in this little image here,you see that there's like this one node at the end. What actually happens when the embeddings are generated is that we cut off thenote at the end and we just take the output from the second to last layer,and that's what a vector is. And those vectors get stored into some sort of vector database like viss orZillow. And this is just the first step to rag.

This is just the step where you're putting your data into the application sothat you can actually have a custom, uh, uh, retrieval augmented generation app. Um, but I've been building a lot of these for the past few months. And, um,typically what we see the stack looking like for retrieval augmented generationis something that we call C V P. Um,and c V P stands for chat, G b t, like l l m or chat,G B T, uh, or any other l l M. Uh,and this is typically the computational power behind these retrieval augmentedgeneration apps.

This is the processor,and if you were to think about this in terms of a computer, uh,then you have a vector database such as novis. And this can be interpreted as the storage block. This will be your, um, your,your, your memory, uh, your rom, um, and then there's,uh, prompt as code, which essentially, you know, augments out the, the prompts,uh, to give a correct, uh,prompts to the L L M in order to get the response that you need. And an example of something that you can use with this is haystack,which is mainly used for the orchestration of these pipelines. Um, this is an example of, uh,this is an example architecture of a application that we built thatis, that uses the C B P stack.

So the application that we built is called O ss chat. It is an open source,uh, software chatbot. Basically it's at oss chat chat. io. You can go check it out and you can ask it questions about open source software.

The way it works is, as we showed earlier, you take your knowledge base. Step one is you take your knowledge base and then you embed it, you chunk it up,you embed it, you put it into the vector database. Then when the user comes and asks a question,the first thing we do is we actually hit the Vector database and we say, um,you know, is the answer in here. And if it is, then we can return that. And if it isn't, then uh, we may have to go ask the l l m for, uh,an answer if it has one.

Um, there are some caveats here, right? Uh, you,you wanna make sure that the questions that you're asking the lmm makes senseand that you're not getting some sort of holness or response back. Um,and something that came out of O Ss s chat is this thing called G P T Cache,which we made because we found that as we were, uh,creating o s Ss chat and we were getting the answers to, uh,or getting questions for this,these documents and generating questions that these documents can answer a lotof, uh, the questions. And, uh,a lot of the queries that we were giving the LM came out to be the same. And so,or, you know, semantically similar enough that the answer would be the same. And so we decided to cache those response, and that saved, um,that saved a lot of money, basically.

So how do you use a Vector database for retrieval augmented generation? This is,uh, basically what I've been building for the past few months. Um,so step one, why would you use a Vector database? Uh,basically what you want to use a Vector database for is that you want to injectyour own data into an L l m, um,or on top of an l l m into an l m application. And you want to, like,you know,you want to get relevant responses to what you're querying about and what you'reasking about. And the l m is not necessarily trained on the kind of data that you are using,um, or your data. In fact, most of the time, and I would say, you know,it would be weird if it were trained on your data, uh,especially if you have private data.

Um, and finally,the last reason why you would use a Vector database for this,or maybe not the last reason,but a third reason why you would use a Vector database for retrieval augmentedgeneration is because the economics work out well. Um,com. Uh, the, the other option that you really have, if you want to kind of,you know, inject your data into, uh,some sort of L L M is you wanna do some fine tuning. Um,and that is typically much more expensive than using a,a Vector database and injecting your data, uh, that way. Okay, so what is a Vector database? Um,essentially a vector database is the database that you would use to store somevector embeddings and do semantic similarity.

So how does this work? Um,you start with your unstructured data. You typically have some sort of data that, uh, you know,most of your data is unstructured. It could be images, it could be videos,it could be some text documents, PDFs, um, some audio. It could be, you know,something else like that. And you feed it into the right embeddings model,right?So it's actually very important that you make sure that you feed the right typeof data into the right type of model.

You don't wanna feed image data into like sentence transformers,and you don't want to feed sentences into like res 50. So once you have these vector embeddings,then you store them in a vector database such as VIS or zills. Um,and then from there, uh,that's all you need to do to put your data into the vector database. That's all you need to do to put your data into the application. From there,what you are gonna do is you're gonna do the query step, right?So the querying is you take the same kind of data that you're looking for.

So for example,if you're storing a bunch of image data that was vectorized using resnet 50,you use your query, um, and you,you want to use the same model to get the vector embeddings from, uh,the input data that you, uh, that, that you're looking to query for. Now,there are cases where you might want to use a different model, um,and it took me a while to kind of figure out like, when would,why would I ever use like a, a different model for, um, for querying? Um,and it turns out there's this thing called graceful degradation where you wantto check against knowns against unknowns,and that's when you would use a different model for your query data to createyour vector embeddings and check against the vector database. So I'm gonna just cover a very quick example of, um,semantic similarity. This is gonna be a very, very simple example. Um,there is no case in real life where you would ever use like a two-dimensionalvector.

And there are very few cases that we see, uh,at least in terms of retrieval augmented generation,where you would use Euclidean distance. But this is a, uh, pretty famous paper,and it is about, um, it is about se semantic similarity,and it illustrates the idea of semantic similarity with factors. Very well. What I want you to get out of the next few slides is basically vectors allow youto do math on things that aren't numbers. And in this example,it's gonna be math on words.

So we're gonna,the example I'm gonna show here is this one is queen minus woman,plus man equals king. So let's take a look at these. Um,let's take a look at these vectors, right? So queen is 0. 3, common 0. 9,king is 0.

5, common 0. 7 woman is, uh, 0. 3,common 0. 4 man is 0. 5, common 0.

You can see that queen and woman have, uh,some semantic similarity because they have the same X value. And you can see that queen and king, uh,have a similar relationship to woman and man, because the y value differs by,um, the same amount. And you can see also the king and man have the same semantic similarity on the xaxis as well, because they have the 0. 5.

Okay? So what happens here?So we'll take the word queen and we'll subtract the word woman. So we'll have 0. 3 comm, 0. 9, 0. 3, comm 0.

4,and that gives us zero comm 0. 5. Um,while this doesn't necessarily represent anything on our graph,we can kind of guess that this probably represents something like the concept ofroyalty, because you know, what is the difference between a queen and, you know,a regular person, um, that's, you know, the royal status, right? Um,and then from there, what we do,what we want to do is we want to add the word man. And so, you know,zero comma 0. 5, uh,plus 0.

5 comma 0. 2 gives you 0. 5 comma 0. 7, which maps directly to king. So this example, very, very simple toy example, two dimensional vectors,queen minus woman plus man equals king.

So what does vector database kind of look like? Um, a vector database is, uh,basically, uh, it's, it's,it's a purpose-built database for doing vector search. And the things that are very specific for vector search is, number one,you're gonna be doing a lot of computations. Vector similarity requires you to do some computations in order to compare,uh, the vectors, right? And you need a purposeful vector database for this,because a regular SQL database doesn't do a lot of computations. Like it is very rare that you're gonna go into, I mean, I,I don't think this happens ever actually. You go into a SQL database and you're like, give me 768 calculations right now.

It's not gonna happen, right? Uh,typically you're gonna be doing some sort of maybe a bitmask,maybe some sort of like, you know, joins, maybe some sort of comparisons,but that's about it. So that's why you need something like a vector database,uh, in vector search and in search in general, there are, uh,three kind of separations of concerns. One is indexing, uh,which is, you know, how do you create the way that you search your, uh,database. One is data, which is data ingestion. How do you get the data into your database? And the last one is querying,which is, you know, how do I actually ask the database for, uh, for my results?And so, vis, uh, has a separation of concerns with the query, the data,and the indexing notes.

And this is, uh, pretty, uh,unique to us as far as we know. Um, and indexing the index,no takes care of the index. There are many different types of index that vis offers. Uh, there are many,we have a bunch of resources on that, and we can drop some links on that, um,below. But some of the things, some of the indexes that, uh,viss offers include hierarchical, navigable, small worlds or H N S W,uh, face, um, or inverted file index, uh,also known as I V f data node.

The data node is used to ingest the data. Uh, it,you know, reads the message. So the write ahead log,and then it takes data and it, you know, uh, compacts,it calls the indexing node when it's time to index and puts it into, uh,sort of permanent object storage. Um,and then the query node is use that query time. And so when you need to query your data, you use the query node.

Um,and typically the query node needs to be able to access both the permanentstorage and the write ahead log so it knows what's going on right now and what'sgoing on, what's already been stored. And the reason for this is because VUSs has this really,really interesting architecture with the way that the data is stored. So the data is stored in 512 megabyte chunks,and every time one of these 512 megabytes comes along, we, uh,cut off the, we,we cut off these like segments 512 megabytes at a time,and we create the index over, uh, over, over that segment,over a 512 megabyte segment. And then, uh, that index doesn't change,and that segment doesn't get changed unless you force a re-index. And this is very, very good for scaling,because let's just imagine for a second, you know, like,would you rather query a hundred gigabytes in parallel or run 200 parallelqueries on 512, uh, megabytes? What's gonna give you a faster response?It's pretty obvious that running, um, the parallel, uh,query will give you a faster response.

The other nice thing about these nodes,uh, being separated is that you can, um,scale them up and down as you need. You don't need to scale up your nodes just because, hey, we're querying a lot,but we never ingest any data. So it's nice to be able to say, you know,it's also economical to be able to say like,we're just gonna scale the data node when we're ingesting this data,and then we're just gonna square up,scale out the query node when we're doing a lot of queries. So that's what a Vector database looks like. I'm using Novus as the architecture here.

Um, so yes,they're purpose built to be able to query large, uh,vectors and do a lot of calculations. So now we're gonna talk about building a rag pipeline with haystack,and I'm gonna hand it back over to you to Ana. All right. Um, okay,so before I start going through the pipeline itself, um,I do have to talk about Haystack. Um,so Haystack is a fully open source framework built in Python,designed to build large language model app, uh,applications meant for production.

Um, but not only that,haystack also covers, um, core N L P tasks, such as actually some of it. Um, yeah, you, Jen also mentioned, um, in a lot of cases,if you do want to build an application that's, uh,leverages large language models,it goes beyond just what you build with the large language model. There's a lot of data preparation to do, and, um, you have to index into your,uh, database of choice. Um,and then you have obviously the option of what kind of, um,application you want to build. So here on the right hand side, um,I have one of the classic examples of a two pipelines you might build withhaystack.

One is called the indexing pipeline, which, uh,Eugene already mentioned. This is where you prepare your files,pre-process them, get them into the shape you need to get them into, um,to be able to use these, um, files in your L L M pipelines. And the second is the query pipeline. So I'm gonna start with the document store. So in haystack, uh,you have the concept of document stores and you have many options for differentdatabases, VUS being one of them.

The second concept is the indexing pipeline,which you've already sort of talked about,but here you have a bunch of options for different file converterspre-processing. And this is important because depending on what kind of language model you wantto use, you might want to chunk your data, uh,in different lengths depending on what the context limit of that large languagemodel is and so on. And then the final step is probably the most interesting step. So we'll talk about this one,particularly today is an example of a retrieval augmented generativepipeline. The idea behind this is that a pipeline consists of a retriever,um, followed by what in haystack we call a prompt node,and the retriever and prompt node act together to, at the end of the day,have an instruction that we consent to a large language model that it can workwith and give us the correct answer to.

Um,and here you get a lot of options. Uh, you can decide to,for the both the retriever and the prompt node,you can make your choice as a developer about what, um,what model you want to use, um,to do your embedding search or what model you want to use to do your answergeneration. Um, if you scan the QR code here,it should take you to the GitHub page for haystack. So if you're interested,please go and have a look. Um,and let's start talking about what components helps us build a rack pipelinein haystack.

The two most important components, um,that you will modify and customize for your own use case for a rackpipeline is the prompt template and the prompt node. So I wanna start talking about the prompt template. The prompt template is basically, uh,blueprint of how you interact with a large language model. They are at the end of the day, they generate prompts,but they can be modified and they can be modified at query time. The prompt note on the other hand,these are the nodes that interact with the large language model itself.

They're basically the interface between your application and the large languagemodel. They use the prompt templates to understand how to interact with thoselarge language models,but they are essentially the node that will be sending out the query and thenreceiving the response from any model that you choose to use. Again,you have the choice of what models you do want to use, um,open AI models or one example,but you can also use o open source ones such as Falcon and M P TLama is a new addition as well. And moreso at the end of the day, I just want to come back to this slide. What we want to get to is,instead of having our query being this to a large language model,we want to get,get to a scenario where we might be able to build something like this.

The difficult part is, okay, but how, how do we fill this context?If we have millions and millions of, um, data, how do we find the right context?And that is where the retrieval augmentation bit is going to come into play. So let's start by looking at prompt templating. So here,this is where we instruct the l m on what to do with the providedinformation. So one example is you can build your own custom prompt template. Um,here's a code snippet.

There will be some code snippets. Um, following this,we're going to actually end up building a rag pipeline and haystack. And,and here I've built a prompt template called rag question answering. Um,and you can see the prompt is given the context. Please answer the question if the answer is not contained within the contextbelow, say, I don't, don't know.

And then we have context and question,but the path I want to highlight here are the things that you see incurly brackets. I've highlighted them with blue. Here we have the context with curly brackets, join documents,and then question and curly brackets query. This means that the parametal documents and query can get filled inat query time. And the needs part is that the documents, um,parameter is by default set to receive anything from a retrieverthat might be perceiving it.

The next thing you can do is use, um,the prompt node and define the prompt template that you just created as yourdefault prompt template. Um, so again, this is our going to be our blueprint. It's going to be the prompt notes blueprint, um,as to how it should interact with an L L M. So here you can see that I've made the choice to use G PT three,and actually I made the choice to use G P T three because it was the same modelI was using in that chat. G P T screenshots I showed you before.

Obviously you're going to need your a p i key and I set my default prompttemplate to, uh, brag question answering,which was the prompt template I just created. The other option is, uh, in terms of prompts,is to use some of the prompts that we've made, uh,available publicly available on our new prompt hub. If you scan this QR code,it'll take you there. Um, but this is a quite a neat way of just making use of,uh, already available prompts without you having to design one out yourself. Um,and you can set the default prompt template to one of the, um,key the names of the prompts we have available in Prompt hub.

One example is,and I use this one pretty f frequently, uh,is called deep set question Answering This, uh,prompt is already designed to do retrieval augmented que uh, question answering. Alright, so we've got our prompt templates, we've got our prompt node. How do we actually build a rag pipeline?So here is where we start talking about retrieval,and this is where we also start caring about what kind of database we use. So in this example, I'm using the Mailbu document store. The Mailbu document store is, um, one of our integrated document stores.

You can have a look at the haystack integrations page to see all the rest aswell, but it's super easy to get going. You just say,document store is vis document store. This will set it up with all of the default parameters if you have vis runninglocally. And this means whatever data documents I have in, uh,the local Nobus database I now have access to. I'm not going through the indexing pipeline,but you could have also done that to prepare your data to put into Nobusdocument store.

Next is the retriever. Again,here you have a bunch of options, a retriever in haystack. I'm using an embedding retriever. Here is essentially the model you have decided to useto query your vector database. And I also saw a question come in about this,one of the first questions in q and a.

And yes, this,what this retriever node will do is once it retrieves a query,once it has a query from the user,it'll create a vector representation of that, uh,query and then compare that to all of our documents in mail,this document store. That's why I also have to tell the embedding retriever what document store towork on. And then the final note is the prompt note here. I'm just going to use one of the, uh, publicly available ones from, uh,deep sets. And again, I'm using G P T three.

So I've got all of my components ready,and now I want to actually build out this pipeline. What I want to be able to do is when a user asks a query,I want the first node, the first component in my pipeline,the to be that embedding retriever. I just created that embedding retriever though then will query the documentsI have in Mel's document store. And then I will embed, uh,the retrieved, um, documents into the prompts for the prompt note to query,uh, the large language one. How does that look? Um,it's, I think that's a bit of a delay, um, on the Zoom,but I'll start talking about it already.

Um,next you build a rag pipeline. Could someone tell us in the chat whether you see the build a rag pipelineslide? Uh, because you see it on our side, but we don't see it on the side?We do. Oh, okay. Great. Um, okay.

So next we're building a rag pipeline. Um,so here what we do is we first of all, um, initialize the pipeline. Um, and then like I said, the first note I want to add is the retriever,and I set the inputs to query, um, in haystack, uh,by default you have two input options that are kind of special. One is query,and the other is file. The file is used for when you want to build an indexing pipeline.

But now,because I want to build a querying pipeline,the first input I'm deciding to use is query. And then simply I add a second node called prompt node,and I say to the prompt node that your input is the retriever. So any documents I get out of the retriever is now going to be embedded into theprompt, the prompt node is using. And then simply,I may run this pipeline with the same query to achieve the retrieval augmented,um, prompt that we saw initially. The nice part of this is the following,um, as you can see, not only do I have query in this run function,I also have something called pars,and these parameters can be provided at each query separately.

And here I've decided to, um,give the top K parameter to my retrieval node. And this is important because depending on what model I've decided to use andhow, how lengthy each of my documents in my, uh,vector database is, I risk the,the chance of hitting the context limit of the model I've decided to use. This is just an example where I've said, okay, my retriever,I want to only retrieve the top five most relevant documents from my database tocreate the large language model. If I would maybe using a model with less capability,I might decide to reduce that. Or if I'm using a model with higher capability,I might decide to increase that each have the advantages and disadvantages.

The advantages being if you have a higher top K,maybe you're giving more relevant context to the large language model to, uh,produce an answer with,but the disadvantages you are then hitting a, you are,you are hitting the limit of what the model can actually consume. I want to end this with an example application that you can see this actuallyworking in. So if you scan that QR code, um,it will take you to a hug and face space. It's public space and the the, yeah, the,the demo we've created is a retrieval augmentation demo with, um,I think this is using G P T three, um, and we're just showcasing what happens,what the answers are when you do retrieval augmentation versus when you don't doretrieval augmentation. Um, so you'll see two answers show up at, uh,answer with plain G P T and answer with retrieval augmented G P T.

Not only that,but you can also try out retrieval augmentation with a static newsdataset. So what we is a dataset that is a database thatwe connect to that has, um, preloaded, uh,documents in there and we are not changing it. Um,but some cool things you can do is instead of using a database,you might simply want to do web search. Um,so you can also see what retrieval augmented, um,generation looks like with web search as well. And in this demo,we uploaded all of the news articles about SS v v, uh,back in the day when that happened.

Um, and uh, you can use the,either the dataset we have at hand on this, uh,space or do web search, um, and that is it. This is how you build a retrieval augmented pipeline with haystack. Um,there's really not much to it,but there's a lot of creativity that you can add to it,especially when you're creating your prompt templates. That's where you can essentially,you can even decide to not make it question answering,but summarization or anything else. Um,it's a very creative process to come up with your own prompts.

All right,Cool. Um, thank you for that. That was a great explanation also. This is a very funny choice. Um, okay,I'm gonna cover some FAQs that we get and then we will jump into questionanswers or yeah, into, into the questions from the audience.

Um, so some of the FAQs, uh, that I get asked, um, are, uh,when do you not want to use a vector database or when do you not want to beaugmented generation? Um, from this,my response is typically if you havedata that is just key value stores and you don't need to compare similarity,you don't need a vector database,and that's not what you should be using in your retrieval augmented generationapp. If you need to compare how similar or how semantically similar your input dataor the data that you're storing is,then you would use a vector database for your retrieval augmented generation. Uh,another question I get a lot is about vector embeddings around C S V files andPDFs. So one of the most important things for retrieve augmented generation is thatyou actually want to retrieve relevant data. And the most important, uh,part of whether or not you're gonna get data that is the right relevance to youis your embeddings models.

And so when you are doing work with C S V and PDFs,you need to really ensure that you're using the right embeddings models forthem. I actually came across a use case recently where someone, um,someone had, uh, input c s V models in, sorry,someone had input C S V data into their Vector database,but they did not use a C SS v like embeddings model to embed it. So they weren't getting good results 'cause they used sentence transformers. Uh,another thing that we, uh, get questions about, uh, is hybrid search, uh,which is, uh, basically, you know,we wanna search structured and unstructured data,and can you do that with a Vector database? The answer is yes, uh,VIS or Zillow source metadata. Maybe you can't do this with everything,you probably can't do this with a vector search library,but you can do this with extra database and that can also help you augment your,enhance your retrieval augmented generation app.

Um, okay,so let's get into q and a. Um,I have a question regarding vector similarity search. My understanding is, oh,by the way, this QR code is to a vector database benchmarking tool. Um,I have a question regarding vector similarity search. My understanding is that the first step in retrieval augmented on,in retrieval augmentation involves embedding the query into the same vectorspace as the document vectors in the database.

Essentially,this implies that the embeddings of the query should be semantically close tothe embeddings of the potential answer. Is this assumption consistently correct?Um, so there are, I mean, so, so yes, yes. So yes and no. Okay, this is actually kind of complicated. Um,so typically the idea is that yourquestions,your queries are going to be seman close to your answers becauseyou're gonna be asking about something and, um,we're gonna be search The Vector database will search for something that issemantically similar to the question that you ask.

Now,one way that we've seen this enhanced actually,and this is actually kind of what I was talking about with the G P T cache in os s chat,is you can enhance this by actually querying against the question space. And so what you can do is you can take your documents and you can go to chat G BT or some other L L M and you can say, Hey,what are some questions that you can generate to answer from these documents?And then you have a question space and you can also query against that. So,um, the answer is yes,like your queries will probably be semantically close to your answers,and also no,because you can actually probably get better results by making a question space. Um, so yeah, I hope that makes sense. Okay,I think I can take this one.

Okay. Um, this question about,uh, how to overcome, uh, rag DB vector limitations, uh,if there are limitations,any changes or updates to large language models required, uh,for re-indexing everything into the Vector db,do you need the exact same L l M for querying? Um,can we change the large language model or is it not allowed because it's definedwithin the Vector db? So actually for the RAD pipeline,we're talking about two separate models most of the time. Um,the large language model at the end, um,the large language model that we use to generate answers can be completelyirrelevant to the model that you use to index and query your data in a vectordatabase. So you've started to use chat G P T or G P T three,but you don't like the performance and you want to fit switch to lama,that is totally fine. This is not something you have to worry in terms of your data in a Vectordatabase.

Now, when, when it comes to indexing, um, into your Vector database,um, two things, yes, uh, you should, you, you pick a model,it has a certain number of, um, dimensions it's going to create, uh,for each vector. And then for your retrieval step,it's a good idea to use the same model because then you want to create a vectorrepresentation of your query that has the same number of dimensions. Um, and then that you can set,you can set up only for your embedding retriever step. There are some scenarios and some, uh, retrievers, um,that do some fancy things where they can use a embedding modelfor your, um, documents and use that model to embed your documents,but use a separate model, similar model, but separate model for your query. Again, they end up with the same number of dimensions though.

Um,so this is the only step that you should pay attention to what model was used toindex my documents into a Vector DB versus to query. Um, there's also a third question hidden in this one about how,what happens when you need to update your data. This does not mean you have to retrain or do anything to your large languagemodel you use for querying,but this is why we have in haystack a separate pipeline that I haven't gonethrough called an indexing pipeline. Um, and whenever you have a new file,you can just run that file through your indexing pipeline and it will embed thatinto your vector database the same way all of your other data is embedded. Um, I also, I wanna, I wanna touch really quickly on this as well.

Um,you, uh, any changes or updates to the,I'm gonna assume what you mean by the l l M here in this question is theembeddings model,if any changes or updates requires re-indexing everything in the Vectordatabase. Um,so that is actually gonna be up to you in the way that you implement it. Uh,it may, um, and it may not. You can actually take a look at what your performance looks like. Um,this is what I mentioned earlier is you may actually want to check that, um,through this process called graceful degradation,where you're checking knowns versus unknowns.

Okay. Uh,do vector databases have any limitations? For example,are there limits on how much data a database could index at a single time? Well,do vector databases have limitations? The answer is yes, they are software,all software has limitations. Um,you can viss only indexes. I mean,viss much only indexes 512 megabytes at a time because it takes chunks and 512megabyte segments. The larger your data,the larger your data that you're, uh, that you need to index,the longer it's gonna take.

And so we typically say, Hey,we're only gonna index on like the size of data because it's more efficient, um,both in terms of like, uh, performance and in terms of like,you know, uh, financial value. Um, so yeah,if there are other questions there that I may not have gotten at in terms oflimitations, feel free to, to update this. Um,there's a question about can you update a particular document into Vis withoutre-indexing, uh, the whole database, meaning deleting the embeddings of a docu,um, of a document. Um, I'm,I can answer this from haystack terms. Sure, yeah, yeah,Yeah.

And I can talk about familiar. So you canGo first in haystack terms. Yes,you can do this with all of the document stores we have. Um, again,this refers to the indexing pipeline that I didn't really go into detail about,but the indexing pipeline,similar to the query pipeline you saw as a pipeline you create and it's alwaysthere. And, um,if there's new data that you want to add to your document store,then you simply run, um, the indexing pipeline, uh,with that new file.

And you don't have to delete every anything,it just will write the new file to your already existing vis, uh,database. You, there are also options. Um,this then depends on the database we use for the document store, um,to update the existing ones. So let's say we use this a lot, uh,documentation sites, they're not new pages, the pages change. Um,so in this case, you can, um, flip a flag.

That means if the page was somewhat the duplicate,just overwrite it rather than deleting it and re um, reintroducing it. Yeah,Yeah. Um, Viss has a similar kind of functionality. Um,there's this concept called asserting,which is basically deleting and replacing, uh,a document and you don't need to rein index your entire database for it. Um,that would be ridiculous actually, in Viss,one of the things that I think is really cool about viss is you almost neverneed to re-index, um, because the indexes are only on very small like segments.

Unless you're like deleting an entire set of like data,you really have almost no reason to re-index. Also unstructured, uh,also unstructured IO seems interesting to prepare the documents before beingadjusted to haystack Ben Mils, what's your opinions on that? Thanks. Do you want to talk about that? Do wanna,We've just started looking into it, into this, so it's very new to me as,as well. So I, I'm not gonna give you any solid answer, but yeah, I,I'm we all looking into it. I've got something for you.

Um,you wanna pay attention to what's going on in my LinkedIn in the next coupleweeks because I'm gonna be releasing something about how you can useunstructured IO with Novus. Soenjoy and be on the lookout. Um,does Novus support hybrid search?Meaning combining factor search with something like BMM 25?That is a real help when working with domain specific texts?So we do not natively support, um, BM 25 hybrid search. Uh,however,if you store your text chunks as metadata inside ofyou can do keyword search on your metadata,and that is essentially the same thing. So the answer is,uh, yes, you just store your metadata in chunks, uh, you just store the,the chunks of your text and the metadata and um, filter through that.

Uh,and the filtering is, is really cool too. The filtering that we do is we use a bit mask, so it's, um,once again computationally inexpensive. Cool. See what else in the chat? Uh,QR code, links to source code. Um, okay.

One challenge I found was that for enterprises search costs can becomeprohibitive. Mm-hmm. If prompts are sent to OpenAI, what alternatives?Transformers to me? Yes. Uh, yeah. Um, you can use open source documents.

This is also why we made G P T Cash. So I would suggest, you know,checking out G P T Cash, um, which was what basically, like we had the same,we had the same problem and we were like, Hey, this is ridiculous. Like,we're paying a lot of money for this and, uh, so that's why we made G P T Cash. It's a semantic cash that says, Hey, like, you know,if we've seen this query before, maybe just return the cash response. My customers say no to OpenAI for fear of security violations, confidentiality.

So how can search Rag be done without lmm such as open a p iI think there we are getting more and more open source options out there. Yeah. So that is definitely an option. Yeah. Um, all of this that I showed,the pipeline that I showed is also stuff that we've tried with, uh, Falcon four,tb, um, and, uh, Lama for example.

So those are definitely options. And to the po to the first point, um, that's where, uh, when, when,when you're talking about costs, um,that's where a lot of the pre-processing steps can come in handy. Um,so for example, um,you might want to try experiment with how you chunk your data. Um,so that's when you do retrieve the relevant documents, is it the whole like, uh,500 lines full of file that you are putting in there, or most relevant,you know, five sentences, sentences you're putting in there at the top,how many that you're putting in there. Um, okay, so back to the q and a.

So you saw the text kind of two times. I think this is the follow up for the BMM 25. Um, so the,uh, the answer to this is, um, no,you store the text once and you store the vector vending once. I'm not actually sure what you're asking here, but like,you don't need to store the text twice, you just need to store it once. Uh,if you want, you can also store the text in a SQL database.

Um, you know,I'm not gonna say you can't do that. Uh, and you can also, you know, then,you know, reference that and query that if you would like. That is up to you. I suggest that you just store it in the metadata and you use bitmask filtering'cause it's fast and it's, and it's easy. Um,what happens when I need to update my embeddings because the source data wasupdated.

If you're, what? Uh, I, I think this is actually kind of,there's nuance to this question. Um, andIt, I think it comes back to the indexing Yeah. Pipeline side of things. Yeah. Um, when you have an update, uh, to your source data, um,this is where you,you look into reusing the same indexing pipeline you had for your original.

Um, and there are, I again, I think vis, uh,you also said that there's this capability, so it would work alongside haystack. Mm-hmm. Um, there are options to not say write documents,but to update documents so that if it's the same filethat has site changes, you are not duplicating it,but you're simply updating it. And that's a simple function call. Yep.

Um, my,my my thought on this is it depends on how much your source data was updated andwhat you really need to do with that. So if you're changing your entire database, then yeah,you probably wanna reimb embed things, but if not, like if it's small updates,I think, I think you should juster and that's, that's fine. Uh,what are some options for image C SS V P D F video embeddings, um, image,my suggestion is always resnet 50, uh, C SS V P D F videos. This is where someone actually mentioned the thing about like unstructured io,they're doing something with, um,PDFs and that's actually what I'm working with them on. Um,videos are a little bit more tough.

CSB is a little more tough. No one really works with C SS V data that much. What I suggest that people who are working with CS V data and they want to do,uh, retrieval augmented generation isyou should turn your C S V data into complete sentences. And that should always be possible, um, you know,with whatever kind of data you're working with. And then you turn those complete sentences and you embed those sentences using amodel like sentence transformers and then you can, uh, query against that.

Cool. So let's see. Uh, yes, we'll,we'll we'll open up for, we'll keep this open for a few more questions. Um,let's say, you know, if you have another question in the next, you know,try to type it in the next, uh, 30 seconds to a minute and we'll,we'll try to get to it. Um, oh,so this comment open source are still not good as l as Oh, right.

Open source LLMs are not as capable as, uh, G P T. Um,Yeah, I I agree right now, um, from my experience, this is,this has been the case, but I've noticed that this is the case for,like you said, agents or very, um,complex applications with large language models, um,in a lot of cases where you want to simply answer questions accurately, um,then the open source models are also quite good. Uh, another question. Um,any general tips for best rag performance when indexing and parsing source data?Yeah, you probably want to, well,there's this thing called context aware chunking or smart chunking that is, uh,people are talking about a lot right now, which is how do you,how do you chunk your data in, uh, sizable chunks that make sense,that retain enough context that you can understand what the data is about andalso retain enough context that you can link them together. Um,and this is not a solved problem, like nobody has solved this problem yet.

Uh,and it is tough,but the typically what I see for chunking is people do 512character chunks with 25, uh, character overlaps. Do you want to say anything about this? Tiana?Uh, general tips best track when indexing and ing source data. Um,this is a interesting one in terms of haystack because actually a lot of thetimes an indexing pipeline is, um,not the type of pipeline you are continuously thinking about. Uh,a lot of the cases we see that the, is that the indexing pipeline, um,runs once, um,and you have your data and then you have your rack pipeline on top of it. Um,when it comes to, uh, best practices for performance,my experiences, especially when you're talking about the pre-processing step,is really thinking out what, um, length you want your chunks to be.

And it all comes back to this more of the economic side of things. Is it going to be providing enough context for the large language model tounderstand what we're talking about and is it going to be short enough thatwe're still within the limits of context size or the amount of money we want tospend if we're doing a closed source model?Cool. Um, I haven't tried LAMA two yet either. I have, oh, by the way, uh, Claude is actually,I don't know if you guys have have used Claude, um,but Claude is interesting in terms of, uh, writing capabilities. I've talked to some people about this and it seems that Claude does notedit you as much as G P T will, so you should check that out and see,see what you think about it.

Um,I think we're out of questions and we are also about to run outta time,so I'm gonna pass this back to Emily. ThankYou so much, Eugene. Uh, thank you Tawana for, uh, joining us on this webinar. It was really great. Um,and thank you to the audience for all of the really wonderful questions.

Um,we hope to see everyone on a future event. Um,and for those of you in the San Francisco area, um,Tawana's gonna be at the unstructured data meetup San Francisco tonight. Um,so you can come to this L's, uh, website and, uh,check out more details about that, um, or reach out to us on our social channel. So thank you so much for everyone and we will see you next time.

Meet the Speaker

Join the session for live Q&A with the speaker

Tuana Çelik
Lead Developer Advocate, Deepset
Tuana is a developer advocate at deepset, where she focuses on the open source NLP framework to build LLM applications: Haystack. With a degree in Computer Science from the University of Bristol, she started her career as a software engineer but then returned to the world of machine learning as a developer advocate and now dedicates her time to helping the open source NLP community.

Designing Retrieval Augmentation for Generative Pipelines with Haystack

About this Session

Meet the Speaker

AI Assistant