Webinar
Unlocking Advanced RAG: Citations and Attributions
Join the Webinar
Loading...
About this Session
Large language models (LLMs) have allowed us to easily “chat” with an AI. One popular and performant approach that has become popular recently is retrieval augmented generation (RAG). RAG apps work by injecting your data on top of an LLM by using a vector database like Milvus or Zilliz Cloud.
On top of simple RAG, attributions, and citations are commonly requested enhancements. Attributions and citations explain where in the corpus your answers come from. In this session, we will look at how to add attributions and citations to our RAG app using Milvus and LlamaIndex.
Topics Covered:
- Why do retrieval augmented generation with citations?
- How do you build RAG with citations?
- What does this mean for your LLM apps?
To introduce today's session, unlocking Advanced Rag Citations and Attributions,and our guest speaker,my colleague Yuin Tang Yu Eugene is a developer advocate here at Zillows. He has a background as a software engineer, um, working on auto ML at Amazon. Uh, he studied computer science,statistics and neuroscience with research papers published to conferencesincluding I E E E, big Data. He enjoys bubble tea,spending time with his family and being near water. Thanks for joining us today,Eugene, welcome.
Uh, hi guys. Thanks for that introduction, Emily. Um,I'm really excited to talk with everyone today about citations and attributionswith rag. This is actually something that, uh,I've been seeing a lot of chatter about in the last few months. And so building something around this and, um,doing something around this has been really exciting.
Um,my name is Tang. Um, I'm a developer advocate as Zillow,as Emily has said, I've put a QR code up here that you can scan. Um,if you scan this QR code, you can, uh,it takes you by LinkedIn and you can connect with me there. Um,my background is in machine learning. I worked with, uh,CV and Natural Language processing before, uh, coming to Zillow.
And most of what I do here is focus on building retrieval, augmented generation,um, applications. So in addition to citations and attributions,if you have questions about l l m apps in general,please feel free to drop them in the q and a. So today we're gonna cover, uh,just a couple things about citations. We're gonna start with why do citations matter. We're gonna motivate this use case of citations and attributions.
Um,and then we're gonna go into how you can build a citation engine. So in this subsection, I'm gonna kind of cover the pieces of,I'm gonna cover the process of a citation engine,and then what goes into a citation engine is more going to be covering thepieces of the citation engine. And then I have some F a Q at the end that are mostly about, uh,vector databases. Um,and this is something that we can choose to go into or that we can choose not togo into, um, because after the after section three,I also have a code example that we are going to walk through so we canunderstand what the code looks like for a proof of concept for a citationengine. So why do citations matter? Section one,this is the first thing that you need to know.
Um, and basically,citations are important because chat,G B T or any other large language model has hallucination problems. If you have been paying attention to the news, um,you've probably seen some interesting news around the use of chat bt,uh, in or outside of industry in either academia or even in in court. And, uh, earlier this year, I believe this was in August or or July, uh,a lawyer used Chade in court and cited a fake case. And it says, this one says A judge is considering sanctions,but I'm actually pretty sure recently I saw that they are actually in troublefor this already. Um, and then, you know, uh,people have been also using Chad GT a lot in academia.
A lot of particularly undergrads, I guess, have been,uh, citing, have been using Chad BT to write their articles and,and cite articles. And it just so happens that if you go to Chad G B T and you tell it like, Hey,I need 10 articles about, uh, let's say,I don't know, like ancient Rome, you know, it'll,it'll like make up some articles and some of them will be real and some of themwill be made up. So this is a problem. Um,and the reason why large language models have this hallucination problem wherethey'll make up things is because they are neural networks and neural networksare really just advanced statistical methods. And so in order to understand what's going on behind the scenes,we're gonna stop here, uh, before we go into citations,and we're gonna take a dive into neural network so we can understand what isgoing on and why they have this hallucination problem.
So let's start by introducing a basic neural network. This has been around since like the 1960s or 1970s or something like that. Um,and essentially what,what I want you to get outta this is you're gonna take some set of inputs. In this case, we have, you know, three inputs. And these are usually numbers.
So this is like a three dimensional vector, is the input, okay?And then it's gonna get mapped somehow. They're gonna do some transformations,there's some functions happening, blah, blah, blah. And then you get an output. And your output is either gonna be, it's gonna be some sort of,usually some sort of classification. It is not always a classification.
It could also be another prediction. Um, and typically, uh, neural network,but typically neural networks do classification. Um,and so this is basically all you need to know about the basic neural network isit takes some sort of numbers as input, does some sort of process,and then it gives you something as an output. And then as we moved, uh, further into the study of neural networks,what we found is that particular types of neural networks work really well for,uh,texts and particular types of neural networks work really well for images and,uh, for graphs or for tab tabular data or things like that. So different types of neural networks work better for different types of data.
And for text data specifically, we have recurrent neural networks,which work well for, uh, text data. And the reason why they work well for text data is because they allow you tokeep track of input over time. So they allow you to keep track of some sort of context. And what this image here is,is the typical architectural image of what a recurrent neur network looks like. What I want you to understand from this image,or what I want you to get out of this image,is essentially each input output kind of, uh,combination is looped back on itself.
And that is how it keeps track of sequence, and that's how it keeps track of,uh, words or tokens or inputs over time. And so in this diagram here,we see that there's this X minus XT minus one XT XT plus one. And what this is trying to show you is that at input time,t you still have some information from your re most recentinput. So essentially what this allows you to do is this allows you to create some sortof sliding window of context along your input. However, you know, this comes with two problems or multiple problems,but the two problems that we primarily solve, uh,that we primarily look to solve on the recurrent neural networks with the nextarchitecture, the transformer architecture, is that at, with this kind of, uh,architecture, you're gonna lose context over time, right?There's a sliding window.
And in addition,you have to convert the same number of,you have to have a sequence to sequence transformation where the sequence is thesame length. Transformers kind of allow you to hit pause on that sequence to sequencetranslation,as well as hit pause in your context and save that context into something thatwe call a hidden state. So transformers composed are composed of an encoder,a decoder, a hidden state, an additional input. And this additional input is typically what we call self attention. Um,and this is a matrix, and this hidden state is also a matrix, right?So it's just a bunch of vectors, right?And so the additional input self attention is just a bunch of vectors.
Just think about it that way. Um, so essentially what the input,what the encoder does is the encoder says, Hey, here's this input from,you know, whatever that's, uh, you know, the text input. And what we're gonna do is we're gonna run it through some sort of calculations,and we're gonna produce a hidden, uh,vector that tells you about the current context of the input,as well as the current token of the input. Where is it?And the current state of the input. So basically,it keeps track of the entire state of what's going on with the input.
And it does this over time, and it does this,and it allows you to essentially do both global and feed forward contextand decoders. What decoders do is they take this context,they take this context when, uh, they take this hidden state that you have,that is the context of your input. That includes, you know, your tokens,where you are and things like that. And it takes your additional input. This is the self attention matrix, and it puts these together,and then it produces an output.
And typically what this output is, is, right?So it takes into context where your token is. And so typically this output would be what is the next token?And that is exactly what G P T does. So G P T is actually a decoder only architecture. There is self attention,there is like the full, you know, feed forward neural network, like this thing,but what it actually is, is a bunch of decoder blocks. So essentially what it's doing is it's taking the, the state,and it's just transforming the state until it produces the next word.
And so in this example, the chicken walked across. So what actually is happening is G P T will say, okay,the chicken walks across the road in a sentence. So what you want, what I want you to take away from this is that G P T, um,what G P T does is it takes advantage of knowing bothwhere the next token is and the current state,the current context of the input. And it uses that to create the next token or the next output. So the reason Chad g p t hallucinates is because it's set up to predict a seriesof words or tokens, it is not a direct match.
It is not a fuzzy match, it is not a search engine. It is an advanced statistical method that is set up to predict a series ofwords. Okay? So now that we kind of understand like what's going on with this,like what is, you know, tragedy, BT does this, like how do we get, how do we,how do we, how do we get out of this? How do we deal with this hallucination?So it's, there's a few pieces to this. And you know,one of those pieces is essentially you need to have a way to vaverify or validate your knowledge. Where is this data coming from?And that's where citations come in, right? So citations allow you to say,Hey, this is where I found this data.
And so the basic process of this, this, the first step in this is to,to basically inject your data on top of your LLMs, right? This is rag,this is your basic rag retrieve, logging into generation,is that you have data on top of your LLMs. And the basic process for this is you take your knowledge base, uh, usually,you know, text, images, video, whatever, you put it into a, a neural network,your deep learning model, it has to be a different model for each type of data. Don't use text models for image data. Um,you put it into your network, you cut off the last layer of the network,and then you get the outputs from the second to last layer,and that is your vector embedding. So that's where you get your vectors and you store it in a vector database.
And that's basically how you inject data on top of your LMSs with citations andattributions. We're gonna add an extra step in here. We're gonna not just store the vectors,but we're gonna make sure that we store the chunk of text as well. Um,but we'll talk about that in a moment. So what I wanna talk about here first,leading into that is like, what is this semantic similarity?Why are we storing vectors? What's the point? How are we gonna,how is storing vectors going to help us cite articles and find the rightinformation in the right place?So let's cover this thing called semantic similarity,so that we understand how vectors are able to find things that mean similarthings.
And so in this example,what I want you to look at is essentially that we vectorize these words, queen,king woman, man, and we're gonna do some math on them. So vectorization,the semantic similarity,what we're essentially doing is we're taking any sort of unstructured data,like your text, and we're doing some sort of math on it. And so queen,um, we're gonna start with queen minus woman, okay? My image is covering this,uh, and we're,what I'm gonna show is queen minus woman plus man is equal to king. So we'll start queen minus woman. This is 0.
3 comma, 0. 9, minus 0. 3,comma 0. 4. And this is zero comma 0.
5, right?So we can see here that you know,queen and woman have the same value on the X axis. What this, or sorry,on the x value, the first dimension,what this essentially tells you is that these two words mean pretty muchthe same thing along that dimension. And another thing that you'll notice here is that queen and king and woman andman differ by the same in the second dimension. And what you can infer from that is that these words have a similar meaning ora similar difference in their meaning. And so when you do this kind of math queen minus woman,you're gonna end up with zero comm, 0.
- What this, uh,vector represents is essentially the difference between the word queen andwoman. So if we take the difference between queen and woman,and we add man to that 0. 5, common 0. 2, we end up at 0.
5, common 0. 7,which just so happens to also be king. So the difference between queen and woman added onto a man is king. And so essentially what I'm showing you here is math on words,words that are similar,can be added and subtracted to create other words that mean the same thing. That's the takeaway Vectors allow you to do math on words.
Okay?And so here, what does similarity search look like? So far,we've talked all the way through steps one and two, right?So we've gotten this unstructured data, we've turned into a vector,we've put it the vector embeddings. We've said that there's some sort of text that needs to be added to that so thatwe can keep track of the citations, and we sort it in a vector database. What happens next? How does this help? Once we have all of this,when the user comes and performs a query and the user says, Hey, um,I want to know about, uh, you know,I want to know about the, I don't know, like the, the, the fall of Rome. I wanna know about the fall of the Roman, the Roman Empire. I don't know.
Uh,tell me about it. Okay? And then, so they say that query,they go and they vectorize it,and then they get this query and they send it off to the large language model,and it says, Hey, we need to find this. And then goes to the vector database and says, okay,we're gonna find the closest things to the fall of Rome in ourtext, in our documents. And so what it is, is it goes in and it says like, okay,what I'm gonna do is I'm gonna find the nearest neighbors fall of Rome. Oh, hey,look, uh, assassination of Julius Caesar, civil War, et cetera, et cetera.
Okay,now this is what we're gonna return, and here's the sighted documents. Here's,you know, where this document, where did we find it? We found it on, you know,this textbook, or you know, this, whatever. Um, maybe you found it on Wikipedia. I know like most schools are like, don't cite Wikipedia,but I think it's actually not bad of a source. Um,but basically it'll tell you like where you found it.
So you can, can either,you can establish like either like a, like,this is a credible source and I trust it, or B, like,this is not a credible source and I trust it. Maybe it finds information on the onion and you're like, Hey,why is that even in my database? Like, I don't need that. Let's get that out of there. So, uh,citations will show you where the data comes from,and the difference between citations and a typical similarity search isbasically that we're gonna add these texts into the vector database so that whenwe query it, we see the sources back. Um,and so the difference here is, you know, I wanted to,I just wanted to like take a second to look at what the data looks like in aVector database.
I get a lot of questions about this is just what does your vector database datalook like? So you'll have metadata in J S O N format,like you would any like blob storage or no SQL database. The difference here is really that we're gonna look at this embedding. And this embedding is the, is a kind of key that you can use. And so the embedding is the approximate nearest neighbors key,and this is actually how the vector database is queried. And this thing that I've circled here is what I've called paragraph in this.
So this is actually a screenshot of my data from a recent project that I've beenworking on. And the paragraph that it has extracted out is we defined an anomaly as follows. And that doesn't have the rest of the paragraph in there. But, uh,essentially what I want to point out here with the citations piece is that itdoes include a paragraph of text there. So that's kind of what the data for a citation looks like.
And typically,if you don't have to do a citation or if you don't have Zoom attribution,you do not have to save this text data. You only have to save the metadata that you need in your vector database. Okay?So what goes into a citation engine, right?So I just talked about how this citation engine works,like what's going on with the vector database inside,like how does it actually pull relevant documents? What's an example of that?What goes into this? So citation engines, uh,are a piece of the rag stack, a part of the L L M application framework. Um, and essentially what we do here is we, uh,define this framework as a C V P framework. We call it C V P, Chad,G B T, or any other large language model,a vector database such as viss and prompt is code.
And so essentially what this, what you can,the way that you can think about this is you can think about this as a, um,processor, uh, C P U, that's the chat, G B T,this does all the computational stuff. It could,you can also think of it as GPU U computationally heavy work. And then you have a storage. So you have to have your data stored somewhere,and that's your vector database, right? So this will be your hard drive, um,your rom, et cetera. And then your prompt is,code is essentially using the prompt to interfacebetween you the C P U and the storage.
So where do citations sit in this stack?Citations sit with your vector database and your prompt as code. So depending on the way that you have organized your project,um,your vector database may or may not be something that you automatically uploadcode to, or it may be something that you use a framework to upload, uh, sorry,not upload code, upload data,or maybe something that you use a framework to upload data into. So it could sit as part of the prompt as code. If you have an AI agent say, or something like that,uploading data into your vector database, you can prompt it to say, Hey,we also want to include the, um,the actual text and the source and things like that. So that's one way that you can do that.
But you can also do this directly on your vector database by just,um, ba by, by just going into the vector database and saying, Hey,like one of the fields we wanna store is this text field. And this also kind of like plays out right before you store your data, right?So there's gonna be some pre-processing for your data that you're gonna need todo. And some of that includes things like chunking your data up into reasonablechunks. Um, this is something I've been playing around with is like,how do you find a good sized chunk? What is a good crossover? Uh,what is a good overlap for your chunks?And actually this is a difficult problem that, um,no one in the industry has really found a, or,or even academia has really found a strong solution to. Um,but this is something that you're gonna have to do and you're gonna have toexperiment and figure out is how do you chunk your data correctly?And once you have your chunk data, how do you store that?So all your chunk data maps the same documents and things like that.
So this is where citations sit. And now the next thing we're gonna do is we're gonna take a look at, uh,an example notebook where I've built a very, very simple citation engine. Okay?So this is a QR code that you can scan to get to the notebook. Um,I will also be dropping the link in the chat. Um,but if you want to scan this on your phone,so you have the link saved somewhere for future use or whatever,I'm gonna let you do that, and I'm gonna give like 10,15 seconds and then I'm gonna, um, open up the notebook.
Okay? Oops. Okay, so this is the notebook. Let's go back into full screen. I full screen this. Very cool.
Okay. Okay. So this is the notebook that I'm using for, uh, citation engines. And essentially what we're gonna do through this is we're going to, uh,you'll see that at the end we get this like cited response. And what we're gonna do is we're gonna scrape some data,and then we're going to put that data into a vector store, and then,then we're gonna use an L L M,and we're gonna put that on top of the vector store, and we're gonna create a,uh, a citation control with that.
Okay?So here I have all of this stuff, uh,commented out because I had already done this. This is in the same folder that I had for my data,for the multi document query engine. Um, if you,uh, were there for that, then you should also, you know, have this data. If not,I will also drop the link to that. Oh, right,I said I was gonna drop the link to this in the chat.
Let me get both links to these and put them in the chat. One sec. So this one,so you can follow along with this, and then, uh,I'll drop the other one in a second. This is the other one. So this, both of these do this, uh, scraping.
Um, and essentially what we do here is we just get a list of cities,and you can pick any cities you want. For me, I chose Toronto, Seattle,SF Chicago, Boston, dc uh, Cambridge and Houston. And then we just go to Wikipedia and we pinging Wikipedia. This is Wikipedia's a p i, nothing really fancy here. Um,and this is also not something that matters unless you're, you know,into web scraping.
Uh, and then we get, we come down here and we get,um, our OpenAI key, you know, so if you're gonna use OpenAI,load up your dot m and then get your OpenAI key, um,Then we are going to import a bunch of things from LAMA Index. So this part matters. So first we have to import a way to interact with OpenAI,and then we have to import the citation query engine. So you can actually create your own citation query, citation query engine. But in this example, we're gonna use the one from LAMA Index,and we're also gonna need the Vector store index, of course.
This is to hook up our Vector store vis, uh, a simple directory reader. So this reads the data from the data directory. Once you scrape it, um, storage context,this allows us to pass around a reference to vus, um,loading an index from storage. This allows you to load an index from storage. I'm actually not sure we use this one here.
Um,and then service context allows you to pass a large language model around,and then we're gonna need the llama index version of Viss Vector Source,so that LAMA index is able to access viss. And then we're gonna just get Viss light default server for Viss light. Um,and we're gonna start it up. Um, so from this one, I was using 2. 2 point 11,but now we're on 2.
3 0. 0. Um, same interface. Uh,so no change in that interaction. Um,and then what I'm gonna do is I'm basically gonna open up a connection to aVector store to to, to mill this vector store, uh, in Llama Index.
And I'm gonna name this connection citations, and I'm gonna have it, you know,running local hosts. And I'm, Hey. Hey, you, Eugene, sorry to interrupt you. Is there a possibility that you might be able to zoom in on your editor?Yes. I always forget that this is like really small to see.
Yes. Is this better?It is. Okay. Um, cool. So let me, I'll just, I'll cover this again really quickly.
I don't know if you guys gotta see this. So, uh,this is a way to interact with OpenAI. This is the citation query engine, uh,functionality, um, provided by LAMA Index. This is a way that we create an index from a Vector store. This is a way to read data from a, uh, from a directory.
This is a way that we pass around, um,the way that we store data in this case Vector database. This is a way to load an index. This is a way to pass around an L L M. And this is the Viss Vector store specifically that goes with LAMA Index. And we also, this is viss Light default server.
We started up,and then we create our Viss Vector store. So I've called this collection of citations, this is local host,and then we're gonna just call default server to listen to the port. Um,we're gonna create a service context here, and we're gonna need, basically,we're gonna get G P T 3. 5 turbo. Um, you know,pick whatever large language model works for you here.
It's just that OpenAI is,works easily by default with both, uh, LAMA Index and Lane Chain. And so, you know, for this one, we're gonna use G PT 3. 5, um,and then we're gonna get a, uh, storage context. And essentially what we wanna do here is we want to have a storage context,and we wanna load up this vector storage that we created earlier. Then we're gonna use the directory reader to load the data.
Um,so this is the data that we scraped from Wikipedia up here,and then we're gonna create an index. So we create the index,we have these documents that we, uh, that we, um, that we scrape from before,and we convert it,and then we pass it the large language model that we're gonna be using and the,uh, vector database that we're gonna be using. And then we're gonna create a query engine. So, um,what essentially we do here is we just go and we call a citation query engineobject. And what this does is this creates something that tells,internally tells LAMA index, when you retrieve information,I want to know where the source is.
And so that's what this is. So this is part of that, like, prompt is code slash like, you know,kind of like interfacing piece in the CVP stack. And so what we're gonna pass this is we're gonna pass this,the index that we want to use, um, and then a similarity top case. So essentially what this is,is what are the top responses we want from the Vector database?And we're just gonna say three, for example. And this is, you know,how granular are the citations? And so in this example, we say 512 characters,and then we just say, you know, uh,does Seattle or Houston have a bigger airport? And it tells us, you know, blah,blah, blah.
And then if we want the nodes,it'll print out much more information. And so, you know,you can see that where they get this information about Houston is, it's like,you know, this list of things, you know, this section about Houston. So yeah, that's pretty much, uh,what goes into a query engine and how you can build your ownsimple example with llama index and why they matter. And yeah. So now I'm gonna open up the floor for questions.
Thanks, Egen. While we wait for the attendee questions, stroll in reminder,everyone used the q and a tool at the bottom of your screen. Um,so I know you've been working on this, uh, a little bit, uh,for a little bit of time. I'm curious,what did you find to be the most challenging part of, um,of building thisOrSolving this problem? Um,Solving this problem? Yeah, I think actually the, like,like I was saying at the, at the, at the end of the presentation, um,the most difficult part in creating an effective citationengine is actually getting your chunking correct. So it's actually your data pre-processing.
Um,which is really annoying because that is not, like,that's like a area of, uh,AI ML that is not very well defined and is not very well solved. Um,but that is something that is very important for your citations because you wantto be able to get context that is the right context and you want to get it inthe right context size window. Um,so that was kind of part of the hard parts. Uh, in addition,I built a few other versions of this. This was, um,this was the one that, this is like a version that runs on llama index.
I did this, I did like another version where I just built it with just vis, uh,and in that case, actually chunking was, was actually more annoying. Um,and this one, this at least automatically chunked the documents for me. Um,but I think maybe another difficult part is just choosing what to store and howto store it. So, for example, you don't just want, um,you don't just want to store the text. You also wanna store like,where did I find this text? Uh, or who's the author? Or when was it published?Or things like that.
And so a lot of these things, um,or just require more pre-processing and thinkingand design processing than actually building the, the indexing itself. Jeremiah, uh, how difficult is it to add new data?Do you need to regenerate the vectors every time you add a document,or can you add vectors incrementally?I'm gonna need more information on this question. Um, let me, okay. Um,I, I, I, I, I don't understand, uh, I,it seems to me like you have this con and you can tell me if I'm wrong. In the q and a,it seems to me that you have this concept that one vector represents the entiredata.
Adding new data requires you to add to the vector. 'cause that's where I'm getting the regenerate the vector every time you add adocument. So vectors actually are the same dimension every time,and they're generated by a, um,they're generated by a neural network like this, right?So the vector is actually this output layer from a neural network of theinput documents. So for example, every time you add a sentence,when you give this, if you give a sentence to this network,you'll get a four dimensional vector. If you give a paragraph,you'll get another four dimensional vector.
Um,so vectors don't change, you don't add to vectors. And adding new data is very easy. All you say is, Hey, here's my new data. Generate embedding. Um, you don't need to regenerate,you just need to generate it once,and then you can store that into the vector database.
Can you explain how you assign values to tokens again?Um, yeah, I actually, I saw this question and I'm,I don't actually understand it because you do not assign values to tokens. So to the person who asked that question, if you can,if you can add a little bit more clarity, um, that would be great. We'll,we'll give that, um, another shot. Yeah. Is this citation approach essentially just prompting the l m to quote the parts?It is reading fromJust prompting the l m to quote the parts.
Um, no. So actually that would be what this citation approach doesis it says, Hey, find me your sources, and as we know,LMS have this hallucination problem. And so what this actually does is this says, Hey,we're gonna store the sources in a database, and then when the,when we retrieve those sources,we're gonna include where we got those sources from and what the source is. So it's not just prompting the L lmm to quote where it's reading from,because the l l m doesn't read the l l m just generates words. It does not read, there's no reading done by the l l m,Um, forOh mm-hmm.
Okay. For, for source one, what exactly are the sources? So in this, in this,uh, example that I provided, I did not, it was,you can tell from the source that it's from Houston because it says Houston. Um, but I did not ask it to provide the, you can see Houston,but I did not ask it to provide the, uh, the whatchamacallit,uh, link. It is unclear what you built. You, you're using a citation query engine,somehow the message got lost as to what you built.
Uh, so a citation query engine is like a piece of this. It's like a,So query engines are a piece of your large language model application. Essentially what they do is they drive the way that you,that your application, uh,sends queries to its database. And so the citation query engine is a piece of a larger large language modelapplication, which is what we built here. This is a application that takes advantage of a large language model.
We've injected a database, we've injected our own personal data,and in this case, our own personal data is, uh, the Wikipedia files. Um, and, uh, yeah, I mean that's,that's basically citation engines is a piece of a large language modelapplication. Uh, that's a better clarification. There's no consideration of other content when generating a vector, correct?That is correct. Um,is this a matter of using a citation query engine from somewhere else,or should we build one? Okay.
You know what,it's becoming clear to me that I should not have used this example,and I should have actually just built, I should have just used the raw example. Um, you can build your own citation query example. This citation query engine is just a,something I pulled from LAMA Index because I like the framework and I like thatthey've made it easy to, um, to essentially, you know,implement this citation stuff. Um,but you don't need to use an externally built, uh, citation query engine. Essentially what you need to do with this is you need to, a,you need to remember to store your, um, to store your,your source data and b,to declare that you want to see your source data when you do your query.
So there's really two pieces of the citation query engine, and I mean,this one also handles the chunking. So you also need to handle chunking your data. Um,but there's only a few pieces of it, and you can build it on your own. You do not need to use another one. I simply used this one from LAMA Index because I like their framework,and I think it's cool.
Uh,do you have any chunking strategy recommendations for a tree based documentstructure,think dom tree or parent child grandchild to enable more specific citations andcontext? Uh, yes, actually, um,I mean, this isn't necessarily chunking related,but in terms of tree based document structures, um,I've been working on something with knowledge graphs, uh,and putting a knowledge graph on top of a vector database. And actually we have a customer that does that origin trail,put a knowledge graph on top of vis, um,the chunking strategy or the strategy that I would recommend here is really tokeep the metadata from your nodesin the entries into your vector database. So for example,your nodes probably have some sort of metadata such as parents or left child,or right child, or, you know, child or, you know, whatever. Um, or you know,grandchild. I don't know if you keep track of that, uh, that relationship.
You don't need to, but you could, I guess. Um,so your nodes will have some of this in the metadata. And essentially what you want to do is you want to ensure that you put theseinto your entries so that when you get your entry queries back,let's say you're gonna query and you find that, hey,like this is my entry back now, I have, um, information that,uh, you know, this is a text and this is a child,and this is a child node of this parent node. So what you can do then is you can then say, okay, well, you know,we have part of this text. What we can actually do is we can go back and query the parent node as well ifwe need more context for the information, right? Um, so yeah,the, the, what you,what you would want to do with something like that is you want to have a way tolink your chunks as well as, um,cite them.
Uh,the class you set up has the main parts,the L l m and the source document. The, the last gets vectorized. The l l m is a black box. So how do you connect a document as a source to the output of a query?Are you doing a similarity search in all the document vectors with the promptresponse? Um,so the way that thiskind of application works is that yougive,you send a query and the query isthen turned into like, you, you, okay? Uh, you,so you ask it a question in natural language,and then what actually happens behind the scenes is that this question that youask in natural language such as does Seattle or Houston have a bigger airport,gets turned into a query or a series of queries,or a set of queries that basically say Seattle Airport, Houston airport,and then size. So it's actually taking the natural language question.
The l the role of the L l M is to take your natural language question andunderstand that question or break the question down into queries in which thenyou can go into the vector store to queryagainst those queries. And that's how you do similarity search on the document vectors. And so that's kind of,I think that's kind of what you're getting at with the prompt response part,part. Um,but the way that you connect the document as the source is that you put thedocument, you vectorize the documents, right? You have the vectorized documents,you put them into a vector database,and then you query the vector database with the, the, the question,which is where the role of the L l M comes in. And then at the end, you just,you, you turn that into a response like this.
And then once you have your response, you say, where do we get this information?Okay, let's, let's get this information. And so that's kind of how that works. We have another question in the chat. How does the program assign values to tokens?Does it assign the same token values the language model has for each word slashtoken?Oh,Your,Oh, there's your audio. Sorry, your audio cut out for just a second.
Oh, uh, okay. Um, I, I'm having a really hard,I'm having a really hard time like understanding what this question means. Um,the, like, uh,at no point in our program do we assign any values totokens. Uh,tokens are, um,tokens are like, uh, oh, you know what? Here I'll, I I have a,I have a visual for this. Here we go.
In this example here,tokens are these words in reality,this example is a very simplified example. In reality tokens are words or pieces ofwords that your network has defined as a single buildingblock of language. So for example,the is probably a single token, but chicken is probably two tokens. It probably says this is like ChiN or chi, I don't know. And then walked is probably two tokens.
It's probably walk, duh, and says, this ed token means past tense. And this walk is like a, a verb, like, so then it p*****s together. It says walked is a past tense, walking, you know, whatever. Uh,across might be one token. Uh, the, the is, once again,probably one token road is probably one token.
So token is,is token tokenization is, is actually a,it's a pretty complex process. Um,the way that it essentially works is you are running a ton of data into yourlanguage model, and it says, Hey, like, it looks like from all this data,this is a single thing,a single like building block that I can identify as the smallest building blockof these separate words. So, spaces, periods, punctuation,are often also tokens. Um,and just to finish this question up, uh,does it assign the same? No, not every language model has the same tokenization. Okay.
I'm,I'm working on a similar use case as you knowledge graph vertices embedded intoVUSs to be used in a rag app whose response must contain citations. Ha. Yes. Thanks for the chunking strategy advice. Any other advice on how to use Novus more efficiently while scaling?Any recommendations on which indexing algorithm to use for tree basedembeddings, which contain child references and metadata?So I wouldn't even embed the tree thing, by the way.
The tree thing should live entirely,your metadata just embed your text or your images, whatever. Um,and to use viss more efficiently while scaling, um, well,viss is actually built to scale really easily. Um,this is gonna be the part where I just like, you know, advertise viss to you. But like Viss has like these three separations of concerns where it's like,it separates your indexing, your data ingestion and your querying. So because it already has this infrastructure there,what you really need to worry about is like the Kubernetes pods and co andconnecting them and saying, when do I, when do I spin up a second pod?And to do that, that's mostly monitoring,and that's mostly like knowing your use case and knowing how your app is beingused, which is really like, not, not really related to vector databases.
Um,and so what you probably wanna do there is like, just, that's,that's gonna be more like load balancing. Uh,another thing you could do is if you get to the point where you're just like,I don't wanna work with Kubernetes anymore. I don't wanna do DevOps,I don't want to make this scale, reach out to our sales team,they'll be more than happy to talk to you. Um,may you share a RO view in vis for stored citation data? I, that is what,that is what this is. I think, um, unless you mean like,well, this is, I mean, these are j ss o n form,I guess you can think of this as tabular data.
Um,but this would be a, this would probably be,I guess a row in tabular data would just be a bunch of the same IDs over andover again, right? Uh, but this is essentially what your data looks like,what one entry looks like. Um,how many hidden states does U S A have? Oh, oh kidding. Okay. As users,natural language inputs can be quite wild,ranging from some keywords to a very descriptive instruction or question,any recommendations on taming them to something that will pull back usefulinformation. Who, uh, I, I want, I want additional,uh, information on this question.
I want to know who your users are that have such awide variety of things and why thatmight not result in pulling back useful information. Um, typicallyI think, I think that if you're using this internally,then you're gonna probably wanna do some sort of like prompt engineering that'lljust be like, Hey, you know, every time that you just have some rules,just have some rules, you know, like, Hey,every time you wanna look for something, include, you know,this sentence or include this set of words or include this link,or something like that. Um, if you're having external users, then,uh, I mean, then you're, you're you're upper creek without a paddle. Um, any recommendations on which indexes to use? Oh yes,for your knowledge graph use case? Uh,it depends on how much, so all the indexes,like the differences in the indexes are mainly just trade-offs. So it depends on what you need in terms of, um, accuracy, speed,and memory.
So for example,if you need really high accuracy and you don't care about speed and your memoryis agnostic, just use a flat index. If you need really high accuracy and you care about speed and youdon't care about memory, then you would use H N S W,which is the most popular one. Um,if you care a lot about speed and you don't care as much about accuracy and youcare about memory, then use some sort of quantization such as scaler,quantization or product quantization with, um, I V F, um,scaler quantization is, you know,or product quantization means you care more about memory than scalerquantization. Um, yeah, so it all depends on the tradeoffs you need. We have another question for you in the chat, Jun.
How simple is it to swap LLMs or vector databases? Um, for an example,using Falcon l l m or Pine Cone db,Um, I don't know about the other vector databases. I will say that it is not that difficult to swap using LLMs as long as you canhost it locally. And as long as you understand the interface,you can use an L L M and you can just drop them in and out. All you need to do is change the way that you're, uh,interfacing with the L L M,but typically all you're doing with the L L M is putting it into a framework,and so it's typically very easy to drop in and out. Uh,I do know that not all the vector databases have the same interface,and so you probably want to, you know, learn what the interface differences are,um, if you want to be able to switch between them.
Um, so yeah, that's, that's,that's my, basically, you know, it's, it's not that hard. You just have to learn all the interfaces. While we wait for more questions from the audience, do you wanna,I know you prepared some FAQs, do you wanna run through those?Yeah. So what,Back into side mode,So, oh, well, we'll answer this last question here. Uh,are the sources perfectly accurate in the sense that there, that,is there gar guarantee that the prompt responses really is sourced from thedocument presents as a sourcesource? I'm talking about, uh, you know,that's a really good question.
I wanna say that like, yes, but once again,you know, um, large language models are neural networks,so they are advanced statistical methods. And just because you give it a source does not mean that it is not possible insome cases,maybe 0. 1% of cases where it is going to give you a wrong answer back. Um, but I would say this is very unlikely. Um,and you can actually make it so that if you prompt your l l Mcorrectly, you should be able to get rid of that almost entirely.
Um,but there is no such thing as perfection in software. I think that also brings up another interesting point of knowing where knowingthe source versus the accuracy of the source itself. You know, I think,you know, the onion versus, you know, the ap, you know, there's a lot of,there's a big scope in there. So I think that that's, you know,we need to kind of continue to remember that citations and attributions can helpagainst some of this like hallucinatory misinformation. Um,but we need to remember that sometimes there that exists within the originalsource itself.
Yes. Um, yes, not all sources of the same level of trustworthiness. Um, the FAQs that I have are mostly pertinent to vector databases. 'cause these are, you know, this is mostly what I talk about. And so this is what I get a lot of questions about.
Um,I will say the when not to use for citations and attributions also applies. Um,but the real like, not to use use case is like, Hey, you know what?I don't care where this data comes from. I just want to be able to use a chatbot. Then, you know,don't use it then you don't need to. That's, you know, extra work for no reason.
Um,when do you want not wanna use a vector database is if you have only key valuedata. So if you're working with a bunch of key value data, um, you don't need,you don't need similarities. So it's overkill. Um,okay. And then I see a lot of things around CSV files and PDFs,especially around vectorization.
Um,it seems that there are not a lot of choices for models aroundCSVs and PDFs. And the challenge with vectorization is that you need to have amodel that is trained on the same type of data that you're trying toembed or vectorize. And so there's not a lot of models that are trained on CSV or P D F data. Um,and so vectorization with these is difficult. With these,what I suggest is I suggest taking your CSVs and trying to convert iteach row into a full sentence and taking your PDFs and converting it into text,and then chunking that up and storing both these as text.
Um,finally, something that I get asked about every so often is hybrid search. And so hybrid search is how do I search structured and unstructured datatogether?And Novus let you do this through this thing called filters where you can say,Hey, I search for something where maybe the publication date was afterAugust 1st, 2023, or something like that. So yeah,these are the FAQs that I usually get. And there's a couple more chats here. Oh,okay.
No, there's not. Okay. Um, cool. So that's pretty much all I've got for this section. This is your last call to get in any additional questions for you, Eugene.
Otherwise we will wrap up the session. We'll just give you another minute. Um,but in the meantime, thank you Eugene, for this really great session and,and walking us through it. Um, I think the,being able to sort of spend some really thoughtful time with the audiencequestions was,was really powerful part of this presentation and we're really grateful foreveryone who spent some time with us today. Looks like we've got one other on the hybrid search.
Is there a way to add metadata to filter on?Yes. Um, in fact, uh, on the hybrid search, um,the, uh, I'm going to move the QR code real quick. On the hybrid search. There is, um, it is done on the metadata. In fact,it is really the only thing you can do the hybrid search on is by,um, filtering on the metadata.
Uh, and since we're on this topic,I'll just give you an interesting tidbit here. Vis performs this hybrid search by applying a bitmask. And so there's no like, hey,we're gonna search for this and then we're gonna filter,or we're gonna filter it,and then we're gonna search by converting your filter into a bitmask,you can actually search and filter at the same time. It's really cool. So it turns your, uh, it, it basically cuts down your, uh,your query time by a lot.
Yes,it is fast. It's basically 'cause it's linear time, right? So it's really cool. I was like, wow. When I learned about that, I was just like, oh,this is really interesting. All right.
I think that is our last question for today's session. Uh,as we mentioned, today's session was recorded,so you'll get a link to the replay. Um, Ian, thank you so much. This was great. Great.
Yeah. Thank you.
Meet the Speaker
Join the session for live Q&A with the speaker
Yujian Tang
Developer Advocate at Zilliz
Yujian Tang is a Developer Advocate at Zilliz. He has a background as a software engineer working on AutoML at Amazon. Yujian studied Computer Science, Statistics, and Neuroscience with research papers published to conferences including IEEE Big Data. He enjoys drinking bubble tea, spending time with family, and being near water.