Events
Tutorial: Working with LLMs at Scale

Tutorial: Working with LLMs at Scale

Name: Tutorial: Working with LLMs at Scale
Start: 2023-06-16T00:00:00.000Z
End: 2023-06-16T01:00:00.000Z

Zilliz Webinar - Zoom

What will you learn

In this hands-on tutorial, we’ll introduce LLMs and two main problems they face when it comes to production. First, high cost. Second, lack of domain knowledge. We then introduce vector databases as a solution to this problem. We cover how a vector database can facilitate data injection and caching through the use of vector embeddings.

Then we’ll use this knowledge to build an LLM application using LlamaIndex and Milvus, the world’s most popular vector database.

What you’ll need:

Python 3.9 or above
A basic understanding of vectors and databases

Topics covered

What is a vector database
Why do LLMs face data issues?
How to deal with data issues in an LLM

View presentation slides

Transcript

Today I am pleased to introduce today's session,working with LLMs at Scale and our guest speaker Gin Tang. Gin is a developer advocate here at Zillows. He has a background as a software engineer working on auto ML at Amazon YuEugene studied computer science,statistics and neuroscience with research papers published to conferencesincluding i e e, big Data. He enjoys drinking bubble tea,spending time with his family, and being near water. Welcome you Eugene.

Um, yeah, thanks for that introduction, Emily. Um, hello everybody. So my name is Yu Eugene. Um,today I'm gonna be talking to you about scaling l l m applications or workingwith LMS at scale. And I've just given this slide, the title, uh,the key to scaling l l m applications.

So, uh,I'll cover a little bit of housekeeping first. So we're gonna go over, um,these slides. It's going to be a short walkthrough, maybe 20 ish minutes or so. And then we're gonna dive into some code where we're gonna work on how you cando multi document q and a with llama index, lang chain,and vis. So thanks for coming.

I'm gonna get started. Okay, so, you know, this is me. Uh,my name is you Eugene t I'm a developer advocate. Asus, if you have your phones,you can scan the QR code. Uh, that will take you to my LinkedIn.

You can also email me@zillow. com or find me on Twitter. Um,but I'm most active on LinkedIn. Oh, what's going on here? Okay,uh, this is a little bit about Zillows. Um, you know, we maintain vis,it's the world's most popular open source vector database.

These are some links where you can find us. Uh,you can go to Zillow Universe on Twitter, LinkedIn. You can look up Zillow,you can find Vic's Slack here, or you can go to GitHub or,and you can go to GitHub and go to viss dash io slash melvic and check out the,um, the Vector database. Okay,so for this talk today,we're gonna start with covering the basics of large language models. We're just gonna cover what neural networks are like and how they've evolvedinto large language models.

Then we're gonna talk about some of the challenges that are involved with, uh,LLMs and using LLMs and production. Then I'm gonna introduce this thing called the C B P framework,which is chat G B T vector Databases and Prompt as code. And we're gonna walk through what that looks like and how that's helped solvechallenges with LLMs. Then we're gonna talk quickly about what is a Vector database,and I'm gonna use VIS as an example for an example architecture, uh,for understanding measure databases. Then at the end, um,we're gonna go through a quick demo and which is, is a code walkthrough.

So first step, we're going to think about large language models. So you guys have been around for a while. You probably have heard of Chad,G B T, um,and you probably know about OpenAI and Claude from Anthropic and barred fromGoogle. So these are what are the most, uh,you know, advanced language models that are kind of out there. These are these large language models that are causing a lot of hype right now.

So let's like take a step back and talk about how wegot here and how we got into this idea of language models. So let's step all the way back to 2012. We're gonna talk about, uh,convolutions, which is, uh, you know, actually they're around before 2012,but they're really Ed by AlexNet. And, and this example,we can see that the way that these kind of convolutions work is that they giveyou context into the tokens next to yours. So in the sentence,Novus is the world's most popular open source factor database.

You see,you can see that the way that, uh,this would be set up in this neural network is that you have context into thetwo, um, tokens right next to yours. From there, we moved on to this thing called self attention,which gives you global context. As you can see,all these arrows are pointing to all of the, all of the neurons,all of the words are all of tokens in the sentence? No,this is the world's most popular vector vector database. Um,I didn't go through all the words here,but basically every single word in the sentence would have context intoevery other word in the sentence. Um, so vis would know world and so on.

But,you know, as we have, as we know in English, that's not actually necessary. You, uh, you know, sentences are, are built like this. So what we actually need is something like this, which is causal attention,which gives these neurons is access to the, like, you know, like the word is,uh, access to the words before them. So you can see that all the words point down,which means that all of the tokens further down the sentence have access to theinformation given by the tokens at the beginning of the sentence. And so this kind of takes us to how LLMs,r n CNN's, you know, how these, you know, trans trans, uh, uh,trans transform, uh,transformer models work.

Um, is that, you know, there, there,there's these different attention mechanisms that are imposed on the neuronsbased on the architecture. And then, um,it's essentially statistics from there, right? LMS predict your future tokens. So in these examples, you would be predicting the next token. So this one,we'd be looking for the word database. And you know,that's kind of modeled here of Novus is the world's most popular vector blank.

Um, and if an l l M is producing the sentence,it would say database and it would say like, well, it, it's a database. 'cause it would see that it has the highest probability on there at 0. 86in this, this example. So because LMS are stochastic models where they're justpredicting the most likely. Next token,one of the biggest downsides is that there's this thing called hallucination,which, um, is a plausible sounding but factually incorrect response.

So this is where the l l M makes up some sort of databased on the data that it's seen. And it says that this is the, you know,this is the right next word, basically. Uh,so let's walk through a little bit of math. Um,I'm not gonna spend too much time on this part. This is mostly, you know,for those of you that are really into the math,behind the machine learning stuff, um, so this is basic statistics, right?We're given some tokens, T zero all the way to tn TN plus one.

And then we want to model the outputs based on a probability distribution. So this is, you know, uh, representation of patient statistics, right?What is the probability of database given all of the words before the in thesentence is the world's most blah, blah, blah, blah, blah. Um, and this is,you know, these are the, the, the indicators, the way that it's written on,on the, in the math, behind the scenes. Okay?Now let's look at some of the challenges involved with LLMs. So we're gonna look at an example of hallucination, um,just from chat G p T.

And so for the example of this query is,how do I perform a query using Novus? It'll return something like this,which on the surface looks pretty accurate. You know,you can look at this and say, okay, this looks like Python code. I can see that there's like, you know, a list here, like some, some,some dictionaries here and so on. But if you know the actual code and you go into the docs,you'll see that this is a hallucination and this example is incorrect. And the problem here is basically that this is not actually how you connect to anovis server.

So later on we're going to talk about how you do do the connection,but I just wanted to show this example as a example of what ahallucination and an l m could look like. So our solution, the solutions hallucinations, right? There's,there's really two big solutions that are being proposed right now. One is fine tuning your models, which, uh,can be helpful for keeping your models up to date, right? Uh,or can be helpful for training them on your specific information. But these include challenges vol involving how much data you have,because you know,these LMS see billions of data points before you get totouch them, before you get to fine tune them. So our proposed solutions,you inject domain knowledge into these large language models using the C V Pframework, which is what we're gonna talk about next.

So as I was saying earlier, the C V P framework is that, uh, is chat,chat, G B T, which actually just has, can be any L L M,we just use it 'cause C B P sounds nice, uh, vector database,which is the v and p, which is prompt, is code, okay? Uh,so the key idea behind the C V P framework is that we view L L M Maps as ageneral purpose computer with a processor, you know, like a C P U, uh,persistent storage, the vector database and code,which in this case is the prompt. And later on when we go through the code,we're gonna see how, um,LAMA index and Lang chain operate as like the processor and the code, right?So Lang Chain helps you build the,the prompting with your LMS and LAMA index kind of wraps around your data. Um,so that's kind of the thought on what this framework forbuilding a scalable L l M will look like. And these are the pieces for it, right? The G B T,any other L L M is the processor. And then vector database,like Novus is like your storage, your, your rom your memory, whatever,and prompt this code is your interface.

And so this is just an example of a project that uses, um, C V P. It's called O ss s chat. Uh, if you've seen this around,you may have seen this around this uses, um,this is a chat bot for you to interface with open source software. Um,and what it does is it takes the documents that areavailable for that open source software,and it creates this layer on top of chat G B T that is,that uses zills Cloud as the vector database. Um,zills cloud is the cloud version of viss.

Uh,it uses zills Cloud as the vector database, and then, um,caches all the data that gets asked as more and more users ask, um,questions using G P T cache. And then once you ask a question,it goes and it looks into the Vector database, it says, Hey,is the information for this, the answer to this question stored here? Uh,and you know, if it is,then we're gonna return that and then we're gonna go to Chad G P T and we'regonna give you a response. Uh, if not, then we don't do, do, do. Um, okay, so let's go back to that hallucination example earlierof how do I perform a query using vis? And when we use that with o s s chat,we actually get this code back, which helps you,which shows you the actual way to do a co connection. Uh, and so this is, um,mainly just an example to show that if you inject the data on top ofthe L L M, then you can get the correct information back, um,by looking through the data instead of directly asking the L l m,which may not have the correct UpToDate information,the correct or the UpToDate information.

Okay?So let's talk a little bit about vector embeddings,and we're gonna just kind of, uh,talk about why these vectors can help solve this hallucination issue by talkingabout what vectors are and how you kind of create them. So,uh, this, obviously, this framework helps solve this hallucination by acts,by injecting domain knowledge, giving the l m access to domain knowledge,and then doing semantic search on the domain knowledge via your vectorembeddings. That's what the vector database is taken care of. So this is a very,like, early on paper,this is very like seminal paper about vectors and vector embeddings. Um, and it's just showing you that you can do math with words, right?The whole idea behind a vector embedding is we're gonna use some numbers torepresent some sort of object,and then we're gonna do math on these objects using these numbers.

So we're doing math on words, math on images, math on videos, whatever. Uh, and so in this example, it's just showing you, if you take the word,if you start with the word queen and you subtract the word woman and you add theword man, you get king right? Math on words. And most vector embeddings nowadays you're gonna see are not too, you know,two dimensional. They're usually like 700 something, 300 something,1500 something, 1000 something, you know, they're much, much, much,much larger now. Um, so in practice,the way that we implement this is we just start with our knowledge base,our documents, we transform our documents, we chunk them, whatever,we throw them into a deep learning model.

And then we actually get the second to last layer here. Excuse me. So traditionally in deep learning models, you get some sort of output,which is some sort of classification layer usually. Um,and the vector embeddings are, is actually the second to last layer. And so you take your knowledge base, you run it through a deep learning model,and you just strip the last layer and take the outputs to second to last layer.

That's your vector. That's your vector,embedding your representation of your object, uh, you know,of your knowledge base, and then you put that into a vector database like Novus. Okay? So what is a vector database, right?We've been talking about kind of how to use it,but let's take a look at what they are. So the way we like to define vector database here at Zillows or viss is, uh,a database purpose built to store index and query large quantities of vectorembeddings. Um, and Viss specifically, you know, um,the advantage of viss is that it works really, really well at scale.

And the reason why it works really well at scale is because, uh, okay, wait,it's not next, next slide. Okay, well I'll cover that in a second. Um,but let's talk about why you want a purpose built vector database. So one of the, if you don't, if you want to work with vectors,you don't actually need a vector database,you could just use a vector search library like face,F A I SS S or H N Ss W or annoy or something like that. Some sort of vector indexing, like, uh,strategy vector databases really provide a lot of differentthings that you would need on top of a vector search.

If you were to say,use this in production, if you were to actually use this in practice,you would have things like filtering, filter searches, chain filters,hybrid searches, um, you know, backups, replication,scalability, horizontal vertical scalability, sharding for your streaming data,um, aggregated search, you know, lifecycle management, up certs, uh,deletes inserts, stuff like that. And you would also have to be able to handle like high query loads or highinsertion deletions depending on your use case. Um, what, you know,you need to be able to mess with your recall, your indexing strategy for that. Um, and then, you know, we have, for example, uh, Nvidia,you know, support with the GPUs, so it operates faster on those GPUs,and we have like, really billion scale storage, which is what makes Viss, um,such a unique vector database. So let's cover some vector indexing strategies, some of these things, you know,these vector search libraries, basically, uh,I'm gonna cover a couple of these and I'm gonna dive into the topic of what Vissarchitecture is like and why it's, um, why it's so powerful.

So one of the commonin vector and indexing strategies is called, uh, approximate nearest neighbors. Oh, yeah, or annoy. Um, it's from Spotify. And basically what you do is you pick any two points in your vector space andyou just split the space in half. Uh,there are some limitations here, like you don't want to be picking two points,which splits the space into two halves, where like,you have like one vector on one half, like that's, that's obviously an issue,but for most cases, just two random vectors, you split the space in half,and then you do it again and again and again and again and again until you don'thave any more half to split or until you have like five vectors in each sectionor something like that.

So this builds like a binary tree, basically. And that's is how you kind of do that search. Then there's inverted file index,which looks like a OID diagram, and it basically is one,uh, you do like a, uh, uh,a like a OID kind of thing using k, K means,excuse me, tocluster all of your points into some number of OIDs that you want. And then the way that you would search it is you would start withthe end closest OIDs and then search into each of those OIDsfor your closest points. So you can see how this kind of will reduce your search space by allowing you tosearch smaller portions, kind of the similar to annoy.

Then there's H N S W or hierarchical navigable small worlds. This is a really, really popular one. This is a really popular graph based, um,vector indexing strategy. And basically what you do is you,uh,as you insert your data points and you create like a graph where you point tothe closest data point, um, and as you insert your data points,you generate a uniform random variable that assigns, like,that gets assigned to it somewhere between zero and one. So,and then you have a cutoff that tells you what layer it's going to.

So it,for example, if your cutoff is 0. 9,then you'll have 0. 9 0. 9 only in layer zero. And then if you have somewhere between 0.

9 and one,you get to be in layer zero and layer one. And if you get to be,or 0. 9 and 0. 99, if you have 0. 99 and 0.

999,then you get to be between layer zero, uh, you get to be in layer zero,layer one and layer two. Uh,so basically the layers of these are assigned by random variables,and it's a graph based index that points to the nearest neighbor. Um,and the way it's searched is you start at the top layer and you find the nearestone, and you search all the way down. Uh,as you can kind of imagine this would cut down your search time, um,because A, the index is already built, and, uh,b you get to search less, uh, uh, a very small space from the start,and then kind of a little bit larger space as you go down. Uh,the only drawback for this is really that there's the indexing is the index islarger than your actual data.

Okay?So let's look at what BU's architecture is like, and why is Novus fast?Why can viss be, why is Novus useful at, at like a,at at scale, you know, um, with,you know, pretty much like a, a couple hundred vectors, it's really,it's really all the same. But when you get to scale,you have to think about things like, ha you know,how big is the space that I'm performing? All these vector comparisons on,right vectors when you compare them, when you do retrieval,you have to do some sort of, uh, comparison to the existing vectors,and that requires some sort of matrix multiplication or, or,or something like that. Um,so the way viss works is that there's these three different query nodes,three different nodes that spin up every time you want to do something. So indexing is a special node where, uh, you know,if you want to do your indexing, when you, this is usually when you add data in,um, you use the index node, but this is not something that's used very commonly. Um, so there's the other two nodes, which are the data node,which holds the data that you need in memory,and there's the query node which performs the query and returns to you,to the data that you need outside of this, these notes that do the actual work,the actual processing in vis, we have object storage, which is basically,you know, min io, s three, Azure blob, whatever.

This is where you store your,uh, this is where we store the data in the vector database. Um,and you probably don't really need to worry about any of this stuff. So one of the reasons why most is really, um,fast at scale is because it does this thing called, uh,segments where it's puts all your data into these 512 megabytes segments,and then it seals these segments every time they hit 512 megabytes. So then, uh, when you're query and,and it's able to search these segments in parallel, so then when you query,let's say, something that's a hundred gigabytes,you are able to do basically 512 megabyte queries. And then, you know, uh,at the end you kind of, uh,coalesce them and compare them based on the segmentations,but, uh,you don't need to search the entire a hundred gigabyte space,which is what you would do if you didn't have these segmented, uh, blocks.

So if you think about that, you know, it's, uh, you can kind of get,get it done with one, 200th of the, um,operations needed, plus like a little bit of extra. So maybe,maybe more like one 100th, but as, as you scale your data size,this becomes more and more noticeable. Okay,so that was, oh, yeah, right about on time. So about 20 minutes. So we're gonna look up a quick demo.

I'm gonna pull up some code,and then we're gonna do a code walkthrough. So,um,okay,we're gonna build a multi document, um,q and a system using, uh, Lama Llama index,Lang chain and vis, and we're gonna use the data. And the data that we're gonna do that we're gonna pull from is from Wikipedia. And, uh, Emily, can you drop the, um, the,the link to the CoLab notebook in the chat?Yes. Okay, there it is.

There's the CoLab notebook. So this is on CoLab, so you can go in there, you can follow along. I'm gonna make my screen a little bit bigger, and I hope this fits correctly. Um, so for this,all you really need right now to follow along is, you know,Python three 10 or three nine, uh, I'm using three 10. Um,and I would suggest to you go ahead, uh, you know,make there some new directory, and then do some sort of, uh,virtual environment, blah, blah, blah, and, uh,get your virtual environment started.

And I'll give you a little bit of time to do that. Um, oh, actually,I don't have my virtual environment started on here. So what I'm actually gonna do,okay, so now I have my virtual environment there. I thought I had that installed already. And then activate it.

And then what we're gonna do is we're gonna install the libraries that we needfor this. So we're just gonna pip install LAMA index,Lama index, link chain, mbus, nobus, uh,I think we need requests. Python m that's why I used to handle my environment variables. You don't actually, uh, need Python m necessarily. If you want to just throw in your, uh, OpenAI a p I key into yournotebook, but I, I have mine in my do m file, so I'm handling it with, uh,Python M uh, and then we probably need OpenAIand, uh, IPI Kernel,which it's automatically installed when you're running through this through, uh,oh, oh, you know what? I actually missed one.

We need N L T K as well. N LT K's like stop words package is what's, it is like a, um,I believe it's Llama Index is using that to do the query decomposition. Query decomposition is this thing where you break down the query into smallerparts, and that's how you do the multi document q and A. So it's also install that. So let's just install it here.

Okay, cool. So now we have N L T K. We also need to, and this part,I'm just gonna copy and paste from the notebook. Um,we need to install the Stop Wordss package from N L T K. And this one,uh, you'll sometimes get these errors, these S S L error,which is why I am doing it programmatically and not doing it through juststarting like a new Python, uh, interpreter and doing something there.

Okay,let's see. We have questions. Please reassure the CoLab CoLab link. You lost CoLab link. Uh, okay, I guess you can reshare it's, yeah.

Um,and feel free to stop me if you have any questions as I'm, as I'm doing this. This is, uh, you know, meant to be an interactive kind of workshop. Okay? So now we're gonna do our imports. Um,so, you know, from Llama Index, uh,we're gonna import a bunch of things. So import,see first we're gonna import G PT Vector Store next.

Then what are we gonna import? The, oh, the simple keyword table index. BT simple keyword table index. I don't know why this is not highlighting for me. I would've expected this to, oh, do I do ah, LAMA index?Maybe this will highlight for me. Okay.

I don't know why it's not giving me a code completion, but it's not. Um,oh, there's a question in the chat. Is there an associated GitHub repub? There is nothing on GitHub right now. Um,I can put this on GitHub, actually, this should be going on GitHub on bootcamp,which is, uh, Novus has a bootcamp, um,repo, and this should be going on there, so I can,I'll put that on there afterwards. And you can check it out on there as well.

Why do we need Llama Index if we have Pine Novis?So Llama Index is what we're gonna be u using between, um,uh, LAMA index is what we're gonna be using to do the query decomposition. So what we're actually gonna build here, oh, yes,I should probably cover the architecture of what we're gonna build. That would,that would've been a good idea, um, that we're, we're actually gonna build here,is we're gonna take these documents and we're gonna load them up into this kindof like these vector stores, right? And then we're gonna use a keyword index,which is gonna be generated by llama index. That's gonna actually route you to the correct set of vectors. So that's why we need, uh, uh, that's why we need llama index,even though we're using vis, John,I'm gonna answer your question later.

Please use the q and A for that. That is a much more general question, and it's not about the code. But, uh,so if you have questions about the code, I will answer them as I'm doing this. Okay? Uh, now we're gonna import something from link chain. This is the only link chain that we're gonna,this is the only link chain import that we're gonna make.

Um,and you actually can kind of do something else for this, but, uh,we're just gonna do this because, uh, I like it. Um,import open. Yeah, chat. That's right. Okay.

Now we're gonna set up all the, um,we're basically gonna set up our, our vector vector database and stuff. So we're going to import oss, uh,we're gonna import mm import load dot m uh,I really wish I had my code completion here. AI m so this loads up the dot, uh,m file to get my OpenAI key,uh, get m OpenAI a p i key. So that sets up, uh, our access to OpenAI. And then we're gonna set up the,uh, we're gonna set up vis, so from LAMA index,um, vector stores,vis vector store.

And then we're also gonna need, uh,the actual vis, so from vis import to default server,default server, um, and then we're gonna start the default servervector store. This vector store,um, with, let's see,host equals local hosts, 1 27 0. 0 0. 0 0. 1,and then the port is gonna equal, um, default server.

Listen port. Okay? So this should take, um,somewhere in the, oh, well,that was actually a little bit faster than I thought it was gonna be. I thought it was gonna be like 10 seconds. Okay. Um,but this basically spins up, uh,a local instance of vis our vis light, which is, uh,was actually created by, you know, like a third party, um,a third party like maintainer, open source guy who doesn't work at Zillow.

So shout out to Matrix G. If you know him on GitHub, give him a shout out. Um, I'm having a version conflict. Um,see which versions, uh, see if you can, um,get your right G G E R P C e O io version on there. If, if that, if I had more time, I would be able to stop and just kind of,you know, work out this work out with, work this out with you, uh, live.

But,uh, unfortunately this is a large session. So, uh,what we're gonna do now is we're gonna get the wiki stuff. So this is unimportant code, this is not technically related to, um, you know,building your l l m app. This is just the gathering the data section. So I'm just gonna copy and paste this code.

You know, we're getting places,we're getting, um, info on Toronto, Seattle, San Francisco, Chicago,Boston, Washington, dc, Cambridge, Massachusetts, and Houston. And if anybody has wants to type in a city to put on here in the next 10seconds, I'll, I'll wait as I read, paste it into chat. G B T, uh, try to pick someone's, you guys are, oh, Dubai, okay, Dallas,Seattle's already on there, so we'll put Dubai on here. Uh,and we'll put Dallas on here. Uh, is Dallas already on here?CCL is on, on here.

Okay. Uh, okay,cool. So there's a lot of cities that are being shouted out. Um,but these are the ones that we're gonna go with. Okay? So now this part is,we're just, the, the, this is literally the code scraping section.

So I'm just gonna copy and paste this. I'll walk through this,but you don't need to understand what's going on here. You just need to just copy and paste this code and just put it in here, okay?Basically what we're doing is we're going and we're sending a get request toeach of the Wikipedia pages that, um,are listed here in this set. And then we are putting them together, you know, using this,this is a yield function, uh, I'm sorry, this is a generator function,and we're putting all of that text together and then we're gonna save it toa file locally. Okay? So this should only take a few secondsand we can see what this looks like, right? Boston looks like this.

Cambridge,Massachusetts, Chicago, Dallas, Dubai, Houston,San Francisco, Seattle, Toronto, Washington, DC Okay. Um,so I'm in Seattle. It's my favorite place. I'm currently, actually,I guess I'm currently closer to San Francisco, but I live in Seattle. Okay.

So once we have all this, like, uh, all the documents,now we're gonna start getting into building the actual index. Okay?So let's get the docs and we're gonna gonna create an empty dictionary for thedocuments, and then we're gonna walk through all of the titles. So fortitle and titles,city docs titles,and we're gonna set this equal to,we're gonna use Simple Directory Reader here, simple Directory Reader. And we're gonna read in some input files. So the input filesare gonna be slash data slash oh six be f stringtitle.

Geez. Um, oh, this actually needs to be, uh,in a list. 'cause the input files expects a list. Um, and then we're gonna load that dataon hash type list. Uh, what did I do? What did I do?Slash maybe I have to do this.

Oh,this is different than what I had before. It was, okay, what did I do wrong? Did someone see this?Um, and put files f Data slash Wiki title. Okay. Okay. Well, I, I, I have the, this is the right, the right, uh,the right code for this here.

Um, so now let's create the, the service context,uh, the storage context, and the actual chat bot. So for this, we're gonna use,um, predictor chat, G P T. Um, and this is gonna use,uh, this is basically the chat bot. Like this is the thing that links you to the L L M. And so for the lm,we're gonna use here, we're gonna use l m, we're gonna set SQL to,this should be OpenAI chat, I believe.

Yes. OpenAI chat,OpenAI Chat, which is the link chain piece that we, uh,downloaded earlier temperature. And we're gonna set this temperature equals zero. And I think we need the model name, yes, the model name. And that's gonna equal G B T 3.

Okay?And then we need the service context. Service context,a service context, which uses the L L M predictor,um, service context,lmm predictor, LMM Predictor chatt. And we need a storage context, which tells the, uh,which tells LAMA index basically how we're storing our files, or,um, how we're storing our vectorsand Vector store equals vector store. Okay? And, okay, cool.

Nothing going crazy here. So now we're gonna build the index or the indices. Uh, what's going on in chat?Oh, okay. Oh, I had Citi. Oh,I had Wiki titles.

Okay, cool. Thank you. Herba. Herve, sorry. Um, okay, let's build this, uh, Citi index.

So Citi indiceswiki summaries or index summaries. Next,what I called them before. Uh,so the index summaries are the keyword things that are used to, um,that we use tobasically route our queries from, oh no, sorry, fourWiki title and Wiki titles. We're gonna loop through all of these, and we're going to create a G B T vector,um, store index from them. So let's see, Citi and the CS Wiki title,it's okay.

So this time I did Wiki title, no ss, so hopefully it's not, uh,gonna be mad at me. G B T Factor store index, uh,from documents,I believe we get the C docs here. C docs title, uh, service context,service context, storage context, full storage context. I believe that's all we need for that part. Um,and then we need to set the index summary.

So index summary,wiki title, it is equal to,I believe we should use an FT string. Yeah,Wikipedia article aboutwhat he title. So this is how we get started creating our index. Well, I'm really glad to see that you guys are helping each other, like, uh,figure out these, these, these errors. Um, okay, so now what we need to do is we need to get the composable graph,uh, object.

So what we do here is we just import the compostable graph from Llama,and this is the part where we create our, um,the graph that composes the indices, basically. So graph equals compostable graph,um, and we'll start with uh, G P Tsimple keyword table index. Uh, which is basically tells us that, you know,we're gonna be using this keyword table index for that. Um, and thenI believe this would be city in the scenes,oh, my nose. Okay.

Uh, and then we also need the summaries,and then we'll just say max keywordschunk is equal to 50. Uh, is that not the right max? Keywords per chunk?Uh, oh wait, that's because I did this wrong. Okay, there we go. Um, okay. Alex Ashkin, does anybody have, uh, the vis error? Yeah,um, if you have that, I would just try starting it, uh, up again,if you've started vis before, um,and you're like locally and you didn't do like a stop, uh,at the end of that file,you're gonna run into some errors that some of the threads are going to go.

Um,rod, Sam, anyone gone past setting up open AI key? Okay, I don't know. Say about that. Jerry Lou, not Jerry Lou, the creator of Llama Index. Jerry Lou. Okay.

Um,what is the index for underscore blah, blah, blah. Okay,so underscores are basically used to represent unused variables in Python. So actually what's going on here is we're getting,we could actually probably just call city indices values. So city indices items returns two different, like,like it returns two things basically, and it returns, um,a key and a value, and we only need the value and we don't need the key. Uh,so yeah, that's why, uh, so Daniel Crosby says, playing catch up here.

Did you run into area Unicode and code error? Um,I was maybe, maybe you, I don't know if you're using the same, uh,Wikipedia articles I'm using,but that seems like a Unicode character and you might want to try to see if youcan mess around with this to get this to in code into U T F eight. Um,okay, so let's get through the rest of this. Let's try to finish this up. Um,we're gonna also import the query transform,which is gonna be the function that does the actual query transformation. What's going on here? Okay,so the decomposed tray query transform, uh, basically sets a,um, a, uh,oh wow, that's not a dictionary, sorry.

That sets the l l m to decompose your query,to split your query into multiple parts. Chat gt and we'll do verbose SQLs. True. You don't actually need to do verbose SQLs. True.

Um,this just shows you how the queries are being split. Uh, and then the last thing we need to import is the transform engine. And in this case, what we're actually gonna start with is, uh,some custom query engines, and we're gonna get that to an empty dictionary. Um, okay,so let's loop through the indexes four. And next in, uh, city in theCSCs values, uh,we'll set a query engine first.

So a query engine. So we're gonna make this the decomposed, uh, all the query, the,the decomposable queries and use a query engine on that in this section here. So index query,um, and then we'll give it the service context. Service context. And oh, there's a bunch of questions in chat.

Will the recording be shared?This? Yes. Is there a valid opening? Uh, so Raj, yes. Uh,the OpenAI key,you have to go to OpenAI to go get an OpenAI a p ikey, uh, in order to build, um, something like this. Oh, perfect. There you go, Daniel.

Uh, and Isaac? Yes,the recording will be shared. Okay. Um,and now we're gonna also give some extra info, so extra info,and that's gonna equal, we're gonna give that toindex summary or summary, or summary summary, I think,and that's gonna equal indexnext. So this actually just passes the, an object, uh,onto it. Um, that basically gives it the index track.

Uh, this,the summary of the index transformed queryengine is equal to the transform query engine. Uh, and we're gonna set that to the query engineD decompose transform. There it is. Um,and we're also gonna pass in the extra without transform extra equals. And finally,the custom query engine query engine,uh, just one, oh no, oops, custom query engines,index dot index id.

Next id, so this actually will just return a string that is the, like a,like an X I D basically. Um, and that is the transform query engine. Okay?So I'll let that go and then I'll answer some more questions. Quick question is open a p I provide ai,a P i a p i access to g P four model paid, uh, yes, it is paid. Is the query engine going to be able to filter, going to be able toadd weights to the wiki pages to filter better?I think you can do that, but we're not gonna do that in, in, in this.

We're just gonna get it to,to work in a very basic level first to make it so that it can decompose thequeries. Uh, if you want to try to add to things,you can go look into LA index docs and see how you can do that. Okay? Custom query engines, and we're gonna go to graph. So this is the graph that we created earlier, right? This is the, uh, the,the graph that starts with a keyword index. Um,next id,so that's equal to graph root.

So is it root ID or root index? So we're getting root indexas query engine andretriever mode. Is that all right? Retriever mode,not happy with me about this. Um, wellresponse mode, uh, tree summarize,summarize, uh, service context. Service context,next storage context here as well. Yeah, storage context.

Uh, okay, so I'm actually gonna go up here. We're gonna change this. I think this also should be given the storage context. Okay,so now we have this custom query engine,and then all we need to do is we need to make the decomposition engine. So just a couple more lines of code.

Um,and then I will answer your questions. Uh,let me just get this couple lines of code on here so you can kind of see howthis works. Uh, graphs on as query engine,and we'll just need to pass in the custom query engines. So custom query engines, it goes custom query engines, and that should work. Okay, so now we should get a response,and that should equal query com query.

And I'm actually gonna let you guys type in a response, uh, ask about, you know,any of these cities here. Maybe we'll ask it to compare and contrast the weather in Seattle and Dubaior something like that. Uh, this a query engine. Where does G P T cash fit into this? We're not using G P T cash in this one. Uh,G P T cash is, uh, an open source cash for your G P T, um,apps that uses, you know, uh, Novus.

But in this one,we actually aren't gonna be catching anything because we're not doing anyexamples where we're pinging the l m enough to have like an f A Q. Um, can novis support adding ways to docs stay on author date? I don't think so. Okay, so let's compare and contrast the weather in Seattle. Compare, oops,compare and contrast the weather in Seattleand Dubai. And now we wait and it's gonna show us what the breakdown of the question is,and then it's gonna give us an answer.

Oh,I actually have to print out the response in a second, but let's see. Response,we'll just load that up and let it, uh, let it go. Okay? And then there it goes. It tells you Seattle receives precipitation on 150 days within the year,often has light drizzle, exact annual rainfall in Seattle's not provided. Dubai has a high average temperature, et cetera, et cetera, et cetera,et cetera, et cetera, et cetera, blah, blah, blah.

Okay? So, you know,therefore the weather in Seattle is characterized by frequent precipitationwhile Dubai has high temperatures throughout the year. Uh, yeah,so it's raining in Seattle and it's hot in Dubai. Do you have a solution for governed data on Vector database?I'm concerned about p i i data to leak. Um,do I have a solution for governed data? I mean, you can host your own data,like you can like Mil,like VIS can be used to access your local data if you, if you want to do that. Um, otherwise you can just use Zillow's cloud.

But if you're concerned about p i i stuff,maybe you just want to host your own data and just host it locally and usesomething like, maybe not vis light. Vis light is, you know, main for,mainly used for this kind of like, uh,building like this kind of proof of concept in a notebook. Maybe you want to use like Nova standalone or, uh, yeah, you want,you probably wanna use Nova standalone and you can do that with, um,helm or Docker composed, uh, Emily, if you have those links,it'd be great to be able to drop those in the chat. Um, so yeah,that was all of this basic tutorial and how to createyour own multi document q and a. And, um,for the rest of this time, if you have some debugging help that you need,I can kind of take a quick look at anything that might be simple or if you haveany questions about vector databases, um, feel free to ask.

Okay, lots of questions. Do you think this could also be used for recommender system?You mean vector databases? Um,because that is actually one of the primary uses of vector databases inproduction. Uh, for example, eBay. What would be the solution to cite the source documents? Um,you could just have it return, uh, sorry, my noses itching. You could have it set up so that it returns the, the source document.

I guess you could like save the link if you wanted to and then just have thatreturn with your answer. Thanks Ian. Thanks Herve. Herve,um, this is amazing what you showed in this example. Are we fine tuning G P T 3.

5 turbo model with our wikis datasets and only afterthat G P T 3. 5 turbo is able to answer? Okay, so casing,we are not fine tuning anything. We are doing data injection. So what's happening here is this is your l l m, you know,it's like this like little big box. And on top of that we have a vector database,and on top of that we have the querying and the indexes.

So what happens is we take some data and we put it into the vector database,and then you come in,you hit this query box and the query goes into the vector database and it says,Hey, do I have an answer? And it says, okay, I have an answer. Um, oh no,so actually it's a little bit more complicated than that. It goes,the query goes into the index, and then the index calls G B T and says,how do I break this query down into multiple queries? So for example,compare and contrast the weather in Seattle, then G P T responds and it says,okay, what is the average rainfall in Seattle? Or, you know,what is the average temperature in Dubai?And then it uses that and it pings the vector database with these questions,and then it compiles these questions together and then uses Chad G P T to getthe answers from those questions and formulate a real response and sends it backto you. So there's no fine tuning going on, it's just data injectionabout p i I My take is if your system doesn't have public endpoints slipping ofN P I would be low and can use D L P cloud solutions. I have no comments on this.

I don't know what this is, uh,I know you have covered,but can you please help me understand what is LAMA Index and what's the rolehere? Um, okay,so LAMA Index is a framework for acts for,for interacting with your data and an L L M. So in the C V P framework slide that I showed earlier,LAMA Index is like the C and the P. Um,so the role here is that LAMA index creates is what we use to createthis, these indices, like this keyword index, this vector store index,and what we use to route those, um,queries through to the right vectors, vector stores. How many tokens involved in this scenario? Yeah, I actually don't know because,uh, I didn't put token counter here. Now,I'm sure there is actually a way to put the token counter here because linkchain allows you to count tokens, but I just, I, I,I dunno 'cause I I didn't do it.

Okay,e brushy answer Live. Is there a way to know if the reply was generated using data from the previousexisting knowledge of the model? And what is FET from the Vector database? Um,so in this case, uh, it is almost,in this case it is using data just from the Vector database. Um,what you could do is you could like put intoyour code somewhere that like, oh, if you don't find this like,like some sort of clause, it just says like,if you don't find this in the Vector database, pinging the, uh,pinging the L L M. But what we're actually doing is we're just using l L M to break down thequestion and the query into, you know, different and,and into like the decomposed queries, I guess into like the simpler queries. And then we're using it to formulate the answersoya pavon, can you query directly without vector database in the middle?What is the benefit we aren't getting on top of G B T?Is index ne is indexing necessary here? Can I query, like,can I query, I'm gonna add this to this live.

Um, one sec. Can I query G P T directly? Uh, I mean technically, yes. Um,do you need a vector database? Is next thing necessary? Uh, so you,like, you,the benefit that you get on top of G P T is that you get to inject UpToDateknowledge and like domain specific knowledge. So, you know, I,I will yield here that like perhaps cities are not the mostdomain specific things, but it's not like I can go and,and grab like, you know, very specific knowledge to show, uh,as a public example. Um, and is indexing, I mean indexing is,is you would have to do this anytime you have vectors and anytime you want tocompare things, um, semantically.

So for example, if you could,if you asked G P T this question, it might be able to get it. Um,but maybe you can ask something. Maybe we can ask something like, um,compare and contrastthe airports right in Seattle and Dubai and in this case, you know,maybe there's been a, I don't actually know,maybe there's been like a new airport built since 2021 and which case we wouldneed that updated knowledge and querying G B T alone would not be enough in thatcase. You would definitely need to have that vector indexing. Does this data injection prevent minimized undesirable hallucinations? Yes,that is the whole point.

Um, oh, Emily. Okay. Well anyway,that is the whole point. Uh, the, the idea is that, um,you basically have all of your data ready and,uh, you query your existing data and then, you know,you never go to Chad G B T and ask, Hey, you know, blah blah, blah,can you do this? Can you answer this question?Chat G PT just helps you break down the query and formulate a response. Can you use this method to,can you use this method to validate G P T output,uh, validate G P T output?I guess I, I guess you could.

Um, yeah,if you have all the right answers and you ask G B T and you want to, you know,make sure that G B T has the right answers, you could use this kind of method. Uh, do we still need to add additional guardrails on how accurate this is?Additional guardrails for,uh, I mean maybe like for example, like, you know, G P T is still biased,you know, it's still trained on data on the internet that is still biased. Um,so in that, in that sense guardrails, yes, uh, in other sense,like we're not asking it to do any,we're like limiting the prompts that get sent to it. So it's not like we're asking it to take over the world. So no need for guardrails on that.

What are the biggest mistakes companies make while building ingestion pipelinesfor the data from multiple sources to vis? You know, Julian,that is a great question because that is something that I've also personallybeen thinking about. 'cause I'm like, Hey, these apps are really cool,but how do we keep them updated? How do we keep the data updated? Um,and I actually don't know the answer to that question, and I wish I could,I wish I could tell you, but I dunno the answer to that question. And that is something I'm also curious to find out myself. Can we restrict bad questions? Can we filter wrong queries? I mean, yeah,it's your system. You can,you can always put in whatever things on there that says like, Hey, like,don't answer questions about, don't answer questions about Dallas.

Don't like that area, whatever. Will tools still hallucinate?If you ask prompts out of the injected data from e brushy, noHamza,at what point does fine tuning make more sense than context or data injection?Uh, if you have a lot of money and a lot of dataand you're willing to fine tune your model often,Ron,some guardrails come from ingestion from corporate sources that have alreadybeen approved from past marketing and sales language. This isn't a question. I don't know. I don't know what to say about this.

Okay, so it looks like that's all the questions that we have and,uh, we're also approaching time. So this was a good, uh, a good session. I hope this was a good session for you guys and that you were able to, um,you know, build a cool app. And if you have any questions, feel free to,to reach out. Thank you everyone for joining us and thank you Egen for this great session.

Uh,we hope to see you all next time. If you wanna see our upcoming calendar of events, check out zillow. com/event,uh, and we hope to see you on a future training. Thanks so much everyone.

Join the Webinar

Meet the Speaker

Join the session for live Q&A with the speaker

Yujian Tang
Developer Advocate at Zilliz
Yujian Tang is a Developer Advocate at Zilliz. He has a background as a software engineer working on AutoML at Amazon. Yujian studied Computer Science, Statistics, and Neuroscience with research papers published to conferences including IEEE Big Data. He enjoys drinking bubble tea, spending time with family, and being near water.