You’re in!

Webinar

Memory for LLM applications: Different retrieval techniques for getting the most relevant context

Resources

LangChain Auto-Evaluator | GPTCache | Generative AI Agents Research

Transcript

Thank you so much for joining us today for our session memory for L l Mapplications,different retrieval techniques for getting the most relevant context. I'm Emily Zi and I'm a member of the team here at Zillow. I'll cover a few housekeeping items and then we'll get right into the session. First. This webinar is being recorded, so if you have to drop off at any point,you will get access to the on-demand version within a couple of days.

If you have any questions,feel free to paste them into the q and a tool at the bottom of the screen. Um,that just helps keep things organized for our speakers and make it easier tomoderate your questions. Um, in a few minutes,I will drop in some resources into the chat panel. I encourage you to check out our upcoming events. We've got a great session coming up next Thursday.

We've got free resources and of course,you're always welcome to join our community Slack. Today I'm pleased to introduce to the session memory for l l M applications,different retrieval techniques for getting the most relevant context. And our guest speaker, Harrison Chase. Harrison is the co-founder and c e o of Lang Chain. A company formed around the open source Python TypeScript packages that aimed tomake it easy to develop language model applications.

Prior to starting Lang Chain,he led the ML team at Robust Intelligence and ML ops company focused on testingand validating machine learning models,and led the entity linking team at Kensho, a FinTech startup. And he studied stats and us at Harvard. In addition to Harrison,we're also joined by my colleague Philip, a software engineer here at Dillas. Thank you so much for joining us and welcome Harrison and Philip. Thanks for having me.

Excited to be here. So I guess we'll just jump right into it now, Philip, and, and so yeah,excited to be here, excited to chat. Um, excited to have, have,have you with me. I think, uh, it should be interesting, uh, dual perspectives,right? So like, you know, I think, uh, you have, uh, you and Zillow's and,and with, with vis and then Zillow's Cloud have worked with, uh,a lot of people on the problem of retrieval. I'm assuming from the perspective more of like a database.

Yeah. Um, and,and a lot of my exposure has been from the application building andI think, uh, retrieval is, is super important. So really just intending to have a fun conversation. We've got some slides,we'll have some back and forth as well. There's the q and a thing as well,so if you have any questions that, that you want answered,just drop 'em in there and we should just have a fun 50 minutes or however longit is.

Um,so today we'll be talking about retrieval in general,basic overview of, of what is retrieval and, and why it is, why is it important. Um,semantic search is maybe the most commonly type or commonly usedtype of retrieval,but there are some edge cases and so we'll cover the basics and we'll also coversome edge cases. We'll cover some solutions, uh, or, or, or workarounds,those edge cases. And then I wanna talk a bit about generative agents at the end,which is just a really cool,more recent paper that came out that combines a lot of these more advancedideas. So at a high level, um,and this is a very bare slide, so Philip, if you want to add anything here,I'm sure that this might be a good slide to add some color, but like,what is retrieval and why is it important? And so the idea of retrieval is,is bringing external knowledge to a language model.

So language models only really know what they trained. They're trained on,I think chat G p t as a cutoff date sometime in, in in 2021. Um,they're also not trained on everything. They're not trained on uh, uh, you know,your corpus of data. They're not trained on sometimes smaller or or less common corpus of data.

Um, and then I guess another, another like reason for retrieval is just to,like,I I I think some people also use the term like grounded generation just to likereally ground the generation in the actual documents. So even if,even if the information is in the language model,maybe it's in there in such a small way, it doesn't really recognize it. And so you bring it up, you bring it front and center and,and put it forward for retrieval. Do you have anything else to add here?I'm assuming most people have some concept of retrieval,but I'm curious from your perspective. Yeah, what,what you use to motivate retrieval.

I think for me,like a big one is also just kind of giving evidence to the answer. So use large language models is they can sometimes say whatever they want. And then with these retrieval agents, you can kind of go and see, okay,where was that evidence pulled out? You can kind of list your factual proof,sort of like a search engine style. So pretty much using these it as a search engine to kind of give proof to whatyour alum is saying. Um,I think that combined with the in external knowledge,like giving it some info so knows what it's talking about are the two mostimportant, um, in my opinion.

How have you seen, like, um, I think the point about like giving,like citing sources basically mm-hmm. Is, uh, is is really important. Like what have you seen people doing there?I think the obvious one is you like ask it to ci its sources and then it likeresponds with like a list of URLs or a list of what whatever happens to be thesources for the documents that you're passing in. Um, what have you see, like,do do a lot of people like display those?Do they do more like error or do internal checking for like consistency withthose sources? Like how, how do you see those often used?So we've kind of seen it mainly used as kind of giving your top K results andgiving sort of like a snippet of the text and then kind of giving you a link tothe actual document. Um,there hasn't been too much analysis to kind of see okay, on every answer,is this correct? It's more along the lines of, um,kind of giving it a backup and then showing like where it's pulling from most ofthe time at the moment.

Um, I know with the retrieval app from OpenAI, um,over there we were kind of doing that style of kinda listing out where if itdoesn't know, I,I forgot exactly what the exact method it was using to see if it doesn't know. I think internally they might have had some kind of trigger to say, okay, um,it's maybe time to search over in the vector index. But it was mainly kind of of showing the results and showing. 'cause also when you're pulling out from the, um, retrieval,it sometimes also tries to summarize,or a lot of the use cases people use are to summarize it. And this summarization can also go wrong because it could then start pullingthings into the sum summarization.

And that's where kind of whenever you are using the vector index,it's worth kind of also giving the search results just just in case. Um,so you're pretty sure what's going on and if you need to kind of go back tosomething, but internally and like kind of large scale usage,it's usually up to the user to kind of look over that factual evidence to see ifthings line up. But I think there is work going on into that to kind of compare the two. I know,um, I'm not exactly sure what, but I think there is work going on there. Yeah.

Awesome. So hopefully that's some good motivation for retrieval and,and why it's important. And I'm assuming most, uh, listeners here are,are familiar with that as well. So now quick overview of Semantic Search. Um,this is, I would say by far the most common type of retrieval.

Um,actually maybe, maybe the the, maybe the real, most popular one. I think like just using search engines. Like if you think about calling out to a search engine,that's a type of retrieval as well. But in terms of retrieval over your internal documents,semantic search has to be the most popular one. Um, the basic, um,overview is, uh, and this is, this happens during query time.

Um,you take the user query, you, you create some embedding for it,you then query it against your vector store. And so this vector store has already been pre-populated with a bunch of yourdocuments. Um,and these documents have been ingested split into chunks and embeddings beencreated for each chunk. And then they've been put in this vector store like,like Zillow's or, or, or vis. Um, and then, uh, you know,at at runtime you pass the search query in,you get back some documents and you put it into the context of the,of the language model and it generates out an answer.

Um, and you can see here,like, you know, the prompt doesn't have to be super complicated,like answer the question using the additional context or something like that. And there's, there's a bunch of variants of this where you say like, you know,make sure you only use the context there, return any source citations that that,that you, um, that you, uh, used. So there, so, so,so this, this box here, um, is where a lot of like the prompt engineering in in,in l chain happens. Um, this is,this is high level before going on to kind of like, you know, where,where this fails, which I think is an interesting discussion. Maybe we can talk about a bit about like the, the Vector store, um, and,and you know, VIS and, and Zillow's in, in particular.

And so I think a lot of people are wondering like, you know, how,how do I like one, I guess like what, what is a Vector store?Why do I need a vector store? Um mm-hmm. And then like, how,how do I choose one? And so from your perspective,as you guys are thinking about building like the best offering for a vectorstore, like what, like how do you guys think about that?What are the accesses that you compare across? Um, yeah,I think currently the biggest one that we're looking at is just ease of use. Um,for a lot of users, kinda depending where they're at,they don't wanna go run six docker containers,set up the network in the dock or connect to the dock or do all that stuff. They just want something that can say, okay, pip um, install, import this,have it work. So that's been a big one.

Then also kind of ease of use of the client. Um,none of these vector databases have really settled on a language slash style,like a p i style, um, for the SQL that people are trying to chase towards. There's like other ones, um, I think there's the graph DB style,but there is no like, singular language for doing this no singular a p i. Um, so trying to figure out what the simplest one is,how can we make things simpler without having the user have a bunch of argumentsto set up the vector database and do all of these things? Um,because yeah, everyone can get it pretty fast,but at some point you need to make it easy to use. Um,we try to offer as many switches as possible,which is great for kind of big businesses where they have a team that can handledoing this stuff, but for a smaller user it's pretty new to this,do they really need all those switches?Does the extra 10 milliseconds really matter for someone who's just kind ofmessing around with it? It's kind of finding that balance,kind of creating options for everyone.

Um, that's been the main thing. But then again, on the big user side, we do wanna be fast,we do wanna be stable, we do wanna kind of handle all the new features, um,that are coming out. Um, but yeah, it's, it's a balance, uh,performance versus ease of use. That's their kind of main looking point. But one thing that I was interested was, uh,what's your opinion on kind of these new models coming out that are supportingtoken sizes of like, I don't know what the latest one was,but abnormally large token says where at some point you can probably fit yourentire corpus of data into the prompt?Yeah.

Um, I mean, I think the,the short answer is I don't really, I don't really know. I don't think anyone knows. I think like, um, I think like it's, um,so, so I've played around with the ANTHROPIC one, the a hundred K one. Um,I haven't played,I know there are some that are now like supposed to be bigger than that. I haven't played around with any of those.

I played around with the Anthropic a hundred K one and for a lot of cases whereyou just need to like, I'll tell you what it works really good for. It works really good for, and I passed it our entire code base,and if I wanna find like a single kind of like fact in there, um,it works really well for that. So it kind of just like finds that fact and, um,and, and, and extracts it. Where it didn't work amazingly well is,is where I asked it to combine like multiple things together. So to like give an example, we have, we have the concept of agents, um,in Lang chain and then we have the, we have uh, different LLMs, right?And so a lot of the examples in link chain, um,that use agents use like open ais, L l m, just because that's mm-hmm.

I mean that's the most popular one. So it's in all the documentation. But,so basically what I asked is like, can you create an agent using, uh,the philanthropic wrapper for me? And,and all that information is present in the code base, right? The,like the anthropic wrapper. And I, I have to like, you know,how do I import the anthropic wrapper and knew how to do that?How do I start an agent and knew how to do that?How do I start an agent with an anthropic model? Kind of struggled with that. So like that type of,and I'm not sure if that was 'cause like the context is really long because I'veseen it performing better at some bigger questions.

Um, but, but for those cases that I gave it, it wasn't working amazingly well. Um, so I think I,so I mean this is a bit ramly. I don't think anyone really knows. I think it's really interesting to play around with. I've played around with it a bunch.

Mm-hmm. Um,and I wish I had time to play around with it more. I think they'll likely be,I think like the, there's also different use cases, right?Like some use cases you need really fast answers. And so the downside of a hundred k is it takes really long. Um,but if you have that time to wait and these models do get good enough where theycan address some of some of those issues and,and they just reason over everything.

Like yeah, absolutely. So, so I,I think there's probably a, uh, a place for, um,a time and place for both. I don't know what's, what's your opinion?How do you guys think about that as a that, so some of,some of the hot takers might say like, oh, you know,vector stores are no longer needed 'cause of these long context length windows. How are you guys thinking about that?Our kind of general thought is that a lot of the data will kind of be lostwithin the giant contexts 'cause there's not gonna be weight that's,I don't know,I've never really played around with it and we haven't played too much with it. We've just kinda been reading papers and all that.

But you kinda lose some of the, the key information, like if you have,if you have a question that really stands out for like, like let's say three,like one sentence that occurs once and or it's like pretty out like outside ofthe context of the story I think,or I think philanthropic released some tests where they put the entire book andthey found one line that one line was kind of an easy one to find. I think I remember someone was telling me that's kinda like where I kind of,I'm not a hundred percent sure,but will it be able to remember things within that text?Will it be able to assign the correct weight to what it's trying to pull outfrom? And that's where we think it might not be able to do it. There's just too much information there where it won't be able to pull out theinfo. And then also speed, um,you're trying to answer a question about your entire repo. Yeah,every time you end up using it, you're gonna have to upload your entire repo.

Um,unless they do some caching method in the layers somewhere where you can kind ofcach and get the response faster based on the first text that you put in. I don't really see it being too useful. Also pricing, uh,last I checked was pretty expensive. Yeah,not too many worries at the moment,but we'll see how this area is just going so fast. And then, um,I have a quick,I have two QA questions that kind of pertain to this that we can probably answerright now.

Um, yeah, so one of them is, uh, from Sunil,and I'll try to kind of summarize it. Given the model is trained on a set of data already,what is the model actually doing with the additional docs I external knowledge,is it simply surprising or is some how,or is it somehow taking external knowledge into account in terms of generatinganswers?Yeah, it's,it's taking the external knowledge into account in terms of generating answers. So you can, you can feed it a document that has a bunch of facts about,um, I mean you can feed it a document about, uh, like a code,a code snippet from the lang chain, um, repo and then ask it like, you know,what, what's happening on this line? Um,how would you change this line to another line? Or something like that. And so it basically,all language models are kind of like conditional generations of the next token. And so when you pass in additional context,it's now conditioning its generation on the data that you pass in.

And so what,what that generation is, whether it's answering questions, um,whether it's doing summarization, it kind of depends on what you ask it to do. So if you ask it to answer a question,it's absolutely kind of like taking in that external context and using that toanswer the question. Awesome. And then, uh, one more from, uh,sorry if I pronounced your name wrong again. Um,and this kind of pertains to the docs and getting them into the vector store,which is how important is it to properly generate chunks here?What's the best size for a chunk?What are the best techniques for chunk generation? Um,I know there are some defaults within Lang chain for how you guys are generatingchunk sizes,but have you kind of looked into that and to see what these different chunks,how they work, um, what's kinda like the best route?That's a great question.

I was gonna ask you what you're seeing from customers as well. I think like, um,I mean, I'm happy to give my take here. I don't know. Okay. So,so I guess even backing up a step,like what is chunking and why is it important?It's basically taking these long documents and putting them and,and creating smaller chunks of data with them.

This is important because you only wanna retrieve, um, the,the most relevant chunks. Um, so I think there's, um, you know,I think there's, okay and like, okay, so at one extreme, you know, the,if you chunked everything into one word,that would be pretty useless 'cause you, you lose a lot of the context. Um,if you chunked everything into one document,you wouldn't have actually done any chunking. So I think that kind of like shows the trade off and like the shorter,the smaller the chunks are, the less context you have. Um,but the, the, the bigger they are,the one like the embeddings might be a little bit more,the embedding might not fully capture,like there might be multiple concepts within that that aren't fully captured.

Um, and then two, like, uh, you can just,you can't retrieve as many of them to put in the context window,holding the context window steady. So, um, I think, uh,I've generally hadbetter luck with smaller chunks, honestly. Um,I think like, I think like,I think the defaults that we have in LinkedIn are a little bit large and so Igenerally move it down to like maybe like a chunk size of like 500 tokens orsomething like that. Um, a few things here. This is, you know, purely a heuristic.

I don't know if there's a right answer. There's a great tool by Lance Martin called Auto Evaluator,which lets you do kind of like this whole question answer pipeline,including like different chunk sizes, different chunk overlaps. So it's often useful to have overlaps in chunks to like, um,basically carry through some of the context. Um,and so there's a great tool by Lance Martin called Auto Evaluator that lets youexperiment with different versions of these and then run kind of like anevaluation end to end to see for your use case, um,what what comes out to be the best. Um, the other thing that I'll add,and I think this is something that Lang Chain has done pretty well,is it's really important how you create these chunks.

Um,so like by default you could just split every token, um, right?But I think like the idea behind chunking is like, as much as possible,you probably want to have like things that are semantically the same,in the same, in the same chunk. Um, and,and so a lot of the tech splitters that we've added don't split on a tokenlevel. Um, you can do that, that there's a default in there. Um,but they split more on like characters that we think are like semanticallymeaningful. So for example, um, if you have like markdown,you have like the,you have like the single hash mark or hashtag or or pound symbol, um,and that's like a header one,and then you've got two of them and that's header two,and then you've got three of them and that's header three.

And so if you start splitting on those recursively, um,you start getting chunks of texts that are like in the same sections and stufflike that. And those are generally kind of like semantically meaningful chunks that shouldprobably be together. And so we kind of like split according to that. Adding in. Another thing, and this is something we're working on and I,I hope to have a version of that this out.

Um, uh,LA Lance is actually working on it. He, he,he's doing a lot of amazing stuff around retrieval. You should have him on as a podcast guest actually should be our guest. But, um,like one of the things that he's looking at is it,when you split,you sometimes lose some of the contextual information about like where you arein the document. Um, so in the example I just gave,if you just split on header ones and then header twos and then header threes,when you split on the header three,you lose a little bit of information in that you lose like, okay,what header two was this in and what header one was this in?And so there's this concept in link chain of like metadata that's associatedwith each document.

Um, and so, um,basically what we're working on is kind of like propagating through some ofthose section headers into the metadata to basically enrich the metadata as yougo along. And then the idea there would be you can kind of like create, you can,you can do chunking without losing some of the, uh,context of where it is in the document. That was my long rambling answer on a lot around chunking. I think it's really,really interesting. We've, we've invested a bunch in a lot of tech splitters in,in lang chain and I think, um,I I I don't know if I've seen that many other places that have, um, but,but I'm also curious for your take on this Phil, like do,what do you see people doing as they're creating like things to createembeddings for, to put in vector storesAt the moment we're kind of seeing the simple route, um,definitely not doing this kind of analysis on kind of what type of documentsbeing looked at and then grabbing based on the headers.

Yeah,so for us it's mainly been kind of just the simple chunk by X tokens. Um, it tends to work pretty well for most users. Um,I think for the results that they're looking for,I don't think like there is a lot to be improved upon. That's where we kind of point people to lane chain to kind of get a betterpipeline, but it's usually just the basics. Um,or what they're doing is they're usually not storing these giants.

So if they're going this P d F route or like, like their own data,they'll usually go through Lang Chain and then kind of give us what they use andthey decide there if it's someone directly using us without using Lang chain orany of these libraries,their data usually is not these giant documents. They're kind of pulling already data out,like QA data or something like that where you can kind of chunk over,you can pretty much have your chunk be the entire document there. Yeah. Um,it really depends, but normally we don't really see, they kind of just say like,okay, we have these embeddings, how can we make our search better and faster?That makes sense. Yeah.

LemmeUm, I think there are a few more questions here,but I think for one of them we'll be going over it later in terms ofconversational memory and prompts. So yeah, I think we could, uh,kind of continue. Cool. All right. I have a few more questions that I wanna ask you around the ingestion pipelineas well, but we'll move on, um, to, to some future slides for now.

So, okay. What are some edge cases of semantic search?Um, there, there are, uh, a bunch that we've kind of noticed. I would,I would say, um, semantic search is is, and,and I I really liked what you said about the ease of use of getting vectorsource to be easy to use because I do think that semantic search goes a longway. It gets you, uh, I'm not gonna put an exact percentage on it,but probably somewhere between like 80 and like 90, 95% of the way there,depending on your use case. And so I think making that really easy to do is really enabling for, uh,a lot of developers.

But I think there are some edge cases that start to arise as you, um,as you progress. And, um,a lot of these are kind of just heuristics on top of semantics search or aroundsemantic search. Um, and so, um, uh, yeah,the overview of them here, I'm, I'm, there's probably a bunch I'm missing,so I'm curious for your take on 'em as well, Philip, but, but I'll just,I'll just run through a few, uh, relatively quickly. Um, so first,um, y first, uh, repeated information, um,so when you have like documents that are exactly the same copies of one anotheror just have like really similar content, um,and so like why does this, um, why, why does this matter? Um,one, like when you do retrieval,you're getting a lot of copies of the same thing,and that's not exactly ideal for, uh, for, uh, you know,providing the language model with,with the best possible set of information that, that it needs. Um,and then like a very correlated to this is just take up more context.

Um,so in order to get that diversity,you'd have to like really increase the number of documents that you have andthen that just starts to take up context. Um, so,so some of the things that, that we've seen and,and implemented in lang chain here, um, one is the idea, yeah,there's,there's one one's the idea of basically filtering out similar documents. So like you do a big retrieval step. Um, and so,so normally like by default in LinkedIn you pass in, you,you do a retrieval and you fetch back like four or five documents and then youpass those into the prompt. And we generally choose four or five because of like the context window linksand, and that's like a, a reasonable default.

Um,but one thing you could do is basically fetch like 20 or 30 documents and thenyou just filter out ones that are, that are similar, um,either either through embeddings or by passing them to a language model. And embeddings are, are, are probably the faster use case here. Um, and,and that's we've implemented in this idea of like contextual compression,which is the idea of basically you retrieve documents and then you do somethingto kind of like compress the information in them relative to the query thatcomes in. Um, and, and, and so that's one thing here. Um,another thing you can do that we've implemented on most,most vector stores I believe is, is have a different type of search.

So rather than doing semantic search where you basically select the documentsthat are most similar,there's another type of search called like max marginal relevance,which basically takes a vector and then when it's selecting vectorsto,to retrieve it optimizes not only for similarity but alsodiversity from other vectors that it's already retrieved. And so the idea is basically, yeah,if you have two vectors that are exactly the same as you start retrieving them,you're you're not gonna add them if if they, they don't,if they don't add much and there's different parameters you can tune there. Another thing that I'll add, um, that I forgot to put here, but it's,it's just as simple as you can kind of like de-duplicate documents before theycome in. You can do some deduplication process that's maybe like, um, yeah,I mean, you know,I guess like one downside of that would be if you have documents and likelocation A and location B and you deduplicate 'em,like what do you put as the location? Do you put location A,do you put location B? Will that actually matter for your application?Like if you're citing sources or something like that. So it's a little bit more risky there.

What what are your thoughts on this,Philip? Have you seen this pop up? Have you seen any other techniques have,which of these three do you think people do most?I would say at the moment, um,it's a little tough area to do because for document de duplication,unless you're doing exact pattern matching, um,if you're going just by the embedding, yeah, you could do like a search to see,okay, what are the, what's the closest result,like what are the 10 closest result? Kind of set a scale for what's the closest,like what's considered a duplication and then kind of remove, um,the only issue there is you don't really know what level you set yoursimilarity in the score to mean a duplicate because again,one uh, element of the vector could be completely different,meaning that everything could be different. These are such large, um,embeddings that, yeah, iation is hard. Um, one thing I think, uh,I had a question, uh, I'll like, I'll ask it after in terms of the, uh,compression,but another thing I think we messed around was whether compression was, um,taking like, yeah, as you said,your top 10 results and then summarizing across all of them and then using thatin your prompt. So kind of,even if they don't hit as accurately, um,you can still summarize across all four to make it a single document and thenuse that in your next question,which I'm guessing is what you guys were also doing in the contextualcompression. Um,Yeah, there's,But yeah, in terms of getting rid of repeated information,it's, it's difficult unless, you know, a document's gonna be pretty similar.

Um,it, you can just kind of do an upstart based on the primary key of the document,but if you really wanna get rid of like contextually repeatedinformation, I don't think we have a trick for that yet. Yeah. Um, alright,jumping into the next thing that I've seen often pop upconflicting information. So you have the same answer for multiple sources. So like a way of thinking about this is like in your notion database, um,you might, and let's say, you know, what, what is the vacation policy for your,for your, um, for your company? Um, you might have an answer,you might have answers for that in different places.

Like maybe the answers kind of like changed over time. And so there's like the,the master kind of like HR document and the answer's there,but then there's some like random meeting notes from like a year ago where theydiscussed the vacation policy and so then the answer's there. And so when you do a retrieval,you pick up information from one place and pick up from any information fromanother place and you pull them both into context and you,and you're now asking the language model like, hey, like, you know, um,uh, generate the answer to the question like,what is the vacation policy based on the following information?And the following information has two conflicting sets of information. Um,so what, uh, so what do you do there? Um,some things that we've seen is basically when you're,when you're retrieving sources, um, you can have some important,you can have some notion of importance built in. So in the example above,you'd probably wait like the HR document higher than the the random meetingnotes.

Um, and if those are the only two documents retrieved,this doesn't do much, but when there's a lot of documents you can,you can add some importance waiting and kind of filter those up. Um,the other thing that is maybe a bit more robust is you can pass,well, it's more robust in some ways is you can pass like the source information,the metadata to the generation step. So when you pass it to the generation step,you don't just say like, um, you know,here's a chunk of text that was found that, that seems relevant. You also say this was found like in the master HR document. And then for the other one you say like,this was found like in meeting notes on like, um, you know,3 14, 20 22 or something like that.

And language models are pretty good at like ambiguous reasonings. They should,if you say like, you know, if they're,and this gets back to some of the prompt engineering and prompt construction,if you say like in the case of like conflicting information,like choose the one with that of the source that seems more credible,that should, that should help there. Um,that's what I've got for conflicting information. Have you seen this pop that at all? Fill up any other,and these are all like edge cases, so to be clear,like I don't think these are extremely common for the audience listening. Um,but you know, as you start to encounter these,just trying to provide some context on what might be going on behind the scenes.

Yeah, I think for us a key one has been, um,if you can get kind of dates and just do metadata filtering based on dates whereyou kind of assume the most recent knowledge is the best knowledge. Um,I'm sure there are cases like that where you said it's HR saying something orit's some random thing saying it. In our sphere,outside of metadata filtering, we don't have really the ability, um,it comes again more closer to your domain of the actual domain kind of promptingaround. Yeah. But yeah,kind of if you can use the large language model as you said,it can usually kind of figure out okay, if, if you can analyze the question,see a answer from HR is probably better.

Um, but just from like the raw,the raw, raw side of it, metadata filtering,kind of trying to see if you can kind of kick some results out, um,and have a higher chance of getting the result you want. If you know,like a random forum, if this comes from a random forum,most likely there you don't want to use that as your result,filter out all that data. Um, but yeah,it really depends on how you can structure your data and what information youhave ahead of time. Um, again,with importance weighting that kind of is difficult to alsodo. Um, 'cause you also kind have to have a say in how important it is.

Um,without, if you don't want to use the large language model,you kind of have to have that metadata at hand where then yeah,importance weighting you can kind of just filter out as well,but kind of automatic things, um,you'll pretty much have to fall back to the large language model to have itdecide. Yeah. Next one, um,related to the, to the previous one is, and,and related to some of the stuff that you talked about is just temporality soinformation can update over time. Um, so yeah, you know,the vacation policy could change from one date to another and both documentscould still be in the database. Um,so I think like some solutions that I've seen here, um,and I think like two of them are very similar to before,so the recency waiting and retrieval or recency filtering as you, as you said,I think that's a, a good way to filter them out completely.

You can also include the timestamp and the generation step,so you can sell the language model to trust more UpToDate information more. Um, the other one is an interesting one. Um,and this doesn't apply in all situations,but the idea is basically to use reflection to update your,your concept of something over time. So in the example of the vacation policy, um,in addition to, and, and this is, okay, so,so I I mostly see this used in like conversational settings where you have achat bot and I'm,I'm chatting with a chat bot and I'm talking about my friend Philip and whathe's up to and what he's doing and, and, and, and things like that. And so, um,rather than, uh, you know,1, 1, 1 way of adding memory to that chatbot is just to retrieve, um,kind of relevant pieces of what I said about Philip when I talk about Philip thenext time.

But another way of adding memory is to create this idea of,of similar to like a knowledge graph where it has a concept of Philip and whoPhilip is and what he does and what he likes and all of that stuff. And so then as, as information comes over in over time,that entity is like updated in in the knowledge graph. And I think, uh,there are a lot of downsides to this. Uh, the,the reflection step uses the language model, so it costs,it costs more money and it's kind of slow. Um,and so I think it's still very experimental, but,but the paper we'll talk about at the end, the generative agents paper, um,uses this technique.

Um,and so I'd I'd say that one's a bit more researchy at the moment,but I think very, very cool. And then the other two are a bit more practical. Um, I don't know if you have much more to add here. This is pretty similar to the one we've talked about before. Yeah, mainly just kind of timestamp based and date based.

Um,I feel like there are, like once you start bringing in,if you have the funds to start pulling in LMSs,you could also then kind of use LMS on your metadata that you stored with avector and then decide, oh, do we want to combine this?And then kind of keep updating and upsetting. But yeah,in terms of our side of things, we, we kind of,the issue with our set of things that you,we try to do as much as we can just based off the raw data, um,like that we get, um, trying to,trying to avoid as much as many l l m calls as possible because that getsexpensive and yeah. Yeah. Um,this is a cool one that we've added recently. Um,basically sometimes questions can be more about metadata than content.

Um,so I think an example of this is like if you have a user question that's like,you know, what's, what's a movie about aliens in the year 1980,the semantic content. And,and so to back up even another step like embeddings are really good at capturingsemantic meaning of things. They create this dense representation of text. Um,and, and so, uh, the semantic thing,the concept that the user's looking for are movies related to aliens,but there's another thing which is this like basically metadata filter as wecall it, which is like, you want the year to be set to like 1980. Um,and this is like a very like exact match type of thing.

Um, it's,uh, this is this depending, it totally depends on the schema obviously,but like this could very well be a field in the schema that you want to queryand you basically wanna filter results based on that. Um,and so we added basically kind of like a,a self query mechanism that before it does the semantic search retrieval,it actually splits that this out into two things. It splits it out into a metadata filter and then it splits it out into the,into the, into the query itself. Um,so in this case you'd get a metadata filter,which is like exact match year equals 1980, and then you'd get the,the a query, which is would be like aliens or something in this case. And so then when you, and,and then there's a question of like how do you apply the metadata filter?A lot of vector stores have, uh,ways to pass in metadata filters and so you can kind of do it directly in thequery.

Um, you know, if that doesn't work,you can all probably also filter things out after it comes outta retrieval. But pretty much all of 'em have, uh, have ways to pass metadata filters in. Um, and so basically then you pass in this metadata filter along with the,the generated aliens query. Um,and it kind of like combines the best of both worlds. You get the semantic search with this, uh, with this, uh, uh,uh, like exact match filter that you can apply.

Have you tried doing that on the other side as well with self inserting?So you can kind of go take a batch of your insertions and then kind of generatekind of a schema for what you wanna insert. So you know,the filters you can maybe extract what you wanna filter on the columns,and then later on you do the same for the querying. And it can decide based on what it sees in the schema, probably be expensive,but that would be interesting where it can kind of decide what metadata it wantsto pull out of the data. Yeah, that is,we haven't done this that we've thought about it framing exactly as you justsaid, like what metadata should we attach from, uh, from a document? Um,yeah, that, uh, let me make a note of that. Um,uh, I'm gonna, I'm gonna slack lance about that right now.

Um,extract metadata during ingestion. Alright, well we'll add something there. That's a really good idea. Yeah,because I think like, I mean, this goes back to the ingestion part, but like,um, even, yeah, like even even for, um,even if you're not doing this self query, it's still really,really useful to have this enriched metadata type thing, right? And so, yeah. Um, some Stan, uh, you know who, so I think, uh, who had brought this up,I think we were chatting with someone from Zapier and they had brought up asimilar thing as well.

So, so now that's two people,so that means we definitely have to do it. SoOnly issue I can see is then you're making an l l m call for themetadata also embedding call and you're, it's getting expensive,unfortunately. Oh, yeah. But yeah,Yeah, yeah, yeah. But you know, I think, I think like, um,you do probably have to trust that like l l m costs are gonna go down andlatencies are gonna go down over time given the whole speed of, of,of everything.

And also, like another thing I'll just say is like, you know,you don't need to do this for every use case, right? Like, you can do this. That is true. Yeah. Kind of as, as desired. It's, uh, it's another,that's the thing with like the space right now is I think there's no one rightway of doing everything, right? Like there's, oh yeah.

I think, um,in a lot of what, what we view ling chain as is just these components,kind of like tools in your tool belt that you can use for like different usecases. So if you really want to care about this, um,and it's really important that you do have like this like identifiable metadata,you know, it's an option, but it doesn't mean that it's the right thing. Again,semantic search gets you 80% of the way there for like a bit or it gets,it gets you there for 80% of the use cases. And so yeah. Awesome.

Um, multi hop questions. So this is another one where,where you could maybe be asking multiple questions in, in, in one thing. And so I think like there are a lot of, um,there are a lot of examples of questions like this. I think some of them, uh,you know, I first started thinking about this when,when the react paper came out, um, which, which synergizing reason in action,an early version of an agent, you've basically got these questions, you know,like what's the population of the capital of Kansas?And so then you've got a question like, what is the capital of Kansas?And then what is the population of, of, of that city? And so, um,the the, um, the you,the issue with semantic search here is that if you just look up the originalquestion, it might pull in one part or, or the other part or,or it might pull in like the population of capitals but not of Kansas,or it might pull in the capital of Kansas, but not population information. And so, um, it's,it's basically this kind of like multi hop thing you need to do.

Um,and so agents are just a, a a way of saying,you know, break the, break the question down into multiple steps. Um,and so use the, use the language model as a reasoning engine. Think about what piece of information you need to look up first. Okay,what's the capital of Kansas?And then think about what's piece of information you need to look up second,okay, what's the population of that city? And then,and then do those in separate queries. And, and by the way,some of these queries might not be to the same database, right?So like one could be to a vector store, another could be to a SQL table.

And so,and this starts to get more and more into like just the idea of like ageneralizable agent that you can use for retrieval. Um, and again, as discussed,you know, racks up a lot of l l m calls, um,it can go off the rails a little bit more easily. So, so you, so I think, uh,it's again, just another, another tool that you can have in your tool belt here. Awesome. Yeah.

Speaking of rocking up the l m calls, I know there's been work,I know we've been work working in like the caching of the L l M calls. I think you include that also in link chain now, but G BT cash. Yeah. GG cash. Yeah.

Great project. Yeah. Hopefully that kind of helps out with these, uh,use cases where who knows,you might be sending the same first question multiple times. If you're askingmultiple questions about the city of Kansas split up, maybe it,it will hit the cash a few times and you might save some money, but who knows?Absolutely. Awesome.

Um, cool. Okay,the last thing I wanna talk about really fast, um,and then what we can probably go to some question answering for the remainingbit is, uh, um, the, uh, a paper about gender if agents,um, and so this came outta Stanford about a month, two months ago. The basic idea is they set up a simulation of these different agents that wereinteracting with each other, um, kind of like in the sims. So they had like 25 different agents. They all had kind of like their own, um,they, they all could take their own actions and,and they all had their own like memory basically.

Um,and so how they did the,the memory I think is really interesting 'cause I think it,you know what it memory is really like retrieval of relevant information at theright time in some senses. Um, and, and so I think how they did it is,it's pretty interesting and pretty instructive to look at. And they,they used the concept of,so basically what they did is they had all these observations, um,and then they would basically fetch relevant observations and then put them intocontext as agents were deciding what to do next. Um, and,and so what exactly was going on in this retriever?There were a few different components. Um, there was the relevancy component,and this is where like the semantic search bit comes into play.

So they'd calculate kind of like the,the relevancy of the current situation to previous situations and bring ininformation about previous situations. Um, importance also came into play,so they would retrieve like more important information. Um,and to your point around like, you know, how do you do this in an automated way?Yeah, they used the language model to assign importance to, to memories. And so they were, they, you know, there was, um, there was that extra cost, um,inferred there. Um, and then recently it was also another bit,so they gave more weight to more recent things, um, not to,not outta the sake of like trying to deduplicate conflicting information oranything like that, but just, you know, more recent memories or,or probably more relevant.

So pulling that in. And then,and then on top of that, the other thing that they added in,and this is similar to some of the, um, to,to some of the stuff talked about earlier, was a reflection step. So they kind of had a reflection step that at each time period they would,they would do reflections on the past, like, and,and observations and then add that as a memory. And so then they'd start pulling this, um, and, and they, they,they treated that just as every observation, uh,just as every other observation. So in terms of the retriever,it didn't change anything about the retriever at all.

The retriever was still a combination of like relevancy important and recency. But the thing is now the things they were retrieving weren't just likeindividual observations. Some of them were also reflections on those individual observations. And so a way of kind of like pulling in higher order kind of like reasoning, uh,about those. Um, I think, uh, again, this was a research paper.

Um, so I don't, this probably not, you know, this isn't production ready. Um,but I think it's a really interesting look at how, um,like more complex types of retrieval can power like really interesting anddynamic simulations and situations. Someone had a question in the chat,if you have a link to that paper or what the paper's called?Yes, it's called, I think it's called Generative Agents, but let me find, uh,lemme find a link to that and I can drop it in there. Generative agents, interactive SRA of human behavior. Um, I'll drop it in the chat.

Oh,someone beat me to it. Yeah. Awesome. Um,I'm not sure, we have a few q and a questions if we wanna go over them. Yeah,I'm not sure if you had more slides, but, um,That's, that's all I had unless there's other things that you think we should,we should talk about.

I think that pretty much got the gist of, uh, memory in terms of lms, but uh,yeah, that's a, I go through these so we'll start with a big one from, uh,so yeah,there's a belief that the future of l Lums is about separating the reasoningability of the model from the knowledge the model holds. This leads, uh,to models that are experts at reasoning about tasks,but maybe don't have all the knowledge about everything and see this is how, uh,this as how people will use LMSs in the future. What are your thoughts on this might be nice if you can discuss a few examplesof this scenario. Yeah, I think, um, I mean I think,I think there's definitely a lot of validity to it. I think that's a lot of the underpinnings of this idea of like retrievalaugmented generation.

Like you're, you're,you're pulling in the relevant contacts and then you're asking the languagemodel to, uh,reason over the relevant context and the question and provide an answer ratherthan just use information that's in its weights. Um,you know, there are, uh, so, so I think there are,yeah, there, there, there absolutely are. Um,a lot of very valid reasons to, to, to think this way. Um, you know,obviously there's a lot of applications where you don't have to do this as well. Like language models are like good at having conversation and maybe like,here's one distinction at draws.

Like, you know,are you interacting with the language model for the sake of its reasoningability? Um,or are you interacting with the language model for like entertainment oranything like that? Right? So like, you know, like if you look at character ai,like a lot of, um, their characters are, it's for entertainment purposes, right?And so I don't, I don't think they do a lot of kind of like this,this retrieval augmented generation when you start getting into and,and then there are, you know, people use chat g P T for a lot of information,right? And it's,it's pretty good at a lot of it and it gets it a lot of factually correct. And so they're all also there,there are also works being done to like improve the underlying kind of like, um,factuality of the L l m. So they don't just like make stuff up. Um,I think even if you have a language model that doesn't make stuff up and doesn'thallucinate and when it know and,and when it doesn't know an answer to a question,just like says it doesn't know and as opposed to make stuff up,which is the current situation, um,you still probably want,or there like this idea of like combining the language model with other data isstill extremely relevant because there's always gonna be data that the model'snot trained on. And so then the option is to train the model on that data.

Fine tuning right now is much more expensive than a lot of this re this,this rag there's retrieval, augmented generation style stuff. Um,so that's a long way of saying, I think, uh,I guess the question is about the future of LLMs and I'm,I'm really hesitant to make predictions about the future 'cause I don't reallyknow anything. Um, but I'm guessing there's probably,there's absolutely a place for this in some form in the future. Um, lemme pull out the next one. So someone had a question about conversational memory.

Um,how are we storing the previous, uh, conversationsand uh, they're more structured and uh, much better than freeform conversation. So I guess, yeah, conversational memory, like how,what the route is for storing this and retrieving from it. Yes. Um, I mean,the basic thing that everyone's doing with conversational memory right now is,um, you just keep a buffer of the previous messages and conversation. So the most, it's entirely recency weighted.

Um,I'm pretty sure that's what Jet G P T does. I'm pretty sure that's what character does, could be wrong. Um,but I think they're just keeping around this buffer of most recent messages and,and as the context windows get longer,you can keep that buffer around for like a longer period of time, et cetera,et cetera. Um, I think there's,um, when you talk about extending that, yeah, like all the same stuff applies. Um, you could do, I mean, as you pointed out,they're less structured than some of the structured information.

Um, um,this is actually a really good question. Yeah. I don't, I,I haven't thought too much about the differences in terms of how you'd wanna doretrieval. I'm guessing you'd probably wanna treat the conversation as, uh,a whole document as opposed to each message as a document. Um,you'd probably want to have a decent amount of overlap in, in the conversation,maybe more so than a, a, a structured document.

Um,but that's a really good question. I haven't thought about it in too much detail. Um, another one about chunks is for long documents. Is there a way to select the best end chunks to use to summarize, uh,leveraging the rector stores? Um, what's the best way to think about this?So is there a way to kind of filter out what you don't need?Yeah, I mean, I think, um, the, well, one,it to some extent it depends on the,the questions that you want to be asking of it. If you're asking it to like,summarize the document, you probably want to have the whole document in, in, in,in, you want,you probably wanna use the whole document if you're trying to extract like,questions about it.

Um, then I think using a vector store, um,provides a very natural way to do that. You kind of like split it up into chunks. Um,you only retrieve the chunks that matter. You don't set like a high level of chunks. Um, so yeah,I, I, I think, I think it depends a bit in the questions that you ask of it.

I also think, and then like how do you determine the chunks? Um,this gets back to the original point where right now it's like all heuristicspace. There's kind of no things with all this stuff. Like I would let me drop a link to like auto evaluator that Lance Martin made. That's really good. Where you can pretty easily kind of like experiment with different versions of,um, all of this.

Um, and uh, yeah, like I think, uh,um, I just put that in the chat. Um, so I think that's, uh,I, I think, I think that's a pretty good way to, yeah,basically it's going to, it, there's no, there's no one universal answer. You should try a bunch of different things and see what works for you. And auto evaluator's a good way to do that. And then we have another one is, uh,any impressions on how effective compression strategies are in practice?So I assume this is for compression.

Yeah. Um, I mean, I think it's still pretty early,so, um, it's, um,I don't know is the honest answer. I think there are edge cases where it definitely works. I think like, I think,I mean, I think the big question, like,I don't think there's any debate that it like works. I think the question's about like r o I and, and like, you know,is it effective in terms of like the cost that it incurs as well?And I think the jury's still probably out on that,and it depends a little bit on your application probably.

And, uh, how does, uh, the weighting of information work,how do you implement it metadata level in the prompt? Um, yeah. Yeah. So the, so, okay, taking some concrete examples,the generative agent's paper,what it did is it did a retrieval step based on semantic similarity,and then it re-rank things by incorporating importance into that. Um,along with like, there was some, you know,there was some like lambda in front of the importance score and some lambda infront of the, the,the similarity score and then also some Lambda in terms of the,the recency score, I believe. Um, and so the, I guess the general,so,so that would be one way of doing it basically like combine the important scorewith the similarity score to re-rank things and then take the combined like topthree or four documents after that re-ranking.

Um,you could also do it in the prompt. Um, this is a bit more, I have, you know,degenerative agents is a very concrete application of this, um, uh,putting it in the prompt, you could, you,you just need to then do some prompt engineering to tell the language model topay more attention to it. Um,haven't seen this one done as much in practice. It's mostly been kind of like re-ranking and reordering after fetchingdocuments. Let's see.

Um,regarding the components described in the generative agent's paper,how well does such concepts map into general higher order abstractions that canbe composed into useful applications?Is cost inherently problematic for the foreseeable future?Um, I mean, I thinkThe, um,So around cost, I think the,I think honestly the biggest blocker for most things is like, uh,still getting like product market fit, to be honest. Um,I think like getting stuff that works and is useful is a bigger blocker thancost. Um, after that cost, yeah, it probably becomes problematic. Um,um, but, but I don't think,like, even if you put cost aside,I haven't seen a lot of projects that are really diving deep into this type ofmemory or this type of retrieval and, and, and, and, and,and really optimizing stuff there. Um,that was the second part of the question.

What was the first part of the question?Um,how old does this concept map into general higher order abstractions that can becomposed to useful applications? I think you did both question. Um,and then we have one as what is your perspective on the evolving trends towardsintimate and private engagements with ai?An increasing number of conversations being entrusted to AI systemsconversations, which typically would've been exclusively human. How can we design and implement trust-based mechanisms to ensure theseinteractions remain secure and privacy centric?Yeah, that's a good question. I'll preface this by saying I don't really know,so you should take everything that I say after this, the grain of salt. But I think like, um,there's definitely like off top of head,there's like two really big components that I can think of.

Um,so one is like making, you know,for these more like private and intimate conversations,like making sure the language model really does not suggest things that areharmful in any way, right? Like, if you're having these types of conversations,like it should not be saying anything that, that, that is harmful or, or,or could lead to, to violence or anything like that. Um, because you're really,you know, you're, you're really, you're, you're for, it's,it's much different than if you're interacting with like question answering overa random document, right? Like the, the stakes are a lot higher there. And then the other thing is around, so that's maybe one around like safety and,and I think this is, um,much more safety of like the underlying un underlying models. Then there's like the privacy and probably like the privacy of this wholesystem. Um, and so yeah, making sure that your data is, um,as secure as possible.

Um, does this mean like a,a locally deployed l l m? Um, it could,it could also mean just better privacy policies from,from some of the cloud ones. It probably depends on your risk tolerance. I don't have a super strong, um, thing here. I think obviously local LLMs would be ideal. Um,but I don't a hundred percent know if that's a blocker.

If,if there's really good security measures put in place, like we,we entrust a lot of our personal information to, to, to the cloud as well. Um, yeah. Good. It's a good question though. I thinkWe have time for one more.

Um,uh, okay. When using LMS for summarization and reasoning based on semantic search,are there any open source models you suggest for that particular purpose?Um, I've had the only, the only one that I've had any look with is,uh, mosaics, uh, M P T seven B model. Um,I've tried out a few others. This, you know, I, I haven't experimented too much. I think UNA is not terrible.

Um, so I don't have, so I take it back. The M P T seven B isn't the only one that I've had success with. Um,I think like vicuna and, and, and then that one are two that I've,I've had a little bit of success with,but still super early on in the open source days. So that will probably,there'll probably be a lot more. Excellent.

Um, I think we're at time. There's probably four more questions we can probably get answers to later. Maybe, um, just so we don't go too far over, um,I'm not sure if Emily wants toYeah, it's up to you. Harrison, if you've got a couple extra minutes,we're happy to take a few more questions, but if you've got a jump,then we will probably do a follow-up blog with answers to those questions. I, I do gotta jump.

I all appreciate you having me on here, but I got,I got a conflict that I gotta jump off. Of course we understand. Um, fellow Paris, thank you so much. This was a really great session. Um,thank you for all of the wonderful questions from our audience.

Uh,we hope to see you on a future, uh,Zillow's webinar and keep an eye out for that blog post. We will do our best to get the rest of those questions answered,and you'll find that on the Zillow's blog. Thanks everyone. Thank you. Thanks, bye.

Memory for LLM applications: Different retrieval techniques for getting the most relevant context

Resources

AI Assistant