Boost your LLM with Private Data Using LlamaIndex

You’re in!

Webinar

Boost your LLM with Private Data using LlamaIndex

Transcript

All right. Good morning, good afternoon, and good evening everyone. Thank you so much for joining us for today's session. Boost your l l M with private data using LAMA Index. I'm Emily kk,and I am a member of the team here at Zillows.

I'll cover a few housekeeping items, and then we'll get right into the session. First. This webinar is being recorded, so if you have to drop off at any point,you will get access to the on demand version within a couple of days. If you have any questions,feel free to paste them into the q and a tool at the bottom of your screen orinto the chat window on the right hand side of your screen. Um,be sure to check out zs.

com for upcoming events. Um,be sure to join us on theda Slack workspace, and, uh,check out some of our free resources. I will drop some links to those in the chat in just a few moments. And today I'm pleased to introduce today's session booster,l l m with private data using LAMA Index and our guest speaker, Jerry Le. For those who don't already know,Jerry is the co-founder and c e O of LAMA Index and open source tool thatprovides the central data management and query interface for your L L Mapplication.

Before this,he spent his career at the intersection of ML research and Startups. He led the ML monitoring team at Robust Intelligence,did self-driving AI research at Uber at tg,and worked on recommendation systems at Quora. He graduated from Princeton in 2017 with a degree in cs. In addition to Jerry, uh,a little later in the session will also be joined by my colleague Frank Leu. I'm ML architect and our director of ops here at Zillows.

With that, uh, Jerry,I'll let you take it away. Awesome. Thanks so much, Emily. And as Emily said, uh,I'll be doing a short presentation, and then afterwards I'll have, uh,a nice chat with Frank. And so, uh, sweet.

So LAMA Index is a central interface between large language models andyour external data. Uh, it exists as a GitHub Pro, uh,as a GitHub open source project. Uh, we have a ecosystem of, uh,different projects within, uh, the LAMA Index, uh, organization. But, uh,today we will mostly talk about the chlor repo, uh, and,and the toolkits that it offers. So the context here is that large language models are a phenomenal piece oftechnology for knowledge generation and reasoning.

Uh,they're pre-trained on large amounts of just, you know, publicly available data,and they can be used for a variety of different types of use cases. Uh,for instance, like being able to answer questions,being able to try and write arbitrary amounts of text,being able to summarize text,and also being able to plan different types of actions. I think anybody building, uh, L l M applications, uh, or building,trying to build applications on top of this amazing technology,ask themselves how do we best augment, uh, LMS with our own private data? Uh,so if whether you're kind of like a single person and you have a bunch of fileslying around on your hard drive or you know,you're an enterprise user and you have a ton of workplace applications likenotions, slack, Salesforce, or you know, you are, uh,using a enterprise data lake and you just have a variety of different types ofdatabases that you're, you're, um, that, that store different types of data,how do you augment language models with this data that's stored in thesedifferent sources?There's a few paradigm cess for adding knowledge, uh, into, uh,a language model. Uh,and so one of the first paradigms is this idea of fine tuning,which is more along the lines of classical machine learning, where you can,you know,add new knowledge by just retraining the network to incorporate this newknowledge. Uh, and so there's a variety of different types of, you know,training algorithms, techniques, processes that you can do, but fundamentally,it just boils down to some optimization process over the weights of the network,uh, so that you train this network on, you know, some new private data.

There's a few downsides today at least, and, you know, fine tuning, uh,has a lot of potential to get, uh, very good, uh, very soon. But today there's a high amount of data preparation effort needed. Um,there's a certain lack of transparency, uh,by actually being able to train on top of this data, uh,you kind of trust that the network will internalize this knowledge, uh, within,you know, uh, the, the numbers basically. Uh,and then it doesn't actually work well for a variety of cases,and it's pretty expensive. The other approach to these days is this idea of in context learning,where you actually put context into the prompt.

Uh, and, um,many of you might already be be familiar with this paradigm,but the idea is that you take a pre-train model. So for instance,like a pre-trained track, G B T or G B T four, and then you can, uh,take a corpus of external knowledge, for instance,like a set of essays or set of texts or anything that you want really,and then you pair the language model with some sort of retrieval model, uh,to give you back the results that you want. So given a corpus of knowledge,let's say here, you would perform retrieval first, um,in order to inject the relevant context into the input prompt itself. And then the input prompt would look something like, here's the context,insert context, given the context, answer the question, here's the question. And then, and then you send the entire thing to the language model.

So there's some general challenges of how do you do in context learning? Well,how do you combine retrieval, uh, uh, and,and generation in a way that that makes sense and gives good results?Another term for this these days is like retrieval augmented generation. Um,and how do you, uh, for instance, like do retrieval?How do you actually retrieve the right context for the prompt? Um,how do you deal with long context?How do you deal with source data that's potentially very large?So, LAMA Index is a toolkit that aims to solve that, and it is, uh,its core mission is the solve that interface between your private data and your,uh, language model. Our goal is to make this interface, uh, fast, cheap,efficient, and performant. And I would say, you know,we've made some strides towards all these dimensions,but definitely it's an area of just continued improvement and growth. So we contain, um, of, uh, three main components within the core project.

The first is this idea of data connectors, uh,offered through our community driven site called LAMA Hub. Here you can actually connect your existing data sources and data formats,for instance, like APIs, PDFs, documents, sql, et cetera. And you can basically ingest all this data in a format that you can use, uh,with the language model. And so, uh,LAMA hub right now contains over 90 different, uh, data connectors. We'll talk about it ingest a bit,but it's a pretty easy to use tool for just like loading in a bunch of data.

The next part, which really gets to the core of, uh, the, the,the repo is, um, data indices. And data indices are essentially data structures that structureyour data for different types of use cases. So if you imagine that your raw data is stored somewhere, for instance,in, uh, object storage or in a vector database, um,indices are essentially lightweight views on top of this data, uh,that allow you to define, um, for instance,like keyword lookup or embedding base lookup. Um,and the idea is that every new index you define will kind of induce a differentmode of retrieval and sentences,and every index will be optimized for different, uh, use cases. And then finally, the last part is this idea of like a query interface where,uh, once you've, uh, ingested and structured your data, uh, you can now, uh,wrap this within an overall query interface where you feed in some input promptand then you get back, uh, a knowledge augmented output.

So another way of looking at LAMA index really is as this kind of black box,uh, where you can, uh,basically see it as a data interface for LM application development. And so, uh, the input would be some rich query description, uh,of the tasks that you want to have. Uh,and then the output is a rich response with references, uh, actions, et cetera. And under the LAMA Index would manage the interactions between your languagemodel, uh, as well as your private data, uh,to give you back the results that you want. The first part here is, um,let's talk a little bit about some of these components, uh, more in depth.

So data connectors, uh, are powered by, uh, Lama hub, which, uh,we mentioned is this like community driven site of,of different types of data loaders. And this basically allows you to ingest any kind of data from anywhere intounify document containers. Um, the, there, it's, uh,there, there's a lot of different data connectors within this hub. Uh,and so this, this number is actually a little bit outta date. Uh, it's, uh,we're at like 90 different loaders in counting now.

So we have, for instance,like a PO ton of pdf pars webpage readers, uh, doc,like different types of scrapers and, um,like being able to load from different APIs, et cetera. And we also have growing support for, for multimodal documents, for instance,with images. Next,let's talk a little bit about the data indices and the query interface. So the data indices help to abstract way common boiler plate and pain points forin context learning. Uh, and so, uh, we, um, for instance,like they help to allow you to store context in the easy to access format forprompt insertion.

Um,they allow you to deal with different types of prompt limitations,like 4,000 tokens per, uh, da ri, uh, when the context is too big. And they also helped you to deal with like tech splitting. Again,the idea is that the index itself is kind of,you can almost see it as like metadata on top of your raw data. And each index will, uh, again, uh,have a different view of the data and create a different mode of retrieval. And we'll walk through a few examples of different indices just to show you, uh,what we're talk, what we're thinking about.

And the key idea is that give users the tools to again,perform retrieval and synthesis, uh,over your data and manage those interactions, uh, in a way that, uh,is very powerful. Finally, um, the next part is the query interface,uh, again, on top of these indices that can, uh, again,take in this input and give you back the output that you would want. So as an example over here, um, uh, if you look at the code image, um,the fundamental idea,the interface that's exposed to the end user is that you can just take in aquery engine, uh,either from an index or something that you could define yourself,and then you could ask a question and then you could give back a response thatyou would want. So the first index example, uh, is, is idea of like,uh, a vector store index. Uh, and,and this is something that is becoming more and more popular, uh, these days,this overall, uh, kind of mode of retrieval and synthesis.

Uh, and I'm, um,I'm, uh, I just wanna show you how this basically works, right?And this is something that offers a very nice integration point with Novis andZs as well. So the,the Vector Store Index is basically this idea of pairing a vector store with thelanguage model. And the way this works is that, um, in the beginning, uh,you would have a set of source documents, for instance, like notion documents,PDFs, and then you would perform data ingestion and you would perform dataingestion. And, and the these documents would, uh, get ingested, um,into, um, or, or, sorry, uh, you would take in these source documents,you would split them,split them up into tax trunks using some sort of tax splitting technique,and then you would, uh, split them up into nodes, basically. Uh,node is basically a text trunk.

And so each node would be stored in the vector store, uh,with an embedding attached to it. Uh, so this is becoming, you know,more and more popular these days. Uh,the vector store will essentially store a set of documents, uh,each document with an embedding, and then during query time,you would have this query. Um, you had taken, uh, uh, you had taken this query,you had taken an embedding model, and then you would embed this query. Um,you would use this query embedding to perform top K lookup from the vectorstore, uh, to retrieve the most similar nodes.

Um, and, you know,there's like different query interfaces that this can expose to. Like you could do, uh, rossman search, you can do hybrid search,you can do a variety of different types of, like, add metadata filters,et cetera. The idea is you retrieve a set nodes from the vector store,and then you take the query, and then you feed this basically, uh,into the response of the assist module along with the set of retrieve nodes. So, um, going back really quick,we'll talk about response synthesis in just a little bit, but again,the high level idea is that you have retrieval and then you have synthesis. So retrieval, uh, comes by, you know,looking up the relevant nodes from the vector store,and then synthesis comes by taking in the set of nodes, pairing it with a query,and then being able to generate a response.

Another basic example of a index structure that we have a and we have more,but this is just a very basic example, is this idea of like a list index where,you know, you, you take in some, um, uh, a set of, uh, documents,you, you chunk it up and you can choose to basically store some,any set of nodes as like a flat list. So instead of necessarily indexing it with, uh,embeddings for top K lookup, you can just store it, you know, um, as a,a linear list of nodes with like previous next relationships. So, for instance,like node two comes before node three, node two comes after node one. And this is a very simple data structure, uh, but, uh, the,basically during query time,you can just take in this entire set of notes from the list,and you can often choose the ad filters if you want, uh,and then you combine it with a query and put it into the response synthesismodule. So it, it's interesting because, um,the idea here is that this, um,idea of like a list index basically allows you to perform summarization queries.

Uh, whereas, um, for instance, with like by default vector store based lookup,you'd fetch like the top k most similar notes from your knowledge corpus. This allows you to basically feed in all contexts from any document orany large subset of documents, uh, into, um, uh,some the response synthesis model to allow you to, for instance,like summarize large chunks of text. We can also talk a little bit about how response synthesis works once youactually have a retrieval model. Um, and so for,there's a few strategies for just like taking in a bunch of different texts andthen being able to create a response even if the set of texts is greater thanthe context length of the language model. Um,one strategy here is create and refine.

So we would start with the first node we take in the query,and we would generate an initial response using a very similar prompt as,as what I showed in the beginning, which is, here's some context, here's the,and then you, you put in the context from the node. And then given this context,answer the question, the difference is that if you have a set of nodes,you could take in the previous response from the previous node. So you take in this intermediate response,you take in the new context from node two, and you take in the query again,and then you pass it back into the language model, and you ask it, uh, hey,like, we have an initial response from the previous node. We have this new context, we have this existing question. Can you actually refine the existing answer to gimme back potentially a betteranswer?And then you would iterate through every node within this list until you getback a final response.

Another approach here. So this, this approach is sequential. And then this approach does things a bit more in parallel is you just take ineach node, and then for each node, you, uh,get an initial response from it. So given this node, uh, given this query,give me back an initial answer. And then once you have an answer for each node,you can hierarchically combine each node into a set of parent answers andcontinue doing that until you get a final response.

We call this, uh,tree summarization. The idea is that you can just hierarchically build a tree of answers almostuntil you get to one root node, and then that root node is your final answer. So this tends to be a bit faster, cause you can paralyze it via async. This approach tends to have a bit more detail when you actually iterate througheach node sequentially. But in the end, this is, uh,up to empirical experimentation, uh, just different types, uh,just playing around with different techniques.

So we've dis there, there's other types of indexes too. Um,there's other types of, uh, kind of integrations that we have. One key integration that we wanna highlight is, uh, integration with novis. So you can actually use integration, uh, novis as a backend store, uh,for both your texts as well as embeddings. Uh,and the way you get set up is actually pretty simple.

You just define like a Novis vector store with all the parameters that you wouldwanna have, um,wrap it in some storage context and then put it into the vector store index. Then when you actually want to query, uh, you know, this, this index,you can just say, uh, query engine equals index as query engine,and the response equals query engine dot query. And this will, uh, you know,query this index that's backed by,by Novis and then allow you to answer any types of questions that you wouldhave. So,uh, we can walk through a few examples, uh, of how you actually, you know, uh,run LAMA index. Uh, and so this is just a demo walkthrough, uh,that allows you to, uh, ingest data from LAMA hub, uh,build an index over it, and then query that as well.

Um,and so I'm gonna leave this, uh, for now. Uh,I'll share the slides and then if we have time, a little bit of time,maybe towards the end of this presentation,I'll quickly walk through the examples. I do, however, want to discuss some of the main use cases of LAMA Index. Uh, and so from everything that we've described, um,the entire goal of LAMA Index is really to orient, uh,ourselves towards being a really good interface, um,to allow you to answer basically any types of queries over your data, um,with a language model. And so the idea here is that imagine talking to chat G B T,you ask the question, you get back a response,how can you basically maintain that exact same experience except now this, uh,you know,chat G B T whatever language model you're using has visibility into the overalldata.

So the, um, the,um, set of queries that you might ask over your data,whether it's just like a simple question or it's an actual task that you want toask the language model, they can, you know, they can vary. Uh,and so we'll walk through a few of these examples, uh,of different use cases of like, uh,queries that you can run over your data using LAMA index. So the first use case is, uh, we've already basically been over,it's this idea of just semantic search, right? And so you connect,you can define a vector store index over your data. So you import, you know,G B T Vector Store Index, you could wrap it with a Novis factor store,and then you load in a set of documents. Then, um,you can define a query engine, uh, on top of this index.

And then you can ask a question like, could you gimme a summary of this article?Or, um, oh, interesting,I think this, uh, question needs to be updated. Um,basically this question should be, uh, can you, uh, for instance,like what did the author do, uh, during his time in college? Uh,and so if you just pretend this question actually really just representssomething that, uh, is more about specific facts in your knowledge, corpus,that's the case where semantic search, uh, does well. And so, yeah,this answer is supposed to answer the question, you know,what did the author do growing up during his time? Uh, you know, and so, uh,the answer would be the author grew up writing short stories programming on I bM 1401. The idea here is that you would taken this question,and this question would reference specific facts that can actually be retrievedin your knowledge corpus. And then you would retrieve the knowledge corpus, uh,at the, the,the documents from your knowledge corpus and use that to generate an answer.

And that's the case where semantics searched as well,because it allows you to do kind of like, uh,relevant search or topate lockup really retrieve the relevant chunk ofx. There are other use cases for a LAMA index. So, right,and another use case is this idea of just summarization. How do you, you know,not just retrieve the relevant pieces of the text from a knowledge purpose,but how do you just like summarize the entire article? Um, and so for instance,if you use the list index, which, uh, just as a refresher,you just store an entire list node, and then during query time, by default,you would retrieve all the nodes and dump it into response synthesis. You can ask something like,could you gimme a summary of this article in new line separated bullet points?And this would basically dump, uh,take in all the node corresponding to this article, however long it is, uh, uh,add it to the response synthes, the module,which will abstract away the complexity of dealing with like prompt limitationsand give you back a final answer like,the author began writing and programming before college, uh, studied philosophy,you know, and then just give you an entire biography of the author,another use case.

And so far we've been talking primarily about unstructured data,is that we also offer, uh, tax receivable support over structured data. So we have, for instance, like a index defined over unstructured data,and we also have an index defined over, uh, structured data as well. And this is within, you know, um, kind of our, our s l index. And it really consists of two main components. One is, uh,conversion from unstructured data into structured data points.

So on the data ingestion side,you can actually ingest unstructured documents and load it into a database. The second part is once you actually have structured data within a database,we offer a pretty, uh, good taxi sql, uh, interface over this data. So you can do default taxi sql, which will just take in the table schema. And,uh, we can use the L l M to infer SQL statements from the, uh,natural language query. We also offer additions on top.

For instance,you could add, uh, text annotations or context on top of the tables. You can actually store the table schema themselves in an index, um,to deal with large, uh, kind of like large amounts of table volumes, uh,in case you're worried about, you know,the table schema not actually fitting in the prompt. Another example here is this idea of like synthesizing over heterogeneous data. So this actually gets into some of the graph structures that you can definewith LAMA index. Um, so for instance here, um,you can define, for instance, a vector index over your notion documents,and you can define a vector index over your Slack documents.

Then you can actually define a graph structure over these documents by having alist index over your notion and Slack documents. The, the,the way this graph structure works is that when you actually query this toplevel graph, for instance, like, um, gimme a summary of these,these two articles, or give me for instance, like, uh, tell me about, uh,uh, like risk factors right within, uh,if these are kind of like financial statements and you ask like, Hey, can you,um, or actually a better question is like, uh, tell, gimme a summary of this,uh, customer A right, uh, uh, for your customer account. This will actually take the query, route it through this list,and then feed it to each index. And then it would get an answer first from each index,and then route it to the top level, uh, list index,and then you can actually generate an answer. So let's actually just walk through an example.

For instance,let's say the question is, you know, tell me the airports in Seattle, Houston,Toronto, um, if you know, one city's provided, give,just gimme the airport information, otherwise, like try to actually, uh, uh,tell me a a a little bit about the airports for all cities. Let's assume that you have a separate index for, you know, Toronto, uh,uh, Seattle and, and Houston. I know this says Boston here, and,and that part can be fixed. Um,the idea here is that you would take in this top level question,it would get routed to each individual, uh, index,and you would generate an initial answer from each individual, uh, index. And then you would generate that, take that response, and then, um,take that response from each index and combine it at the top level through thelist index.

And then at the, uh, very bottom, you see here, you know,the airports in Seattle, Houston, Toronto are, uh,and then you can actually get back, uh, uh,final response that's actually able to synthesize over all the different datasources that you have at a very high level. Going back really quick,the idea of defining a graph structure is just to kind of define a slightly morecomplex view over your data so that you can perform slightly more complicatedforms of retrieval to solve, uh, to, to answer certain types of questions. Another example, uh, is just be really being able to compare and contrast, uh,more explicitly different types of, uh, documents or, or, uh,anything that you want really. And so here, you know, um, an example here is,let's say you wanna compare and contrast the sports environment of Houston andBoston. You take in this question, and let's say, you know,you compose a graph consisting of, uh, uh,vector index for both Houston and also Boston.

Those are separate indexes. And then you combine them with a list index at the top level. This query would get routed to each index on its own. And we can add something that we call like a query decomposition module where wetake into this more complic complicated question,and we could convert it into a simpler one,like what sports teams are based in Houston,what sports teams are based in Boston,and then use that question to actually ask, uh,or ask that question over each individual vector index similar as before we getback a response. So get back the answer for the sports teams,for both Houston and Boston, and then both answers would then get combined, uh,through the list index, and then you get back a final answer.

So again, defining this graph structure plus, like, you know,additional stuff like the,these query decomposition modules allow you to run more complex qu uh, queries,for instance, like being able to compare, you know, stern X, uh,and across like different types of documents. Uh,and it's a very powerful tool that allows you to ask kind of like more complexanalytics queries beyond, uh, for instance, like simple semantic search. Another use case that I wanna highlight is just this idea of like multi-stepqueries, which evokes, uh, this idea of like chain of thought prompting. If you're familiar with that,the idea is that you can break a complex query into multiple,simpler ones over a data source. An example here is, for instance,like question who is in the first batch of the accelerator program,the author started,that's a slightly more complex question because it has multiple parts to it.

Let's say you have access to a data source where you can answer questions aboutthis given author. You could take this question, um,use the query decomposition module, which is powered by a language model, uh,to, uh, you know, infer a simpler question,what accelerator program did the author start?And then you could use that to generate back an initial response. You know,the author started this accelerator program called yc. Then you feed it back in, you know, um,or feed it back through the query decomposition module and ask who is in thefirst batch of LY C'S accelerator program? Um, you know,feed it back in through the data source index. Get back in answer.

You know,this first batch of ycs accelerator program started in 2005,included a bunch of different startups,and then keep on going until you've feel like you've answered all questions thatyou can from this data source, uh, given the question. So then you get back this final answer. And the key idea here is that if you have a more complex question,you can choose to break it down into simpler ones if you'd like,until you're actually able to get back a satisfactory answer. Finally, just the last bits that I wanna talk about is, um,there's also questions that are very interesting that have a temporal nature tothem. Um, one example question is, what did the author do after his time at yc?So, given such a question, if you just do basic semantic search,you're gonna hit a node where it just only describes the author's time at yc,um, as opposed to really looking at, you know, before or after.

So we have like, uh,a set of abstractions that'll allow you to continue feeding in additionalcontext that could be relevant to the question even after the basic retrievalprocess. Because the basic retrieval, for instance, if you do semantic search,is just gonna take this embedding and then probably match it with the author'stime during yc. And, uh,a lot of times it'll be helpful for you to feed in additional context in atemporal manner. For instance,look at future nodes or also look at previous node. And in this case,you wanna look at future nodes.

Another use case here is this idea of recency filtering or outdated nodes. Um, oh, uh,this is actually a feature that's been widely requested by users where imagineyou have, for instance, like three timestamped versions of the same, uh,like same data. And some of this data is outdated, uh,and obviously the most recent, uh, version is gonna be the most up to date. So then when you ask a question, you don't want to, for instance, like, uh,confuse the language model with a bunch of outdated context. And so we have capabilities for you to do, um,different types of recency filtering, whether it's like,like time waiting through some sort of like, uh, mathematical formula,or you just explicitly sort by date,and then you can actually filter out older notes.

This allows you to give back a response that, um,and basically prioritize more recent notes first. Sweet. And so that's basically it for the talk. I will share these slides, uh,uh, in the, in the, in the chat, uh, as well. And, um, there's a, a different,like, there, there's other slides, uh,here too that basically show you a different types of tutorials.

For instance,like here you could integrate with a downstream application. Um, for instance,you could build a chat bot with LAMA index plus line chain as the outer agentabstraction. Uh,here's tutorials on how you can build a streamline app. There's a bunch of demo walkthroughs,especially with the new release that we did with zero six that allow you tocustomize retrieval query engines and topic or data, uh, and also, uh,simple router abstractions that we added to help you build this unified queryinterface. Cool.

And so I think that's it for the presentation,and the next up is, uh, Avi, uh,talking to Frank and while answering some questions. So yeah, thank you for that presentation, Jerry. Um, great talk by the way. And you know,we have a couple of questions both from the audience as well as for myself,and I figured we could sort of have more of a, a bit of a conversation, uh,sort of like a conversation q and a hybrid. Um,I'm gonna just talk more about llama, talk more about the origin story,you know, where where you guys are gonna be going in the future.

You know,what's some of the exciting features that you guys, uh, have planned, uh,coming up as well. Uh, and the first one is, you know, I think we see,I think you sort of see, you know, two camps of people when it comes to lms,right? I think the first is folks who feel like there's going to be sort ofcontinuously evolving models,models that are designed for very specific purposes. For example, Bloomberg,G B T, you know, on financial data. And then you have another camp who feel like, you know,the future is going to be a lot of just these very,very general purpose lms very large models, uh,designed to be all purpose, right? To be to,to essentially do anything that you really can or want to do within, I imagine,you know, long index is,is is going to be useful for both these paradigms and which, you know,if we do go for that first route, you know, models that are more,that are smaller but more targeted,do you see any differences in how you might potentially use Lumex with these?Yeah, it's a good question. It kind of gets into a very first part of the presentation,which is about this idea of like fine tuning dis distillation, uh, right.

Uh,for instance,you could imagine a world where everybody just does some machine learningprocess on top of, uh, new data to basically train and distill all these like,specialized models that can do different types of tasks. And that really gets into more classical machine learning, uh, which is, uh,why a, a lot of current models these days, uh,really are trained for specific tasks. Um,I think there will probably be a world where we start havingmore specialized models. Um, in fact, I think that world is probably good, uh,just that there's less of a monopoly from kind of like a single model provider. Um, and I think it also makes a lot of sense from like a systems perspective.

And the reason for that is that, um, these large models are amazing, but they,uh, are by nature very big. And I'm sure, you know,class and scale and all these things will come down,but there's just like some sort of like probably fundamental informationcapacity of these networks such that like, you know, there,there's gonna be some minimum size that some model needs to be to be, uh,as powerful as it is, right? And so I think like,just purely almost for like cost and specialization purposes,especially for instance,if you want models to be able to run like on device or on-prem,you're gonna start seeing a bit more like distilled in specialized models. And so I think, um, I'm very excited personally about that type of ecosystem. And then I think the next part here is that I could see LAMA index being used inboth of these worlds because even for these specialized models, you know,there's still a lot of these trade-offs I mentioned in the beginning about kindof being able to actually make sure that you're,you're incorporating the right knowledge in your data, right?Like you could choose, for instance,fine tune on every bit of new information that comes in so that this model'smodel's able to incorporate the knowledge,or you could fix the model itself and pair it with a retrieval model. And a lot of times that's way easier to think about with this idea of just like,you don't have to do a bunch of machine learning,you can just wire this as part of this overall like data pipeline or system,and then it'll still work for you out of the box.

And then the,the last part I'll say is I think going with this idea of like fine tuning disdistillation, one thing I'm very interested in is actually this idea of like,uh, being able to fine tune a very good retrieval model. So for instance, like,uh, make this model way smaller,like taste something like GBT four and just like strip out most of the knowledgecapabilities. Like it doesn't need to know about Wikipedia,it doesn't need to know about other stuff,but you just keep the fact that it's very good at reasoning about newinformation that you feed it. And that part I'm very interested in cuz I'm wondering like what,like how small can that model really get right to to, to, uh,still have those amazing reasoning capabilities because that would actually helpa lot with something that we're building with LAMA index. Oh yeah, 100%.

And I think a big problem with a lot of existing, you know,these large language models or auto aggressive language models is the fact that,you know, you have it trained on this very,very large corpus of data and because it's doing a lot of next token prediction,you know, if it immediately doesn't get that next token correct, uh,you end up having,you end up having hallucination or you end up having a wrong answer, right? Um,if you can in some way, shape,or form really constrain the model to look at data very specifically from theprompt, I think that would definitely go a long way in terms of, you know,both what you're, what you're working on with Loma Index, uh,as well as a variety of, uh, sort of other applications as well. I say a great answer. I really appreciate that. We've got some questions from,uh, you know, the attendees as well from folks here. And the first is, uh,is probably more general,it's is LAMA index Riva model that is paired with an lm?Yeah, so that's a good question.

You could, um, it, it's, it is that,but it is a bit more than that too. Um,so there's different layers to LAMA index. The very top level view of LAMA index is that, um,LAMA indexes just a black box around your data and L L M,and so you can query a LAMA index the same,same way that you would typically query a language model. And then, um,similar to something like for instance, like, uh, track BT or GBT four, uh,you would get back a response,but because we manage the interactions between the language model and your data,you'd get back a response where it actually has context over your data. Now,under the hood, right underneath that black box, there's both,there's a lot of things going on.

Um,there's like retrieval over your data and then there is kind of like synthesis,uh, being able to combine stuff into an answer. And then, uh,this could be a one step process where you do retrieval, then synthesis,then you're done. Or it could be a multi-step retrieval synthesis process. For instance, if you define a graph over your data or if you define like, uh,some sort of, of the,the chain of thought prompting stuff that I just mentioned. And, um,you could choose to use each of these models independently too.

And so you could totally choose to use LAMA Index as a retrieval model by itselfif you also wanted to pair it with kind of your own, uh, applicational logic. Yeah, yeah. Great, great response there as well. Uh,a sort of a great follow up for that, uh, number from as well is, you know,could you have the index be dynamically built for each query,or do you have to have it be manually predefined for every query? That's,that's actually a great question. So I think this is something that we've been thinking about a lot.

Um,right now it, the index is, um, kind of user specified. So the user, uh, has to define a set of indexes that they want. Um,so I guess to, um, frame this a little bit better,it's kind of like you're the user. Um, you develop these, this LM application,you define the set of indexes, which means, I guess roughly operator, you have,uh, some sense of like the types of questions that the, uh, you know, uh,that this, uh, application might receive,so that you wanna kind of like almost prepare, uh,for that with the set of indexes that you think would, would make sense. Now,one thing that we think is very powerful though is that, for instance,like different indexes are optimized for different use cases.

Like if you have a vector index, it's better for semantic search. Like if you have a less index, it's better for kind of like, um,a summarization. If you have like one of those fancy graphs that I just showed,it's better for like comparing contrasting stuff. Um,one thing that we've just kinda like recently introduced is the idea of like abetter router abstraction where you can like, uh,kind of like define some set of tools. Each tool is better for like certain types of queries and wrap all of thoseunder some sort of router.

Uh,and it's similar to this like agent tool paradigm where it, basically,what it does is it provides kind of like, uh,it unifies like all these tools under a single query interface. So you could take in like a query and then it would hit the router,and then the router can actually pick the right tool for the job in an automaticfashion instead of you as a user needing to kind of like anticipate, hey,this index should actually try to solve like all these different types of like,uh, queries. I might that's awesome. Respond. You can focus on developing different types of like indexes and queryconfigurations for different use cases and then wrap that all under some sort ofrapid abstraction.

So that's something that we're super excited by. And I think this goes into just being able to kind of like anticipate thequestions that you can ask in a more automatic fashion. Um,and then the next part that I actually haven't talked about is like, now, um,how do you automate the indexing process too, right?This is kind of more how do you automate the query engine process to,to route to the right index, but how, like, you know, in a production system,if you have a ton of data coming in, how do you just like automatically, uh,figure out the best indexes over this data,both given your use case and giving your system requirements and that partresult in game up? Yep, fair enough. When can we maybe,when can we expect this kind, this router, ooh, this router abstraction to be,to be in Luma index? Uh,and so the router abstraction is already in LAMA index actually. Um, we,we've had that for a while and we just recently, uh, kind of revamped that like,uh, yesterday basically.

And so, uh, if you want, uh, here's a free plug for,for like the new blog post that we just put out. Um,and then for kind of like some of the more automatic indexing though. But I do think the routing abstraction is just one step. Um,I think there's a lot more steps that you can take to build a better automatedinterface to really handle different types of queries and execute that over yourdata. And so for instance, auto indexing during build time, um,being able to like optimize stuff like token usage, all of these things we're,we're still continuously improving.

Good stuff, good stuff. Uh,a follow up to that is, you know,what is the best L l M for running locally and using Loma Index?And there's probably, you know, I'll, I'll, I'll let you answer that first,but um, uh, sure. I can probably guess what your answer's gonna be. Um, yeah, you know, well AC actually I haven't really like, um,I haven't actually done a extensive testing on this, so,so actually I don't even know if I'm, I'm like the best person to answer this,and so we'd love to hear your thoughts. I these, like,I have played a little bit around with like, uh, kind of like, uh,like the alpacas stuff as well as like stable lm and then a little bit of likehugging face.

I think they're all, like,they're all decent, I guess. Um,I still think like open ai like should b t is just better. Um, but I think,you know, uh, for a lot of like basic tasks that they, they can work,but honestly, like, uh, I I'm probably not the best person to answer,so I think I, I would actually love to get additional, uh,feedback and insight from the community. Yeah. Yeah.

Well, I think, you know,I think the, a lot of different models, they all have their own sort of,you know, things that they're good at, I would say. And, um,I don't think there's a very, very easy way to answer that unfortunately. Uh,but, uh, but yeah, I think a lot of it,I think as you mentioned sort of opening eyes is probably the,the gold standard for now. Um, you know, G P D four, uh,but, you know, it can be, I think having, having a full, you know,what I really, really love about Loma Next is, you know,having the capability of having potentially a full open source solution,a full stack open source solution from, you know, the, your, your,your indexing all the way to, you know, whatever LM that you're using, right. And Yep.

Uh, while there are a lot of open source LMS out there today, um,each of them I would say is trained a little bit differently. They have, um,you know, they have things that they're good at, things that they're not. They're, they're also not so good at. Um,and definitely a bit of experimentation is, uh, is required before,before performing forward, but, um, but I would say any, you know, any,any,any model that is based off of Facebook's LA model is probably a good place tostart. Yeah, probably.

Yeah, I, I've heard it's, um,I've have heard it's pretty good. I,I feel like these days every day I see some new open source version of someproprietary model coming out too,and so I wouldn't be surprised if there is some sort of convergence there. Yeah,yeah, for sure. For sure. Um, and then there's, uh, you know, I,you probably get this question a lot, which is, um, you know,sort one of the reasons I,I'm sort of leaving it sort of a little bit more towards the end, but, you know,could you compare Llama mix and blank chain? Yeah, for sure.

Um,so I think the, uh, I mean, yeah, it's, it's a,it's a very popular question. Yeah. Uh, it's a, it's a good one too. Um,so I think the way I would describe it is that we are pretty much like, I mean,if you just saw the presentation, we're,we're very focused on like retrieval synthesis of your data. Like I think that pretty much is the main goal that we're focusing on,and we're focused on both that as like a set of building blocks, um,and then also kind of packaging,packaging that into an overall system that solves all the dimensions that wetalked about, for instance, like the ease of use aspect to performance,like latency costs, like all these different things.

And so we've are just like really thinking deeply about all of theseabstractions that we wanna take. And the goal is to just make it really easy for users to interact and querytheir data. And so, but there are some overlaps with, with Lane Train,but I would say like Lane Train has a bit lighter abstractions around this area. Um, and, you know,we also make it really easy to integrate with your Lane Train app. Uh,so for like, what Lane Train has is a bunch of like prompts, like evaluation,like agent abstractions, like some retrieval stuff, and then also kind of like,like being able to build chatbots and, you know, a lot of different,different things.

And so we kind of see ourselves as like a really,really good data plugin, right, that you can just like,use as part of some our client abstraction, whether it's like an agent, uh,from Lane Train or like chat G P T interface or anything else really. Um,even with stuff like auto G P T, so like Loma Index, uh, you know, we,especially as a core module, that's really what our c see ourselves going and,and, um, if you do use it,we like ideally would love to offer all the functionality that you would expectover your data and you can actually easily plug it into your downstreamapplication. Yeah, and I also want to add that, you know, in,in Jerry's slides there's also a great one about, uh,demo that's actually using Lumex along with Lang Chan. So I think these two are actually very, you know,if you dive a little bit deeper into it, there are two very,very complimentary projects. Um, yeah.

But great, great question nonetheless. Uh, sort of piggybacking on our,on our previous conversation about lms, uh,about which open source l l m to use, is there any l l m, uh, in,in your eyes that performs close to G 3. 5? Um,and also I guess as a follow up, is there anyone that has comparable GBT four?Oh, yeah, I've, I've, um, yeah, I think the main one is probably like,um, like anthro, I think that's, uh,it seems to be a pretty common sentiment too. Um,I have played around with other models, um, they seem decent, I guess,but I think that the main thing is, um, for a lot of these models,it's kind of a question there. There's like a few things you look for, right?It's just like one is, um, how often does it hallucinate? Um,how often does actually just like make, uh, or there's one,how often does it hallucinate? Two is how often does it like,make the wrong decision?If you put it in some sort of like repeated reasoning chain,it just like actually just picks the wrong response.

Um, and the three is,you know, what's the quality of that output? And so, um, I've,I've kind of like stress tested a few of these models through both like theretrieval synthesis stuff, like there's a few indexes within la uh, the, uh,LAMA index that allow you to just like kind of do repeated, uh, reasoning and,and, uh, synthesis calls. And I think a lot of these models do get tripped up at some point in the middle. Like they, they're not able to fully reason over this information. They can like hallucinate wrong information. Uh, and I think G B T three,um, uh, and especially G B T four, uh,are just like very good at kind of like being able to output, uh,good quality answers, but without too much prompt tuning.

And I think that's something that's very powerful. Um,and I think that's also why for instance,like G P T four is starting to be used in all this like auto G P T stuff becauseyou can kind of trust it to do some basic reasoning over this data and do it ina way that doesn't propagate a ton of errors through like repeated calls. And I think that's something that you can't really do with a lot of theseelements so far. Yeah, and sort of just to add to that very,very briefly as well, you know, when it comes to G P G 3. 5 and open source,I would say the only any, I'm not gonna say any,any llama based model is going to be comparable to it,but I think you can get pretty close.

Um,and really I think the quality of open eyes training data is a little bit betterthan, than than that of a lot of open source models. Uh,when it comes to G P T four, uh,you probably won't get any open source models that are close to it right now,unfortunately, probably about generation behind,but I think a lot of these open source models will continue to catch up. Um,and there's,there's a lot of sort of open work that's being done right now on thatperspective as well. QD four, you know, if my, if I had to guess,probably has something like a, probably around trillion parameters. I don't know if you have any thoughts about that either, Jerry.

Um,and also I think the reasoning capabilities come from, you know,it's not just a longer context length, it's also a deeper model as well, right. And I think mm-hmm. The depth is probably what really contributes to the improved reasoncapabilities of G D four, uh,not something that's available in the open source world, unfortunately. Yeah,I mean, I think I am curious about just like, can you, uh,get similar two d four reason capabilities, but just make it, um,in a much smaller package. Right.

And I think that's still an open question. Oh yeah. Oh yeah. Yeah. I've always been curious if you could, you know,add some recurrence on top of a smaller model if that would give you, you know,at least some semblance of proof reason capabilities.

But anyway,that's a story for, uh, whatI'm going to, why, why when indexing PowerPoints or PDFs,does the l l M understand the information a little bit differently? Uh, um,oh yeah. So I think for indexing, like different types of data. So the way, um, uh, LAMA Hub works, which is like our data ingestion piece, uh,from different types of, uh, data sources like PDFs, uh, PowerPoints, APIs,et cetera, we just convert whatever format it is into text. So if it's like,you know, text in the PowerPoint, it's text, right?If it's an image on the PowerPoint,we'll convert it to text through some like image captioning model or,or something. Uh,if it's like PDFs we'll run OCR r over the PDF or we'll do whatever it takesdepending on the parser to convert it into text.

So it's kinda like, um,there is a piece that is, uh, that I haven't really discussed,which is just like the quality of the, you know,parsing from whatever format it is in the text. Like for instance, if you're,uh, uh, like tax parser is just really bad,then it's just gonna be harder for the language model to index and understandstuff. Um, and so I think like when you're building this overall system, uh,the tax parsing does is, does matter, uh,you do have to translate it into some tax format,but I do think it's a lot easier than before because, um,Q P T has the capability of just like understanding raw text without you havingto like do too much processing over it. As long as you clean it up in some like minimum way, uh,it tends to work already pretty well. And that's why you have something like LAMA hub where you can just ingest stufffrom like any different type of data source into a format that you can use.

Great stuff. Yeah, I think, um, just, uh, double tracking. I think there's like a question in the chat too, uh,which I think is different from the q and a,but I just wanna make sure that we can talk about some of these. Totally. If there's any one,I know we probably only have time for maybe about one or two more questionscoming from the final five minutes of, of, of this conversation.

But if there's any any, you know, couple that you'd,that you'd like to answer in particular, we can go for those. Oh, for sure. I just wanna double check on this because I think Archer asked like,is custom synthesis over heterogeneous data scalable?And have you looked into scalability numbers, uh, with like, uh, long index?Right. And,and I think that's a good question because I think that's something I do want totalk about, um, which is the idea that, like, I,I think scalability is definitely a challenge. Like, I think, um,anytime you have like chained l l m calls,it's just gonna be a bit slower because you're repeatedly calling the languagemodel with more data and it's just gonna cost you more money.

And so I think that, uh, fundamentally is a challenge that, you know, we,we've been thinking about a decent amount. Um, that said,there are ways you can try to make this like,kind of synthesis over heterogeneous data a bit easier to,to kind of reason about. One way to think about this is that, um,one we allow like an async api so you can like paralyze all these calls, uh,across the different nodes. So you don't have to wait sequentially for like one node of your graph to likecomplete before like waiting for the next node. You can just like paralyze the calls there.

The next part here is that I,the default we showed was defining like a list index over like a differentvector indexes. That's really more for demo purposes. A list index, by the way,is not really scalable, cuz again, just by virtue of being a list index,you're dumping all the data, right? Uh, into, uh, the language model. You're asking it to process all the, uh, the data. I think one way,another way you can kind of think about this is imagine you have some sort oflike, um, uh, vector index on top of other vector indexes.

Then you could like first retrieve a subset of documents that are relevant toyour top level query. And then within those documents then like,kind of look for a specific piece of information. So there's like different things you can do with the composable graph structurethat can make some of the, these, uh, like kind of lookups and,and synthesis and all these different things. A little bit more scalable. Yeah.

Good stuff. We probably only have time for one more question. I know there's a couple, we actually have several more. Um, so what,what we'll try to do is we can, we'll see if we can sort of get, get to,get to the remainder of these questions offline, but the final one Sure. Um,you know, that, that, that, uh, from, again, from the audience as well is that,you know, the query itself, right? So oftentimes when you do, you know,when you do a prompt or when you do a query,the query itself oftentimes creates, uh, you know, it,it has a lot of knowledge embedded into that as well, right? So how do you find,you know, important, you know,important aspects of queries or query frequency and can youu can you better utilize, are there better ways to utilize the query, um,in conjunction with whatever index or indices that you have in LAMA index togive better or more sort of polished responses?That's a, that's a good question.

Um,I think a lot of users have actually asked for this speaker. Uh,we don't have like, uh, explicit tutorials or demos for this yet. Um,but it's something that I think I is gonna be really important because the ideais that, you know, you have a bunch of queries, uh,you want to kind of look up similar queries, uh,in addition to just like looking up stuff from your knowledge purpose. It's almost like one way to think about this is like a, a memory,but kind of like a memory for your queries really,as opposed to just like a general like chat bot conversational memory. Um,I think a very basic, uh, example that you could do is just like,just create a separate background index for, for the, uh, for like the,the query, right? Or, or for the static queries and, and responses.

And so then when you do look up,like you could look up from your knowledge purpose, um, uh,let's say that has like an index or graph structure defined over it,and then you have, uh, index defined over your query information. You could look up stuff in both sources and then figure out how to combine themnow for deeper interactions between this like query, uh, memory versus, uh,plus your knowledge corpus index. I think that's something that we'll probably investigate ourselves. Yeah. Uh,so, so I think that's a very interesting problem though.

And I think, uh, we,we do want to think a bit deeper about the right architecture for this. Yeah,definitely. I would say that's probably a bit more of an open question as well and wouldlove to sort of hear, um, mm-hmm. Uh, look, look to see, you know,as there's more research coming out in this area,what some of the interesting applications there are,what some of the interesting ways we're doing. So, um, yeah, I think, uh, yeah,I think we're sort of at the top of the hour here.

Um,and I'll sort of throw it back to, uh, you know, thank you for, for,for going through this q a. Thank you for that awesome presentation as well,Jerry. Um, I'm going to throw it back to Emily, uh, and, um, you know,Thanks guys. Yeah,Yeah. Thank you Jerry for the great presentation.

That was really wonderful. And Frank, thank you so much for, um, hosting such a great q and a session. We do have a lot more questions,so we're gonna do our best to get some answers from Jerry offline and we'llprobably pull that into probably a blog round up. So we'll make sure to send that out to everybody, um,because we do wanna get to those and we do need to let Jerry get onto, uh,the rest of his meetings today. So we're gonna let him out a little bit early.

Um, thank you all for joining us. We really hope you enjoyed the session. You'll receive a link to the recording and some follow up materials, um,from us via, so keep an eye out for that. And then we hope to catch you at the next Zillows webinar. You can check those out at zillows.

com/event. Thanks everyone. Awesome. Thank you. Thank you.