Advanced Retrieval-Augmented Generation Apps with LlamaIndex

You’re in!

Webinar

Advanced Retrieval Augmented Generation Apps with LlamaIndex

Transcript

Okay, today I'm pleased to introduce today's session,advanced Retrieval, augmented GenerationApplications with LlamaIndex. And our guest speaker is Laurie Voss. Lori is VP of Developer relations at LlamaIndex. He's been a web developer for 27 years. He was a co-founder of NPM Inc.

And he cares passionately about making technology accessibleto everyone by demystifying complex technology topics. Welcome, Lori. Hi. Thanks for having me. Shall I go ahead and share my screen?Yes, please.

Excellent. Thank you Christy, for the kind introduction. Hello everyone. Uh, sowhat are we gonna be talking about today?Uh, we are going to recap retrieval augmented generation. Very briefly, what is it? Why do we do it?What are the challenges of retrieval? Augmented generation?How do we do it? Uh,and then we're going to go into LAMA Indexand how it helps you get retrieval,augmented generation done.

Uh, we're gonna dive into seven techniques. We can level up your RAG game, uh,and then we'll wrap up by talking about howto get your app into production. Let's start with retrieval. Augmented generation or rag. Fundamentally rag is a response to a limitation of lms.

They're trained on enormous mountains of data,but they're not trained on your data. Uh, your data sits behind your firewallwhere open AI can't see it. It's sitting on hard drives or in databasesor behind an API. Your data is the most interestingdata from your perspective. Uh, and if you want an LLM to processand answer questions about that data,you have to give it to them.

But, uh, you can't simply give them allof your data at once at the moment simplybecause they can't handle it. You give chat, DPT, all of the documents, uh,at your company at once. It will barf. Uh, context. Windows are in the hundreds of thousandsof tokens at the moment, uh,but your company probably has hundreds of millionsof tokens worth of data.

Even in a universe where an LLM had an infinite contextwindow, it would still not be practical. Every time you want to answer, ask, ask a question. To give an LLM every pieceof information your company has everhad, you have to be selective. Uh, and that is where the retrieval part of RAG comes in. You have to give them the most relevant context,and that is a very complex problem.

Uh, but there's more to it than just context size. There are a lot of unique challenges, uh, to how you do rag. What do we get out of doing rag? We get four things. First, we get accuracy. Obviously, we want the right answer, uh,but by retrieving the most relevant dataand giving it to the LLM at query time, we make surethat the answer is not just correct,but as complete as it can possibly be.

Incomplete answers can be just as misleading as answersthat are entirely wrong. Closely related to accuracy is what we call faithfulness. Terms differ for this across the industry,but essentially it means that there are no hallucinations. LLMs try too hard to be helpful. Sometimes giving them the dataand only the real data helps them per,helps prevent them from inventing things.

Uh, nobody wants hallucinations. They're one of the most frustrating partsof building AI applications. Another big challenge is recency. Uh, while you can, to some degree improve accuracyand faithfulness by fine tuning a foundational model on yourdata, doing so is very time consuming and expensive,and your data changes all the time. Sometimes it changes from minute to minute.

So the only practical way to keep up with the CHA datathat changes in real time isto use retrieval augmented generation. And finally, there is provenance,by which we mean the LLM can say where it got the answer. Being able to cite your sources is good in an academiccontext, but when you're talking about search,it's often essential. Finding the documentthat the answer came from is often the point of a search. It's not just a nice to have feature.

So that is why we do rag. Now let's get to how we do rag. Uh, the core of RAG is retrieval,and there's a bunch of ways that you can do that. At the highest level, you can, for instance,just do keyword search using the same algorithms that,you know, search engines like Google useand have been using for years. They are pretty good at finding information.

It turns out we've been working onthat problem for a long time. And then there's structured queries. If you have a huge relational database full of data,it doesn't make sense to dump it all to textand then give that to an NLMbecause you have all of this valuable relational context. Uh, instead, LMS are getting pretty good at writing SQLqueries and you can get the LLM to query your databaseand retrieve important context that way. And then there's vector search, which is the onethat you hear about most often, partly'cause it's the new kid on the block, but alsobecause it's really uniquely powerful Brave Perfecta search.

The first thing you do is convert your data into numbers,specifically gigantic arrays of numbers called vectors. The models that do this work are closely related to lms,and they encode your data as meaningsrather than as keywords. The total set of available vectors is enormous,and that is known as the vector space. So for that reason, converting text into numbers istherefore known as embedding your text into vector space. And therefore the numbers are called embeddings for short,which is confusing.

I think there's a couple degrees of separation betweenwhat it is and the name for it,but embeddings is what we call them. The effect of embedding all of your data this way isthat you can then take your queryand embed it into vector space the same waybecause of the way vectors encode, meaning the answerto your query is likelyto end up mathematically nearby in vector spaceand can be located with relatively simple math. So this allows you to do retrieve augmented generation. You embed all of your data, then you embed your query,you retrieve context that's nearby to your query,and then you feed that context and your query to an LLMand you get a reasonable answer to your query. It's not just regurgitating the context,it contextualizes and explains it as well.

So you can use this amazing vector searchor you can use keyword search or structured queries,but of course you can do even betterand use all three of those at the same time. Uh, keyword search is going to work really well ifwhat you're looking for happens to be the right keywordand vector search will work even if you use a differentkeyword that has the same meaningas what you're searching for. So in practice, you want to do all of them to some degree. So that brings us to LAMA Index. LAMA Index is an open source frameworkthat connects your data to LLMs,where available in Python and TypeScript.

Uh, and connect your data to LLMs is one of those problemsthat's easy to say, but very tricky to do in practice. Let's take a look at a typical rag pipeline. First, you load your data from your sources, you ingest itand you prepare it to be processed. There's a lot of complexity in that process. Then comes indexing.

This is where you embed your data, preparing itfor vector search, which like I said,is not the only way, but a great way. And then you put that data in a vector storeand there's lots of choices to be made about howto put it into a vector storeand obviously, which Vector store can use. When you're ready to query that data, you first haveto retrieve the most relevant context. There's another ta. This is another task with a tonof hidden complexity, often involved in, uh, post-processingof the data that you have retrieved.

Then you have to combine that context with a prompt,and everybody knows that prompting is its own whole art. Uh, but combining a promptwith context can also be a task within hidden complexities. And finally, you get your result, which depending onwhat you're doing, might require further processing,like turning it into valid js ON or something like that. So there's a lot going on here. RAG is magical technology, but magic is tricky to get right.

So let's talk abouthow LAMA Index gets involved at every stage of this processbefore you get started, you have to pick your LLM. We support dozens of LLM modelsand more every day via APIs like O llamaand hosted options like Bento ml. Let's start with ingestion. The very simplest, uh, form of ingestion isto just load a bunch of files on disk into memory directly. And here's how you do that.

In two lines of codein LAMA index, uh,you can do the actual ingestion part in one line of code. The second line is your indexing. This can handle this simple directory readerthat I'm showing here can handle a huge varietyof file formats, including CSVs, PDFs, word files,and also images, audioand video for when you want to do multi mobile things. What if your data isn't in static files?Like I said, your data lives everywhere. What if it's in Google Drive or notionor database of some kind?LAMA Index also provides a registry called LAMA Hubavailable@lamahub.

ai,and the hub provides a huge library of softwareto help making make building your rag applications easier. That includes connectors to connectto your favorite data sources, tools for building agents,data sets for evaluating your applications,and a set of, uh, things we call LAMA packs,which are essentially arbitrary code packages which turnvarious complicated tasks into one-liners. I'll mention all of these again as they become relevant,but the one that's relevant right now is the loaders. So if you need to load from some other source,you can check out LAMA hub,but for this example, we're gonna stickwith simple directory reader. If you're happy with our defaults, then you're doneand you can move on to line two, which is where we embed,uh, index the data and send it to an embedding model.

But you might not be satisfied without defaults,in which case you want to build an ingestion pipeline. This lets you configure a series of transformations. You can specify how you want to split up your documentsfor embedding, and you can specify what metadata you wantand what embedding model to use. The result is a set of nodes that you can see online, 11,which you can then use to start indexing on line 14. And of course, in production,you'll probably end up running the same pipeline overand over as you experiment with these parameters.

So we allow you to cache your ingestion pipelineand reload it instantly. If nothing has changed, only rerunning theparts that have been modified. That brings us to the second stage, which is indexing. As I mentioned, this is the stage where you take your chunksof data and you embed them into vector space. We support a huge set of embedding models,just like we support a huge set of everything.

Uh, these all have different performance in termsof both speed and quality of retrieval. The your workhorse here is the vector store index,which takes care of getting all of that embedding done,whether it's via a remote, API or a local model. You'll see us, uh, call out to Vector Store Index overand over in these examples coming up,depending on your use case, you also may wantto check out our knowledge graph index,which can take unstructured text and split it into entitiesand relationships that then performentity-based queries on your data. But now you have all your embeddingsand it's time to move to the thirdstep, which is to store them. We support dozens of vector stores.

Vector stores are all doing roughly the same thing. They take giant piles of numbersand allow you to search against them using vector math. However, there is a lot of differentiation in the spaceas we're going to see later, uh,in which ones allow metadata filtering and hybrid search. I'm going to explain what metadata filteringand hybrid search are, which brings usto the next stage, which is querying. And the thing that we're not going to be talking about verymuch, which is prompting other rag frameworks talk a lotabout prompts and prompt eng, prompt engineering,but we take a different view.

LAMA index is a batteries included framework, uh,which means that we have come upwith excellent prompts already. They're tried and tested and they're finely tuned, uh,and we've built them into the framework so you don't haveto figure it all out yourself. Of course, if you want to, you can modify the prompts. That starts with the most basic thing, uh,which is you can inspect the prompts so you can seewhat the defaults are that we've provided to you,and you can pass in your own either inadvance or query time. The syntax or prompt custom, uh,customization is will be very familiarto you if you've used any other libraryto do this stuff before.

But essentially prompting is giving arbitrary text to an LLMthat instructs it on how to answer the question. So now we're at querying. Let's start with the most basic form of querying. Uh, we just get our index to give us a query engineand we accept all the defaults this wayand we just run the query. This is great.

This is part of why LAMA Index is so easyto get started with because you can go endto end from loading through ingestion to embedding to, uh,storing, uh, to queryingand to your result in just five linesof code if you really want to do that. But of course, like everything in LAMA index,your query engine can be customized. Uh, what we're doing by default is retrieving the top twomost relevant pieces of context of your dataand returning them to the LLM for querying. But you can configure your own retrieveras shown here, shown here. I'm setting similarity top K to five, which meansthat it will select the five most relevant piecesof context, uh, as my context.

You can also configure the synthesizer. That is how the engine puts the query togetherbefore it sends it to the LLM. There's a bunch of available strategies herethat we don't have time to go through in this talk, uh,but effectively what they all do is group your chunksof context together and query the LLM with them. Finally, you can create your query engineby combining your retriever and your synthesizerand querying the LLM. So that was exactly the same thing that we did in two linesof code in 10 lines of code so that we could show allof the stages, uh, involved in basic querying.

I said that we would look at more advanced strategies. So let's look at the first of those, excuse me. Let's look at the first of those, uh,which we conveniently package up for youas the sub-question query engine. The problem that the sub-question query engine solves hereis complicated questions. If you have a lot of unrelated data sources,it may require getting data from more than oneof them in order to answer a single question.

Uh, and that's impossibleto do in a single query from a single source. So what Ascension does is break your query up into a simp,a series of simpler questions using the LLM then given anarray of data sources, it routes each query, uh,to each data source appropriatelyand combines the answer from each data sourceinto a single answer. In this toy example, we're creating just one data source. We've created a simple query engine suchas we had on previous slides,and we're assigning metadata to it, which specifieswhat the tool should be called and roughly what it does. The l and m uses this descriptionto decide which data source is going to beand able to answer which kind of question.

You can have as many of these sources as you want. The next problem that we want to tackle is precision, uh,basic rag and strugglewith making sure the context is precisely relatedto the query one tactic by handling that is known as smallto big reval. In small to big retrieval,we break a large document up into single sentencesand perform retrieval, uh, on those very specific sentencesfor maximum precision. But then at the synthesis stage, before we hand the queryand the context back to the Land the context of the LLM, we go backand retrieve five sentences worth of context before andafter the sentence we just retrieved. That's it's small to big.

This gives the LLLM more context to workwith while maintaining the precision that we want. Getting this done in LAMA index is disarmingly simplebecause again, we've done all of the workand package packaged it up nicely for you. When creating our query engine, we can supply a listof node post-process, which work after retrieval, butbefore synthesis and use the casually named metadatareplacement node post processor, uh,with a target metadata key of wind. The automatically search through nodes, uh,which all have links to their previous and next nodesand link together the previous and next nodesbefore sending the query to the LLM. Small to big is precision by post-processing nodes,but you can also get better precision by pre-processing.

One way that you can help the L-M-L-L-M out isto use existing metadata that you have about your dataand filter on that before retrieval. In addition to using embedding retrieval,some vector databases natively support attaching arbitrarymetadata to each of your embeddings. An excellent example of this technique is if you are dealingwith documents representing annual legal filingsof the company, you could attach the year as re as metadataand pre-filter your data on that as shown herewith the exact match filter. One of the key, uh, lessons here isthat if you have metadata about your data,you can help the LLLM by giving it less stuff to work with. You can say, I know this answer is goingto be in the documents from 2021, so I don't needto give it context of 2020 and 2019.

I can just give it this particular yearand get it to do its magic within the documentsthat I know are relevant. Using the Vector index auto retriever, we can describe eachof the metadata fields and get the LLM itself to decide howto apply the metadata queries, allowing you to continueto query in natural language without any manual manualrefinement on your side. As you can see from this list, mostof the vector databases we support alsosupport metadata filtering. Only a handful don't. So that's good news for you.

Our next example is hybrid search. Uh, embedding base to vector retrieval is truly magical,but like I said, as an industry, we have poured decadesof effort into building search engines,and at this point they're really pretty good. So some of the most interesting vector databases areactually search engines executing and pivot. They allow you to do not just top K retrieval,but also use existing search algorithms. One of the most prominent of which it's called VM 25.

Once you've got a databasethat supports hybrid search using it's extremely easy,you just pass vector store query mode equals hybridand then pass a value called alpha. Uh, it runs the same query against both vectorand traditional search Alpha determines the degreeto which search relies on results from traditional searchor vector similarity. Zero means that your results come entirely from traditionalkeyword search and one means they come entirely from vectorsearch and you can tune this numberto get the best results for you. As you can see, the list of vector databasesthat also support hybrid search a lot shorter. Uh, not on this list is Vespabecause we haven't built a specific integration for it yet,but it does support hybrid searchand works well with LAMA Index.

The reason I mentioned it isbecause Vespa was invented by Yahooand as an x, Yahoo myself,I'll always have a soft spot for Vespa. Um, another use casefor advanced retrieval is complex documents. Not every document is just a pile of texts. They often contain complex tables. So let's consider how we can break down a complex documentinto a set of simpler queries.

But first, I need to introduce you to our abilityto query tables in the first place. We have a fantastic built in called the Pandas Query engine. You create a pandas table in Python such asby reading in A PDF or by reading in a CSV file. Uh, and using the c the LLM,it can generate correct pandas calls that query those tablesto get the right values. We can then provide the description of each table,either from ourselvesor we can get the LLM to look at the tableand derive what's in the table, uh,and create a series of what are called index nodes.

These are just like regular nodesthat you saw me pass the vector store indexduring ingestion, caching earlier,but the retrie I'm about to introduce will nodeto treat them specially, excuse me,the retrie I'm about to introduce willnode to treat them specially. Each index node contains an indexto a dictionary that we're creating here. Dictionary consists of instances of pandos query enginesand one for each table in the document. Now for completeness, we'll show parsing the restof the document from which you can assume we have alreadyextracted the tables. We create a vector retriever that operates all of our nodes,both the regular nodes and the index nodes.

And now we set up a query engine just like we did earlier,a retriever, a synthesizer combined into a query engine. But this time, instead of a regular retriever,we use the recursive retriever. The recursive retrie retriever expects a list of retrieversto use, so we give it the vector retriever we just created. Uh, it also notes that if it retrieves an index node,it should look up the index it finds inthat node in a dictionary of subquery enginesand perform a query there. So we pass it a map of IDs to query enginesthat we created just now in that previous slideand create a query engine with that.

So now a queryto the top level engine will result in retrieving allof the relevant nodes,and if one of those nodes is a table,then it will perform a panda's operation on that tableto retrieve the exact results fromthat table rather than just trying to read the tableas if it were text and it will return, uh,the synthesized prompt to the LLM,which will return the full answer in context. A very common advanced retrieval use case is wantingto query a SQL database directly. There's no end of really great data in SQL databases. So the use case here is obvious,and let's break this down and show how it's done. Under the Hood LAMA index is using SQL alchemy,which if you've ever done data work in Python,you'll already be familiar with,but here we can just connect to a databasethat initialize Lama Index's own SQL database class.

The SQL database class is handledby the natural language SQL Table query engine. Again, one of those really catchy names, uh,which is another built in from LAMA Index. What this does under the hood is pass the schemaof the table as part of the query to the LLMand gets it to generate sql. This is obviously tricky stuff, so it works best on GPT-4and other advanced LLMs like Mr Large,but if you know the name of your table in advanceand you're sure the schema will fit into the prompt,this is an incredibly simple way to get it done. You can see how this would fit well into the recursiveretriever that we were just looking at too,because you could make SQL tables just another uh, nodethat you can query in your, in your uh, retriever,but what do you do if that stuff is isn't true?What if your table schema doesn't fit in memoryor you don't know which table to query in advance?Uh, in that case, you can create a se a uh, SQLand table node mapping as we're doing here.

You create an index of each of the tables that you turnand turn that into a retriever, which you then passto the SQL Table Retriever Query Engine, which isyet another amazing built-in. This will search the index for the most relevant tableand then pass thatto the natural language SQL Table query engineand perform the query as it did before. So it doesn't matter how many tables you have in yourdatabase, you can get the LLM to select which table is goingto be the most relevant one and run the query for you. How does it know which table to query?It uses the LLM and looks at table and column names,but if you want to be more helpful,you can manually describe the tables as I'm showing hereto give it more context of which tableis likely to be helpful. The final use case that we're goingto cover today will also be the most complicatedbecause it's not just a retrieval strategy strategy,it's an example of an agent in LAMA Index.

Our favorite example of this in the wild is our demoapplication, S sec insights, ai,SEC Insights ai. AI was initially intended as a demonstration of LAMA Index,uh, but now is a fully open source demo applicationthat you can adapt and customize. What it does is you give it financial filings documents fromone or more companies and it can compare contrastand summarize them to tell you what's in the filings. In the latest Y Combinator batch, there is a companywhere this is their whole business,but for us it's just the demo. Our first step here is to create a query enginefor each of our data sources.

Here they are very similar sources. They're just in different directories, uh,but they could be entirely different query engines using anyof the strategies that we've already talked about today. So one of them could be a SQL query engine. One of them could be looking at a pandas table. One of them could be, uh, a sub-question query engine.

You can compose these engines together,but now because this is an agent, we define tools to give itthat it can select from,as we did earlier in our SQL example,we give the tools metadataso the LLM can decide which tool is goingto best be able to answer a question. Now, we define our agent, which is again,surprisingly simple. We give it a nice capable LLM like G PT fourcapable of tool use. Uh, we give it a set of tools that we created. Now, the agent, when given a query will enter a loopwhere it tries to select the best tool run queries overthat tool and continue until it gets an answer.

In this contrived example, our query is very simpleand our tools are very simple. But remember, you could have made any of the above example,you could have made any of the above examplesthat we talked about into just oneof the tools available to the agent. So you could have had, uh,one doing sub-question query engine, another doing textto sql, a third performing hybrid search. You can get up to some really amazing thingsby composing your sources this way. So we've seen how to build some amazing stuff,which brings us to the final step,which is getting it into production.

I hope you have some listenersto the Latent Space podcast Here. It is an excellent podcast about the riseof AI engineers in the future of AI generally. And recently the host Sean describes 2024 as the yearof LAMA index in production. We completely agree. One great way to get your AI creations out of a notebookand into production is create lama.

It is a command line tool loosely based on create React app. It's an application generator that creates a readyto ship rag application with a working front end interfaceand a backend in your choice of serverless, TypeScript,node js or Python. So you just run this one commandand you've got a working appthat you can customize and ship. It has dozens of options and templates attached,and it's a really great way to accelerate your developmentand get production quickly. Here's a short list of companiesthat already have LAMA Index in production.

It's shorter than we'd like because the truly giantcompanies that we know are using it are often a littlehesitant to say so publicly. But rest assured there are household nameswith LAMA Index in production today. As a brief example, take Gunderson Detmeras a major law firm dealing with startups. They have an enormous body of internal data informing theircurrent agreements in the state of the tech world Chat,GD is an internal tool that uses LAMA Indexthat lets their lawyers quickly query that corpusfor relevant information and answerclient questions more quickly. So let's cover again what we learned today.

We talked about what LAMA Index is. It's an orchestration framework,but it's also a hub of tools and connectors. And finally, a set of tools we're getting into production. We covered the stages of retrieval, augmented generation,ingesting indexing, storing, querying and brief,and we talked about LAMA Index support for pipelinesand pipeline caching, how index works, the setof vector stores available to you,and what features they support. We briefly covered promptingand how you can customize your promptsand how to create a query engine.

Then we went into those seven examples. We started with naive top K retrieval. Then we went into the sub-question query enginefor more complex question. We talked about post-processingfor precision with small to big. We talked about using metadata filteringto increase the precision of queriesand how that filtering can be automatedby further using the LLM, we covered hybrid searchto combine the best of vector retrieval with the bestof traditional search engines,and we talked about the recursive retrieverfor indexing really complicated documents.

Then we briefly dipped into the world of text to SQLand multi document agentsand how all of those different techniques can be combined. I hope that this was a useful look into your journey startof your journey into ai. We are still really earlyand everybody is learning,so if this was a little overwhelmingat points, don't sweat it. I also hope that it gave you a senseof all the amazing things that you can do with LAMA Index,and I look forward to seeing theamazing things that you build. Thank you all for your time and your attention.

Ah, thank you very much, Laurie. Um, hey, so, um,kudos on being an open source, um, software. I have a question. Is there a commercial versionof Llama Index available?Yes. Um, we have, um, in additionto the open source Pythonand TypeScript versions, uh,we have in beta a service called LAMA Cloud, uh,which allows you to create an a managed ingestion servicethat does all of this stuff for you, uh, using a GUIand, uh, uh, just a few clicks.

Um, it's still in beta,but if you go to our website, you can, uh,get into the beta by contacting us. Alright, so you heard it here. Hot take. Um, okay, so we have a few questions. Um, see one, a couple in the q and a window.

Um, so Daniel asks, does deck isbuilt in prompts solve the problemof asking off topic questions,like when was the universe made?But the website is about dog food, sothat means your documents would probably be about dog. Um, that's an excellent question. Um, no, our prompts don't specifically address that. Our prompts, uh, assume that you're goingto be asking relevant questions. Um, however, there are, uh, pre-processingand post-processing strategiesthat you can use to deal with that.

Okay, and questionfrom garish. Um, hello. Thanks. Would you be sharing the slides?Yes, absolutely. Uh, if you, if you follow me on Twitter,my username is sdo, S-E-L-D-O.

Uh, and I will be putting the slidesthere right after this talk. Okay. Someone who does not want to be named, oneof the biggest issues I face is the speedof search users are used to incredibly fast keyword search,but hybrid even ends up being a bit slower. What's the best, most initial way to manage resultsand speed FTS on search engines being the use case,just the recommended bare bones?That is an excellent question. Um, there is not a lot of, there are not a lot of waysto get around the fact that LMS are goingto have higher latency than traditional keyword search.

Um, one of the features of lms,however, is that they are essentially verycomplicated auto complete. So they begin to answer the question immediately. It just takes them a little while to answer the question. So one of the things that you can do, uh,is stream your results. Users are in practice a lot more forgiving, uh,waiting for an answer.

If it starts coming immediately, this is what chat GPT does. Uh, they will wait for, you know, they will wait up to,you know, 15, 20, 30 seconds,which is like an unheard amount of timefor somebody to wait for an answer. If the answer has started comingand they can see that the answer isgoing to be relevant to them. Okay. Um, so Mike asks, once I have completed my rag based app,but it's not performing as well as I would like, what partsof the pipeline and tools do you recommendto improve the accuracy, latency, and cost?Wow, those, that's another really great question.

Uh, people are really listening to this webinar. Um, so accuracy, latency,and cost are three very different questions. Um, uh, if you're doing, if cost is your primary concern,um, then you're probably going to, uh, wantto slide down your expectations in termsof accuracy and latency. Um, one of the ways that you can get your cost down isby using a local model instead of a hosted model. There are lots of very capable lo local models,especially if you fine tune them on your data.

Um, a local model can, depending how you set it up,also improve your latency. 'cause local models tend to be a lot smaller andtherefore their answers are a lot faster. Um, accuracy and latency. Um, latency is a tricky one. Like I said, streaming can sort of give you a smokeand mirrors response to latency.

Um, but there's, uh, no better fixfor latency than just throwing a lotof hardware at the problem. Um, accuracy, uh,is the most complicated one. Accuracy is what you can get out of using a really, uh,subtle, uh, query strategy. So, um, a really finely tuned prompt, uh, a query enginethat splits your data sources up into multiple independentdata sources and queries, then independently using an agent. Uh, those things can really improve your accuracy.

Um, you can also improve your accuracy, latencyand cost, uh, by messing with, um,your, uh, the size of your retrieved context. Um, so, uh, when you are splitting up your data initially,you have to turn it into chunks. Um, so those chunks can be very largeor they can be very small. The smaller they are, the faster your responses are goingto be, but the less context you're going to get in them. So you have to make this trade off, uh, between, uh,how much context do you wantand how fast you want to be able to retrieve it.

Um, so as you can tell from my answer, which is sortof all over the place like accuracy, latency,and costs, those are, uh, a, a complicated trio, uh,of, of sliders that you have to move aroundand pick the best one, the best combination of them,uh, for your application. I hope that helps. Okay, thanks Laurie. So another question. The q and a window, um, anonymous.

Thanks for this talk, Lori. Very informative question. What uses uses, does LAMA Index have for data cleanup?I want to not retrieve,but actually do the opposite, insert data using my LLMand then build out my retrieval pipeline. Any tools for this?Huh. Um,That is an interesting, that is an interesting use case.

Um, I'm not particularly familiar with that use case,so I can't say that I have any particular toolsto answer that one I'm afraid. Um, in terms of data cleanup though, uh, if you goto LAMA Hub, there are lots of pre pre-process available,uh, that will help you, uh,when you are ingesting your data in thefirst place to clean it up. Uh, one of the best ones, um, is a PII detectorthat will help you, uh, get rid of, uh, um,protected personal information. That's a concern for you. But the, you know, the more general question here ishow do you get your data clean?It's not really an LLM thing.

Getting your data clean is, uh,an old school data science ETL question. Um, and so all of the previous, uh, you know, decades worthof of work on how to clean up ETL data, uh, are what apply. You have to do pre-processing, you have to do ingestion,you have to do, uh, pre aggregation, uh,and you have to do a lot of filtering. It's all a lot of nutsand bolts programming that doesn't really have much to dowith l lms, but it's very important, uh,to getting the quality of your answers. Thank you.

PII tool sounds really useful. Okay, so Terry asks, can you elaborate onhow rags will be useful in a world where LMSthat have larger and larger context windows?Also, could you share your opinions on using open sourceversus closed source LLMs?Absolutely both. Good questions. Um, as I tried to mention, uh, at the beginning of the talk,the, um, we've all heard of, uh, Gemini 1. 5 Pro, uh,which is going to have a 1 million context window.

Uh, we have got at LA takes obviously a lotof questions about what does that meanfor retrieval augmented generation. Um, the, uh,the answer is you still need a wayto connect your data to this LLM. You're still going to need to know where it is. You're still still going to needwhat is the most relevant data. Like I said, if you had a 1 million,even if you had like an infinite context window,you wouldn't want to, when you were answering a question,give it a million tokens worth of stuff.

There's no need to like throw a million tokens worth of datato answer a very simple question. If you know that you can get it in just, uh, you know,a hundred lines of context. So you're going to get, you know, it doesn't matterhow big the context window is, you know,if you have an infinite context windowand you could get the answer in, in a hundred linesof context instead of a million lines of context,the hundred lines of context is always gonna be fasterno matter how big that context window is. So we were talking about accuracy, accuracyand latency and speed. Uh, your latency is gonna be better if you are managingto give it relevant contextand also your cost is goingto be better if you're giving it relevant context insteadof your entire corpus of data every time.

Um, as for open source versus closed source l lms, well,I used to run NPM, so I'm a big fan of open source. Um, I think open source lms, uh, are doing amazing workand I think, uh, having accessto the models democratized is, um, a really,really excellent trend. Great. Um, so I think, uh,I see a few other questions in the chat window. Um, I see ESH was asking does the RAG support queryingon Microsoft Excel data?Uh, yes it does.

We have an adapterfor querying Excel data. Okay. And now we have another, oh, sorry,You'll, sorry. You'llfind the answer on Lama hub rather than in asimple directory leader though. Okay.

Great info. Um, and back in the qand A window, another question teary, what kindof l lam projects would you recommendfor AI engineers looking for a job in the field?You mean, uh, I could think he's thinking like aportfolio project maybe. Yeah. Um, that's an excellent question. Um, the thing that's most hot right now is agents,uh, excuse me.

Uh, everybody wants to see, uh, agentsthat can demonstrate tool use, uh,and actually take actions rather than just answer questions. Um, so if I were looking for like a funnel LM projectthat would also demonstrate that I know what I'm doing, um,I would combine RAG with an agent. So, uh, imagine an agent that can read your calendarand then set up new appointments that's using rag. It's going to have to look at your calendar, it's goingto have to figure out, you know,when you're busy and when you're free. Uh, and then it's going to haveto take an action like creatinga new event on your calendar.

Um, that's, you know, that kind of projectdemonstrates the full gamut of stuff that you can dowith LM applications. Um, so, uh,something along those lines is a really good project,but thank you for that question. Great, thank you. Okay,now I'm gonna go back to the chat window. Um, is to, Tossin asks, is it possibleto build a framework to load various kindsof data sets into Vector database?Okay.

Any sample code or approach? Does he mean different?Um, yeah. Okay. I mean, that's what it's all about, right?Yeah. Um, that's what LAMA Index is for. Uh, Lama Hub has, um, has hundredsand hundreds of connectors, uh,for loading from all sorts of data sources.

Um, and, uh,if you have lots of different data sources,you have some choices to make, whetheror not you want to combine them all into a single vectorstore or you want to maintain one vectorstore per type of data. Um, personally, I think, uh, storing them into separate,storing them in separate data vector stores is a good idea. Uh, because then you can use, uh, tool use, like I said,you can use sub-question query engineor something like that to say, okay, I got a questionthat is probably best answeredby this data source versus a questionthat is best answered by this other data source. And you can keep them separateand answer questions separately. Okay.

Um, so we will get,have the recording out in a few days, um,and then back in the q and a window. Uh, Daniel asks, what's a use casewhere LAMA Index might not be an appropriate solution?Um, LAMA Index is very firmly focused on rag, uh,and using agents to solve rag. Um, so if that's not what you're doing, uh, you're goingto find fewer tools for us to do that. So, um, it's hard to, you know,it's like don't try and build Photoshop using LAMA Indexbecause that's not what we do. Um, but uh, if what you're doing is,is retrieve augmented generation,we're very firmly focused on that use case.

Alright. Um,and then I'm gonna go back to the chat window now. Um, Stefan asks,what library is the generated Python server basedon Flask Fast?API,Um, the generated, uh, I assume you mean the,the API that is generated by Create Lama, uh,and that is a Flask api. I, that's a Flask application. Okay.

I'm going to jump back to the q and a window. Uh, anonymous asks, I hear moreand more about using PG Vector store,but what data type would you actually put in the pg,uh, to use it?Maybe I read that wrong. Um, what kind of data type would you actually put in the PGto use it as a Vector store?Oh, he's talking about Postgres sql, um,but they've got PG Vectors Time series Vector, wait,TS Vector seems basic. I'm not sure TS Vector,I'm not sure either. Um, Okay.

Maybe rephrase that question and try again. Yeah, yeah. Anonymous, can you please rephrase yourquestion and we'll we'll try to answer it. Thanks. Um, okay.

Here's the, I think he's trying to type it. Nope. Okay. I'm gonna dismiss that one. Um, okay.

I, are there any plansto support Code Similarity search ah, code completion?Um, yesor rather we already do, uh, you can use, there are modelsthat are very good at, uh, code search, uh,and we connect to those LLMs so you can alland those embedding modelsthat are really good at code embedding. Uh, so you can do that today. Awesome. And um, Rajesh asks,we are working on building Genai platform for Enterprises. Is it possible to partner with on Index?Absolutely.

Please goto the contact us form on our websiteand we'll happily partner with you. Alright. Um, so yeah, I think, uh, thatlooks like we got a wrap on that. Um, any, uh, last words of advice?I think we're gonna, I think people, um, at least myself,I'd like to know where we can find out more. Where can I get started?Um, will we get some, some links for that?Um, yeah, there, there'll be in my slides,but also if you just go to Dots lama index.

ai,that is a great place to get started. Okay. Okay. One more question coming in qand a window anonymous. Okay.

He's, he's, he's re rewritten his question. Um, do you use Postgre as a Vector storethat can be retrieved by LAMA Index?Would you just embed your existing text data using embeddingmodel and store that as a vector data type?Yes, that is absolutely what you would do. Okay, perfect. All right. I think, um, we've answered everyone's questions, um,and you gave us a lot of information.

It's amazing how much a llama index does, um,with the in-memory, uh, vector store part. So thank you, Lori. I learned a lot too. Thanks Christie. Alright, thanks next.

See you next time. See you next time rag folks. Bye. Bye everybody.

Advanced Retrieval Augmented Generation Apps with LlamaIndex

AI Assistant