Events
Building an Agentic RAG locally with Milvus, Ollama and Llama Agents

Training

Building an Agentic RAG locally with Milvus, Ollama and Llama Agents

Zilliz Webinar | Zoom

Join the Webinar

About the session

With the recent release of Llama Agents, we can now build agents that are async first and run as their own service. During this webinar, Stephen will show you how to build an Agentic RAG System using Llama Agents and Milvus.

Topics Covered:

How to make agents leverage planning, memory, and tools to accomplish a variety of tasks.
How to empower your RAG system by using a multi-agent framework and by being able to call custom user-defined functions.
Common issues like hallucinations and how to add fallback and self-correction mechanisms so your agent can try to fix itself.

View presentation slides

Transcript

And today I'm pleased to introduce today's session,Multi-Agent Systems with Mr. Ai BU and LAMA agents and our guest speaker Stephan bpo. Stephan is a developer advocate at Zillow,who previously worked as a machine learning engineer atWalt, where he created and worked on the ML platform. And previously as a data scientist at B Bravo. Stefan studied computer science and artificial intelligence.

He is a founding member of m Lops community Berlin Group,where he organizes meetups and hackathons. He also enjoys boxing and surfing. Welcome, Stephan. Thank you very much. Thank you for the introduction.

Thank you all for coming today. Uh, yes. So we're gonna talk about multi systems with mic, ai, visand LAMA agent. There is also gonna be a demo that I'm gonna do later on,uh, during this presentationwhere I will actually show you how to build it. You will also see, uh, that, you know,not everything is deterministic,so sometimes you have answersthat you like, sometimes you don't.

So I will hopefully have the right answers to the demo. Uh, it looks nice, but for the people that don't know me,I'm gonna start by introducing myself. So, yeah, I'm Stefan, I'm a developer advocate at zilli. And Vis, if you have any questions, uh, you can feel freeto like, send me an email or chat on LinkedIn or Twitter. Uh, I'm basically available everywhere.

Um, you can like, ask anything relatedto vis ask anything related to AI in general,generative AI agents. Um, it's usually the things that I work on. And before we continue, I'm gonna introduce Vis a bitfor the people that don't know, uh, who, what we're doing. Uh, then, uh, we'll start, uh,and talk about digital different agents. So vis it's part of the Linux Foundation, um, for AIand data, and we're the key containers of, um,I mean Z sorry, is the keyer of vis.

Uh, and then, you know,we are graduated from the Linux Foundation. Uh, we have a lot of different, a lot of styles on GitHub. We have more than 28,000. Uh, we also have, uh, a lot of people learning it. We have more than 10,000 companies usingus, I think, in production.

But what is cool is that with malware,you can start with an easy setup. So a PIP in store to start, you know,coding directly in your notebook. Uh, but then you can also move very quicklyto then having VIS deployed on capabilities, for example,or using zills in the cloud. Um, we also have integration with different partners. So open ai, link Chain, LAMA Index, all differentor different, uh, and Feature Rich as well.

So we support d embeddings, par embeddings filtering,re ranking, and a lot of different things as well. Uh, actually today we'll see how we can, uh,do meta filtering using agents, uh,where the agent will actually create the metadata filteringby themselves, as I've said before. So we have integration with different AI toolkits. So L Chain LA Index, you know, are the most two famous ones,but also D-S-P-Y-L use in ai, Voya ai,and a lot of different, uh, partners as well. So you can check out directly on our documentation,so vis io,and then you can see that integration we have, uh,with our different partners.

I'm gonna start by introducing the RAG concepts. So because I'm not sure if everyone is familiar with it,then, uh, we will quickly see like the limitation of RAGand then we'll go more into the agent part of it. So rag, which is Retrieval, augment Generation. The basic idea is to force ULLM to work with your data. And how you do that, you do itby injecting it into a Vector database like vis,and the basic architecture is this one.

So you have your data, then you'regonna extract some content. You're gonna do what we call chunking, uh, which means,you know, like getting some part of the data. And then you put everything through an embedding model,and then you store everything in Vis, once you have a queryby your user, what you do is then again, you're gonna put itthrough an embedding model, the same one as you use before. And then you do what is called semantic search. So you're gonna search for thingsthat are semantically similar to the query.

Then Middle West will return similar data. We'll put everything back into the context for the LLM,and then the LLM will produce a response,and then we give this response to your user. So that's the basic raw architecture. Uh, that's the one that it can dowith basically five lines of code. That's one example.

If you are, if you're using LAMA index, uh, you know,you're gonna load your data and then you put it in a Vectorstore, and then you build a query engineand then you can run the queries. And that's basically, it's, um, so like, that's very nice. You know, that's a very quick one. Uh, with that one, if you have some documents you want to gothrough, it can be very useful. Uh, but it's not usually not enough because NA RAG is nice,but NA RAG is a bit limited, you know, soit has different value modes.

So one of them is summarization. So you know, because RAG is like closely relatedto semantic search, to summarize the documents,you basically have to fetch the whole documents. And ra I mean, RAG is not designed for that. Um, so usually it can strugglewith summarizing, uh, documents. Also, it can struggle with implicit data.

So, you know, if you, if you ask a question, uh, likefor example, what is a companythat has the highest review revenue on the US stock?Uh, then it will do like semantic search on your data. But then if you don't have this information,they will not know about it. So you also, your documentary have to say it explicitly,you know, so otherwise, if you don't have that,then it's likely not gonna find it. So that's, that's a problem you may have. You may have, where then you haveto use, uh, different tools.

Uh, it can also struggle with, uh, multiple questions. So if you ask multiple questions, you know,the rack system might skip oneor you might not answer it properly. Um, so that's also like a, a struggleof the naive rag failure. So in a nutshell, uh,RAG is very useful, don't get me wrong. Uh, but basically it's necessary and not sufficientbecause, um, a lot of tasks are related semantic search,but there also other tasks, you know,that might require more than that.

And from now we're going to check like how do we gobeyond the limit of the basic crack?But first, a reminder for everyone. Uh, I've been working in the whole ML fieldfor like years now, and we used to have a,a thing which is garbage in, garbage out. Um, and it's kind of a reminder now it's like, you know,good dishes come from good ingredients, so you still haveto make sure that your data collection is good. The data cleaning is also good, and the passingand chunking are good. You know, you can't really expect your rack systemto perform if your data collection is not good.

Uh, and you know, like then the cleaning is not good. Like, you know, like then the RLM is gonna be confusedby a lot of things that are gonna make somenoise for the RLM. And then same for chunking. You know, if your chunking is a bit too small, uh,or it's too big, then your RLM will also get confused. Like if, basically the way to see it isthat if you yourself can't understand, uh, the chunksand you know, can't really understand the context,then it's gonna be very hard for your LLM to do it.

So now if you have like different documents, for example,BDFs, given the capacity of lms, now one pa one page,uh, can be a good one. Uh, but then you might also be able to, you know, to chunkthrough paragraphs or then create some super chunksand create some smaller chunks, you know,or then have a summary chunkand then a smaller chunk as well. Uh, those can be really useful, uh, for your LLMif you don't wanna struggle with that. Uh, we have what is called zills pipeline, uh, sothat it's available on zills cloud,and we basically have state of the art ingestion pipeline. So then you have like, you know, um, good chunking,so you don't have to struggle then, uh,with having bad chunking and then having abad rack system for it.

But just to come back, na rag, that's a NA rack pipelineand that's good, but you know, you, it's only single shot,meaning that every question is brand new. There's no memory of the previous question. He has also no query understanding, no planning. Um, it doesn't really, he doesn't use any tools. It also have no correction, no error corrections,and it has no memory.

So it's like a little thing that doesn't have,and it's really like every time I'm gonna make a query,you have to go through your rack system again. Uh, whereas you could use something in memory, for example. So when you wanna improve that, one thing to remember isthat you really want to measure how you want to improve it. That's a different talk. Uh, but like, yeah, keep in mind that, you know,if you really wanna work on somethingand measure, how do you prove it,uh, you have to measure it.

Um, it can be with like, you know,having some golden data sets, basically where yourself,you know the answer and then you're gonna check your racksystem if it's actually answering the question properly. Um, and then you can also use LMS to evaluate LLMs. There are a lot of like documentation on theinternet, uh, about that. So you can have a look. But nowlet's talk about Agent Craig.

So that's a cool, the cool, the cool,the new cool one in, in town, basically. Uh, and Agent Craig, it's multi turn. So, you know, you can have agents understand the questionand then plan with multiple steps. So that's something that is like very good. Uh, it also has some tool interfacefor external environment.

So if you need to browse the internet, uh,then the agent can then decide to actually use this toolto then bros the internet and then get the data back,and then, for example, integrate it, uh,into the context of the LLM. So you can give an answer. Uh, it also has reflection, uh,and then it also has memory for personalization. So that's, you can see now it has like, it's,it can be way better. Uh, and it's not only gonna be single shots,you can actually, you know, like your agent will use a toolthat is gonna then, you know, call the ragand then it can go back to the agentand then it can loop through that upuntil you are happy with the answer, actually.

So there are different ways. Uh, the first one is self reflection. So you have, you know, you have your query,and then you're gonna have like the chunks, uh,of your documents, and then you're gonna check, you know,they're gonna get the top K ribbon chunks, uh,and you're gonna make sure, like, okay, someof them are correct, but some of them are ambiguous. So then for the one that are ambiguous, what you can do isthat then your agent can search the internet to verify. And then, you know, like then you can compare, um,the data it has on the internet, uh,with the data you have in a database.

And then you can be like, okay, actually, you know, it's,it corresponds to what I have in my database. I have five different results from five different websites. They kind of like say the same thing. So then I'm pretty happy with it. Then I can give back the answer to the A LMand then give back the answer to my user.

So that's one. Then you also havewhat is called query routing. So you know, it's gonna be like you have your queryand then you put it through your agent,and then your agent will be like, oh, do I need to,to like check my database for that,or do I go directly somewhere else?So that's like, if you, if for example,in your database you have data about, I don't know,financial situation of different companies in the world,and you ask your agent, uh,something about zoo in the world, then you know, it'll knowthat it doesn't have this information in the database. So instead of just doing search,it will skip that part entirely. And then it can do like different things.

So you can go on the internetand check about zoos, uh, different like everywhere. So that's also like a part that is possible. So that's square routing. Then you have query routing with subquery. Um, so that's also like one that is very useful.

So instead of, you know, having one query, uh,then you can divide it into different ones,and then it'll then, you know,make the search for the different queries. So if you have different questions, for example, uh,what is milli vis and what is zills, um, you know,those could be like two different questions. So then the first one would be, what is, uh, vis?And then you'll have the secondquery, which is what is zills. And then you do a vector search, uh, on both queries. Um, and then you get the top K of both, um, answers.

Then, you know, you combine those two to the lmand then you give it back to the, to you give it backto your user, sorry, as an answer. So those are like, you know, different ways, uh, of like,you know, having better resultsand you know, tweeting complex queries. Then you have also conversation memory. So you know, you have your query, you breakthrough your rack system, and then youstore everything into memory. So you basically pass the history back to your context.

And that's an example of like how chat works in chat GPT. So the only thing you have to be careful here isthat in practice, um, it get, so, like you put more contextof the history, but then you can also overflow thesize of your context window. So you have to get clever, uh,and maybe you have to condense the historya bit so it can still fit. You know, you can't just like append and append and appendbecause at one point, uh,you will just become bigger than you, your context window. So you just have to always remind that.

And then we have, um, what is called react prompting. So that's resulting in action. So this one, it's um, basically designedto integrate the different reasoning capabilities of L lms. And it's also possible to like take action steps. So if you make a query,and we'll see it live, uh, later during the demo, uh,but if you make a query, then it'll be like, okay,for this query, I need to use this tool.

And then it's making surethat you're actually answering the question. And then if you are answering a question,then it will give you a result back. And by doing that, it's able to, to understandand process information, uh, you know,really evaluate situations. And then it can also take appropriate actions. It will also communicate some responses to you.

And then it's really also keeping track of the situation. It's really keeping track of, oh, okay,my first question was, what is vis?Then I'm asking that. And then, you know, I'm using the,a different tool that is like browsing the internet,and then I'm just still tracking to see why I am,and if I have answered actually, uh, the userin practice, uh, it looks like that. So you can see the first thought, you know, like it's a,it's a question about Apple remote. Uh, and then like we can see like the first act is actuallysearching for Apple remote.

Then you have an observation, uh, which is a resultof the search, you know, so it's the upper remote control. Um, and then you, like, you have a second thoughtbecause then the other mis not really happy with the answer. So then, uh, it's really again, you know,like gonna do another search, which is a about front row. Then for example here, you can't find front row. So then it's like, you know,searching for something similar.

So then you find front row software, which is similar,and then you know, like different things. So then you have like a different thought,then you search for front row software. Uh, and then finally you have the final thought,which is front row software is controlled by an Apple remoteor the keyboard function key. Um, and then you have the act which is finishing, uh,and so then you can just give that back to your userand then you can be like, okay, actually, uh,the result is keyboard, uh, function key. So yes, that's basically how react prompting works.

Uh, and again, you'll it in actionbecause it's what I'm using, uh, for my agent later on. And then I said it multiple times,but like Agentry allows you to use tools. Uh, so let's say, you know, you might be able to,you have a query and you want to have auto retrieval,so then you can have your queryand then your LLM will actually generate, uh, this query,you know, like for the meta metadatafor example and everything. But then you could also have something that is, you know,generates some SQL from text. So then you have your query, which is text, you putthrough an NLM, uh,and then it will give you some sequel that you can then run,you know, and then it will give you the answer,but also can use like, different tools.

It can go through like calendar. So like if you connect it to Google Calendar, um,then you can be like, okay, I need to check if, uh,Han is available tomorrow. Uh, so then it will infer the API as wellfor Google Calendar, and then it will give you a result, um,and then you can then give that answer to user. But basically that's agent rag. Uh, now I'm gonna go bit deeper, uh, abouthow I actually built everything.

So we're just gonna talk about the stackand then it will go directly into the demo. So for this one, I'm using First LAMA index. Uh, so LAMA Index is a frameworkfor building LM applications,and it's really focused on like retrieving dataand integrating with different lms. It has integration with lots of AI tools, so like really,really lots of AI tools. But recently there really is something whichis called LAMA agent.

And LAMA agent is made by LAMA index, is what?It's open source. Uh,and it allows you to build stateful apps with N LMSand multi-agent workflows. You have also cycle and branching. You can have human in the loop as well. It allows you to have persistence.

So let's say, you know, you built an agent systemand you work, I don't know, in customer support, thenbefore actually, you know, issuing, I don't know,like a new ticket or something to your customers,you can actually ask for human to validate itand can be like, okay, please, here we stopand we ask the human if, uh,this action should actually happen or not. And then if it happens, if the human says yes,then we can continue the action. So that's like one thing it allows you to do,but it's really good at actually organizing multi agents. And you have, like, you can seeon the right side the components. And I'm just gonna zoom in on that one.

So on the, on the component parts you have,so you have the user and then you have the control planeas well, which is, you know, has like everything. So this one,the control plane is really like the central gatewayto the LAMA admin system. So it really keeps track of the current task as wellas the services that are registered to the system. And it's also the one that is holding the orchestrator. And the orchestrator really, you know,it's really like the module that handles, uh,incoming tasks, uh,and also decide what service to send it to.

So like also how to handle different results from services. And a, uh, orchestrator can be agent,so either an NLM is making the decision,or it can also be explicit. So then you can define a flow, uh,and then, you know, like you can gothrough something that is very specific. Also, it can also be a mix of both if you want. Then you have the different services.

So you can see like the different agent services here. So that's basically where the work actually happens. So like a service accepts an incoming taskand context, uh, that you give it to, um,and then it processes itand then it publishes everything as a result. And, um, a tool service is basically like a special servicethat is used, you know, to effort competition of agent toolsbecause they can also be very heavy, um, and everything. So that's, that's LA my agent, that'swhat I'm gonna use later on.

Uh, and that's the different components. And by doing that, also by having agents that are like,you know, really different, um, very sorry,on individual levels, then it can also scale upand down at different agent services. So that's also very handy in some cases for the LLM,uh, and every and the EM embedding model, uh,I'm gonna use LMS for missile ai. So for the people that I know, uh, missile AI is, uh,French based, uh, research lab. Uh, and they published, uh,recently published like open source models.

They're really focused on, I think open source models. Actually, uh, they released, uh,two in the last couple of weeks. So the first one is mis, which is a 12 billion modelswith 128,000, uh, complex length. And it's really strong at function callingand retrieval for its size. Also, you can run it locally, uh, with all AMA, for example.

So that's what I'm gonna use. And then for like more complex task,I'm gonna use miss all large two,which is 123 billion parameters with, again,128,000 contact length. And this one is also been,is been fine tuned on actually being ableto use function curling and retrieval skills. So those are really, really good for bragand function curling. And then for the main model,I'm just using the one from ral, uh,because it also also been focused on retrieval, uh,which is very useful for rag, unfortunately.

Um, when I checked it was English only. Say if you have something in your different language,you might, uh, need to use a different embedding models. I am also gonna use VIS lights. Everything, uh, from Elvis will run on my laptop. So with VIS Light, what's good isthat you can just run a PIP tool by vis,and then everything you have,like basically vis lightweight, uh, on your laptop.

And then, uh, if you want to use VIS is in the cloud,you just have to change the URI of the code. So, you know, we'll see, like, we'll see it's a local filethat I'm calling, but then if I wereto have something on capabilities, then I need to put the,your I of the cluster, and if I havecloud, then it's the same, you know?So you basically only change one line of code, uh,and then everything else is supported. And yeah, now it's gonna be demo time. So let's see if actually, uh, what I wanna show works,because disclaimer, so my agents are like,everything is non-deterministic, so usually it works,but sometimes they decide that they, uh, don't wantto do what they supposed to do. Uh, so that's also like a big problemof using Agency General is that yes, it can be very handy,uh, because then, you know,you can like really make decisions on its own,but sometimes it can also be capricious.

So let's see. Let's see what we have today. Uh, but yeah, let's go. So I have this demo,and then this one is using Lama agent with VIS and ral. So, you know, vis, I talked about it.

So this one I'm using VIS Light, uh,and then I'm gonna use LAMA agent. So my agents are running our microservices,and then I'm using ral ai, uh, the different models. And yes, basically we'll see. So we'll get some data from the internet,then we will store it, uh, into vis, uh,we'll use LAMA Index with Mistral for data queries. Then we'll create an automatic datasearch and reading agent.

And then we'll have an agentthat is creating made filtering based on the user query. And then everything is done automatically, you know,so you'll see, uh, and then, yeah, everything should work. So just for the sake of the demo,I'm just not gonna install the dependenciesbecause they're already installed. Uh, but you know, you can see like it's a pre installed LAMAagent, and then by VIS as well. And by doing that, you just install vis slides.

And then here I'm installing the integration we havewith LAMA Index, uh, also like, you know,being able to read files. Uh, and I'm also using Llamato run ral Namo locally on my laptop. So then I'm installing it as well. Uh, and then integration with, um, RAL ai. And same for the embedding models.

Um, I'm just gonna run this one, uh,because this one is needed for agent when it's really,it's needed, basically, becauseotherwise I think EEO in notebooks are, are not happy. So I'm just gonna say, uh,this one is only needed when you run on the coding notebook. I'm gonna import my API keys, uh, from Israel. So if you don't have any, uh,you can go on their website and edit. Um, but that way, just like I have a file locally,which is an end file,and then it's just reading credentials from there, uh,I have this data, which, uh, I will show you, uh,which is financial data from Uber, uh, and Lyft.

I will just show you a glimpse. Uh, you can see it's, you know,it's like from the security exchange commission in the us. Uh, this one is about Uber technology,and it's the annual report, you know,and we have another one, which is from Lyft,which is the same, you know, like very similar format. And this one is about Lyft. Uh, and we can see like different, you know, d like a lotof documents, and it's 238 pages long,and the other one is 300 pages long.

So those are pretty long documents. Uh, but I will store everything into anyway. So, uh, given that I've already downloaded the data,I don't need to, but from now on, I amgonna prepare the embedding models. So I'm using mis embed, you know, as I said, uh,because it's developed by raland given that I'm gonna use RAL models, I was like, okay,maybe let's use the those ones as well. So I'm just defining, um, you know,the default embedding model that we're gonna use,uh, for all la my index.

So I define it here,and then I'm just saying, okay, please use Miss Embeddingswith a miss embed model. Then I'm defining the default modelas well that I'm gonna use. So I've said it multiple times already,but, uh, I'm gonna use alama. Uh, and Ulama allows youto run models locally on your laptop. It makes it very easy as well.

Uh, and then, um, you know, by default as well, again,I'm saying, okay, we're gonna use ral nemo. So Ral NEMO is the 12 billion parameters, uh, model. So now I'm gonna instantiate visand I'm gonna load some data, uh, and we have it here. So those are the two files that I've shown before. And then here we're creating vis, uh,so we have the VIS Vector store,which is the direct integration we have with LAMA Index.

And then you can see here the URI, the URI here, uh,it's a local file, and when you define the URI hereas a local file, then VIS knowsthat it's gonna use VIS light. If you were to want to use like, you know,something on ities, then you have to change to URIand then it would be, I don't know, it would be cluster, uh,gcp, do EU West one, for example. Uh, and then if you do that, then it knows, okay,actually I need to go somewhere else. And I want to use Milli slide for this one,but I'm just gonna go back to this. Uh, I'm putting the dimension of 1024, uh,because that's the dimension of the embedding model.

By Tal I am saying override true, uh,because I like to live dangerously,so I'm actually gonna overwrite everything Ihave in my database. And then, uh, I have a collection name, uh,which is Comp Compan Company, companys Docs, sorry. So it's just the name of the, of the collection. Then I'm gonna create a storage context, uh,with VIS as a vector store. Uh, and then I'm gonna load the data, uh, directly.

Um, you know, we're gonna use, like,we load the data into memory,and then we're gonna build the index. So this, we know when we build the index,we are actually saying, okay, please do the, um,build the index with the documents that I have in memory,and then use the storage context that is defined before. And this one is the one where like, you know,you have VIS installed and everything,and then here I then define the query engine. So I'm gonna run it, uh, it's gonna take a tiny bit of time,but it shouldn't be too long. And then, yes, here is the query engine.

Basically I'm saying please only return the top three. Uh, like, we don't need to return, um, I don't know,10 results or something. So for now it's building the index. So I'm just gonna move on, uh, in the interest of time,but from now on, then that's where the cool part starts. So that's here I'm defining the different tools, uh,that the LLM can have access to.

So this is then, you know, like where if you wantto have a tool that is, I don't know,browsing a specific database,or another one is then, you know, connectingto Google Calendar, that's how you define them, you know?Then, uh, so this one, uh,I'm defining a query engine tool,and it's, uh, actually only about Lyft. So it's only about the Lyft document. Um, and it's gonna provide information, you know,that I have to write the descriptionso the agent then knows which tool to use and when. So it provides information about, uh, lift financialsfor the year 2021. Um, and thenI'm saying use the type plan text question as input to the tool.

And also I'm saying, please do not attemptto interpret or summarize the data. I just really want it. Um,and then we have the same one for Uber, uh, for Uber,you know, so it's like, it's the Uber 10 K,and then here is like gonna be like,provide information about Uber financials for year 2021. Um, and then the other questions are the same,so I'm just gonna define those, those ones. Uh, so now they're defined, uh,but then I need to tell the agent actually what,um, that they can use those.

So I'm setting up the agent here. So my LLM is again the same, you know,it's visceral memo again that I'm running locally. And then the agent, uh, is gonna be react agents. So react, if you remember, it's what I've shown youbefore, where then, you know, you use the LLMto then actually making the decisions, you know, to thinkand to divide the queryto make sure you're actually answering the question. Um, so that's what we're gonna use then I'm saying, okay,uh, we have those tools available for you.

So query engine tools is what I've defined above. So it's the d the two tools I have here. Uh, then please define the LLM,and I put variables true, justso you can see better, uh, what's happening. And I have a question, which is,could you please provide a comparison between Lyftand Uber total revenue in 2021?Uh, and then we're gonna have a, a response. So we can see here, uh, that's the input we give.

So that's exactly what I wrote. Then we can see the thought process of the, of the LLM. So the user has asked a question and comparison between Lyftand Uber total in 21 on, I need to use toolsto find this information. So then it's like, oh, actually, so I had needto use the first tool that is defined. So it's Lyft 10 K, uh, which is the one that I've definedabove, you know, uh, which is this one.

Uh, and then it's giving the inputs, uh,what was lift total revenue in 2021. 2021, sorry. So this one I haven't written in myself. You know, it's really the LLMthat has then divided the question, uh,into two different questions. So it goes through our documents.

You know, the observation is here. So lift total revenue in 2021 was 3. 6 billions. And then it's like, okay, now I needto find out Uber's total revenue for 2021. So then it also goes through a different action,which is Uber 10 K, which is a different one.

And then it changed my question towhat was Uber's total revenue in 2021?Then it finds the answer, you know,Uber's total revenue in 2021 was 17 billion. Um, and then it's like, okay, I'm happy, uh,I've gathered all the necessary information, uh,the user's language is in English, so I can now compare Lyftand Uber's revenue for the year 2021,because I have both of them. So the answer, you can see it here in 2021,Lyft auto revenue was 3. 6 billion compared to Uber's,which was 17 billions. This shows that Uber had more four times therenew of Lyft that year.

And then that's basically the answer we give backto our user. So that's the whole, like,that's the whole ver verbal verbal mode, sorry. Uh, but if you didn't have that, then you wouldn't see it,but then you would say, basically get the answer. Okay? Uh, it was 3. 6 billions and Uber was 17 billion.

So you've seen now how LLMs then can divide the question,uh, into different queries when it makes sense,and then it can sometimes really think, um,but no, that's cool. But you know, like imagine I hadto create two tools now for two documents. So that's what I have here. You know, I have this one, and then,but imagine if I had, I don't know, a hundred companiesor a thousand companies. You know, imagine I'm a financial analyst.

Um, I don't want to have to create different tools,you know, for all companies. What I would like to have is to have one tool,which is all about, like, you know,all the companies I have,but then I can create some meta filtering. So, you know, I can filter through company names,for example, or through document names to only then searchthrough the data that I, you know, that I want to access to. If I want to search for Uber,then you should only search for Uber. You, you shouldn't have to searchthrough all the collection.

So that's what I'm gonna try to do now. Um, and yes, basically that's what we have, you know,like you have better precision, more efficient,and then you can customize different things. So here I'm showing you an example of howto use mainten filtering, uh,when you know what you want to filter on. So this one, uh,we're gonna filter on the exact match filter, you know,which is the fine name. Uh, and then I want to have, uh, Uber 21, Uber 2021.

So we really filter only on this one. Uh, actually I wanna do Lyft. Uh, we're gonna, no, let's do Uber. So we have Uber, and now we have a filter that is created. So I'm creating an agent, uh, again,and then, you know, we're gonna interact with this.

So with this one, uh, I'm gonna askwhat is the business business overview of Uber?And, you know, my mates that are filtering is only on Uberdocuments, so I should find something. Let's see. Now it's gonna take a bit of time. Uh, and the other question I have, uh,it's the same, but it's about lift. So I am gonna run both of thembecause though it's a bit slow.

Uh, but then, let's see, actually, if my agent finds data,uh, with, you know, like the meta filtering that I have,so also you can see, so, you know, I ask my questionand then he is like, I need to use a tool. And then the tool is using it's company docs. It's the new tool that I've defined, which is the one,you know, that is, uh, more broad. It's not really specific about a company. So we can see some observation now, uh, about Uber.

Um, and yeah,then it's really like gonna continue on that one. Uh, this one can be very long depending on like,if the agent is happy about the answer,and I'm not sure the agent's gonna be happybecause it seems to be loopingabout, you know, about the action. Um, and that's a bit of the problem that you have sometimeswhen you use agents is that, you know, they,sometimes they're not exactly happy with the answer,then they're gonna loop, and then they're gonna tryto find something, you know, like really tryto find an answer upuntil they reach the max, uh, iteration. Um, so it seems like that's what it's doing now,and I guess that's part of the demo effect. Uh, but otherwise you can see, you know,like the observation is actually correct.

Like if I ask for like a business overview, you know,it's a global technology platform, uh,that connects different things. Um, but yeah, I'm gonna just keep that, uh, in the meantimebecause it should finish in a bit. Uh, but then what I want is that, you know, I wantto use an agent to extract a metadata filtering, uh,because, you know, like if you look here, uh, above,uh, I've defined it manually here, you know,I've defined like the, the filters,but again, you know, I don't want to have to do that. Uh, with like everything, you know, I wantto just type a queryand then with my query, then, um,I can have like the answers directly. Uh, I will likely check this one.

I think I will stop this one soon. Uh, but otherwise I'm gonna show you,and if it's not done, then I'll stop itand then we can go back to this one. Uh, but here, uh, that's why I have a prompt. So I have a prompt template, which is, you know,telling the LLM, Hey, uh, please given something specific,then you can make some actions. So for this one, uh, I'm giving a question to my LLM,and then really it's like given the following question, uh,extract the relevant metadata filters, uh,like consider company names, years,and any other relevant attributes.

So it's just, you know, to tell the LLMwhat I consider metadata filteringand what I consider important,but also then I'm like, please don't write any other text. I actually just want a metadata filter object. Uh, and then I'm say, I'm like explaining, uh, howto format it because you know,it maybe doesn't know about like meta are filteringand uh, this object in particular. So it's like, okay, formatted by creating a filters,like showing the following. So that's what I have here.

Uh, this one, uh, is an examplethat I use above as well. So it's like filters with the different filters. And then if you don't have, uh, any specific filtersthat I mentioned, then you red 91, uh,then I'm giving the question, uh,and then I return that one basically. So that's, uh, that's what we have. Uh, and then I'm gonna use Mr.

Nemo again, uh, and then we'll have an answer. And then I'm just checking,basically here I'm evaluating the answerto actually build the object in Python, uh, and then use it. But I think this one is still running. So yes, this one is unfortunately still running. Uh, I'm just gonna stop it quickly.

So I guess that's part of the demo effect. I'm gonna restart my notebook quickly, sorry for that. Uh, because otherwise it will not be happy. I'm gonna go through that very quickly. Uh, so those are like the tools that I define, you know,my agent again, uh,and I'm just gonna skip, uh, this onebecause it seems like it's loopingand I only have, uh, 10 minutes left,so I don't wanna wait for that long.

Uh, but the idea is then, you know,then you can't really find the answer for Lyftbecause I filtered on on Uber. So I'm gonna run this one after. And then once it's running, uh,then we can ask the question, you know,which is like, what is Uber revenue?Uh, and then for this one, it's not smart. My agent is not smart enough yet to infer uh, everything,uh, because I have some, you know, my metadata keys,uh, are very specific. So like, it's really like file nameand then it's the name of the file.

Uh, but then it really depends on the metadatafiltering you have basically. So for this one, I'm like, okay,this should be in the fine name Uber 22 1 PDF, uh,and then once we have that, uh, then we'll seeit should actually be able to have, uh,met is drink if it works. But now I have a bit of demo effect, so I'm just tryingto check where are we in my agent. Okay, so now we're comparing everything again. So, and you can see as well, uh, you'll seethat the answers are actually differentto the one we had before.

So before, you know, it was way more over both. And this one is just like, for example, 3. 7 billions. And now it seems like come from at theoutput, but let's see if it still works. And yeah, basically that'swhat I was saying at the beginning where like those agentsare work can work amazingly well,but then not everything is deterministic.

So then it really depends, you know, on the agentand the action is taking. And this one, yes, we should find the answer very quickly,and then we can go back to the live demo. Sorry. Yay. Yes, we have it.

Okay, cool. And you can see the format is differentas well to what we had before. Uh, but let's go backto our live demo now, 'cause that should be done. Yes. So that's what I was saying here.

I created the MET filtering,and then you can see actually the answer is this one. So the answer is metadata filter with a key file nameand the value Uber 21 PDF, um,and that was created by the agent itself. And so then I can, you know, run the same question. Uh, and then here I have my engine tool, uh,and then I say, you know, like it's company docs filtering,that's the name, but then here I don't write anything aboutthe companies, you know, I'm just like, Hey, please, uh,it's providing information about various companies financialfor you, the year 2021. Uh, and that's what it's doing.

And then now putting everything through my agent again,and we're checking for the answer. Uh, and here it's really like, okay,the user has asked about Uber's revenue,which can be funded, the provided PDF, you know, uh,I'll use the company docs filtering tool for that. Uh, and then it's again, asking the questionand then it's finding the answer as well in this document. And now it's only search actually in the Uber document,you know, there was nothing about Lyft and everything. So now, uh, just for like the bigger part, basically,because I haven't used LAMA agents yet,so LAMA agent is now coming into play.

So that's the one where, you know,like then you can orchestrate everything. So, you know, I've shown you different examplesor smaller task, but if you really want to have an agentthat works and that just like, does like a lotof different things, then that's, that'show you would define it. So you have LAMA agent, uh,and here you have, you know, like LAMA index with the agentas well, uh, function calling worker and MR. ai. So if you remember the graph that I havebefore, so I have this message queue as well, that is like,you know, using to talk to the different agents.

Then I have my control plane. Uh, this one is actually is where I'm using MISS for large. Uh, and the reason I'm doing that isthat the control plane is really, you know,the head of everything. So it can be useful to have an LRN that has like biggerand better capabilities, basically. Uh, then here I'm defining the tool services, you know,with the different query engine tools.

Um, and yeah, here is just like, uh, different tools,you know, so like those are meta tools basically. Uh, and finally I define the agent,and I'm gonna use me so large again herebecause I need a model that has like very good capabilitiesof, you know, dealing with agentsand function, calling, uh, and returning. So that's what we do. Uh,and then I'm just defining, you know, the description. So now that is defined, uh,I'm just changing the log level so you can see everything.

Uh, and now I'm gonna launch it basically. So we have the agent server, the different tools,control plane and the message queue. And then I can ask for example,what are risk factors for Uber?And I can see, so we have the agent service, then you know,the different control plane, we're publishing everythingto the control plane with like, you haveto create actually a new action. You have to create a new task. So then you then, you know, go for like an action, uh,you creating an agent, then you publish again, you know,then you create different tasks.

I'm not gonna go through everything,but, uh, that's how it works. And then here, that's where like, you're like, okay,actually I think I found the answer. So it seems like you send backto the control plane, okay, that was completed. Um, and then, you know, you do like different other actionsand now it should be done and we can actually print it. Oh, it's not done yet, okay, no, this one is done.

So this action is done,and then it's really like, you know, during different, um,different calls and soon enough this should be done,uh, if it's actually working at one point. But yes, that's basically what's happening is that then it,you know, talks with a different, uh, agentand talk to the different services. Uh, and hopefully it will finish at one point. But otherwise, uh, that I'm gonna come to the conclusionbecause I have 10 minutes left. Um, but that's basically the idea of how to use LAMA agent,uh, to like, create different agents, uh,and using visual as well to com, you know,to orchestrate everything.

Uh, and then you store everything into visand yes, basically if you like these tutorial,even though there was a bit of demo effect, uh,you can still, you know, give us a Starling GitHub. Uh, we would really appreciate it. Uh, that's really, really is helpful for us. Uh, and then if you have any questions,you can also add me on LinkedIn, uh,and I'll be happy to answer your questions. Uh, and that's it.

Thank you. Okay, cool. Thank you Stefan. Uh, you have a few questions in the chat,so I'll just go ahead and, um, ask those. So to run this rag service using LAMAand menstrual API do, we need GPU based serveror CPU based is good enough.

And then share server vm, uh, CPUand ize use for this demo. Okay, everything's broken now, uh,but we can just use CPUs for this one. Uh, everything is running on my CPU at the moment. Uh, and then like a small, a small laptop is working as wellfor models that I ize. Okay, thank you.

And then for viss, on-prem deployments,what are the recommended embedding modelsand l LMS suited, suitable for viss?Uh, I mean, we support everything. So then it really depends more on your use case. Embedding models would be,I know it depends on the language, for example,it depends on what you doing. Um, you can also depends on likehow much resources you wanna spend. Uh, but some models, for example,will be very good in German or French.

Some models have been shown on English only. So like once you embed everything, uh, then, you know,like we can store everything into, into vectors. So then it was good. Yeah, and I also shared a resource in the chat of, uh,AI models for your Gen AI apps. And then there's also a blogof choosing the right embedding model for your data.

So you can check that out on, on our website. Um, I shared that link. Um, and then another question you have is, would it workto set temperature equals zero to make the agent's LLMas deterministic as possible,thus achieving more deterministic behavior?Yes. Uh, that would, uh, that would be very helpful. In general, um, it depends on like, I mean, yes,that would be very helpful to be more, um,I think I had changed the temperature at one point.

Uh, I was not too happy, but, uh,we will try with zero again. Okay, thank you. And then someone's asking could this notebook beshared to the participants?Um, so it's gonna be on our bootcamp on the Milus bootcamp. Uh, Stefan, do you have a timeframe?May maybe within the next day or so?Yeah, end of the week. It's in peer reviews atthe moment, so that's why.

Um, yeah, and once it's published,This, this whole, uh,recording is gonna be posted on YouTubeand that the notebook will be in the comments,I mean in the, in the description so that, um,you can go back and check, um, at the endof this week if, if that's up there. Um, so thank you for your patience on that. Um, and then in the next question is,can we implement some breaking logic for agent example,give up after and attempts and generate answers?Yes. Uh, there's like some parameters you have like maxsituations, for example, you can set for your agent. So then, you know, like you can likestop after three or four.

Uh, also depending on your prompt,you can also tell your agent, please,if you don't know at one point, just stopand, you know, give an like,give an answer saying you don't know. Uh, that's usually how it works,except when you have a demo effect. Uh, but yes, for those usually yes. Okay, cool. If anyone has any other questions,please leave them in the q and A tool.

Um, a question I have is, what are the considerationsfor choosing between different LMSfor various tasks within the agent system?I think it's really depending on like, you know,what the LLM has to do. So if it's like, you know about some small taskwhere you don't need a lot of cap reasoning capabilities,then a small model, uh, might be very useful,might be enough, you know, you can have it running locally,but then if you need something where has like, you know,a lot of reasoning capabilities,it's gonna go into like different functions,it's gonna call different agents, then it might be good, uh,to have a look into bigger models. So like for example, misel, uh, large is 123, uh, billions. But then you can also think of like, you know, LAMA 3. 1,uh, 400 billions.

Uh, and those can be very, very useful as well. Just more expensive. Okay, thank you. And what are the benefitsof using a big LLMfor high level decision making while delegating specifictasks to other agents?I think it's, um, kind of repeating myself herefor this one, but it's gonna be like, you know,you're gonna save money and time, uh,if you use a smaller LLM, uh, for the other task, uh,because also big L LMS can be very slow,uh, processing things. Um, and yeah, it's, it's way more expensive as well.

So like time and money basically. Okay. Does anyone have any other questions?Please, uh, feel free to ask them. I see one more here. If you already host minstrel AAM locally,what is the Mistral API key used for in the notebook?Uh, it's used for mistral a large, so I can't run, um,the 123 billions model locally.

So if like in the demo, that's what I have here. Unfortunately it crashed, uh,but that's what I had, uh, hereotherwise, you know, it's really controlling everything. Uh, so the control plane, uh,with the agent orchestrator here, I'm using visual largeand this one is too big to run on my laptop. Uh, so that's uh, why I have the API keyand also for embedding modelsbecause this one, uh, is only there. Yeah, sometimes the demo gods are notin our, in our favor.

Yeah. And this part has nevercrashed and it was like, yeah, let's go. Of course when we're recording this happens. Um, yeah. Thank you so much Stephan.

Uh, we will end this webinar a little early. So thank you so much all for joining. You'll receive the recordingand the slides, uh, through email later today. Um, and look out, um, on YouTube in our middle list bootcampfor this notebook to try this out yourselves. So thank you all and thank you Stefan for hosting today.

Thank you. Bye Bye everyone.

Meet the Speaker

Join the session for live Q&A with the speaker

Stephen Batifol
Developer Advocate
Stephen Batifol is a Developer Advocate at Zilliz. He previously worked as a Machine Learning Engineer at Wolt, where he was working on the ML Platform and as a Data Scientist at Brevo. Stephen studied Computer Science and Artificial Intelligence. He enjoys dancing and surfing.

Building an Agentic RAG locally with Milvus, Ollama and Llama Agents

About the session

Topics Covered:

Meet the Speaker

AI Assistant