Events
Building an Agentic RAG locally with Milvus, Ollama and LangGraph

Training

Building an Agentic RAG locally with Milvus, Ollama and LangGraph

Zilliz Webinar | Zoom

Join the Webinar

About the session

RAG systems are talked about in detail, but usually stick to the basics. In this talk, Stephen will show you how to build an Agentic RAG System using Langchain and Milvus.

Topics Covered:

How to make agents leverage planning, memory, and tools to accomplish a variety of tasks.
How to empower an LLM to perform web searches and call custom user-defined functions.
Common issues like hallucinations and how to add fallback and self-correction mechanisms so your agent can try to fix itself.

View presentation slides

Transcript

I'm pleased to introduce our session building an AgAgentic rag locally with Vis AMA and L Graphand our guest speaker, my colleague Stefan Tupo. Stefan is developer advocate at Zillow. He previously worked as a machine learning engineer at Vol,where he created and worked on the ML platform,and previously as a data scientist at B Bravo. Stephan studied computer scienceand artificial intelligence. He is a founding member of the ML ops community,Berlin Group, where he organizes meetups and hackathons.

He enjoys boxing and surfing. Welcome, Stephan. Thank you very much for the intro. Thank you everyone for joining, uh,has said today we're gonna talk about howto build an agent locally using LAMA three graph. And, um, I'm Stefan Befor.

I'm your speaker today. Uh, I said I'm a developer advocate at zi isif you have any questions related to the talk,you can reach out to me directly, either on LinkedIn,on Twitter, or per email as well. This query would lead you to my LinkedIn. Um, but let's get started. So first I'm gonna talk about vis, uh,if you're not familiar, VIS isOpen, open Source vector database.

It actually belongs to the Lin explanation. Um, and the good part is that we have,we have everything going from VIS lights, which is, uh,allows you to do a PIP install directly on your notebookto then also scaling up to like billion plus vectors. Uh, we also integrate with like different partnersand we have a lot of different features like thatand spas, embeddings filtering, rear ranking and beyond. Um, feel free to check, get us, um, on GitHub directly. Uh, stars are also very helpful.

Um, so yes, we also integratewith the different AI toolkits. So you can think of like Lang Chain number,index haystack length use ai,voyage ai, and different things. Today I'm actually gonna use Lang Chainand Lang Graph in particular. Uh, so let's go. First I'm just gonna give you a quick introductionfor the people that I'm familiar with, what RAG is.

Uh, then I'll also talk about, uh, agent Rag and what it is,and then we'll go more into of a,we'll go into a demo, sorry. Uh, so you'll see we'll actually build, uh, the agent, um,ourselves, and then you'll be able to follow alongif you have questions for freeto put them in the chat on the q and a tool. That code will also be shared later. So yes, let's go, let's get started, sir. So Rag means a retrieval augmented generation.

And the basic idea is that you wantto force your LM to work with your data. So you put your data in a Vector database like Vis,and then you're gonna give this data to your LLM. So when you ask a question, uh, then you're ableto then use your own data. So the basic raw architecture looks like that. So you have the data part.

So then you're gonna extract the content,and then you're gonna do what we call chunking. So you know, you're gonna create some part of the content,like you extracted, like, let's say every 200 characters. Uh, you're gonna create a chunk of, of, uh, of your data. Then you put it through an embedding model. Then you store that into vis.

And then when you have your user,your user is gonna give a query. You also put it through the same embedding model,and then you're running, uh, what we call a semantic search. So you're just gonna search for thingsthat are semantically similar. Uh, and then we return the similar data to that. We give it back to the context of the LLM.

The LLM gives you a response. That's the basic architecture. This one works well. Uh, it has some limitations, uh,but it's, um, it works quite well as well. If you, you know, you just wanna use your own dataor if you wanna use, I don't know, legal dataor something like that, do you have a famous five linestarters where you know you are gonna loadyour data in the first place?Uh, then you're gonna embed it and index it.

Uh, then you generate to queryand then you just, uh, run queries. That's, that's the famous one. You know, it's, it just, it's only five lines. The only problem is that, is that,you know, it's like, it's very simple. It only has the data that you store in your Vector database.

So for this whole thing, the tech stack I have, uh,I'm gonna use Lang Chain. So for the people not familiar with it, uh, it's a frameworkfor building LLM applications. It's mostly focused on retrieving dataand integrating it with LLMs also has an integrationwith most AI tools you, you can think of. And on top of that,I'm gonna actually gonna use land graph by language chain. So this one allows you to build state food apps with LLMsand it also allows you to build multi-age workflows.

It also adds cycles and branching. Uh, you also have human in the loop. So you know, if you are, I don't know, you are workingwith customer supportand you have your agent making decisionsbefore actually, you know, making a decision. It can actually ask a human you like, Hey,is it fine if I actually, um,you know, refund this customer?Um, or you can have, look at those things. Basically can always have checks.

So it can be very, very handy. Also adds persistence. You know, when you are, when you have rag,it's usually just goes one way. Uh, so you don't usually have memoryor anything, uh, by using Lamb Graph, uh, then you can addthat I'm also using alama. Uh, so Alama is a really cool tool that allows you to,to run your LLM basically anywhere.

Uh, so I run it on my laptop. I'm gonna use LAMA three as well with it. Uh, they also allow you to run on billing models,so they make it very easy for you to run a Lambs. Um, the good part as well is that they're fully compatiblewith the OpenAI, uh, API. So then, you know, when you run your Python, if you use, uh,Python, then you can also run alama with it,and then you actually have to import OpenAI to,to, to use alama.

I'm gonna use Vis light. Uh, so as I saidbefore, you know, with like with Vis light,you can just run it, uh, you can buildonce and then deploy anywhere. So then by just changing the UI later,then if you start on Vis light, then you can go on VUSthat is running on Kubernetes,or then you can go on Riss that is in the cloud. Uh, so you don't have to change your code at all. But yeah, I said it before.

Uh, gonna talk about Agent Rag. So what is it? First is that, you know, before, as I saidbefore, like if you have a simple rag,you would only have the query, then the agent, then the rag,and then you give a response. That's it. Uh, with Agentic Rag, what's good is that, uh,you can have multi turn. So, you know, like if your rag is not like giving thecorrect answer or if you are like hallucinatingor something, instead of just giving the the answer,you can actually check again,you know, generate another answer.

Uh, so that's like, that can be like really helpful. It can also allow you to have a queryor a task planning layer. You know, like if you have, if you have very complex query,uh, then maybe you can divide it in different queries. So that's can also be something that is very useful. Also, you can plan, you know, different thingsthat you can do with your agents and your different tools.

Uh, it also allows youto be an interface for the different tools. Uh, let's say you wanna have a toolthat is gonna browse the internet. Uh, also you can do that. You can have a tool that is,you know, checking your Google calendar. You can also do that.

Um, it also adds reflection. So it's making sure that your agent, you know, is,is like checking, oh, actually am I, what am I saying?Is it correct or not?Um, am I actually answering the question, uh, of the userand as well as memory for personalization. So you would like, it's not only going one way,you can actually, you know, store things in memory, um,and then you give it back to your LLM. So the general idea is, is that we, for likestatistic example, sorry, you're gonna have routing. So it's what is called adaptive rag.

So they're gonna root questionsto like different retrieval approachesand we'll see that, uh, later. But basically my agent will decide, oh,am I gonna check in Elvis if the data is there?And if the data is not there,then it's gonna perform a web search for me. Uh, we also have fallback, so corrective rag,so it falls back to websites if thedoc relevant to your query. And then we're gonna have self, self-correction as well. So self rag.

So it's gonna try to fix answerswith hallucinationsor answers that don't address a question. Uh, and then, you know, like if you don't address aquestion, then it's gonna generate another answer,or then it's gonna process the intended as well, maybe. So those, like the agency is gonna makethose decisions on its own. I don't have to say anything to the agent, uh,except doing some prompting. So you'll see my prompts are gonna be quite long actually.

Uh, but that's, that's a way of like, you know,having an agentand having, like having it doing some magic basically. So yeah, we're gonna have like r in action. Uh, and I'm gonna use milli lights, sorry. So we're gonna go in a demo directly now. So I have my notebook,so I think everyone should be able to see.

So I'm installing VIS lights. So it's uh, pip, pip install by vis, uh,and then also installing all the different dependencies. So as I said, sorry, as I said, I'm using Link chain,uh, and then, you know, like Link Chain, hum. And also I'm using, uh, which is a web search for LLMs. So those ones is something that we can use, you know,and then, um, I'm just, I don't,I'm not gonna install everything becauseotherwise it's gonna take a bit of time.

Uh, but yes, so it looks like that. So basically I have Myra Rag and I'm using LAMA three. And then I have Vis as well the Vector database. And the idea is that I saidbefore, you know, so we have routing, which is here,you know, looks, I have my question for my user,and then I give, like, there's a routing part here,and then it's gonna check is it actually related to my indexor is it not related to my index?If it's not related, then we're gonna do a web search. And if it's related, then we're gonna first retrieve thedocuments, uh, and then we're gonna grade the documents.

So then, you know, like it's actually checking if thedocuments are relevant or not. Uh, and if they are relevant,then we're happy we generate an answer. Uh, and you know, if like then we check for hallucinations,if we have an hallucination,then we're gonna generate another answer. Uh, and if we don't have an elimination,then we check if we actually answers the question. So if we answer the question,if we think we answer the question,then we give the answer back to the NLM.

And if we don't, then we're gonna do a web search. And basically you are always gonna fall backto doing a web search if you know something is going wrong. So, yeah, so basically we are adding reflection. So self-correction, we're gonna add plaing. So you'll see like we're gonna have a graphthat is gonna be built that is really gonna help us,uh, to plan different actions.

Uh, and then in, yeah, we're also adding tool use. So like, you know, having some specific nodes in the controlflow, so like web search,and then the agent will use those tools. So I already have downloaded the model,but if you don't with alama, you just pull the model, uh,and then you can, you can then use it. So I'm just gonna import, uh, my filesbecause I'm gonna use tab Iand table is not open source,it's not running on my machine. So I just need to use the API key,I'm defining the LLM I'm gonna use.

So it's just LA three. And then do here, this is where I'm startingto use link chain, so I am gonnabrowse the internet as well. So I'm just gonna show you, uh, the different, um,articles that I'm gonna use. So there's this one, uh, which is about LLM powered agents. So you can have a look.

We're not gonna read it, but you know,he's just telling you different things about agents. So like system overview. So, you know, as we saidbefore, like planning, memory, tool use,and different things like that. So you can see, just keep in mind, this is about agentshave another one, uh, which is about prompt engineering. So, you know, it's the same kind of thing,it's still related to LLM, it's still relatedto agents, uh, and everything.

So those, and then I have a third one actually, which isabout advers adversial attacks on LLM, sorry. So again, it's still related to LLMs. Uh, you get the idea, you know. So, uh, I have my URLsand then I'm just saying please, for those ones, uh,let's load them and then we're gonna get, uh,all the documents and then I'm gonna split themand I'm creating chunks. Um, just for this demo, I'm just gonna gofor chunk size of 250.

There's no golden rule about that, you know, it's not like,oh, you have to use 250. It's just, I found it that it works quite wellfor this case, but it mostly depends on your use case. Uh, if you work with legal data, for example,you might want to have longer chunks. Uh, you can also use semantic chunking, um,which is gonna create some chunksdepending on the similarity of the document. Uh, so yeah, you can do that.

Then I'm just gonna split everything, uh,and I'm gonna store everything in Vis, so this iswhere I define, okay, with my documents that I havewith my new splits, then I'm gonna create a collection name,which is Rag bu, uh,and I'm just gonna use a hugging face onbeddings, the default ones. Uh, and please store it in BU lights. So then we create a file name. So let's go, let's actually check if it works. So this is just a warning, so it's fine,but like at the, like, what is doingthat at the moment is like going on the, on the articlesthat I've shown you before, and then it's getting them,and then it's creating chunks,and then it's inserting everything into bu.

So yes, so that this one worked, so we're happy. And then now it's where I'm actually gonna create the agentsand the different parts of the agents. So this one is the retrieval grader. So I'm using Alama as I said. Uh, and this one is where, you know,I write all the prompts.

So that's when you kinda become some,what we call prompt engineer. So you are a grader as, uh, assessing relevanceof a RETRIE documents to a user question. And basically I'm then saying, uh, if you know, it's,if it's related to the user question, like then grade itas relevant, and if not, then filter it out. Also then give a banner score of yesor no to indicate whether the document was relevant or not. And the reason why I'm doing that is that, you know, LLMs,from what we see in the research at the moment,LLMs are pretty good at being, uh,for like a classification problem.

But if you ask them to give them rating from zero to five,for example, then they're not very good. So basically, I'm just gonna divide everything like that. Everything will be a binary scorethat the NLM will have to return. Uh, and that's what I'm gonna do for my retrieval, you know,for checking, uh, if I'm hallucinating or not, for example. So, yes, so basically then here I'm giving the documentand the user question, uh, and here I have my chain.

So, you know, it's the prompt, then the LLM,and then I'm just, um, just giving an output as Jason. So my question is gonna be, is just a simple question,it's just agent memory. So let's see if actually we think, uh, this one is relevant. So it's gonna go into visand check, you know,to see if like this one is relevant or not. And given that my question is related to agent memory,then the score we have is yes, you know, as we saw before,or the articles were actually relatedto agents and agent memory.

So it makes sense that it is telling us, yes. Then I have another one, which is just generating some text. So this one, uh, you know, it's like, okay,given the answer, you're gonna have, uh,if you don't know the answer first, say that you don't knowand otherwise generate some textwith three sentences maximum, and please keep it concise. So this one is just, you know, when,when you actually has your GPT, for example,then it's gonna generate some text for you. So that's what we're doing here.

It's basically, it's basically the same. So my question is still related to agent memory, uh,and I'm giving, you know, the context and the question, uh,and let's see if it's actually generatingsome text that we want. So it should, it takes a,it takes a bit of time, but that's fine. And then you can see it here. Uh, then it's like it generates this text.

So based on provided context, I can answer that. An agent memory is a collection of observationand events directly provided by the agents, uh,and blah, blah, blah, and blah, blah, blah. So that's what it's doing. So that's another action as well that you're givingto your, to your rag agents. This one is then an hallucination grader.

So this one, you know, I'm still using alamaand I have a prompt that is okay,saying you are a grader assessing whether an answer isgrounded in or not. So if it, if it's grounded, then please give a binary scoreof yes, uh, and otherwise give a binary score of no. Uh, and then, yeah, again, it's always like that. You always ask for yes or no,and it's the same question that I'm asking. So I'm still asking question about the agent memory.

So then it should check, you know, by the answer,it's gonna give me, it's gonna give me a score. Actually, it's again, a score of being like, yes,um, this is grounded. You know, we actually have somethingthat is about agent memory in vis. So for this one, we're happy. And you know,before I also said we're checking justto check if the quality of the answer is good or not.

So it's still the same here. You are like gonna give a binary score or a yesor no, uh, if the answer is useful to resolve the question. Um, so that's, that's what you're giving. Um, and then I'm still have the same question. My question is still related to agent memory.

Um, and then we're like, okay, we're happy, you know,we're actually, it's actually useful to the questionof the user, and here iswhere I have all my routing. So this routing is, you know,it's like you tell basically the agents to go where,depending on what you give. So I'm like, okay, you start with you an expert at routing,uh, a user question. So either you go to a vector storeor either you go to a web search,and then it's like, please use the vector storefor question related related to LLM agents,prompt engineering and different attacks, you know, andotherwise use the web search. Uh, so that's, that's why we say it like, please,otherwise, you know, use the web searchand then actually you're just gonna give a choice.

Um, this one is either you, you say web searchor you say vector store based on the question. And then, you know, like, then you're just gonna returnadjacent with a key data source. Uh, and that's it. So for this one, I have two questions. Uh, I have the first question here, which is,when will the Euro football take place?And then we have the second one, which is what other typeof ancient memory we expect the second oneto go into Vector Store,and we expect the, expect the first one to be a web search.

So let's see if, you know, my prompts were good, uh,and it seems like it's fine. So the first one, you know, which is relatedto the Euro football, uh, we are saying, okay,this is a web search, and for the second one, uh,that is related to agent memory, then we're saying,please go to the Vector store. So here is then where I'm like, you know,gonna do the web search actually. So then I'm using, uh,because it's a search engine specializedfor LLMs, so it's quite handy. Uh, and now from now on, you know,we define all those prompts and defining the prompt is fine,but you don't, you don't have, you know, the control flow.

You need to have the flow and the graph to build this. So let's go, we're gonna go around, have a look, uh,I won't go over the details of everythingbecause it's quite long, but here we define a graph state,you know, which is just representingthe, the state of our graph. So you know, you're gonna have the question, uh, the land m generation, whether or not to add a web search,and then the list of documents. Then I'm adding all the nodesthat are needed for reflection. So, you know, we're gonna add the re function.

So this one is gonna retrieve everything from a vectorstore, uh,and then, you know, you're gonna print everything, um,and then it's gonna invoke the question. So then you can like actually have a do a,a vector search, sorry. Then you have another one which is generating,so this one is, we've seenbefore, you know, you're gonna generate an answer using rag,uh, based on the documents that you retrieved. So this one is really helpful, you know,if like your user is asking a questionand then, you know, you have the data, uh,but then you just want it to be like a nicer answer. Basically the, you have another functionof six grading the documents.

So you know, it's basically checking, as I saidbefore, if the documentsare relevant to the question or not. So you check this one is quite a long one. Uh, but yes, you're gonna check the different states, uh,you're getting the question in the documents, um,and then you, you know, gonna really grade,okay, are we irrelevant?Uh, then we fine, uh, if the document is not relevant,you know, then we like print itand then we're gonna do a web search. That's how we define it basically. Um, so it's only like, okay,the LLM is actually grading itselfand then, you know, if we're not happy, then we definethat we're gonna do a web searchand then we continue, uh,then I define my web search function.

So, you know, it's again, you know, like it's really based,uh, on the question of the user. So you have it here, uh, I have the web search tool,so then I'm just gonna invoke it with a querythat is actually the user's question. I'm gonna return everything, you know,I'm joining everything, so it looks nice. Uh, and yeah, then I'm returning the web results. And here is like, that's where you havewhat is called conditional conditional edge.

So this is, you know, like based on conditions,maybe you can escape, uh, your graph. So this one is rooting a question. So this one is rooting a question to either do a web searchor rag, you know, so that's what we saw before. So it's like if the data source is web search, thenwhat will returning web searchand then we'll perform a search, or the agent will do that. Then if the data source is vector store,then return vector store,and then the agent will search everything invis and yes.

So here as well, you define a state, you know,for your agent to determines whetheror not to generate an answer. So either you generate an answeror then you add a web search. And so yeah, then you assess,you know, the graded documents. Uh, if you know you did a web search, um, then you're like,okay, you're just checking for relevance, uh,then we're gonna regenerate new query. So because, you know, we're like, we can seethat all the documents are not relevant, uh,even even with the web search.

So yeah, and otherwise we have relevant documents. So we can generate an answer here. That's where, you know,like you determines whether the generation is grounded inthe documents and answers the question. And then, you know, it's really,it's returning the decision for the next node. So is it gonna be like, oh, are we happyor are we gonna make another web search?Or are we gonna generate a new answer?That's why you define those things.

So yes,you're basically checking hallucination in the first place. Uh, you're checking if the generation is grounded in thedocuments, you're then checking, uh, you know,if you're actually answering the question. Uh, and then yeah, you check if like, the answer was usefulor, or not useful. And then if the answer is not grounded, you know,we say it's not supported. Um, and then based on that we can do the new web searchor then you can make another research.

Uh, then here we just add all the differentnodes that we defined before. So yes, we add the web search nodes, the retrieve nodes,great documents, and the generate one. So I'm just gonna execute everything. Uh, that should be very quick. Uh, let check if he was, yes.

So that was executed so that now we define the graph. So we have the graph running and now we're gonna build it. So basically, you know, I'm setting an entry point entryand I'm gonna root the questionand either the question will go to websitesor then it will go to a vector store. If you were to have, I dunno, something else you wantedto add, then of course you have to, you know,create a function as I did before. But then here you would also haveto set a different entry point.

If you, uh, I don't know, you want to, you have bubut then you have like another database, you know,for like a different data, then you could also check there. Um, then yes, I'm gonna add an edge, which is, you know,to grade the documents once you retrieve it. Uh, and then I'm gonna grade itand then, you know, I'm deciding to generate or not. Uh, and then it's either I do a web searchor then I generate, um, the text. And then here is, you know,like you're gonna grade the generationand you're gonna check in if you're actually answering it.

And then if you are, if you're returning not supported,then you know the next one is gonna generate. If you are gonna think it's not useful,then you do a web searchand if you think it's something useful, that'swhere you end, that's the end of your graph basically. Um, and that's when you know,then you're gonna like return an answer to your user. So let's build the graph. Now it's built, uh,and we're gonna actually startasking question to our agent now.

So, you know, I compiled the graph here. We can just, you know, the workflowthat compile and that's it. And then we're gonna ask the question. So like, the first one is still you relatedto my agent memories. So the question is, what are the types of agent memory?So we start, uh,and then we can see, you know, like what other typeof agent memory, that's my query.

The data source is gonna be vector store,and then we're gonna route the question to our rack system. Then we're retrieving the document, then we're donewith the retrieving part. So then we're gonna check or are documents relevantto the question or not are our documents, you know,like then here we're grading all the documents,so we're really checking if they're good or not. Then we assess, okay, so we, we, you know,we graded all the documents, are we happy with that?And if yes, you know, then we generate an answer. So then here, apparently the LL airline was happy,so then decided to generate an answer,and then, you know, you start generatingand then you're gonna check for hallucinations.

Then, you know, the decision was like, okay,actually the generation is grounded,so we are happy with the generation. And then you're also checking if you're actually answeringthe question, you know,and it seems like thegeneration was addressing the question. So then here, we're done. That's it. Uh, and now we're gonna generate actually the text.

And the text we have is, is this one. So, you know, based on provided context,there are three types of agent memory. Uh, you have the first one, which is retrieve model,then you have a reflection mechanism. Um, and it seems like it's missing one,but that's fine, that's how LMS work. But yeah, anyway, uh, you can see like it kindof gave me an answer, you know, with like,which is related to agent memory.

So then we're happy here,we did a vector search, you know, everything works. If I wanna show you how it looks like the graph,it's the one I've shown you at the beginning. Um, but it looks like that. So you have the start, then you go to your vector store,retrieve documents, you grade your documents. If you're not happy, you do a web search.

If you're happy, you generateand if you think it's useful, then you stop. And otherwise, you know, you always default to web searchor then you can loop, you can loop as wellfor the regeneration parts, you know, you can generate againto see if it's actually something that was good. Then there's a big conference happening, uh, next week,uh, actually in, in Europe. So then, you know, I can ask my agents, uh,when is the we are developers conference expectedto happen next in Europe?So then here, you know, it's noticed that it's not relatedto LLM agents or anything. So then the routing part was like, oh,actually let's do a web search here.

So then, you know, we'rerooting the question to our web search. The web search is over, we're gonna generate some text. We're again checking for hallucination, uh,which checking if the answer is generate, uh, is grounded,sorry, in our documents,if it actually answers the question, uh, then it seems like,you know, we are addressing the question, so we're happy. So then it's giving you an answer. The We Developers conference is expectingto happen next in Europe from July 17thto July 19th in Berlin, Germany.

Uh, you can check that, that's actually true. It's happening next week. So here we're happy, you know,and that was because then the agent decidedto do the web search here, you know, he was like,there was nothing, uh, in MIL related to that. And if I were to have like, you know, a lotof different data in vis, um,then you could also still do a web search. Then you, you also, you know,like I live in Berlin in Germany, uh, so you know,from people that are visiting, maybe youwanna know what you can do in Berlin.

So I can ask this question. And then, you know, again, it's gonna do web search, uh,and then it's gonna generate thetext and you can see it live. It's generating the text at the moment. Then he's checking for hallucinations again,and then you, you can see like everything happening,like basically agent is checking for everything. Um, you like, the decision was like, it's grounded.

So we're checking if we're addressing the question. Um, and apparently, you know, we're addressing the question. So let's see, uh, what we can do. Uh, so there are plenty of things to do. Uh, you could visit the Bellin Wall Memorialand Documentation Center, you can explore Museum Islands,uh, you can check out a shopping street, uh,and then, you know, if you're lookingfor something more unique, uh,consider visiting the topography, the tower, uh, which is,yeah, it's, it's very unique.

We can say that. Uh, and yeah,otherwise you can maybe take a through through Tear Garden,which is a very nice garden actually in Berlin. So I would say actually, right. Um, and that's basically how you can then build, you know,agents for like different tools. So I've only shown mail with a web search,but if you have different tools, then you just haveto add them to your node and with the different conditions,and then your agents will then be able, you know, to,to do everything that you want it to be.

You can, for example, you can even like do like, you know,check your Google calendaror you can check different things like that. So that's kind of it for my demo and everything. So, uh, if you liked it, uh,please give us a start on GitHub. It really helps, uh, for the open source community. And also if you have any questions, uh,you can add me on LinkedIn, both you,I could should redirect you to the correct place.

Uh, thank you very much. I've seen a lot of, um, activity in the chats,so there might be some questions. So I'm gonna take questions now. Thank you. Um, we have a question in the chatof why didn't you use pipe function?We didn't choose what, sorry?The pipe function.

Ah, from long chain. Uh, just because I don't like it sort, I find it's um,like I just don't like it personally. So then that's whyIf it's from long chain, um, here's another one. Uh, could function calling fit in this use caseor would it only be if the tool is an APIwhere you need more arguments?For example, an internal API versus web search,which is using freeform text?No, they then usually the, the agent will figure out the,the API uh, parameters. So that's like the good parts basically.

Uh, so if you were to call Google Calendar,then it's an API calland then, um, then he will figure out like he will infer uh,the different parameters that it needs. Is this an example of a multi-agent system?For example, each component is an agentor is the whole graph considered an agent?Uh, this one is considered an agent. Uh, like I have another demo that I'm building, uh,that is then gonna be like multi-agent. They can also work at the same work at the same time. Sorry.

So that's what's planning in the future. How would asking a follow up question work in the graph,for example, if you want to follow up on the third kindof memory that was left out,Uh, can well let, who would you asking follow up question?Yeah, so then, uh,the third kind of memory that was left out,Then like in your graph,you would then define another action, which is like,you know, then you can maybe make it stop. Uh, like it depends onhow you want to ask a follow up question. If it's something you want to be automaticor if it's something that you want to be, um, to be manual,you know, where you'll be like,oh actually I myself wanna ask a follow up question. Uh, otherwise it's always you can add a condition,which would be like, you know, if you don't know, uh,and you already do, did a web searchand you try to generate an answerand you're still not happy, then take a break,ask the user for a follow up question.

That would be usually that would be how,how would do, how I would do it. Sorry. And then, sorry, I'm just trying to go through the chat. If you guys can use the qand a tool, it really helps me not miss, miss your question. Um, okay.

Uh, how close to open AI assistant API,how close is it to the open AI system? apr? Uh,I feel like, I mean the open AI system,DPR has more resources I would say. So, you know, like open source can be nicebecause then people can build it directlyor collaborate how close we are. Can't really answer like the how close we're question. I think it's like the same as, you know,the different open source and land models you have. How close are we to GPT-4 Ohor something, which then you have benchmarks,but way it depends.

Is using land graph the only way to build multi-agentor are there any other approaches or alternatives?No, you have like different, different toolsthat are available for that. Um, land RAF is one. Um, then you have LAMA agents that was released last week. You have crew AI as well, um, which is also quite,quite known in the open source community for that. Uh, I'm sure you have closed source ones.

Uh, I tend to not use those, um, as I'm mostly working on a,with open source you could also have your own your own,you know, like if you have some kind of, um,so like you can write everything in Python,then it might take a lot of work,but you know, you, it's something you could also do the sameas, sorry, yeah, just to finish the same as, you know,when you use rag, when you build like a basic ragapplication, you can either use LAMA indexor long chain or like different tools. You can also do it yourself. So it's basically the same here. Um, you could also do it yourself if you, if you want to,if you want to control everything. And how do you define an agent? Stephan?Usually an agent is something that has defined itas like something that is taking action as on its own.

So, so that's what I would say. So it's like, yeah, can even like just checkingor if the answer is corrector not, uh,is usually something where I would call an agent. Here's another one for you. Frameworks like Riga check things like answer,relevancy and groundedness. Is the difference here that this is liveevaluation with the nodes?Yeah, so it's a bit, ragga is really checking for like,um, is this Yes, basically yes.

So I just read the question. Uh, so yeah, it's, um, it's checking live basically. Uh, and what I'm doing here is is basically theequivalent of, you know, when you talk to your lamband it gives you an answer and you know the answer already,but then it gives you the wrong answerand you ask it, oh, are you sure?And then the lamb is like, oh actually you're right. You know, I was like, you know,I was, I was wrong and blah blah blah. It's basically doing that.

Uh, so yeah,it's like the difference is that here it's live. When should we use PEFT on these open source LLMs?Uh, that's a very good question. Uh, it's the one from hugging face I'm gonna assume. Um, and so far I haven't really like haven't,I haven't looked at it, so I'm not gonna tell you anythingor I don't know, I'm not gonna invent, I haveto like search and follow up it. You don't wanna hallucinate an answer.

Exactly. Um, how would you differentiate the usageof Lang Chain and LAMA Index?Is this use case specific?Uh, yeah, I mean it's also likecomes down to your own preference. You know, like some people might prefer L chainor LAMA index depending on the obstruction level,depending on, you know, like the integration they have. Um, on my end, uh, also back then, you know,LAMA Index didn't have a lame LAMA agents,so long chain was the first one to, you know,to really have agents, uh, to make it easier. So it's also why I went for it.

Uh, but yeah, usually also can be use case specific. It can be, it mostly comes downto your like personal preference. Great. Sorry, I lost my question window for the moment. No Worries.

Um, why use the sameLLM for hallucinations,Uh, to check for hallucinations, I guess?Um, usually just, it's just a first check. You know, I'm not like, it's not a mixture of expertsor anything, it's just I'm using one and only one, onebecause they are usually capable of checking, uh, ifwhat they say was correct or not. Just, you know, just when you challenge it a bit then,you know, the example I gavebefore is like, oh, if you ask, are you sure?I'm sure like most of you have seen SHA GPT being like,oh no, actually you're right. Uh, I was like doing something else. You know, and then it actually gives you the right answeror then it gives you a wronganswer and then it's very annoying.

But you know, that's like, uh, that's the problem. Uh, there is now also a mixture of experts, so, you know,having like different teles uh, running, uh,for your agents, that's also something that is possible. Then I mean it also depends on, you know,your use case depends on the cost as well. Um, you could for example, use a smaller LLM to checkfor something very specific, you know, like having l LMSthat are more like tasks specific instead of like, you know,having a big one that is trying to do everything. Like a lot of people, uh, recently, like, you know, a lotof people especially are like cloud providers, like AWSor Google Cloud or OpenAI.

They use those, um, you know, to likeactually have a pretty good overall system. You know, like I know that AWS is doing that, uh, with q uh,they have mostly like mixtures of experts,so like smaller L LMS that have been fine tunedfor some very specific tasks. Yeah, it depends. And what is the reason you should be using an agentand are there any pros and consor how do you think about that?Uh, yeah, I mean, what's cool is that, I mean,I find it cool, so that's the first one that's a pro,you know, like you can see your agent actually checking theinternet or, you know, booking something on your calendar. So that's the cool part.

Uh, the bad part is that,yeah, it's doing things on its own. So maybe, you know, you might have a surprise in yourcalendar tomorrow of like having a full book calendar. Uh, you know, as a surprise, um, cons isthat it's more expensive. Um, it uses more resources obviously, you know,because then he has to think, he has to,you know, bro something. And so that's like something and then,but then the process that, yeah, if you don't have the data,uh, for example, in your Vector database, then insteadof being like hallucinatingor you know, giving an answer that is not relevant, um,then you are able to then give a relevant answer.

So your, your users might be happier as well. Great. Uh, any last questions from the audience?Otherwise we can wrap up today's session. We'll just give it a minutewhile we wait for any last questions,I just wanna thank everyone for spending some timewith us today and, and Stefan for the great presentation. Last call on any questions?Apparently we covered everything.

Oh, here we go. Uh, can you give some insights on how much it would cost?It depends, uh, in production. Yeah, that depends on the model, depends on what you use,depends on, it has so many factors. Uh, it's almost impossible to answer. Um, I'm just gonna say, if you wereto only have a basic rack system that is more expensivebecause you're gonna have way more actions, you know, then,uh, you're gonna generate way more tokens.

So yeah, I, it's like two or three times at least. Um, but then it's not like too expensive they put on theamount of users you have and everything. Usually if the cost is a problem, I would usually recommendto have smaller RLMs, uh,that are doing some actions of having a big one. You know, like I have, I mean, I only have LAMA three,which is a small one on my laptop, uh,but maybe have, you know, some l LMS that are like just hereto do one task, you know, they're here. Um, yeah, they're here to, I don't know, call APIsfor your calendar or something.

Uh, lowly as responses are becoming more desirableand ag agentic rag systems often take quite a while,what are some optimizations you might suggestfor an agent agent system?The first one I suggest is not related to LMS is,is your user experience. You know, your ui, uh,make sure you under your user understand, uh,that the waiting, you know, that something is happening. Uh, that's one, uh, that I usually suggest. Uh, and then it's also, you know, to trim down on actions. Um, if, you know, like you can filter on like a lot of data.

For example, if you, if you do a vector search, maybe,you know, try to filter through metadata, try to filterthrough like differentthings where you don't need to scroll. Uh, I would say those are like usually the ones, uh, whereI'm just gonna share my screen again,but, uh, that's usually what I suggest. Uh, and yeah, just make sure users know what's happening,um, so that even if they wait, you know, they,they can understand why they waiting. Um, and yeah. Great.

I think that is it on the question front. Um,Yes, as you said alsobefore the beginning, uh, I organized our events in Berlin. I'm organizing the biggest hackathon of Germany in October,so feel free to follow me on LinkedIn and something,and you will see, uh,we will take people from everywhere in the world. So even if you're not in Berlin,you're more than welcome to come. Great.

I'm excited for that event. Uh, thank you Stefan. Thank you everyone who joined us today. We will call it a day. Keep an eye on your inboxesfor the recording, link to the notebook, uh,and today's slides for your review,and we will catch you on an upcoming event.

Take care. Thank you.

Meet the Speaker

Join the session for live Q&A with the speaker

Stephen Batifol
Developer Advocate
Stephen Batifol is a Developer Advocate at Zilliz. He previously worked as a Machine Learning Engineer at Wolt, where he was working on the ML Platform and as a Data Scientist at Brevo. Stephen studied Computer Science and Artificial Intelligence. He enjoys dancing and surfing.

Building an Agentic RAG locally with Milvus, Ollama and LangGraph

About the session

Topics Covered:

Meet the Speaker

AI Assistant