You’re in!
Webinar
LLM Application Development with LangChain
Today I'm pleased to introduce today's session, building l l m,applications with Lang Chain and our guest speaker, Lance Martin. Lance is a software engineer at Lang Chain with a background in applied machinelearning prior to Lang chain,Lance spent over five years as a manager and perception lead on perception forself-driving cars, trucks, and delivery bots. Prior to working on self-driving,he received a PhD from Stanford. Lance is also joined by my colleague Gin Tang,a developer advocate here at Zillows. Welcome Lance and Yuin.
Alright, um, I guess I can just get started here. Uh, it's great to be here. Uh, Yuin and I have chatted, uh,for several months about different projects and ideas and so it's great tofinally be here and speak to you guys. Um, just as an intro,what is Lang Chain? It,it's an application development framework that makes it as easy as possible todevelop LM powered applications. And you can think of it, um,across kind of three layers.
Um,there's an underlying platform which currently is called Lang Smith,which has tooling for debugging, testing and monitoring applications. And we'll give a bunch of examples of that in this talk. Then there's also an open source library of building blocks,which has many integrations, Viss being one of them, different chains,different LMSs. We'll talk through all of that. And then kind of on the top is use cases,things you actually want to do with it,and we're gonna talk through those as well.
Um,maybe as like a bit of kind of context though,there's maybe two general ways that pre-trained LLMs can learn things. One is by updating their weights. You can think about weight updates as kind of cramming before a test. Uh, in,in the case of fine tuning, you're packing knowledge in, uh,given instructions and there's a lot of good e evidence that it's actually badfor factual recall. There's some links I provide below and you can browse 'em when you see theslides, but it, it can actually increase hallucinations,but it is good for tasks like extraction,Texas SQL things where you're modifying the,the context of an input out pair.
On the other hand,you can of course feed information to get prompting. This is retrieval,augmented generation or retrieval documents, stuffing them into the prompt. Very good for factual recall. So kind of a good summary of this is fine tuning is for form not facts as a nicesummary provided by any scale in there. Good blog post on this topic.
There's just something to keep in mind,like when thinking about fine tuning versus retrieval and a little bit more onretrieval, you can think about, you know,on one hand you have these pre-trained LLMs that have, you know,effectively a lot of information packing their weights. On the other hand you have, you know,search engines which just retrieve information. And then there's this middle ground of retrieval,augmented LLMs or generation in which you're actually retrieving documents fromsome source viss being one. And you're packing into the the working memory or context of an l LM and havingit do a task. And we're gonna talk about a lot of tools to enable this today and we'll alsotalk about fine tuning at the end and when you might choose one versus theother.
Um, first building blocks. So document loaders. This is kind of the first step in really any retrieval application you need toget documents and we have a lot of integrations is the short story. You can unstructured data versus structured proprietary data. Your local files,your local texts or public data archive papers, Twitter,Wikipedia and so forth.
Um,so this is kind of how you access data and kind of feed it into a system thatcan be used for retrieval. Um,now we also have text splitters and these are often required because LLMs havelimited context window. You can't just stuff in arbitrary numbers of tokens into the prompt of typicalLLMs. It varies, we'll talk about that later. But text splitting is commonly used when you take bunch of texts,split them and then you embed those and store them in vector stores.
That's a very common flow for retrieval and we'll kind of talk about that in thefollowing slides. Um,it is also worth noting that we have a lot of different splittersthat have kind of some tricks. So one is context to where splitting where when you take your sourcedocument you break it into chunks,but every chunk retains some tag about where it came from in the original. Like if you're working with code,every kind of code chunk has an attribute that is his the function or class itcame from in a markdown file. Orain like the headers in A P D F,it orain like the, the section summaries or the section parts.
And this is very useful, um, in retrieval and certain applications. If you want to ask questions about certain parts of a document,there's ways you can provide that and only those parts of a document can beretrieved. And we may talk about that a little bit later, but the,the key point here is that there's some ways you can do documents splitting thatretain context from the original document. Um,now this is where things get kind of interesting. We have embeddings and vector stores and this is kind of where VUS comes in ofcourse.
Um,so there's of course loading splitting and you're taking every split you'reembedding it which is basically mapping it to kind of a,a representation that can be easily searched and you're storing that. And as you note here,L change integrations with many different embeddings as well as providers. You can see here, you know, for embeddings of course hosted embeddings,open AI hosted vector stores bu for example you can also do private. So you can do on device embeddings with no EC on device with chroma for exampleon the vector store side. Um,so we flexibility in terms of how you wanna operate.
Some people want private only on device and we have support for that. Alternatively,hosted has a lot of benefits like scaling speed and so there's of courseintegrations there too. Um,now LLMs is kind of where in this retrieval augmented generation flow. Uh,it's really kind of the knowledge center, knowledge core, uh,of the application. And we have many integrations for LMS and the current landscape kind of lookslike this current steady art is is G PT four.
Um,context window goes up 32 K tokens and you can see the cost there. Notably anthropics CLO two is quite close to G PT four,um, it is cheaper per token and I provide a link there. You can have a look at that. So it's kind of interesting to note. Um,as a larger context window of a hundred K tokens, which is quite large,that's around 75 pages of P D F.
Um,in addition there's some open source models that become very popular recently. LAMA tubing one cost is free, which is great. The 70 billion variant is about on par with GBT 3. 5 foreverything except coding. I show that below.
You can see if you look at this final column, LAMA two,the coding score is 29. 9 versus 48. 1 for GBD threefive. That has some interesting implications maybe for agents and other things. But the other,the other point of this is from language LAMA two is actuallyabout as good as GBT three, 3.
5 for the 70 billion uh, parameter variant. So that's really cool. You can have a free to use that one's actually a bit harder on your laptop,but in principle you could, if you had to enough memory,it is possible to have a free to use open source roughly GBD three five levelmodel. Uh, that is, that is yours. That is completely private.
So that's pretty cool. It's kind of worth noting what's happened in the open source ecosystem over thelast year, I mean only a year ago. We have models like O P T,which you can see kind of on the left here,which lagged the state-of-the-art by a lot. And in one year we've gotten LAMA two, which is what we just said,pretty close to GBD three five. One big difference here you can see is base model.
So training tokens we've ramped from like 200 billion to 2 trillion. That's one very important point. The other is fine tuning and we'll talk about this later, but we've,we've actually found there's a lot of tricks that we're gonna cover. Uh,instruction tuned base LLMs can get quite good inrival at least the visible performance of a very high quality, you know,generalist model like check T bt check T B T or GPT three five. Um,and we'll we'll talk about that a little bit.
There's some nuances there in terms of is it actually as good or is it just kindof, 'cause often they're actually fine tuned,you can see the instruction counts. They've often been fine tuned on shat BT dialogues and so there are somequestions about uh,whether or not it truly is at that level of quality but at least it kind ofmimics the, the chat behavior quite well. And there's a lot of activity now on fine tuning LAMA two of course. So that's kind of the landscape of open source models. Uh,this is a fun example.
This is LAMA two running on my MacBook. I have an M two max 32 gig. So it is a, it is a better MacBook but you can see,I mean this will run at 50 tokens per second. This is close to real time. It's pretty awesome.
Um, and this is completely private and free. Um, we have an integration sub,you can browse all the integrations. I don't wanna spend too much time on that. Save time for questions at the end. Um,but let's talk about use cases for a little bit.
So we talked about rag retrieval, augmented generation,this is one with the classic use cases. You start with documents,you split them, you embed and store them and then you retrieve them. So what's happening here is you take a question, you embed the question,you have all your splits in your vector store that are also embedded. You can do a similarity search, cross embeddings, cosign similarities. There's lots of ways to do very,very fast embedding lookup and you retrieve relevant splitsrelated to your question like we talked about before.
You pack those into the l m prompt, you get your answer. That's kind of a nice flow that works quite well. We have a few different levels of abstraction you can choose more or less. This is something that's come up with ang chain quite a bit too abstracted,I get those concerns. I think the key point is that there's different tool,there's different parts of the library you can use.
If you want more control,it's like low QA chain allows you to completely independently manage the otherpieces. If you want an end-to-end kind of pipeline,we also have something that will just kind of do everything for you. So depends on your, your level, your interest and so forth. But the point is we support many layers of abstraction. Now this is where I'm gonna introduce Lang Smith a little bit.
And so what's happening here is Lang Smith is a tool for kind of observabilitydebugging like we said before. And what you can see here is on the left,when you run one of these pipelines,Lang Smith will show you and automatically log the trace of everything thatactually happened under the hood. So this is kind of nice if you want to see like what is actually going on,you can look at the link, the trace and look,you can go down here to the chat open AI calls. This is running,this is running GT three five. You can see exactly what's going in.
This is kind of cool. So this is the prompt and you can see all you're doing is usethe following piece of context, answer the user's question. If you don't know the answer, just say you don't know. And then here's literally the chunks that we retrieved from our vector store. This could be mildness and you can see that is run here in the trace.
I dunno if you can see my mouse, but I'm looking over at the trace. The retriever is run the stuff. Documents chain takes the retrieve docs stuffs in the prompt and you can seehere, here's the docs, here's the question. And the l l m produces the response based on the docs. So you can very easily audit what is actually getting retrieved,what's going on with tool like Lang Smith.
Um, so that's kind of what I, what I,what I wanna highlight here. Now chat is kind of a variant of what we just talked about with memory. You can also have retrieval plus chat. So you know,many re retrieval,augment generation applications also enable chat that just keeps a kind ofpersistence of your conversation history. So you can see it here.
Here's a chat history that's persisted. It's all passed in the prompt. The LMM has that whole history. It can disambiguate like the sentence,it knows it to reference. Uh, I I guess yeah, where's the sentence?Yeah, this sentence.
I love programming. So that's,that's the benefit of Chad. I think that's pretty intuitive. Now summarization is another area that we've done quite a bit of work in andthis is a common application. If you start with for example a large corpse of documents,you may wanna summarize them and there's a lot of ways to do that.
In short,if it fits in the context window,just stuff it in and you know with a hundred K context window models,that's 75 pages of P D F, that's pretty good. But if it doesn't, there's a few tricks. So one is you can take the doc,split it, embed it,cluster it so you're clustering similar embeddings together and sample thoseclusters and pass those sample clusters into NEL. And then maybe that will reduce the size significantly while still preservingthe essential chunks or clusters that you want to summarize. You can summarize each cluster.
You can also do what we call MapReduce. You split the documents,you summarize each chunk and then you summarize the summaries basically. And I'll show you use case here. So we get a lot of user questions in lang chain in our documents and they'recaptured by service called ible. Um,so we took 30,000 user questions and we pushed itthrough the summarization pipeline using MapReduce as well as clustering with afew different LLMs.
And you can see here's what we saw with MapReduce. We,we asked kind of what are the main themes and the questions,what are people most confused about?And you can see we used MAP Duce plus OpenAI,we use Anthropic and Map Duce and we used OpenAI and clustering andthere's reasonable agreement, which is good. Um,but then this gives us a very nice way to like look at this massive corpus ofdata you can never kind of grapple with independently using L L M to kind ofsynthesize what's going on. We also have the L l M produce example questions in each category,like give us the top five questions, it'll do that. So summarization is very useful for these cases where you have, you know,obviously summarizing large course of documents, summarizing massive P D F,things like this, user questions, support tickets,it could be very helpful for those types of applications.
Applications. Um,extraction is another major use case. So if everyone,if anyone has spent time trying to prompt an L L M to give you a J S O N output,they know the pain of extraction. It is not easy. So function calling is emerged over the last few months is really a nice way todo this.
Um, open AI support function calling, um,Claude does Anthropics. Claude does the open source models are emerging as well,but here's a good example. Given an input sentence and a schema,you can do this function call, which is o information extraction function,which will extract information per the schema and you get what you want out. Now let's look at Lang Smith again and you can see what's happening is righthere. This is the output from the function call and here's the prompt.
It basically extract and save the relevant entities in the passage. Use this function. Um, and here's the passage. And so this actually tells the L L M,which is NA function calling to actually use that function,which is then supplied to the L L M is an independent argument. So it's a nice trick.
You can specify these functions,pass 'em to the L L M that supports function, calling that in the prompt. The model will then know to use that function if instructed appropriately. And you can see you can get very consistent high quality output following JSONformat, pedantic data class. Uh, it's quite flexible. This is another application area people really like.
I'm not sure about you guys. I'm, I've been a software engineer for a while. I definitely am not very good at sql. I don't like writing sql. Uh,and there's a lot of nice capabilities now where you can actually have LLMswrite SQL for you from natural language.
Um,so going from question or text to SQL is a very kind of common and popularapplication. Um,and we also have chains and agents that can automate both running thequery as well as running many queries in the case of an agent. We'll talk about that a little later, but this is actually kind of fun to see. There's some interesting papers on this topic that I link below,but folks have found that you get,you can get quite good performance on Texas sql if in the prompt you give thel l m basically here is what the table has and here'ssome example, select statements of like what the rows look like. If you give it that and then your question,it can convert that to SQL pretty effectively.
Um,the paper talks about this is pretty interesting,but this is exactly what we're doing. So you basically pass the l lm your table. Um,and then what happens behind the scenes is in the basically lang chain will dothis extraction and then pass that information to the prompt lmm. So L L M when it's doing the SQL generation has access to your table definitionand selection of a few rows. And from that it's able to produce uh, SQL query.
Um, now agents, this will be kind of be the last big theme I'll talk about here. This is a topic that's been, you know,a very high interest for a number of months. You know,there's maybe a simple way to think about agents. A basic L l M does not have tools and does not have memory. You can give an L L M access to tools.
You can give an L L M access to like an A p I. Um,we actually have a number of chains that do that. Um,like you can give an lmm access to search to a search function for example. And that's just giving access to tools. So that's like the tools piece.
You can also give an l l M memory, so that's like chat, right?We talked about that. If an l m has some like short term memory and you have a chat bot chator memory and tools or agents, that's maybe a simple way to think about it. It's combining these two principles and there's a few different kind ofpieces of this ecosystem. So Lang Chain has lots of integrations to support agents on the tool side. There's lots of different tools and toolkits, Google search, Gmail, pandas,many other things you might want to do for memory of course at the vectorstores.
VUSs being one potential like longer term memory option. There's short term memories used for chat like buffers and then there's a bunchof different types of agents. We'll talk about action agents today. There's a lot of interesting work on simulation. The gener agents paper is very interesting here where you can actually haveagents kind of take on personalities and act as MPCs effectively in a gameenvironment.
Um,and there's autonomous agents that are more like long-term planning horizons,that's like auto G P t, baby A G I. And we have many different l l M integrations to support, uh,the reasoning behavior necessary for agents. Let's talk about action agents briefly. Um,so React was kind of the one of the first and most popular actionagents to emerge. Um,and you can think about it this way kind of simply and l l M was standardprompting does not necessarily exhibit multi-step reasoning with chainof thought prompting.
There's a lot of interesting pay up logs and works on this. IT LMS can do multi-step reasoning. Um, and that's kind of this,you're conditioning the lmm to show its work. Often this is the think,please think step by step kind of um,like tag you can include in your prompt. That's chain of thought.
Prompting is just conditioning the show's work. There's a lot of empirical evidence to improve performance on certain tasks. Now alternatively,there's been a lot of work on getting a agents to learn through actionobservation through tool use. So this is some interesting work out of Google,the Sayan paper that basically equips an L L M with like a robotic toolsit can interact with the environment, get observations back,but it doesn't necessarily do multi-step reasoning. And so react was this idea to bring the two things together,give agent access to tools and allows you to multi-step reasoning.
And let's,let's show a specific example because this is kind of like high level without anexample. So this is actually the SQL React agent that we showed previously. And if you look at the trace here on the left,you can see things that it's doing. So it used the tool to get the SQL database tables. So SS Q L DB list tables gives it a list of the tables.
You can see that's past the prompt here where you can see,um,the observation from that action is the listof all the tables in the SQL DB here. You can see up here action i if you can see my marker. But as you got the tool slash action arrow points to here's the action Iperformed, here's the observation from that action. So the action is SQL DB list tables. Observation is here's my tables and then thoughtis the next step in this sequence.
The thought is I should query the schema of the invoicing customer tables. It defines new action and then the actioninput is invoice and customer that gets passed down to SQL DB schema. So the original question we actually can't see here, um, oh, sorry,we can see input here lists total sales per country,which country's customers spent the most. And so it's working through this. Um,and you can see it's basically using tools to form actions.
Those tools return observations, and then it's thinking about what to do next. That's exactly how these agents work. Um,so hopefully this gives you some like kind of hands-on appreciation for what'sactually happening under the hood. Um,I will make a note that if anyone has worked with agents,they've probably experienced this as well. They're not the most reliable.
Um,and we actually had kind of a a a use case for an agent. We wanna do kind of web research, um, where you can ask an agent a question,it can go off scout the web polar relevant documents,summarize them for you in a nice report. There's some interesting open source work on this. G B T researchers. One example, we took this up and actually we found that a,a retriever was good enough for this particular use case.
Um,it was much more reliable than an agent. So I think it's,it's just a cautionary note that you don't always need an agent. Um,in this case the retriever is able to perform search, uh,basically read or scrape the HTML pages, transform 'em,store them in a vector store. It could be vis for example,retrieve chunks relative to your questions, summarize them,and that's all that's going on. Um,we actually have a hosted app that does this.
Here's an example,intro Web Explorer. Um, it's all open source. The link is down here. You can see here's the, it, it,here's the answer getting generated to your question from a web search. It gives you all the pages that it scoured to get those sources and everything.
So the point is, we start with an agent,we end up with a retriever and we found that, you know,in some cases retrievers have to be more reliable. It's,it's kind of all you need for this particular application. Um,now then the final section, I want to save a lot of time for questions. So I'll,I'll make it a little bit quick. We'll talk about tooling and Lang Smith in particular.
We've seen a lot of Lang Smith traces. Um, but again, I'll,I'll emphasize this,this point about fine tuning versus um,prompting or retrieval and I'll show you how we use Lang Smithto perform a case study on fine tuning. Now,as a note we talked about fine tuning is really good forthings like tasks form extraction. It's less good for things like retrieval or factual recall,in which case retrieval degeneration is is typically much better. Again,I link the sources below.
So we focus on extraction task and the task is kind ofesoteric but it's extraction of knowledge graph triples. So if you ever built knowledge graph, you typically build these,it's built from these triples of subject predicate object and we wanted to seehey, how well can LLMs kind of do this extract?So we actually have a playground here to enact open source free use andyou can pass it any text and it'll perform this triple extract. You can see the triples down here. So we actually collected a lot of examples of this. We released this a couple weeks ago.
Lots of people played with it input different in input, different uh, you know,text inputs to the app. And you can see this feedback button down here. Now this is where kind of Lang Smith comes in. This is connected to a Lang Smith project and Lang Smith Connect collects allthose outputs. Imagine that being all your user outputs or all your lmm generations inyour app that users are interacting with.
Imagine users can give feedback up down. You can take that in Lang Smith and filter on bad feedback and sayokay, here's all the places that the LMS doing bad andlet me try to fine tune N L M to improve on that. So this is kind of an example case. You can collect those generations at relatively large scale within LangSmith. You can do this querying for what if you wanna query all the bad generations,you can build data sets and you can expect them and clean them all within LangSmith.
And that's what I'm showing over here. Now for the fine tuning piece,you can effectively take your data set,split it into a train and test set test for evaluation,which also Lang Smith enables train set. You can build more synthetic data if you want. We don't talk about that too much here, but that's another interesting option. And then you can fine tune.
So this is kind of a flow. Lang Smith lets you,if you have an application running easily collect all those generations,filter them according to various things you care about. Could be time,could be user feedback, build data sets from 'em,inspect and clean those data sets. And then those are set for different things you want to do. One being evaluation, another being fine tuning.
Um,so in our case, here's an example. We took this triple extraction challenge and we tried a few different things. We first tried few shop prompting GBT four and GBT threefive. We tried base LAMA two seven B and we triedfine tuned lama fine tuned GBD three five which actually just came out two,three days ago. The ability to fine tune chat G B T, it's very interesting.
We evaluate all of them with Lang Smith. Lang Smith gives us this metric from zero to 100 on performance. And what's kind of cool is it gives us a very nice way to audit all of thoseevals. This is kind of what you can see here. You can drill in and every generation you can see here's the score for thatparticular generation and you can dig in even further,which I'll show shortly and I'll talk about conclusions here,but this gives you kind of a stack rank.
We actually found few shot G B T to be the bestfine tuned GBD three five to be second fine tuned LAMA two to be thirdfew shot gbd three, uh three five to be fourth lama two base fit. It actually kind of match what you might expect. We'll we'll talk about this in a little bit more detail. First I kind of want to show you what's happening in fine tuning. So this is kind of funny,I don't know if you can see it but this is the base laude chat.
The answers we want you can see are like these very specific extracted clusters. They're almost like J ss o n basically where you want subject object relation,right? Subject object relation. The baseline model doesn't know that it answers this informal way. The question here is by this point Simpson had returned to his mansion inBrentwood and surrendered to police. This is probably about OJ Simpson.
Um,the model hallucinates Homer Simpson as a subject andit says subject is Homer Simpson object is police relation surrenders. It kind of does the hallucination. You can see the output is this like chatty style. It's not what we want with a trivial amount of fine tuning. You can see we can get much closer now it's still not perfect and we can talkabout that, but it's much closer to the output style we want.
Um,this is only on fine tuning on like a thousand instructions. Um,on a 100 and CoLab this takes only about five minutes. It's very quick. You can see you can get much closer to the intended output. Um,so maybe just to summarize, let's, we can then move to questions.
Um,blank Smith can help address some pain points in the fine tuning workflow. Uh,data collection, evaluation, inspection results. Like we said, um,we found that rag or few shot prompting definitely should always be consideredfirst before we undertake fine tuning few shot prompting of G P D four actuallywas the best performer. But fine tuning smaller,more source models cannot outperform much larger generalists. Um,fine tuned LAMA chat is actually better than GBD three five.
So those are,that kind of matches what's been reported elsewhere in the literature and andrecent blog posts. Um,and it may gives you a flavor for how Lang Smith can work in a practicalapplication like fine tuning. Um, and with that I think I'll just open it up. Leave questions. I think we're only at half hour, so plenty of time.
Maybe I'll just stop talking and, and uh,open the field up if there's any questions or topics we wanna discuss more. Great, yeah, thanks for this, uh,really good overview of all of the things that are kind of going on with this Ll M app building stuff in lane chain. Um, I have quite a few questions. Um,I'll let people ask questions in the chat before uh,I kind of start rattling off questions,but if we don't get any questions like being typed up in the next minute or so,I'm gonna start asking away. Yep.
Okay. I don't see any popup yet,so I think that there are some parts of this presentation that I would haveloved to uh, dig in deeper on. Yeah,and so one of the things I wanted to ask about was why is it that theseagents are not like super reliable?Can you talk a little bit more about like how come sometimes the agents will,uh, give you different steps or perhaps even um,uh,not be able to run the the steps they like sometimes like they'llrepeat their steps and then it's like, okay this is not very useful. Can you talk a little bit more about, uh, about that?Yeah, so one thing I'll note is empirically we've seen,I'm trying to find the agent part empirically. One thing we've seen is that agents with,for example the open source LAMA models are notably worse andagents with some of the hosted, uh, closed source proprietary models.
One hypothesis I've heard for this is that they're quite a bit worse. And I'll go back to where I show the data, they're quite a bit worse at uh,coding and there's some those,there's some speculation that coding ability or that kind of the structuredreasoning present in coding is something that is quite important for agentbehavior. That's maybe one thread. Um,the second thread and, and frankly I am not an expert in agents. I actually haven't done that much work with agents.
Um,so I would defer this to folks that havedeeper expertise in this area. Um,but frankly I found that in my work with agents,the difficulty in actually inspecting what they're doing under the hood isone of the main challenges and that's actually one of the main motivations forrecent blank Smith. Like here you can actually see its trace and progression to actually understandif it gets stuck,where it gets stuck and why that doesn't necessarily answer the why question,but it does help you understand where precisely things aregoing haywire. Um,which I found to be a missing piece when I've worked with agents. Um,so I I think one is logical reasoning in the form of coding is certainly onein empiric place that we've seen is very important for agents.
Two is the ability to actually trace it could be like implementation problems,it could be actually like LMS getting confused and frankly I found it veryhelpful to be able to inspect the prompts,specifically going into the agents using somebody Lang Smith. I've actually done this a decent amount with generative agents,the more recent simulation agents and it's very helpful to see what's actuallygoing in the prompt to help debug why and where it's getting confused. I think it's very nuanced as to where, where and why agents go off track. I think just good observability is like the starting point. Um,that's maybe what I'll say for now.
Cool. Um, yeah, so now we have a few questions that have popped into the qa. So,uh, do you wanna talk about automatic evaluation of prompts?Let's see, of prompts in particular?Well soMaybe eval in general. 'cause I actually wanted to ask about eval in general as well. So I think this would be great, a great chance kind of about bothEvals a good topic.
SoI actually put out open source eval app called Auto Evaluator a couple monthsago and the main intuition there was you can have LLMsthemselves grade responses. So if given a generation and an output statement,an L L M can look at those and reason about whether or not it's,it's accurate or not. Now this is kind of a big topic. There's actually a few pretty good open AI cookbooks on this. We had a seminar on a while.
The open eye AI folks basically contended that LMS canperform discrimination pretty effectively. So if given like a sentence,a generation and like a ground truth sentence,it can compare those pretty effectively and say these are factually consistent. Okay, soThat's kind of central idea behind a lot of the evals that we're doing here islike an l l M based grader, typically G P D four. So what's happening here is, for example, for these applications,remember this is outputting these triples,so you can kind of see here's like some examples here. Let's look at this explicitly.
So here's like the melons output,it's this basically this structured, um, list,um, and has subject object relation, subject object relation. And we have our reference output. And L l M is going to look at these two output generations and make anassessment as to how consistent they are. And there's a lot of questions there. And in fact you should check out our, we we open source the collab.
Well you can actually see the,the grading prompt that we use and you can certainly tune that. I think there's a lot of questions about, uh,L L M mediated evaluation and grading,but at least you can see exactly what we're doing. Um,and the point is we're using l l m grader with a very specific grading prompt. Um,and it is assessing the consistency between the L L Moutput and the reference. Now prompting plays in there because of course you can tuneyour prompt produce different generations.
In fact we did quite a bit of that and you can do a grading across thosegenerations and determine the best quality prompt. That's something we actually did quite a bit of. Um, so is that,is that enough of on eval or do you wanna go a little deeper?Little deeper? Uh, I think that's a really good coverage of eval. Like,you know, I have heard some other people talk quite a bit about, um,these kinds of like using LLMs to evaluate other l l M outputs and it looks likethe person who asked the question is also pretty satisfied with that response. Um, so the next question is about, uh,use of data labeling tools, um, for scoring and judging outputs.
Any thoughts and comments?Okay, so that's a great topic. The hardest part, I mean,folks who have done ML kind of appreciate this. The harsh part of N M L project is the data that acquisition,that's always the hardest part. And what's very interesting actually is we have these notebooks belowfine tuning your l l m is pretty trivial. You can do it on a single A 100 or even a T four G P U inuh, 15 minutes depending on instructions.
I mean on an A 100,I was able to fine tune with 1,015 hundred instructions in like less thanfive minutes for this task. Okay,Fine tuning park's trivial. It's the data and the valuation that's annoying and hard. I, I mean, look,getting high quality ground truth is always of paramount importancefor this. We used, um,we used public data actually in the end.
So there's open source triple data sets that we used to keep things simple inpart because we found we kind of go up here. Getting an lmm to produce triples is notgood enough basically. Um,and this is where data labeling could have come if we really cared and thiswasn't just a demo,we would've probably gotten some and there's a lot of labeling. I've done a lot of work with scale in the past. Um, we probably would've gotten,yeah, human annotation to supplementsynthetic data.
I, there are maybe two different threads here. One is getting high quality human annotations is certainly a great idea for anyproject. Synthetic labels is a very interesting topic area. We could dig into that wormhole if you want. I've seen variable results,it can work very well in this case.
We tried,it was actually quite hard to get it to match the exact output format of ourtrue labels. There's some nuances with synthetic,but I think it is the case that always consider how you're gonna get highquality annotations and human labeling is a great option scale. I know obviously has a lot of services and frankly that've worked with 'em inthe past. They're very good. Um, it's always hardest part,especially when it comes to building a high quality eval set.
Yeah, yeah, I mean I've worked in n ML and I totally agree. I mean,people ask me about this and I'm like, yeah,the most important thing is you gotta have good quality data, right? They do. Um, yeah. Uh, so the next question is, um,someone's using LLMs for splitting. Um, but uh,they want to know, uh, oh wait, they say that we're,they're letting the l l m model split, uh,the text and create semantic bound boundaries, I believe is,is what this word is.
Um,what are your thoughts on using an l m model to do your splitting or chunking?That's interesting. I haven't come across and I'd be curious to know why. Um, I can see the intuition maybe like you want,you want to group similar texts together. Now there's a lot of ways to do this. Um, so first I'll make a note.
We actually have, and I'll, I'll,I'll share the, it's actually linked here,but we actually have a splitter that does kind ofint intelligence splitting for code where it will keep, uh,classes or functions together in the splitting process itself. So that's kind of one example that's getting at what you're saying. Another is this context where we're splitting will retain the metadata. So even if you split, for example, this P D F, imagine you have a long abstract,you split that into three chunks. Those chunks still have the abstract tag with them.
So you can always metadata filter for them in retrieval so you can recover them. So actually I think this approach probably makes more sensethan having an L l M split. Um,but I would be open to hearing more of the rationale for why you'd want anlmm to split. I think if you can get very high quality metadata,uh,basically attribution to your chunks with like one of these types of tools,then you can always recover them with metadata filtering. Yeah.
So, uh, I agree. Um, Mason,if you want to drop a little more of your comments on why you guys are doingthis in the, in the chat, that would be awesome to, um,I personally agree with this as well from like a, you know, uh,kind of both a cost optimization and also a,uh, kind of like architecture slash structure kind of angle. Um,our next question is could you share more about what task and amount of data wasused for the fine tuning of G P T 3. 5 slash LAMA two? Yeah. Versus two shot and which type of fine tuning it was?Was it prompt tuning or other parameter efficient fine tuning or an end-to-endone?Yeah, exactly.
So this is actually where I'll share these slides and you can see the blog postsand the collabs are all there. But in short, so here's the task. The task is triple extraction,so you're given a chunk of text and you're extracting knowledge craft triplesfrom it. So that's, that's the task definition. Now there's a few open source data sets for this, a variable quality.
We ended up using two car and Benchy carb and benchyand it basically was 1500 examplesof sentence, uh,alpha triple pairs in our train set and a hundred inour test set. Um,now in terms of fine tuning, so we use LAMA two 70chat model. We do this in CoLab on a single A 100 G P U. We use parameter efficient fine tuning with Q Laura. So that means the model weights are all four bit, which is great.
You can fit it right in memory, especially within a 100,even when a T four that'll work, right?And you're actually only saving the weights for a small fraction,roughly 1% I believe, of the weights of the,of basically the adapters that are basically injected into the,into the model itself. So those are only things you're actually saving all the computations done in FB16. And those two lower weights are also saved on FB 16. You rebuild the model at the end,you save that all FB 16 and that's what you run inference and eval evaluationson. That's really all that's going on.
So again,you have a 7 billion chat model laude chat model. You're using four bit quantization to load it into memory. Then the memory footprint is on the order of like less than 10 gigs. That's great. Fits easily in G P U vra,you're doing parameter efficient fine tuning,you're fine tuning a small fraction of the parameters,maybe roughly 1% saving those FB 16,rebuild the full model FP 16, save that, that's all that's going on.
It's fast, it's not that expensive. It definitely improved performance if you want to do better. There's a ton of ideas to make it better. This was just a quick demo,but it is cool. This stuff works in CoLab.
It's very accessible. Um,you don't need that many examples. I think it's really about example quality. Um, now I did all that work. A colleague of mine did the open AI thing and that just came out two days ago.
He used a smaller number of instructions for that particular one. I think it was only like, like, uh, 500 or something. I,I should go back and re that literally got added yesterday at the very lastminute. So I didn't like go through that work very carefully,but we have a separate notebook for that though as well. Um,and it seemed it's quite out the box.
So in that case,I don't think you need to worry about any of the parameter efficient stuff. I think I, as far as I understand, I should,we can look at the notebook together right now if we want,but I think you just send that data over to open. I'm sure they do the fine tuning on their end and then you just,presumably you get a hash of em, of a model checkpoint back or something. I think that's even easier. Um,so in short lama fine tuning CoLab works really well.
Um,I mean with an A 100 you can also do a 13 billion parameter model. Um,and it seems like the DVD 3. 5 fine tuning,I mean my colleague turned that around in like an hour,so I think that's quite trivial. Um, wow. Yeah,so these are really both really good options to consider.
But again,I would highlight our conclusions. Always look at simpler methods first. Always look at rag first, look at few shot prompting. Like they'll often get you what you need when you really need to fine tune,don't fine tune for like, for factual retrieval. Fine tune more for form things like extraction or triples or summarization orlike a particular style of speaking, like kind of style transfer.
Those are all good fine tuning tasks. But yeah, happy to talk about them more. It's, it's a, it's a great topic. Yeah. Um, thanks.
That was pretty, that was a very like, you know,comprehensive answer for that. Um, okay. What about policy enforcement of guardrails on LLMs?Do you have any thoughts or comments about that?Yeah, that's a good topic area. I mean I know RAs, uh, guardrails library is,is, is popular. Uh, I think it's interesting to consider.
I have not worked with it personally much at all. Uh, but it seems compelling. The one thing I'll note is on the topic of hallucinations,if you go back to the retrieval augment generation piece, you know,when you structure these prompts, you tell L l m, uh, right?Use the fine pieces of context, answer the question. If you know the answer,just say you don't know, don't try to make something up. My experiences,this is pretty good at managing hallucinations,I certainly would not argue that it's foolproof.
I think it, it may be wise,it's considered guardrails on top of this. I think franklybuilding out evaluations that using Lang Smith or otherwise or runningevaluations carefully your application understand where it goes out of bounds isprobably the thing I would do first. And then I think about, okay,if that can be addressed with prompting,then maybe I want some additional guardrails on top of my application. That's probably what I would say about it. But my sense is it with rag you actually already prompt the model to limithallucinations quite a bit.
Um,evaluation may tell you where it's leaking out and then you may address thataccordingly, maybe with guardrails. Yeah, yeah, I mean, with rag what we typically do is we, uh, like we're,you know,we're setting the temperature to zero on the L L M and we're basically onlyusing it to, to structure the results into a human readable context. That's right. Yeah. Okay.
So someone asks,is there a node code functionality or will I needto code every, is there I I, I'm asking,I I bet this is actually asking if there's a no code functionality. Um,or if there's or yeah, if there's no code functionality,basically that's what it seems to be asking. Okay. For Lang chain in general, um,I've actually seen some open source projects thatdo provide that,but I can't think of them specifically right now. I would hunt around,I'm not sure that's very mature to be honest with you.
But the thing I will say is there's low code, like for example,if you wanna do rag, we do have some options. You can see the documentation. If you want to do this in, you know, two or three lines, you can use this. Frankly,Lang chain gets hit a lot or critiqued a lot for being too abstracted,which I completely understand and that's why I present. Look,there's many different ways to do things in Lang Chain and frankly,if you wanna be, you know, as close to the metal as possible,so to speak, you can use very,very lightweight like l l m chain only and then customize everything yourself.
Only use Lang chain for like logging the generations and Lang or something. I,uh, look, I completely appreciate that, frankly,in production I might advocate something like that. But we do have,we do have abstractions that would allow you to do certain things like rag inlike two or three lines of code. So low code, not quite no code, but,um, yeah, I, I would look around, we don't offer anything ourselves. The Lang Smith stuff is, is pretty low slash no code.
Uh, with Lang Smith,you literally set an environment variable and everything gets logged to yourLang Smith project and you can browse and, and, and like, it's,it's all UI based. So Lang Smith is actually pretty low code, which is nice. All this stuff tracing and so forth, you get for free,just set an environment variable in your project,it all gets logged to Lang Smith. You can open up any trace and play around. And so it's, it, it's, it's quite nice.
Okay. Yeah. Um, looks like, oh, oh no, it's, uh,uh, yeah. So, uh, I think we've got time for maybe one more question. Um,and it doesn't look like there's any questions in the qa, so.
Oh,someone asks, uh, someoneI, okay. Someone wants to ask about your thoughts on using length flow or flow wise. Have you heard of these tools? Yeah,Yeah, I think those may be some of the lower no code options that, or, uh,in short, I, I haven't played with them at all. Uh, yeah,tell me what they are maybe, and then I can, I can respond. I, I don't know what they are.
I've never, I've never heard of this,so that's why I was like, so this question to me is a little bit confusing. Um,I, I've heard the names, I've not personally played with them. Um, yeah, if the,if the questioner wants to provide more context, I'm,I'm actually happy to comment on it, but I haven't, I,I'll look it up right now. Ang flow,I, uh,Chain. Okay.
Yeah. Um, yeah,it looks like it's quite popular actually. Lang Flow UI for Lang Chain. Yeah, that's what he said. Looks kindOf cool.
Effortless way to experiment, prototype, uh, Lang Chain pipelines,it's like kind of drag and drop functionality. Looks like a lot of adoption on GitHub. It looks,looks interesting for sure. I haven't played with it personally,but I think it seems like a nice option depending on, um,your needs aptitude for coding and so forth. Uh,but it looks pretty cool actually.
Yeah. Um,okay then I think we're probably out of time for any more questions. Um,Emily, do you want to kind of take over, wrap things up?Sure. Um, I just wanna thank everyone, uh, who spent some time with us today,um, for this session. And, uh, special thanks to,to Lance from the Link Chain team for, um, sharing so much great information,um, and walking us through all these different concepts.
Uh, you, Eugene,thank you as always for, for co-hosting, um,and we hope to see everyone on a future Sillas webinar. Have a good one. All right, thanks. All right, bye. I.