Events
Safeguarding Data Integrity: On-Prem RAG Deployment

Webinar

Safeguarding Data Integrity: On-Prem RAG Deployment

Zilliz Webinar - Zoom

Join the Webinar

What will you learn?

Are you working with LLMs and care about your data privacy and security? If so, you need to deploy retrieval augmented generation applications on-prem, or use something that splits your data and control planes.

In this webinar, we’re going to introduce both solutions, and dive into how you can deploy RAG applications on-prem using open source tools such as LLMWare and Milvus.

Topics covered:

Understand different concerns with deploying on cloud vs on prem
Learn how you can use local embedding and generative models
See how LLMWare can help you create a RAG application on your own framework with Milvus

View presentation slides

Transcript

Today I am pleased to introduce today's session,safeguarding Data Integrity on-Prem Rag Deployment,and our guest speaker Darren Oberst. He is the CEO of AI blocksand innovative AI platform, revolutionizing the landscapeof LLM based application developmentfor generative AI in financial servicesand legal industries. Prior to AI blocks, Darren served as the CEO of Exeland launched and grew HCL softwareto over 1 billion revenue in five years. Darren is currently focused on building enterprise LLM basedapplications, which includes retrieval,augmented generation, open source, LLM middleware, uh,also called LL mwareand fine tuning specialized enterprise LLMmodels for open source. Darren is a graduate of uc, Berkeley with degrees in physicsand philosophy, and Harvard Law School with honors.

Welcome Darren Oberst. Wonderful you, Eugene. Thank you so much and thanks everybody for joining today. Um, we've got a few slides that we're gonna go throughand then, um, we're gonna spend, um,hopefully at least half the session, um, in some live demos. Um, as you Eugene mentioned, um, we want itto be really interactive, so post any questionsthat you have and, um, we'll certainly stop and,and make sure we spend more time on some areas that areof interest, um, to everyone on the call.

Um, I'm really excited about this. Um, we, we've been working, uh, with, uh, vis, uh,with the actual technology, um, for the last 18 months. We've had just fantastic experience with it,and it's really central to the tech stackthat we're putting together, um, in termsof a private cloud rag system. So we're gonna try to illustrate that, um, you know,both bringing to life some of the overall capabilities,but then really I think, highlight, um, to everyone, sortof the central role that BU playsas part of that architecture. So let's, um, let's go ahead and dive in.

Um, so really the, the goal of today is to get to the demo,um, and look at this thing, um,a a live running private cloud end-to-end, um, rag system,um, to really motivate that. Um, there's a handful of slides,um, that we've put together. Uh, we're gonna give a quick intro, um, into AI blocks. We're gonna keep the commercial to a minimum, um,but we do wanna just help to set some context,some motivation about sort ofwho we are, some of the capabilities. Um, all of these are, you know, in open source,so they're all things that, um, you know, you can goand check out, um, for yourself.

We wanna set up some motivation, um,around private cloud rag. I mean, what, what does this mean?You know, why would somebody, uh, want to do that?Why is this, um, you know, an interesting direction?Um, and then we really wanna spend a few minutes, um,laying out a conceptual architecture of how do you bring allof these pieces together and how can you actually do that,um, in a single private cloud, um, instance. Um, and then the demo scenarios, uh,that we're gonna look at, I'm gonna try to getto four different, um, demo scenarios. Uh, we're gonna run through, um, first two scriptsthat are really just to show you, um,a a quick hello world type of illustration ofdocument parsing, text, chunking, embedding,and some basic semantic querying, um, all,you know, using vis. And then wrapped, um, you know,within the LLM where library.

So there are two examples of that. Um, they're actually available as examples, um,in our LL mware GitHub, uh, repository. So you can go and check 'em out, run them for yourself. But we're gonna run through those twoexamples sort of quickly. Um, then we're gonna get to a third example, um,which is actually a rag scenariothat builds off those first two scenarios,in which case we're gonna bring in a 1 billion parameter,um, LLM that we've trained and fine tuned, um, for rag.

And then finally, if we have some time,I'm actually gonna pop open, um, sort of a popup web app,um, which is a derivation of the multi-tenant, um, you know,multi-threaded, um, SaaS site that we've been running. Um, and that site, um, we actually use vis, um,we've massively paralleled it in a Kubernetes cluster. And again, we've had some really great resultsin terms of scalability. So just to maybe help illustrate, um, someof the retrieval capabilities, we wantedto show you a quick UI just so you can see it,because sometimes it's a little easier than lookingat a console screen. So a lot, uh, to get to.

So, um, let's start with just a little bit of motivation. Um, why rag in private cloud?Um, in fact, in many ways this would seem to fly in the faceof so much of kind of the marketingand what the, the attentionthat you hear in generative AI todaywhere just about everything, um,our public cloud API services. So why would anybody be talking about doing this,um, in private cloud?Well, the number one reason, um, is data privacyof sensitive business documents. And, you know, a, a sort of a, a, a, a something, a,a dialogue that, that we have all the time,almost every single day,and I'm sure many of you, uh, do as well. People will start with a pilot, a proof of concept,a validation, uh, let me impress my bosswith, with what I've got.

Typically starting with open ai, starting with Pine Cone,starting with, um, you know, a public cloud deploymentbecause it, it's a very fast way to get started,but in many industriesand in many enterprises, as soon as it gets real ofhow are we actually gonna start building a systemthat we're gonna be connecting an LLM to our private,you know, knowledge bases, issues around, you know,data privacy, data governance, the sensitivityof moving those documents outside the security zoneof the enterprise become paramount. So one story I'll just tell,and I think we all have a bunch of stories like this, uh,we were actually, um, on a panel a few weeks ago, um,a a very senior, uh,bank executive was really extolling all the virtues of,you know, what chat GPT could do, what GPTfor the revolution that that, that it was leading. And so we grabbed this executive, you know,after the session, we said, you know what?Give us a sense, I mean, what's going on ininside your institution?I mean, are you guys rolling some of this stuff out?And the feedback was, well,inside our financial institution,nothing is leaving our securities in period. So in any kind of real rollout of an LLM based technology,it's gotta be some form of a private cloud. And again, a lot of ways to skin that cat, so to speak,but it ha would have to be something that would be secure.

There's no way that this financial institution would besending, you know, key business documents, contracts,regulatory information training collateral, HR information,customer related information, PII related data. There's no way that that's ever gonna be goingout over a public cloud. The second main reason, um, that we hear,and I I think this has been, um, really oneof the great stories of 2023 is the open source innovationcurve has been absolutely spectacular. Um, it's really one of the most revolutionary things I canremember in my lifetime. You know, working in the software industry, um, when,when you look at where open source was a year ago to today,it's just an unbelievable curve of the typesof technologies, the types of foundational modelsthat are being rolled outand deployed in open source practically on a monthly basis.

Um, there is technology that continues to improveand improve and improve. And I think what we hear from moreand more customers is that they want to be able to build offof that innovation curve. They don't want to get locked into proprietary technology. Um, and they also see the potential then with open sourcethat they can customize, that they could make it their own. And as everybody starts to digest what, you know, LLMs mean,what generative AI means, we see moreand more companies saying,we don't wanna fully outsource this.

In fact, we see this as something that will be integralto our future, that will be part of our differentiation,part of the capabilitiesthat we wanna offer our customers and our employees. We want it to be part of our DNAand what open source models give us the ability to do isto start to customize that. And so the AI becomes not somebody else's ai,but it really becomes an AI that is uniqueand proprietary to our business. Um, third big reason is, um, is cost. Now, there are a lot of spectacular use cases for LLMsand for gen generative AI in the enterprisethat don't make sense when you're,when you're thinking about every single time I call this aPII need to pay for it.

It also doesn't make sense when you start thinking about amodel that's a hundred billion plus parameters. The cost of the models we're gonna show today, you know,which are on the 1 billion to 7 billion rangeon an inference basis, are probably a hundred times cheaper. In fact, maybe several hundred times cheaperto actually deploy. So it starts to become practical, a whole set of use casesthat can actually be achieved today once you start thinkingabout smaller modelsthat can be deployed in a private cloud. Um, hey, I wanna, I just wanna pause here for a second.

Do you have, uh, links for, uh, some of the, the,the research that you guys have done or seen around this?'cause that, that's actually really cool, that it's likeso much that the cost is just so much lower. Um, I would love to be able to share that. Totally. Um, I, I will get to it in,in in a couple of slides. I'll actually point you, I mean, you, you,you sort teed me up perfectly.

I'll actually show you, you know, our GitHub repository. I'll show you our landing page on hugging face,which are really places where we have, you know, the models,some benchmarks about the models as well as a whole seriesof articles and videosthat we've posted around these topics. Perfect. Okay. Um, next,and I think it's gonna be one of the themesthat I really wanna emphasize as we look at the demo today.

Rag retrieval, augmented generation. There's retrieval and there's generation. And, you know, when you really think about building a ragsystem, we think there'sso much focus on the generative sideand not enough on the retrieval side. When if you get the retrieval rightand you have a good retrieval strategy in terms ofhow you're indexingand text chunking, applying the right typeof embedding model and putting it into a vector database,if you have the right retrieval mechanismsthat can carry a lot of the weight. And conversely, if you don't have the right retrieval,all the generative AI in the world isn't necessarily gonnagive you a, a great fact-based rag system.

So again, stories like that, I hear, uh, people saying,well, maybe, you know, GPT-3 0. 5 isn't enough for rag. When we're looking to ask questions to documents,maybe we need GPT-4, GPT, you know, 4. 5or something that's gonna come in the future. I think in most cases, in our experience,it's not degenerative ai.

That's the problem. It's in getting, um,both the retrieval strategy, right, the packaging ofthat context information. If you get that part right,oftentimes you can get the same resultsor results that are, that are almost as good with modelsthat are, are much, much smaller. But again, we're gonna show you some of that,um, in the examples today. Um, and then finally, um, we, we really believe thatthat bringing rag into the enterprise shouldn't be aboutbringing the enterprise out to the ai.

It needs to really be in reverse. The AI needs to get infused into enterprise processes,into enterprise workflows, and into enterprise knowledge. When you start thinking about doing that in private cloud,the totality of this, it starts making it much easierto start integrating AI into the waythat businesses actually operate day to day. All right, what is Dragon?Um, now Dragon, we have some cool videos out therewith like lots of pictures and stuff,so I'd, I'd encourage you to check it out. Um, but what is Dragon?Um, dragon is a set of models, um, that we rolled out.

Um, we just launched them,I guess about three weeks ago now. Um, all of these are modelsthat are posted in our repository on hugging face. Uh, dragon actually stands for, in additionto being a cool name delivering Rag on, um,it's a model series that we've launchedand we've built seven models. Um, and the seven models are a rag fine tuning,and I'll explain what that means in a minute of seven,leading six and 7 billion parameter foundation models. Now, what we've actually done is over the courseof the last couple of years, uh,we've built up a pretty large, uh,proprietary self-curated self-developed, you know,bespoke dataset on contracts, regulatory documents,complex financial, um, newsand information with fact-based question answering,and a set of very specific tasks and skills.

And so we've built up this dataset,and then we've gone throughand we fine tuned, um, all of these models. And then what we'veprovided, um, are a set, um, in our repository. And again, to the question that was asked, maybe we'll,we'll flip over to that screen. I can show you where the models are. We've provided really easy to get started, um,generation scripts.

We've done some really unique things in terms of, I mean,LL Mware giving these really first tier, um, support. One of the, the things that we've seen, um, with a lotof open source models,the problem sometimes isn't the model itself. It's that a lot of the surrounding, you know, wrapper code,the Lang Chain or LAMA indexor LL Mware code that you'd be looking to usedon't have a lot of the built-in bellsand whistles to really make that model shine. Um, and so sometimes the amount of work that you have to doto get the model fully integrated into the workflow is justso much more than it is with an open AIor with an anthropic. Um, and then finally, we've, we've benchmarked all of these,um, and we haven't benchmarked them with things like MLU ml,MMLU, and a RC and heli swag.

Again, if, if, if you're familiar with, you know,the hugging face, um, lmm leadership board,there's a whole bunch of scientific metrics out there. But the key question we always get asked is, well,does this work, um, is it going to give me a levelof accuracy that I can count onfor the specific workflow that I have?Um, it doesn't have to be perfect. It doesn't have to be a hundred percent,but I need to have some baseline assessmentof is it gonna be 90% accurate, is it gonna be 98% accurate?And where are the situationsthat it's likely to be successful?Where are the areas that it's not?So we built what we would call a common sense, um, ragand struct benchmark, and then we've run these tests onevery single one of these modelsand published all the results. So if you pick up one of the models,you've got a pretty good sense of the kinds of use casesand the kinds of accuracy that you're likely to get. But most importantly, like if,if you're really thinking about building, you know,production grade system, you're, you're,you're gonna go do something really meaningful in termsof rag a llama.

Man, I don't know if that's what you want, red pajamas. I don't even know what that is. A falcon is sort of cool,but ultimately what you need is you need a dragon. Dragon is fire, dragon is the ultimate cheat code. Anybody that's seen Game of Thrones,those dragons are ultimately, you know,the baddest thing out there.

Um, and so we thought it was, it was sort of a,a cool way to brand. What our real aspiration is, is we want to buildthe best private cloud, you know,7 billion in under parameter models,putting them all out in open source sothat people can build really high quality state-of-the-artrag systems, um, at lower cost, all in open sourceand all on private cloud. Um, okay. So this is a good segue into oneof the questions that was asked. Um, when you talk about private versus public cloud,can you clarify what the, what the distinction is there?If someone is using Novus on their private cloud,but OpenAI as their LLM, isthat an API request considered using a public cloud?Um, and, uh, how so?So at, at any point, if information is goingoutside the security zone of the enterprise,then at least in the context of this conversation,we're considering that quote a, a public cloud use case.

Um, and again, imagine, let's say, and,and one of the examples we're gonna take a look at, um,let's say you take some contracts and you parseand text chunk and embed those contractsand key contract provisions, you're packaging in a contextand you're sending that out to an open ai. Now, in reality, in most use cases, it's goingto be relatively safe. Mm-Hmm. But that is a public cloud use case. And so the situationthat we're gonna be talking about is everything running in aprivate cloud context, including the LLM model.

So what I'm actually gonna show you when we getto the demo is literally everything from the LLM model,the embedding model, uh, vis all running on a single serverwith, with the workflow running on that server on a,a private cloud, um, instance. Cool. Okay. Very cool. Thank you.

Any other questions? That was the only onethat I felt like was, was good to touch on at that moment. I think, uh, we'll let you continue. Cool. Cool. Um, so I, I've mentioned Dragon, um, you know,dragon are a series of whatwhat we believe are production grade models.

Um, they, they run on a single GPU instance. Um, and again, in q and aand through the course of discussion, we'll,we'll talk a little bit more about that. But what we've also built is we saw a huge gap is, um,for CPU, um, inference models, um,very common scenario go, going back to, to the questionwhen you're actually ready to roll outthat application in production, you may be finea actually calling on a public cloud model,depending on the sensitivity of the data. And as you're thinking about, you know, all sorts of reviewthat may go in before that,that production application is rolled out. But for testing for rapid prototyping, you know, I'm,I'm sure many of you that, that, you know, work with clientsor in consulting type of roles, um, you know,clients have a lot of like wild and crazy ideas.

Hey, does this work? Hey, for this type of scenariowith this type of document,am I gonna be able to get results?Is this gonna be able to automate a process for me?Is this a good use case?So what we, we wanted to create with our bling models,which are the best little, um,instruct no GPU required models are actuallyinstruct trained, rag, fine tuned question answering modelsthat can run inference on a laptop. And we're actually gonna show one of these, um, in the demo,but they're 1 billion to 3 billion parameters. Um, and again, all of them have been benchmarked and scored. And one of the things we believe then is you could run somequick tests and quick prototyping with a bling model. And then as you move into production,you don't necessarily have to go to that open AI model.

You could go to a private cloud, you know, dragon modelwith the same workflow. All you'd be doing is simply changing the name of the model. And then finally, the, the last, um, model familythat we've rolled out, we're gonna see, um, in, in the demo,um, is industry Burt. And, and these are just, um, sentence transformer modelsthat we fine tuned. Um, the example we're actually gonna be looking at is a,an industry burt for contracts.

And we did exactly that. Um, we fine tuned the model, um, with, you know,thousands of contracts. So it's a little bit more finely tuned to someof the terminology and language around the contract. Again, we're gonna show you an example to help, uh,kind of bring that to life. So all of these are kind of the model families.

They're all available, um, on hugging face. Um, to, to the question about where to get this, I, I might,I might just pause here for a secondand just show you our, um, repository if I can. Oops, I'm sorry about that. Okay. So this is just one example of a model.

Hopefully everybody can see, um, hugging face. Um, if you want to go look at the hugging face, um,the LL Mware repository, it's just right there. Um, on hugging face, this is one of the model cards. This actually is one of the models that we've loaded. Um, Dessi is one of our partners.

We did another webinar on. So we, we always like to promote, um, our partners. Um, so this is a, a 6 billion parameter modelthat we rag fine tuned. What you see for the model card is all the information,then you know about the benchmark tests, the accuracy, um,and what we include then with the repositoryare both the test results, that's these, the answer sheetsas well as the sample script. So you can go ahead and in a great way to do a hello worldto get started with the model, um, is you can run, um,these generation scripts, it'll run the test for you.

Um, so a lot more here if people havequestions or are interested. But all the things that I just went through, you can find,um, on our LL Mware organization page, these are someof our blogs that talk about some of the benefits of someof these smaller models and all the differentconsiderations and trade-offs. And then you can see the collections in terms of, uh,the three model families, um, that I referenced. Okay, so, uh,coming back now, um, to the presentation, um,what, what is LL Mware?Um, again, I, I don't wanna give you a,a big commercial about it. Uh, this is a, a library that we rolled out.

Um, again, you find us on GitHub just at LL Mware. Um, and what it is, I mean, to put it simply,it's a lang chainor LAMA index like toolkit, um, completely an open source,but it's indexed around these things that, you know, we,we we've just been talking about. Um, it's really designed, um, for enterprise workflows, um,massively scalable document ingestion. And again, we'll highlight that in the architecture. So we've built our own, um, you know, c based, uh,document parsers from the ground up, um,full implementations of the specs for PDF Word documents,PowerPoints, and Excel.

And the idea is, if,if you really wanna roll this out in an enterprise,you're looking at a problem of thousands, tens of thousands,hundreds of thousands of documents, you need to be ableto parallelize that and distributeit across multiple workers. And so the, the first thing that we've really focused on is,you know, massively scalable, um,you know, document ingestion. Uh, we focus on an end-to-end data modelwith persistent data stores. Um, so viss comes in, in any use case around LL Mware. We also use MongoDB integrated into our parsers, um, s aex,chunker and a text collection index.

And then our real priority is in building out, um, you know,a framework for building LLM based applicationswith open source models, not as an afterthought,but really at the foundation. So we're constantly building out new featuresand capabilities to support a wide range of open sourceand hugging face models. And then really leading with, you know, the models that, um,that we've just discussed. Super easy to get started. And again, a bunch of examples that, someof which we're gonna highlight,um, through the discussion today.

All right, now that we've sortof set up all the different pieces,I wanna bring it all together. So I wanna show a conceptual architecture ofwhat a private cloud, an end-to-end true private cloud. So everything is in the box, um, whatthat architecture might look like. And then we're gonna flip over and we're actually gonnastart running some demos in it. Okay? So where it starts conceptually,um, is ingestion.

Um, and as I said, as part of LL Mware, uh,we package our own, um, parsers. Uh, there are really high speed,you're gonna see an example, um, of that. They're also very, very high quality in termsof the richness of the metadata that they gather fromthat document and the consistency of that metadata,which we're able to use then pretty extensivelyas you move down a data pipeline in termsof retrie retrieval. And then ultimately, after you've done some sortof LLM call, able to pull back on that metadatato I identify, you know, where was the sourceand what was specific information about that source. Um, as, as I mentioned, it's fully integrated, the parser,um, with a a MongoDB text collection, we extract a chunkof text, we populate that database ultimately.

And again, you're gonna see that in the example. Um, the VUS vector and database is really at the core thenof this conceptual architecture we're gonna use, um, in,in this demo scenario. Um, small specialized, you know, open source models, um,that we fine tuned for this purpose,the industry Bert models. And then we're gonna use the Dragon serieswhere we use a a bling model as well, um, for the LLM. And so the flow of this as we're running the embeddings, allof the information is text chunked.

It runs through the industry bird embedding model,and then all of those vectors are stored, um,in the VUSs, uh, database. And then from a conceptual point of view, um,everything ultimately comes down to either a queryor a prompt, which are the two main classesthat you would interface with. Um, within LL Mware,you're either gonna be querying this knowledgeinfrastructure, um, that, that's been created,or you're gonna be running some type of prompt callingand invoking, um, an LLMand usually some type of interaction between the two. And then finally, what we're gonna be doing, um,in this demo, um, it's a little bit unique, um,is we've actually put, uh, this whole, uh, infrastructure,um, all of which you'll note is open source tooling. We've put it on an Nvidia, uh, a 10, uh, chip,which is a 24 gigabyte of ramof, of GPU Ram.

And this is actually, uh, in, in this case,we're gonna be using an AWS, um, a MI,it's an an EC2 instance, which is a G five four X large. Um, it's gonna be running, um, on standard Linux. It comes packaged right outta the box with, um, the Nvidia,um, a 10 chip and what we've loaded into it. Then the only thing we've done, we've brought in, um,the LL Mware library,and then we've brought in some of the, um, the sample code. So that is, uh, the setup.

I'll just pause here for a second justto see if there are any questions about the architecturebefore we flip over and we start looking at the demo. Okay. Um, here's the question. What embeddings model is used for LL Mwareand do you have any benchmarks on data ingestion?Speed. So you're gonna see it.

Um,we're gonna use two, um, embedding models. Um, we're gonna use one, one, um, out of the box, um,sentence transformer, and you'll see it, I mean,I think the performance is pretty fast. And then we're gonna use, um, an industry Bert model,which is, uh, basically a standard Bert, you know,it's 110 million, um, parameterthat we fine tuned on contracts. And again, you'll actually see it running someembeddings, um, in real time. Um, we have not done benchmarking of performance, um,with, with embeddings.

A lot of that is gonna depend on the underlyinghardware, um, infrastructure. So if you throw enough hardware at itand you can parallelize embedding jobs, um,you can do them spectacularly fast. Um, what we're gonna see here is it just running on a singleserver, and you'll see the times and,and you can kind of eyeball, um, kind of how fast it is. Um, but yeah, so, so you'll actually see it. Cool.

I also have a question. Um, why use two embeddings models?So I'm gonna do it here. Um, really just, just as an illustration, um,and the, the, this is may,maybe setting up the demo a little more. Um, we're, we're gonna look at two different, um, kindof documents, source documents. Um, one is a set of UN resolutions,United Nations resolutions.

It's about two years of those resolutions. Each resolution is maybe three or four pages to 15or 20 pages, uh, I guess depending onhow much they had to discuss that week. And there are all sorts of things. It's whatever the UnitedNations, um, you know, resolved. Um, it could be about, you know, an an issue of warn peace.

It could be an issue about environment or social justice. Um, it could be any number of thingsthat United Nations get gets involved in. They're 500 PDF documents. Um, it's a fairly general purpose, um, scope. So we're gonna use a general purpose,small fast betting model 'cause we get good results with it.

Ah, The second one thenthat we're gonna use is, um, contracts. And, and we're gonna pull down about 80, uh, contracts. And for that one, we're gonna use a, a modelthat we fine tuned for contracts and legal documents. Um, we have found that doing that in that fine tuningto the industry domain, where the industry domaindoes have some unique, uh, dimensions to it,we found overall, um, it does yield benefit in termsof the accuracy and quality of the retrieval. And again, we'll, we'll actually try to highlight that,I mean, in one of the examples.

So why use different embedding models?Because you, you're gonna be looking at in effect, librariesor collections from different domains. And so where you can start fine tuning that embedding modelfor that domain, you can usually get better results. Okay. Very cool. Thank you.

All right, you guys ready? Ready. All right, so let's flip over to the demo. Um, I, I'll tell you, I have this on two, two machines,two identical machines set up. So just in case anything goes wrong on the first one, I,I do have a, a, a fallback that we can,um, we can take a look at. So this is the machine.

Um, this is just the, the EC2 instance that I mentioned. This is, um, in AWS, um, Linux server. I spared you kind of the, uh, the, the launch timeand the, the EBS, you know, getting warmed up and,and, and all that stuff. Um, but what, what I wanted to do is, uh, we're gonna walkthrough, um, these three, uh, Python files, um,and I'll, I'll quickly open one up at least sothat everybody can get familiar with, with the recipe,I guess, of, of the code. And we're gonna start,the first example we're gonna look at is thisUN resolutions one.

And again, for what, uh, we're gonna be showing, justso everybody's clear, all of this isthat conceptual architecture so loaded on this single, um,AWS instance. We have vis, we have Mongo,we have all the embedding models,and we have all the LLM models. What we're gonna do is we're gonna do a one-time pull. So there are some things we're gonna get pulled into this,but once the information is pulled in,actually nothing is gonna leave. So there's no API call happening behind the scenes.

Everything that you're gonna see is actually running,um, on this machine. Uh, second thing I wanted to say, this example, um,is posted, um, on our GitHub page, um, in our repository. Um, as you Eugene mentioned, we're participating, um,in a kind of a 25 day, you know, promoting, um,open source community, you know, hackathon, um, with vis. These are a couple of sample scriptsthat you can get started in a few minutes, um,and do some really cool things, um,with L Mware and with vis. So just so everybody gets a, a little comfortable with,with at least the code logic, um, I'm gonna runthrough this quickly, but hopefully it,it'll at least help everybody, um, kind of understandwhat we're doing first.

Um, as I said, in this case,the embedding model we're gonna useis a standard off the shelf open sourcesentence transformer model. It is a mini Burt. Um, sometimes people think youneed bigger and bigger and bigger. We actually have found, um, this mini lmm expert is,is a go-to model for us in a lot of use casesbecause it's small, it's fast, and it's pretty good. So for a general proof of concept, we actually find it's,it's not a bad place to start.

And if you think about it also, um, again,VUS are the experts in this,but if you're gonna build a really, really large collectionwith really large set of embeddings,having a slightly smaller modelwith slightly smaller dimensionality, um, actually has a lotof really practical benefits in termsof the overall longer termadministration management of that system. So anyway, we're gonna use mini, um, LM expert, uh,mils obviously is gonna be our database. Um, starting point is you just create a new library. This sets up the construct within, um, L mware. We're then gonna pull down these sample filesthat we've put in just a public bucket.

So if you run, you know, this, you, you know,pip install L mware,and you run this setup, it's actually gonna pull in, um,all these sample files into a localfolder structure for you. Now, this step is the parsing step. Once we've downloaded all of these files locally,the parsers actually work with a really simple interface. It's just adding files to a library. What actually that one line does is it actually, um,routes the individual documents based on their fileextensions to the underlying parser.

It parses it text chunks, it extracts all the metadata,the tables, the images,and it puts it into a text collection. So you're gonna see this step, um,unfolding across these 500 documents. And then there's also a really simple, um, interface, um,to install new embeddings. This, we we're gonna apply it to that model. Um, we're gonna put it, um, in vus.

Status Manager is really useful in very large applications. You can pull it to get some real-time visibility on theprogress of a large, um, embedding or parsing job. And then once we've done that,and we're just gonna run a simple test query here, um,obviously the real fun in all this isto go do the semantic searchand start extracting all sorts of thingsand doing really cool retrieval. I'm just gonna do one simple example just to illustrate it,um, which is a sample query on sustainability issuesimpacting women, which we think, again, is a,a good conceptual type of query that oftentimes it'swhere an exact search failsor doesn't give you the right results. 'cause there's a lot of concepts, um,around things like sustainability issues.

And then we're gonna print it out to the screen. And that's it. So let me go aheadand, um, run itand we've got three examples we wanna get through. So hopefully,hopefully the server didn't goto sleep or anything like that. All right, here we go.

Um, so we are parsing, um,and the parsing status manager, every 10 documentsis printing out, um, this update. Um, it's just, you know, pushing it to the console as wellas, um, in a database. And so you can see, um, these 500 PDF documents, again,several pages average length up to probably 15 or 20 pages. We've parsed 200 of these documents, um, all in real time. All that information running locally, um,on this, uh, server.

So we're gonna keep cranking, um, through those. Once we're done with all the parsing,then we're gonna start, um, the embedding. And what's happening in the parsing is all 500of the documents are being parsed. They're also being packaged as little text chunks. Those text chunks,each individual text chunk is ultimately what'sgonna get, um, embedded.

So we're done, um, 50 seconds,about 10 documents per second,and now we're off running the embeddingsand we're passing the embeddings 500 text chunks at a time. You can see there are about 12,000, um, total chunks,and we move through it really fast. Now, 12,000 is still a pretty small library,but hopefully this gives you a sense allof this running locally, um, on a single, um, machine. Look how fast we just parsed 500 documents. We embedded it and we've just run our first query.

Wow. This sample code and this sample script,it's in our repository. You can bring it down, you can follow the instructionsand in a few minutes you can startto do some pretty interesting, um, you know, embedding work. Um, we have a question from the audience. Are you pulling the PDF files from S3?Yes, yes.

So the PDF files, um,are actually just something we provide. It's just a, a public S3 bucketwhere we've put all those sample documents. Um, of course, um, you know, in any type of custom use case,those could be in a private S3 bucket, you know, that,that a company could be pulling from. Um, but in this case it's, it's just from a,a public samples file that, that we make available,that that's what was pulled down. Very cool.

Okay, now, um,just a sample retrieval, again,we're gonna look at some more. Um, the primary focus probably is a little more,uh, technical in flavor. So how cool it is that we're, you know, parsing so fastor text chunking or embedding. But I did want to at least show, um, someof the results of this. Um, which is again, a, a nice kind of conceptual query.

And then what we pulled up with just a single line, um,you know, running a semantic query against the library, um,we have the document, we have the page number,and then we have the distance. Um, one of the things we find really interesting is thegeometry of the embedding space. And using some of these distancesto start doing really interesting thingswhere you can start defining topic classification. You can start defining that as thresholds of what typeof information, how much information to look at based on,you know, an embedding distance, um, threshold. And then you can see, and we, we chop the text at just a,a short number of characters just so everybody could,could quickly read it.

But you can see it's done a pretty nice job. Um, you know, with, uh, women with issues that, um,are about sustainability, you know, globalizationand independent and interdependence, climate change. You can see many of the topics around, um, you know,women in development, um, vulnerable countries,you know, greater gender equality in distributionof economic resources. It's done a pretty nice job of getting down someof the semantic meaning to pull forward. You know, a lot of the interesting, um, contextaround the query that we gave.

Now, um, I'm gonna show another examplethat's gonna be fairly similar. It's gonna be a fairly similar type of recipe,but the domain is gonna be different. So I'm not gonna show the code,but certainly if anybody wants to, we,we can take a step back and go through it. This is gonna do something fairly similar,but this time, instead of looking atun resolutions, we're gonna do the exact same flow,but we're gonna pull down, um, 80 contracts. And for this one, um, again, you can see it's gonna parse,um, really fast, um, those 80 agreements.

Um, for this one though, instead of that mini, uh,Bert model here, we're gonna usethat contract fine tuned embedding model. 'cause again, what we found is just having a little bit moreof a flavor of the key conceptsand terminology around contracts we find actually gives, um,better retrieval results. So we parsed in text chunk those 80 documents in,in 23 seconds,it was decomposed into 7,643 text chunks. And now we're powering our way, um, through, uh,the embeddings of those. Uh, hopefully you can see the screen.

Yes. Uh, we have another question from the audience. Um, can you briefly discuss how toensemble the models together, um, in the code?I'm gonna show that in the next example. So I'll, I'll perfect, I'll highlight that if that's okay. Perfect.

Perfect. Um, so this example, again,almost exactly like the UN resolutions one. Um, but this gives you just an a, a very similar recipe. Helps to, it helps to hopefully showthat this is a repeatable pattern. You ingest some documents, they get parsed,you can pick your embedding model,you create your embeddings,and then you can start running some retrievals.

Now, here, and again, probably a, a sortof a lightweight hello world kind of example,but we just picked something like, um,incentive compensation. I mean, you can see it did a pretty nice job of,of picking in words that would be very related to that,even though the word compensation, um,wasn't reflected here,but it did pick up all of these, um, kindof key text chunks, um,around incentive and incentive bonus. Now, one of the reasons I wantedto show this is, is really coming. I, I think to the question that that was asked,which is the third example that I wanna show. And the third example, um, starts sort of chainingor bringing together, um, you know, these models in a,uh, rag scenario.

And it, it sort of builds on this contract example. So in this example,what we're gonna do, uh, we're gonna use, uh, it's contract. So we're gonna use our contract embedding model, same, uh,recipe boilerplate. We're gonna create a library. We're gonna pull down thesedocuments, they're already pulled down so itwon't, it won't pull them again.

Um, we're gonna parse those documents,we're gonna install our embeddings,and then we're gonna do something alot more interesting now. So now what we're gonna do once we've createdthat embedding space, now we're gonna loada 1 billion parameter bling model. So we loaded the industry, Bert, that's the embedding model. So that's churning out vectors, you feed in text to it,it gives you a vector output. That vector output then is used to, in effect, identifythat chunk of text in a vector space.

That's what we're putting, um, into Novus. That's what's gonna help us with all the retrieval. Now, the next step we're gonna loadin a different type of model. A, an LLMA generative model that takes in text,takes in a context passage and a questionand answers that question based on that passage. It's gonna yield text.

It's a mini, mini, mini, mini like GPT if,if, if you can picture that. And it's gonna be running entirely locally. Um, so I already did download it from hugging face. So it is running on the machine. It will take probably 10 or 15 seconds to load, um,into memory and onto the GPU.

So we're gonna take this model. And now the interface that we use, that the concept,the abstraction within LL mware is you create a prompt. 'cause a prompt ultimately is what you want to dowith a model, regardless of what the model is. So now here we're gonna load the model. Um, we're gonna load this bling model,and we have just a really simple query.

Again, not the most exciting query in the world,we're just gonna say, what is the executivesbase annual salary?Each of these are employment contracts. They're 10 or 15 pages. You, you can picture this as a real project. Someone comes in and says, you know, could we get a list ofwhat we offered all of these employees to see?Is it, is it comparable?Are there some differences between 'em?Are there some outliers?Typically that would be kind of a manual exercise,especially if it's not 10 contracts, but a hundredor a thousand here. We're, we're gonna ask an LLM to do all ofthat work using the embedding space that we just created.

So we load that model, we're gonna go run a semantic query. This goes back and queries, um, the, uh, the vector space. And then we're gonna loop just contractby contract by contract. And what we're gonna do for each contract,we're basically gonna grab the top, uh,semantic retrievals for that contract. We're gonna package that into our prompt as a source.

What this one line does of adding source, it packagesand aggregates, um, allof those query results into the prompt. And then we call the prompt. So we loaded the model, we loaded that context,and then all we're gonna do when we actually prompt themodel is we're gonna ask it our questionand then we're gonna print out the result. Okay? And again, all of this, all the orchestrationof these models, the embeddingspace, it's all running locally. The data is never gonna leave, um,the server that this is running on.

Okay, so nowlet's run this example. So here there are only 10 contracts. You can see it, it moved through that really fast. It's creating the embeddings super, super fast. Um, this is a simple example of only 10 documents.

And now here we are, that key step. We're now pulling that LLM model,the little 1 billion parameter, the smallbut mighty, uh, bling model. We're just pulling it, um, into memory,and then it's gonna crank through theinferences really, really fast. And then, and then we're gonna take, uh, a second or twoand we'll actually look at the resultsthat we've gotten from that. Um, as this is running,I am gonna give you another question from the audience.

Um, I don't know if it's discussed,but the end of the walkthrough, I would loveto learn more about building your own dataset from scratchwith GBT four pineapple emoji. Okay, cool, cool. Alright, so this ran, so all the inferences ran, um,all the queries ran, um, it, it iterated through, um, allof the contracts and let's, let's show what it actually did. Okay? So again, we took 10 contracts,we parsed them, we embedded them. We ran a semantic retrieval with a, with a basic question,you know, what is the executive's salary?And then what we did for each contract, we loopedthrough each contract, we took the top retrievalsthat we found, basically go out, what's the best informationthat you found in the Rhea executive employment contract?And it was these four passages.

You can see this one was, uh, you know, 0. 62 away. This was 0. 6, 4. 8, 1.

The prompt that one line of package up the sourcepackaged up, um, all of this text with allof the metadata identifying which source ultimatelyprovided which piece of information that thiswas passed to the LLM, the, the, the smallbut mighty bling model. And then here was the model's answer. So the model read this passage,and in this case found the answer that was sortof right in the middle of that retrievaland gave us then, uh, the correct answer. So what it did, just while we were, you know, sitting herein just a couple of seconds, it went throughand it did this for all of these contracts.

Now, I wanted to highlight specifically a1 billion parameter model. Not to recommend that a 1 billion parameter model can doeverything that, you know, GPT-4 can of. Of course it can't. Of course it can't,it will make mistakes, it will be wrong. It is not intended to be a replacementfor those kinds of models.

But what this illustrates, when you've got greatretrieval strategy, even a small modelcan do meaningful work for you. Now in this case, the small model is running on A GPU,so it's crazy, crazy fast,but this small model,it can run on your Mac laptop just outta the box. No quantization, no special treatment of it. Um, it just runsand it runs inference at a reasonable speed. And it's giving you pretty good qualityin a relatively simple extractive key value task.

But it's getting the correct answer. And so I wanted to show this to youas I think a really good example. This is just a simple rag use case,but this is a recipe that we think you can replicateand scale to do a lot more complicated things. Your embedding space. We've walked through now three examples where you can seehow to, how to parse, how to create that embedding space,how to run a, a query against it.

And now we've integrated in that prompt and that LLM model. So you can start doing an end-to-end rag. Um, we have a couple of other videos out on YouTube. I'd encourage you to check them out that walkthrough this in more detail, walk through someof the fact checking capabilities, walk through someof the ways that automatically all of this output is,is produced as JSON L files. You can push it into an upstream system as wellas a CSV file if you wanna push it to a person to say, Hey,this is the output that I got.

Does this look, you know, accurate?And a person then can quickly reviewand be the final pass on it. So, um, we've gone through three examples. They've all been on this ugly, ugly command line. Um, and so what I wanted to do for those that would liketo actually see something a little bit more visual, um,I have actually also run a web app. Sorry, um,sorry, I am gonna flip overto right now.

So this web app is just a simple web appthat's running on top of that infrastructure. And please don't, um, you know,if there's anything wrong in the ui. My apologies upfront. Um, I actually prepared some of this just for this demo. I wanted to give something that would just bring to life,um, what we had just done that wasn't on the console.

Um, and I also wanted to use this as a moment just to, uh,uh, kind of highlight that in additionto all the cool libraries and consoles and experimentsand all that other stuff,we've actually used this exact same infrastructure,you know, with vis on some really scalable, um, applicationswhere you can start bringing all the pieces together. So start doing this high powered retrieval with, um,a local private cloud, LLMand all of it running, um,in the context of a private cloud. So what I'm gonna jump into,and I know we're a little short on time,and I do wanna get back to that question on data sets. So I just wanna show you one thingthat I, I thought was kind of cool. So this is just a very simple little retrieval, um, screen,um, that we've mocked up.

And we're gonna go back to that simple question. Like the most simple thing, most basic thing in anemployment contract is like,what was the amount of the salary, right?And now first I'm gonna run this just against atext index, okay?This is not a semantic search,this is just a pure text index onwhat looks like some pretty basic words,the salary and the amount. And I want you to take a look at, at what it retrievedand what it prioritized. Well, it's, it makes perfect sense. It has the word salary and it has the word amount.

Like how could, how could it be better than that?But when you're looking for a salaryamount, what are you looking for?You're looking for a number, you're lookingfor the actual amount. So what the text index did is it indexed on the words,it gave you a salary amount. But now I wanna show you,and again, the question about why would you use, you know,a fine-tuned embedding model,why would you use different embedding models?Or why would you even use semanticsearch versus tech search?Well, here's just, again, a super,super simple little example of it. Now what we've done the same query salary amount,but now we've actually run it against our semantic index. Look at, look at what it did.

It prioritized a different set of results. And now what's fascinating about this is it doesn't mention,mention the word amount, but it seems to,because it's been run through an embedding space where it's,where it's, it's churned on lotsand lots of contracts, it understands that amount actually,in many cases, what's more important is getting the actualamount itself. And so it prioritized, it was closer tothat query in the embedding geo, uh, geometrywere the passages that actually had the amount in them,not the passage that had the word amount in it. So a very simple little example,but I thought I would just use that just to highlight,right, the power of semantic searchand the power then of having an embedding modelthat's been fine tuned on a particular domain. Finally, I know we're tight on time,so I'm probably gonna give this not the justicethat it, that it deserves.

Um, but we haven't even really gotten a chanceto look at the dragon models. Again, there's a whole bunch of videosand stuff that we have on this,but one thing I'll just, I'll leave and,and maybe this is something we can, we can pick up in,in a follow-up session,but what you can actually do then is,and this was the model that we had pulled in, was oneof our Dragon DESI models. And just for this UI we did two ways of doing this. Both a text passage that you can directly ask a question to,or you can pull something directly out of oneof these contracts. And I'm going to do just vacation.

Um, and, um, how many vacation days?And now this is running an LLM based query,a rag scenario against that documentof the employment agreement. And here's what it's given us. The AI model came backand you can see subsecond response time running locally on aGPU, a local model. It read through the entire contract, called out things outof the embedding space and gave us the applicable clausealong with the sources and confirmation of wherethat information was. I'm gonna pause here again, there's a lot more stuff here.

I think we're just scratching the surface. But hopefully this has given you, you know,a few different use casesof an end-to-end private cloud rag system powered by Melvaand powered by LL Mware. I'm gonna pause the presentation then here. I'm not gonna cover any more, um, demo. Um, but I will just quickly take a breath, I'll,I'll come back to that question about howto build your own dataset.

Um, but, but maybe you, Eugene, just to pauseto see if there are any other questions you want me to take. Um, before we get to that, um,Maybe this one will be quicker. Um, uh, someone asked,I didn't catch why we're utilizing another1 billion model for inference. Does our Dssi model teach the smaller model, uh,or is there another concept happening here?It's a great questionand the answer is actually really simple. Um, I just wanted to showcase a couple different things.

Um, the 1 billion parameter model, it's really useful if,let's say you were running that script on your laptop. So you know, you finish, uh, this session, you, you're kindof excited, you're like, Hey, this is kind of cool. Lemme go check it out. Lemme go to LL Mware,let me pull down the repo or PIP install it. You're probably running it on a laptop.

A 1 billion parameter model is a great wayto get started in a hello world to say, okay, yeah, I kindof get this, I understand how this, this worksand it's gonna give good,but not great accuracy in most use cases. And then let, let's say you do havemore of a professional setup. Maybe you're running this on a GPU server, um,or you set up a pop-up, uh, a GPU serverthat we have a another video on of, ofhow you can quickly set up a sixor 7 billion parameter model, um, as a, uh, your,your own inference server. If you have that at your disposal, then nothing better. I I would always go ahead and use a sixor 7 billion parameter model on A GPUbecause you are definitely gonna get better results.

So part of what we wanted to illustrate, greatto test locally on your laptopwith a billion parameter model upto about a 3 billion parameter model. And then when you're ready to move it to moreof a production environmentor put it on a GPU,nothing better than just swap out the name ofthat bling model for, for a dragon modelfor a higher performing, uh, set of results. Cool. Very cool. We have just a couple minutes left here, so, um, uh,I, I think it would be greatto just touch on the dataset thing and then we can wrap up.

Dataset is an awesome question. Um, data sets are, are, are the key to all of this. So, um, you know, when we first started doing this,we looked at a lot of public data sets that were available. Um, they're very geared towards chatand there's tons and tons of stuff. If you want to improve sort of the conversational fluency,kind of the, the, the safetyand helpfulness types of quotient.

There's awesome stuff, um, that's out there. But our use case was really different. Um, we actually wanted kind of a serious no nonsense,you know, LLM that could read all these sortof dense contracts and regulatory informationand resolutions and compliance and financial newsand could extract, you know,very technical kinds of things from it. There weren't as many data sets that were out there. So preparing that dataset, I'd say there are probably,you know, three key dimensions to it.

The first was assembling some of the raw source materials,which is work in and of itself. The second then was really definingwhat are the key categories of tasks or instructionsor skills that we want to train the model on. Um, and some, you know,might be things like, like value type. We wanna make sure that anytime a model is asked a certaintype of question about a who,that it's consistently answeringand identifying a personwhere if we're asking about the address of a companythat it's consistently able to respond to, you know,that type of value type or it could be a, a question type. So we spend a lot of time on Boolean questions.

Yes, no, the models actually work pretty well. If you ask it a yes no question, did this happen?Can you terminate looking at the little text here?Can you terminate an agreement for convenience?So yes, no questions as a skill, right?So first, gather your source materials. Second, identify that set of instructions. And then the third, the really hard part is spend a lotof time in the details to make surethat you're asking the right types of sharp questions. You're building both enough consistency as well as diversityand then that you're providing the right fact-based kindsof answers to be derived from those contexts.

So I would say it was a bottoms up effortwith just a significant amount of kind of care in the designand then in care, in, in every single detailabout how designing it. Now, I think your question was can you bootstrap, um,can you bootstrap off a GPT-4?And the answer is absolutely. Um, in fact we provide some really cool tools in LL Mwarewhere all these sort of inferences that you just saw us runactually are all captured in a prompt state. They're saved as individual J-S-O-N-L filesthat can then be consolidatedand packaged into a an LLM fine tuning datasetthat's model ready out of the box. Um, the biggest caution that I would have is less about themechanical side of gathering that datasetand much more about sort of the first couple of pointsof really giving a lot of thought to, you know, what kindof training objective do you have,what are the right source materials,and then what are the key skillsand tasks that you want the model to be trained on.

Cool. Uh, thank you everybody for coming. Thank you Darren, for giving this wonderful presentation. I think it was really cool to see,uh, what you guys are doing. And also it was really cool to see the live demosand just how fast they are.

Um, I look forward to, you know, seeing more ofwhat you guys do and thanks again everybody. Um, this will be available, um, in recorded form soon,so I'll see you guys all next time. Excellent. Thank you so much everybody. Please follow up anytime with any questions.

Meet the Speaker

Join the session for live Q&A with the speaker

Darren Oberst
CEO of Ai Bloks
Darren Oberst is the CEO of AI Bloks, an innovative AI platform revolutionizing the landscape of LLM-based application development for Generative AI in financial services and legal industries. Prior to AI Bloks, Darren served as the CEO of Exadel, and launched and grew HCL Software to over $1B revenue in 5 years. Darren is currently focused on building enterprise LLM-based applications, which includes retrieval augmented generation, open source LLM middleware (LLMWare), and fine-tuning specialized enterprise LLM models for open source. Darren is a graduate of UC Berkeley with degrees in Physics and Philosophy and Harvard Law School with Honors.

Safeguarding Data Integrity: On-Prem RAG Deployment

What will you learn?

Topics covered:

Meet the Speaker

AI Assistant