I see people are still logging in,so we'll give it a little bit longer. But I do want to, uh, introduce the session today is,like I said, multimodal pipelines, which is, uh, very cool. Our speaker today is Sam. Sam is a seasoned technology leaderwith over 18 years of experience. I've worked with Sam at, uh, at, uh, Hortonworksor Cloudera, whichever it was called at the time.
He is a real expert in a lot of technologiesaround data engineering, analytics and ai. And, uh, he is currently the CTO at Data Voloand he's in chargeof the field engineering team, which is pretty awesome. He comes from Google and Databricks and Hortonworksand a lot of cool companies. So if you're ready, I think the, uh, numberof attendees is stabilized. Uh, welcome Sam, kick it off.
Thank you Tim. Great to be here. Very grateful, Forus having us on the webinar today. Uh, thank you for the kind introduction. So Tim and I, uh,have worked together in the data engineering space at HortonWorks, and, uh, very excited to be collaborating again.
Um, so today I really wantedto talk about the day two moment for adoptionof AI applicationsand what that implies for the data platform,the data engineering,and the overall AI stackthat enterprises need to be successful. So I'm going to start with some observations aroundthe state of the AI market with a lens on the,the data set of challenges. And then I'm going to show you a bit of a demo. I have two different demos if we get to both. Um, we're going to show the data volo platformand we're going to take on what's kindof become the hello world of rag, which is a processingof financial 10 K documents so that I canget the right data that I need for a kind of chatwith my docs as a financial analyst style of use case.
Uh, so looking forward to having, uh,questions towards the end as well. Um, happy to, uh, set aside a good amountof time for q and a. All right, let me share my screen hereand I'm going to, uh, jump into some observations that, um,I've had, you know, working with a lotof enterprises over the past, uh, several years. As Tim mentioned, I spent about four years at Google Cloud. I was helping, uh, digital natives like Datadog and MongoDBand Collibra build data architectureand adopt, uh, language models like Gemini,I think Palm at the time.
Um, and one of the big challenges I saw wasthat folks were not spending enough time thinking about thedisruption for the data tools that we needto make AI apps successfuland that there needed to be more of a recognition thatwith any AI strategy, with any AI application, we needto really focus on the data preparation side of things. And we need to potentially adopt new toolsand new frameworks in order to, um, uh, get the rightfoundations to make our app layer successful. Alright, so a couple of observations related to that. You know, when LLM APIs first came out,I think everyone just kind of treated them as an API,it's enough to build the application. And what we've found over the past couple years isthat's not really the case.
You do need, you do need more of an AI stack. Part of that being the data pipeline solution I'll talkabout today, but lots of other components to this, right?There's the, the vector database side. Uh, with solutions like vis, there is the guardrailsthat we need to mitigate risksand do secure enforcement of access. Um, there's a need to do evaluation and do quality. So a lot of new tools,and especially for enterprises that were not part of theML ops and search world prior to some of this, uh, AI wave,a lot of that was very new.
Um, another big challenge that we saw is that initiallywe wanted to bring our own context,our own enterprise documents and data to the picture here,but that was challenging without having theright retrieval platform. So at first, folks looked at fine tuningand other ways to get their context, their datainto the language model. Um, but retrievaland the pattern that we'll talk about more today has kindof become a defacto way of bringing that enterprise context. And the proprietary data is reallywhere all the value lives, depending on the use case, uh,but the deeper value for enterprises really requires gettingthat proprietary context into the application layer. And last point here is evaluation is extremely important inthis space and in ML ops and AI applications more broadly.
In particular, I really encourage teams to thinkwith an evaluation first mindset. As you think about developing a new applicationor use case, think about how you're goingto evaluate the success ofthat application right at the outset. So how am I going to construct an evaluation data set?How am I going to monitor retrieval efficacy over timeas data changes, as the questions myusers are asking changes?This is really important and it's challenging in this spacebecause we know that models are non-deterministic. We also know that what a good response looks like for someof these applications that can be very subjective. And so, um,even getting good labeled data can consistently fromdifferent labelers can be a challenge.
So this is kind of a new world to engage inand we'll talk about evaluation more in the,in the second demo in particular. Alright, so retrieval augmented generation is the patternwe're gonna talk about today. It's really a way to ground the LMS in contextthat we have throughout the enterprise. The challenge is how do we get that context, that grounded,uh, truth to the application layer, to the modelin the right form at the right time?And how do we make it useful?The disruption that we really see in the data integrationmarket is around unstructured data. You know, when Tim and I were at Hortonworks,we were talking about the value of unstructured data atthat time, and that was, you know, eightor nine years ago now, I think.
Um, but we never really got where we needed to be in termsof unlocking the value of unstructured information,multimodal information, things like documentsand audio and images. I think one of the most profound paradigm shiftsthat we're seeing with AI today is that we're now ableto compute on language compute on imagesand documents just as we were ableto do computation on numbers in the past. And so part of that, of course,is the embeddings themselves, which represent these thingsas numbers in a sense, in some high dimensional space. Uh, but this is somethingthat traditional data tools really struggle to cope with,especially at scale. And the day two momentafter we've kind of found the value in day one,we've refined our use casesthat day two moment is all about scale.
It's all about resiliency. It's building pipelinesand um, deployment architectures that are going to be robustand resilient over time. And that's really what Data Volo was founded to do. Um, we are powered by an open source platformthat Tim mentioned, which is Apache NiFi,and we'll talk more about that. But let me situate data Volo in the overall AI stack here.
So everything you see in this, uh, purple box here,by the way, I like that the Zillow and data volo brandingand colors, uh, seem to be very nicely aligned these days,which is excellent. Um, but everything in that purple box, that's reallywhat Data Volo brings to the table in terms of capabilities. So think about those traditional ETL stepsof acquiring data, transforming dataand loading it into vector databases like Zillow, as wellas other retrieval platforms, which can include, uh,traditional search like elastic search, um,can include document stores. There's a lot of patterns which are goingto use more hybrid search patterns. You might be filtering on metadata,but you might also be fusing results both froma search platform as wellas a dense retrieval platform like, like vis.
Um, the other aspect of this is orchestrationof these pipelines. So how can we schedule them? How can we backfill data?How can we handle when upstream systems might be offlineand we need to retry, um, and backfill some dataor maybe we have a high volume of changes to processand we need to manage that effectively. And that can include back pressureand other key pipeline orchestration capabilities. And then the last piece of this is around observability. We at Data Volo have the philosophyof observability being built in.
This should be partand parcel of the platform itself without my having toadd another component to be ableto see what's going on with my pipelines. And we think about evaluationand retrieval efficacyas just another lens on pipeline health. And so ob observing that monitoringthat over time is going to be important as well. So where data Volos job ends is really once we've loadeda vector store like vis. So we focus on the data engineer persona, we focus on the,uh, data engineering required to get the dataand make it useful for the AI application layer.
We don't focus on the application front end side,so we're not in the path of the LMor in the path of retrieval. Once the pipeline has done its job, that part of, um,the problem is, is solved. And, and that's where data Volos job largely ends. I have seen a bit of an anti-pattern with someof the frameworks out there, um, where they seemto conflate some of the data challengesand the data loading aspects with more ofthat application UX and prompt engineering. And we've, uh, firmly believed that, uh,to make retrieval work well, you really have to situatethat in the data landscape.
You have to situate that as a problem to be solvedby data engineers, and you really need this clean separationbetween the application side and the data side. I think we saw that with prior waves of ml by the way. It was really the provenance of the data engineeras a skillset and as a persona. And that emerged because, you know, data scientists,ML engineers, they're very good at building predictivemodels, get good results. They're focusing on the training and the evaluationand strong, uh, prediction.
Some of the feature engineeringand some of the data work that's requiredthat is more in the wheelhouse of the data engineering team. And we think that same pattern repeats now in the LLM spaceand uh, that's why we're really focused on solvingfor the data part of the problem. All right, um, so how do we enable the 10 x data engineer?Well, as Tim mentioned,we are built on an open source platform called Apache NiFi. This is a proven platform at scale. It's deployed at over 8,000 enterprises worldwide, um,in many areas of high, uh, security postureand regulation like financial services, defense, healthcare,telco, uh,and so meeting that enterprise readiness bar around securityand scale, um, is something that NiFi has been doingfor a decade or so.
Um, we really promote optionalityand kind of that Unix mindset of composing tools together. We do not wanna be a black box. We want to be pluggable to have good implementationsof different parts of the, uh, ETL stack for ai,but to allow folks to plug in their own models,their own transformations and really promote flexibilityand agnosticism throughout the ecosystem. We've created a ton of integrations into that ecosystemand we'll see more of that in the demo. And then observability security, really table stakes.
It's just built in, it just worksand it's easy to implement and configure. Uh, so what have we been doing at Data Volo?We got started, uh, late last yearand there's been three major elements that we focused on. So the first is our cloud native form factors. So we deploy both as a SaaS as wellas a bring your own cloud model. Uh, I think this is pretty similarto the Zillow deployment form factors as well.
So you can go with a full serverless type of optionor you can also deploy our platform insideof your cloud VPC. And we also can deploy natively in platforms like Snowflake. And we do provide a Docker based distribution for folksthat might wanna deploy data Volo in their data center. Another big aspect is how we've made security easier. You know, security is often a challenge to get right.
Um, the most secure system is the onethat no one can access in some way. And so you have to really balance the trade offsand we strive for having the best security posture,but also making it easy for usersto implement and configure. And a lot of the managed offering makes that even, uh,far simpler and easier. And then, as I mentioned, we've been working hard on theneeds of AI systems. So this includes the right, um, integrationsto acquire data, the right, the right transformationsto make that data useful using machine learning modelsand different ways to transform data like chunkingand embedding it so we can do, uh, good retrieval and searchand we'll see that in the demo.
And then, uh, the abilityto load the right data into the right systems. So we're very excited about the,the vis integrations that we've released. Um, we have integrations into the broader ecosystem. So, uh, whether that's evaluation and frameworks like Ariseand Galileo, whether that's, uh, securityand frameworks like DAAand Pueblo, we have partnered with a lot of the,the key players in the AI stack, uh, to have maintainableand easy to use out of the box integrations. And I'll just touch onthat maintainability, uh, piece quickly.
So, um, it's very easy to get started in a lotof these AI applications again, in that day one setting. But when we go to, um, actually, uh,implement these things in production, we're goingto see a lot of changesand a lot of churn in a lot of these frameworksand APIs, open AI anthropic, Google Lang Chain LAMA Index,they're all moving extremely quickly. There isn't a lot of stability in these libraries. And so when we build an integration, we are also taking onthat burden of maintaining it over timeand providing the right aversion controland the right, uh, deployment strategies to adopt it easily. So that's an important pointto think about when you think about whyI would use Data Volo.
Um, we are taking on a lot of the burdenof maintaining these integrations, of securing them, thingsthat your team doesn't have to do as far asundifferentiated code. Alright, maybe last slide hereand then we will jump into the demo. Um, so I just wanted to highlight as well,data Volo is a streaming platform. It's a continuous and automated platform. So with one of the, uh, thingsthat we've noticed in retrieval ishow important the metadata is.
So metadata is used for, uh, filtering by dimensionsbefore I do semantic search against the embeddings. The metadata can also be used for ranking results, right?Getting the results ranked in the right order is importantbefore I send it to the LLM for generation. And I might also use the metadata for authorizationand secure enforcement of retrieval. You know, Sam and Tim, we might not be partof the same team within an organizationand we might not be entitledto get the same answers from a chat botbecause we have different privileges, different accessto the underlying source documents. And so we talk about CDCor change data capture for unstructured databecause we're ableto ingest change in an event driven way from differentupstream systems, let's say Google Drive or SharePoint,and then propagate those changes downstream.
So for access control, that's extremely important. If Tim's access on a document changes,we'd wanna know that right away. We'd wanna replicate that change down to the vector databaseso that that updated metadata can be used for enforcement. All right, um, with that, let's jump into the demo. Um, we can talk more about the platform partnershipsand of course our ecosystem integrations.
Zillow being a highlight for today. All right, I'm gonna jump over to the data volo ui. So as I described at the outset, uh,this demo is really to, um,acquire financial 10 Ks from some different companies likeNvidia and Apple and to par 'em, chunk 'em,embed them and store them sothat we can build a chat bot chat with your data experience. Imagine I'm a financial analyst type of persona that wantsto very quickly and easily get the right insights outof these multi hundred page documents. Um, and as we'll see,these documents have complex elements in them.
They have tables, they have sections,they don't have markup. So unlike something like an HTML documentor a markdown document where we can split things, uh, bythat markup, we don't have that in a pdf. So we actually have to derive that markup, derivethat metadata to figure out how to chunk in the right ways. Um, and of course with rag,you're always striking this balance of howto best split the data so I can get good retrieval efficacyand also good generation from the language model. Alright, so how do we acquire the documents?We have over, uh, 350 integrations that ship outta the box.
I'm showing S3 today. Uh, but you can imagine this, um,being a host of different systems. You know, if we think about someof those Google Drive changes I mentioned we can capturethose in an event driven wayor maybe we wanna do that for SharePoint. Um, really so many systems that we could talk about here,but, um, any of the, you know, major enterpriseplatforms that you can think of,we have a built-in integration. This platform is also extensible, so you can build your own.
Uh, but in this case, all of our documents are in S3and we're gonna acquire them from there. Very easy to configure a integration like this in data Volo. I'm just specifying what S3 bucket to use. I'm going to use a service to provide my authenticationso I can access those. One thing I'll call out here is the listing strategy.
So I mentioned Data volo is continuous and automated. So as new data arrives in that S3 bucket, we're goingto pull it into the flow right away. So if we have a process that is landing those documents,we will get them in near real time,but maybe we don't want to reprocess older documents. And so I'm using this timestamp strategyto only acquire things I haven't seen before. And we're gonna store a little bit of state, um, to do that.
And this is pretty usefulto make sure I'm only processing new thingsand there's different strategies for this. I can also use hashing techniquesand track this at the entity level. And, but this looks good. I'm going to, um,after I list the data, I'm going to fetch the data. And we do this in two different piecesbecause this makes it really scalable.
Um, we run on Kubernetes. So this environment we're seeing here is in EKSin our SaaS environment. And we're going to scale horizontally with volume. So we have a pattern herewhere the listing might just run on one nodeand then it's gonna split outthat metadata listing across many nodesso they can all fetch data in paralleland not step on one another. We also capture provenance.
So this entire directed graph that we see within data volo,every state change between processors here is goingto be captured as a provenance event. So in particular, when I fetched the raw data, um, I can,I can look at the output here. And so we have this Nvidia, PDF, this is uh,hundred page 10 K with lots of complex information. If we scroll down into the document,we're gonna have things like tables. Uh, we need to be very careful abouthow we parse these tables.
We don't wanna split this in halfand kind of ruin the referential integrity of that. Uh, we're also going to have charts and imagesand these are going to be important as well. And so we'll see how we parse this data in a moment. Alright, so I'm going to clean up a few things. Set some URLs and DOC types.
Um, the data engineer can set different metadata on thisevent, um, which is really useful. We're gonna see metadata as a theme throughout this demo. Um, but just to show you, um, what this looks like,um, in the attributes. So we've automatically generated some metadataand others were captured by the data engineer. We're going to bundle up all of that metadatabefore we write the data to vis,and that's gonna enable someof those hybrid search use casesand ranking search cases I described.
So you can see some of this was grabbed from S3. Um, and then we have things like the original source,URL when we're doing a rag app, we really wanna be ableto do citations,and that's part of this grounding, um, aspect, right?So when I return a chunk, I might wanna return it withthat original source, URL,and I can even get more specific than that, um,with sections which we'll see in the demo, um, in a moment. All right, so after I have that raw PDF, I'm now going to,um, do some computer visionto do what's called layout detection. So layout detection is going to give me bounding boxesand labels of every component of the document. And we have, um, in our environment here, we have a model.
We have fine tuned it and customized it. Um, it's based on some prior art, uh,with a model called YOLO X,but this is running on A GPU in a shared service. And we use this to derive those bounding boxes and labels. We can also do OCR if we needto extract text if it's not in a text layer,but importantly we're going to use those labels. And let me show you actually an example ofwhat this looks like.
So we have this PDF annotation viewer in the ui. And so this looks similar towhat we were just looking at except now you can see I haveall these boxesand labels around all the differentcomponents of the document. So if we go down to some of those, um,more complex elements, um, we see things like tables. If I click it, you'll notice that on the left hand side,I've actually built this hierarchy. This hierarchy is really usefulbecause when I'm traversing the document, I wanna use thisas a document graph.
Essentially, I wanna be able to pull the siblings,for instance, this table is gonna have narrative textabove it that's describing what's happening in the table. Um, similarly with the image. So I want to be able to pull the proceedingand succeeding chunks that surround an element like this. I'm also going to wanna chunk this document by the sections. So you'll notice that in the green text here,we've recognized this as a section with high confidence,and this is, uh,this hierarchy is collapsible on the left hand side.
So all these different narrative textelements are tied to the section. So sometimes I might want to match on some narrative text,but pull a bigger chunk like the set full sectioninto the prompt that I sent to the LMto generate the response. You're always playing this game of rag this Goldilocks gameof I wanna have the right semantic precision in theembedding to get good matches against my queries,but I also want to have good richness, uh, for generationand give as much context to the model as possible. And those two actually, uh, push and pull on one another. So you have to, um, decide how to balance that trade off.
Similarly for images,and this gets pretty interesting, um, you also wantto have something that you can, uh, embedand retrieve about this image. Now you could use multimodal embeddingsto actually embed the image itself. And there's some really exciting workaround vision language models. Um, Tim and I have been talking about that from a, uh,Zillow, uh, support standpoint in terms of someof these multi-vector representations that you needfor these VLM models,these late interaction retrieval models. Um, but in addition to embedding the image in that kindof multimodal embedding something like clip, um, we can alsouse language models to describe the image.
So the language model's gonna give us natural language back,we can embed and retrieve that natural language. And then we can also use this graph, this hierarchy to pullrelated chunks based on where they sit in the document. So this table is related to this image,this proceeding description is related to it as well. And so all of that is available to us in the parsed formatthat we're gonna land in the Viss database. Excellent.
Um, all right. And you'll notice that we use LSduring the pre-processing stage, so we might use that to,uh, describe an image or to summarize some text. So when I said out we weren't in the path of the lm, um,maybe I lied a little bit. Um, we're not in the path of the LMduring the user interaction during the query time or,or um, retrieval time,but we often do use LMS in the, uh, ETL processor the data pipeline to make the data usefulbefore landing it within the vector database. All right, let's go back to our main flow.
So one cool thing here is we can use those derivedannotations about each component of the document to route itto different downstream parsers. So I'm routing the imagesand the tables in particular, uh,to different downstream models. For the image which comes in here as we just talked about,we're going to use, uh, GPT-4 oh mini. Um, you can be very dynamic, by the way,with model selection and model routing using data volo. This is becoming increasingly importantbecause some models are bigand give high quality responses.
Other models are smaller, give maybe slightly lower quality,but faster cheaper responses. And so as a, a data engineer, I might wanna balance fastand cheap versus quality. And in some cases these smaller models do quite well. Um, we're even seeing some true s SLMs small language modelswith, you know, a billion parameters are under. Yes, that does sound large,but compared to, you know,multi a hundred billion parameter models, they are, um,much, uh, more spelt.
Um, but we're seeing some s SLMs usedfor evaluation in particular. Um, I mentioned Galileo. So they've created one that is, uh, very low latency sothat you can use it in the path, um, withoutcreating, um, a lot of delay. So we're using GPT-4 L mini here. We have a system prompt to help us describe this image,and we're gonna get some natural language backthat we can then retrieveand embed on the table side, just likethat layout detection model.
We've also developed our own proprietarytable extraction model. It's actually a, a table transformer model. Um, it, it again is based on prior art,but we've customized this and tuned this. There are big, uh, data sets out there like dock lane net,which have a lot of, uh, table examples in complex PDFs. And so we've used those publicly available data sets to, uh,really make the extraction better.
So the output of this model,and again, this is pointingto an ML service running under the hood on A GPUin our SaaS environment. Uh, this is going to give us a CSV style representationof the rows and columnswith the right data types extracted from each cell,we can load that into a variant table in a database. We can parse it and extract it. We can also just describe what's in it. Again, using a language model.
Um, and you know, you might not need to do this step,you might wanna skip it because it might be more tokensthat you're sending out to open AI and leading the cost. But we have found it's nice to also describe the CSV outputof the table image extractorand use that as something wecan retrieve and search as well. Um, finally, at the end of this step,or at the end of this subflow,we're gonna merge everything back together. So remember, we split things apart. We split apart the tables and the images and the text.
We're then going to merge it back together. Um, behind the scenes we're going to have,if I go into the data provenance here, um,what this looks like, it looks nicer in that, uh,PDF viewer, but just for, uh, kicks here,this is the underlying JSON representation ofthat parsed information. And so this is what we're gonna use downstreamto do chunking, to do embedding, um,and if you want to drop into this lower levelrepresentation, you can do that as well. All right, so we've now merged everything back together. We are going to come back to our main flow.
One thing we do wanna do sometimesduring pre-processing is we wanna do enrichment. So if you again, have that evaluation first mindset,and you're thinking about how are users goingto query this data, they often are going to ask questionsthat are, let's say we're dealing with an NVIDIA 10 K,they're going to ask questions about Nvidia as a companythat might not be contained in the 10 K. So they're going to ask questions like, well,how many employees does NVIDIA have?Um, or what, where is the headquarters?Or, um, maybe I wantto trend things over multiple documents. So you often need to do some entity recognitionand some enrichment with those recognized entitiesto get even more context into your, uh, parse document. So in this case, we are extracting the ticker likeNVDA and we're using that to, um, do a lookup.
Uh, we have a Postgres table herethat has some more information about someof the companies we care about that's gonna pull inthat additional company level data. But you can think about this as enriching any entity, right?Any knowledge base you haveor data warehouse that contains information aboutthat entity, you might wanna bring that into your, uh,retrieval system here to answer more user questions than youcould if you were just using the document. And so that's what the enrichment step is all about. I'm gonna skip down here to chunking. I do want to, we have time briefly cover evaluation as well.
So I'll drill into this chunking subflow. So we're going to chunk in a few different passes here. The first pass is going to use chunk document, which expectsthat JSON representation. And we're going to chunk by section. So there's a few different strategies I could choose here,but section is really nicebecause it's going to have that nice balanceof semantic precision.
It's just talking about a few things. Um, but it also has good richness. And we've derived the section labels withthat layout detection model,and we're going to use them as our chunking boundaries. And we're also, because we're using those as boundaries,we're gonna not run into any challengeswith chunking those tables or images. They're going to be split out separatelyand we can tune certain things.
Um, maybe we do or don't wanna include subsections. Maybe we wanna tune these chunk sizesor overlaps, um, which are useful. You essentially have a bunch of hyper parameters,and as a data engineer, I'm tweaking themto get the best retrieval results sothat my AI engineer has the best, uh,data set to work with. Um, but this does tie into evaluation as well. I wanna be able to capture which hyper parameters,which chunking strategies, parsing strategies leadto the best retrieval outcomes.
And so you can capture each of these hyper parametersin line with your evaluation metricsand essentially do some data science to correlatebetween those hyper parameter valuesand the best possible retrieval scores. And we can definitely chat more about that. Um, all right, so now we've chunk by section,but I'm gonna do one final chunking pass. So I also, once I have section, uh, text,I can also chunk that text further. And again, this might be usefulbecause I want to kind of play with that precisionand richness trade off.
Um, so the most naive thing I can do,and it doesn't really work that well,but just to show you different examples, is to chunkby 400 characters at a time. So this is just gonna use 400 characters. It will run into problems, um, with, you know,maybe splitting somethingthat's semantically connected in the middle. It has no notion of like, what is a sentence boundaryor what is a semantic boundary. So it generally doesn't work that well.
Um, but I also have this chunk text processor,and the reason I'm splitting this all out isto do proper evaluation. So I'm doing it in parallel, which is nice,and I'm storing different collections in vis so that I can,uh, point my evaluation data set at those differentcollections and see which performs the best. So another option I have is semantic chunking. Um, we also have things like sentence based recursivedelimiters with semantic chunking. I can tune again, things like the similarity threshold.
This is computing cosign similarity between pairsof consecutive sentences. So we look for that sentence break,but then we compare, uh, cosign similarity,and we only call it a split if we drop below the threshold. And so we've tuned it. How did we arrive at 0. 6?Well, like with many things in machine learning,we just empirically tried different thingsand we decided what worked best based on evaluation.
Um, so very common in machine learning, uh,to get these kind of magic numbersfor hyper parameter values. But we do have another option that we tested. And 0. 4 has worked less well. So just something that you can again, tune and evaluate.
All right, um, soafter we have our final chunks, this is the granularitythat we're gonna persist to our vis, uh, database. We're using Zillow's, uh, serverless environment here. Uh, but before I write them to the database,I do wanna first create my embeddings again, at any partof this flow, I can come in here and see the output. So in this case, we have textand we also have the embedding weights. Um, but very importantly, I want to capture allof the chunk metadataand write that in line with the embeddings themselves.
So the way I've done that is I've organized allof my different metadata paths in one place. And the cool thing to note here is eachof these metadata elements, they came from different partsof the pre-processing flow. Some were set automatically by the system, some were derivedby those machine learning modelsand others were set by the data engineer. So, uh, things like the document typeand the URL were set by the engineeras part of their flow design. And then we also captured some things viathat enrichment part of the process.
So things like the, uh, title hierarchy, we use the symbolto look up in that Postgres table, what this hierarchy was. And we've captured that as a metadata element as well. All right, we're now ready to write this data into Mel. Um, as I mentioned,we are using the Zillow serverless environment,so we have a, a connection service here. And if I just show you that briefly, um, we're cominginto this serverless environment, uh, very niceand easy to provisionand works very well from a performance standpoint.
And then I just have to configure my authentication. Um, we store everything using secretsand integration with secrets managers. Um, so we don't haveto worry about our credentials being leaked. And then I'm just configuring things like,which collection am I writing to,what are my different paths ofwhere I wanna store things such as the chunk content,the metadata, and um, we can store, uh,the content if we wanna do that patternwhere we're maybe sending different textto the model during generation. Alright, so now we're done with this flow.
We've written our data to um, our vector database. We're ready for the AI engineer to kind of take over here. Um, before we, uh, go to questions,I did wanna talk a little about evaluation. So lemme go to a different environment here. So with this evaluation first mindset, the thingthat we wanna do first is we wanna acquireor build an evaluation data set.
So, um, for this one, um, we, we used,I'm using a Wikipedia data set here for the example,and there is a known set of questionsand answers about these Wikipedia articles, um,that we pulled, I'm just pulling a few here, for example. Um, but we pulled Wikipedia articles about some hockey gamesand hockey, uh, playersand, uh, some other, um, articles here about Irelandand community activism in Chicago. Very diverse set of Wikipedia articles here. But the key thing is that we have questionsand answers that came from our evaluation data set. So one thing we'll often ask as well,what if I don't have that dataset?Well, there's a couple options.
The first thing is you might wanna use human experts similarto how you might label data to build that q and a dataset. You might also have prior historical data that can be used. So for instance, if I'm building like an hr, uh, chat bot,well I would ask myself,how did users get answers from HR in the past?They probably had a ticketthat they created or they sent an email. Can I acquire the history of questionsand answers that have been asked in the pastand use that to build an evaluation dataset. A third option is we can actually use language modelsto synthetically generate, uh, evaluation dataset.
So if we have the answers,we can pretend we're playing jeopardyand we can ask the model for the questionand it will, um, oftentimes do a very good jobof giving us questions synthetically that we can useto construct the dataset. You probably would use a mixof these techniques in real life,but the important point is you're never gonna knowhow well things are working if you don't acquire build,synthesize that dataset at the outset. So that's very important. Um, so we're gonna do some pre-processingwith these ground truth questions. And then I'm going to, I'm gonna do a bunchof different embeddings here.
Um, I'm gonna embed the questions,I'm gonna embed my ground truth, which is that answer. I'm also going to generate an answerthat's gonna be the generated oneand I'm gonna embed that too. We're gonna need to do some comparison similarity wiseof each of these different aspects. Alright, so once I've embedded my questions,I can now query the database. So just like we saw the ingest into Elvis in the previous flow,we also have the ability to query Mel this.
Um, and so very similar setup here. What we're going to do here is we're going totake the question, so you can imagine this will be comingfrom a user if it was deployed to the application layer,but in this case, I'm pulling the question fromthe eval dataset, I'm going to embed thatand I'm going to do matching ofthat embedded query against the embedded generated answer. Alright, so what do we actually mean by evaluation?And I'm focused here on two classes of evaluation metrics. One are the traditional information retrieval metrics. So these are things like context recall, context, precision,mean, reciprocal rank,and then the end-to-end metrics,which are based on the response of the language model.
So these are things like faithfulness, uh,that we'll see in the next part of the flow. All right, so, um, what I now wanna do is I want to evaluatehow well retrieval has worked. And I'm gonna use the LM as a judge,that's what this pattern is called. Rather than asking a human labeler to assess,we're asking the LM to assess. So we're gonna send the question,we're gonna send the results.
So when I did this query against Novus,if we look at the output, um, I got results, right?So if I took that first example,who were the three stars in the NHL gamebetween Buffalo Saber and Edmondson Oilers?Um, we have an answer, we have a ground truth from the qand a dataset, but what I wanna do is take the questionand I want to see, based on the data I've loaded into viss,I, I skipped that step for brevity,but we loaded all this data into vissas well after chunking it. So I'm seeing the score here, which is co-sign similaritybetween the questionand some retrieved chunksthat I had stored in the vector database. So, uh, we're doing, uh, K equals three here,just getting three results. Um, but those are the results that we getand that's what I'm passing into this integration to. Um, in this case we're using GPT for oh, mini.
Um, so that's the contextand then the ground truth is the answer. And then we're gonna persist some retrieval results. So we can take a look at that. Um, so now we have retrieval results. We have things like recall precision, um,and then F1 score, which is a blend of recall and precisionbecause they do push and pull on one.
Another F score is the harmonic meanof recall and precision. And then we can also compute the mean reciprocal rank. What the language model is telling us is whether givenstatements actually match the provided context. So you can see for a given chunk ID, this one matchedand it gives us a reason. This all came back from the language model.
The context explicitly lists those three stars,these were the three stars, the name, and we have a reason. So we're able to derive these attribution scores. This could have been a human labeler, we're just using an LMto do this in a more automated way. And so now because we have those attributions we can doevaluation properly. Um, the next thing we need to do,so those are all retrieval metrics.
We were able to compute recall precision. I can use this to determine how well my parsingand chunking worked from that first demo,but now we wanna do the end to end part. So what does that mean?We now wanna generate answers using the language model fromthe provided context, right?That's what retrieve augmented generation is all about. So we're gonna generate answersand we're gonna compare the answers against the ground truththat I had in my eval data set. All right, so after I've generated answersembedded the ground truth embedded the generated answer,I can now evaluate two other metrics.
The first is answer correctnessand the last one is faithfulness. So answer correctness is going to be a functionof the F1 scoreand it's going to be related tohow well the ground truth compares to the generated answer. So if we take a look at this, um,we'll get some more interesting output here. And by the way, we can send these resultsto an experimentation um, framework. So a lot of folks use ML flow or weightsand biases as like an ML ops experiment tracker,and we have integrations into those.
Um, but now we have both the, um, retrieval metricsas well as the actual, um, answer correctness evaluation. So we have a generated answer. Um, we did compute the embedding for the generated answeras well as the ground truthand then we were able to assess answer correctness. We got a very high score here of 96%, um,for on average across the top K results. Um, but clearly the first answer was the best.
So we can use this to, um, as,as feedback into how we're building the data pipelineand um, determine which strategy is best. But it's nice to have that answercorrectness in a single number. And the last one is faithfulness. So believe it or not, um,these models can still hallucinate. Even if you give them all the beautiful context in the rightreading order, they can still decideto generate something else.
Um, and faithfulness is a measure of that. So faithfulness, it extracts the claims, so like statementsof fact from the retrieved chunksand it computes how to what uh, percentage thoseretrieve that retrieve context was usedto answer the question. So, um, if we take a final look at this step, we're goingto see that, um, for faithfulness, um, we evaluated thisas well and um, we got a good score here. So the model is using the retrievedcontext to answer the question. It didn't add any additional noise to the answer.
Um, and then finally we can record these metrics. So ML flow is one of the integrations. Um, we're using our Databricks environment hereto integrate into ML flowand then I can have different experimentsthat I track over timeand tie those back to the hyper parameter values. Alright, so hopefully that was fun and exciting. Um, let's uh, pause here and go to some audience questions.
Okay, let me see what we got here. Check in the q and a in the chat. Okay, um, quick, uh, one from QA down in the chatand I'll read it for you. Why is a new kind of ETL platform needed to make AIapps successful?Great question. Thank you Tim.
And this really goes to the heart of whyData Volo exists as a company. So love the, uh, question here. So I think the first thing we saw is disruption in the ETLmarket relative to unstructured data. So a lot of systems, they're really builtto replicate relational data from a databaseto a data warehouse, and they have more of that ELT patternwhere I do transformation after I land the data. That really is not the case with embeddingsand I think embeddings are going to be oneof the important residues of this AI wave.
We're going to recognize them moreas a first class data citizen. Um, and one thing you'll findwith embeddings is you can't reallytransform them after the fact. Um, you can't really do this ELT pattern. You have to do the transformation including the metadataacquisition in the pre-processing layer. And so data volo is built for that.
First of all, the underlying technology NiFi is builtto handle, um, any form of data, really any bite stream,audio, video, images, text,and to do that in a very scalable way. Um, but also the last part of this is we also haveto make the data useful before we land it. And that's where we've invested in these ML powered modelsto better extract and parse the right information. So if I'm dealing with complex PDFs, I need to be able toum, uh, do that section metadata derivationso I can chunk the data correctly. I need to recognize tablesand extract the tabular data correctly.
Those are the sort of investments you would want in an ETLplatform designed for multimodal data. Makes sense. Let me check for another one. Yeah, yeah, I've done some pretty cool flows with NiFiand unstructured data. That's pretty awesome.
Okay, we got another one. Uh, why are continuousand automated data pipelines important for AI apps?Excellent. Um, and I also see one from Vincent. I'll answer the one you just posed first, Tim,and then we can go to Vincent's questions. Um, so yes, so continuous and automated is really importantand you have to think about this not just for the contentof the data but also the metadata.
So metadata of documents is going to change over time. Permissions can be added and deleted. We might create new versions of these. And you, once you're building these retrieval platforms,they're going to become one of the defacto interfacesfor your teams to access all of your data. And so once that's happening, you needto keep these retrieval stores up to dateas the state of the world changes.
You don't want stale data. We've all noticed with like GPT models thatbased on their training cutoff,they don't know what's happened over the past, you know,two months, three months, another good use case for ragor search in concert with the language model. But you have to have these continuousand automated pipelines to keep both the contentand the context up to date at all times sothat users are getting fresh answersfrom the AI application. Okay, very good. Uh, you might as well do Vincent's.
He's got aExcellent In there. Thank you Vincent. So great question. Um, so let me talk about Lang chain then I'll talk aboutLang graph specifically. So, um, if we go back to that uh, diagram I showed where,um, we have the da, uh,data Volo situated in the AI stack.
I'll just pull that up briefly. Um, so, um, I put Lang chainmore broadly on the right hand side inthat application framework setting. So Lang chain is really good at the application UX at promptengineering at having the right templates,but where we believe that it's kind of conflating someof the overall problem here is the data loadingand the data engineering. We think that should be situated more in a truedata pipeline solution. So we believe that Lang change should focus onthat application side, the front end sideand use a real data pipeline solution for loading the datainto the right platforms and doing the transformation.
Um, this is especially important at scale when it's notabout just doing this for one document,but it's about doing it for all my documentsand, uh, doing that in a continuous way. Now Lang RAF specifically is very interesting. Um, it's more in my understanding about, um, agentsand orchestrating, um, different agentsand building these cognitive architecturesthat Harrison likes to talk about. So, um, I think that's a very, uh, important area. Um, we're actually using, um, agentsfor our copilot service, which you'll see released soon.
And this copilot is intendedto help data engineers build the right pipelinesas Tim has probably experienced. Um, jolt transforms can be annoying getting a connectionstring, right, getting a regular expression, right. Wouldn't it be nice to just ask a copilothow I should configure this processorand it'll just do it for me?Uh, absolutely it would and uh, we've uh,worked hard on building a nice, uh, alpha version of that. So you'll see that soon. But I would think about Lang Graphas more building that overall agentic architecture,which is situated in that front end.
And data volo is more about solvingfor the data pipeline side of things. It's not in the path of the user sending a taskor a prompt to an agent. Okay, there was a question about the recording. Uh, it will be emailed to you if you're registered. If not, it will be in YouTube under Zillow'sand I'm sure Data Volo will have a link to it.
You'll be easily able to find itand we'll make sure the slides are out there as well. Uh, we got another question. Uh, what are you seeing as some of the major blockersfor enterprises to adopt ai?Which we've all seen these. Yes. This is something we think hard about every day.
You know, I think, um, we all got very excited, uh,around the chat GPT moment of the promise this technologyand the promise is fully real. I think this is one of the biggest paradigm shifts I've seenin my career, along with cloud and big data and mobile. Um, but enterprises right now,they are seeing a few challenges in adoption. One of the, the big ones is the risks that are associatedwith not doing things in the right way. So I talked about having the proper guardrails, um,that can include role-based accessand secure enforcement at retrieval time.
That can include having the right evaluation and testing. And one problem is that, um, user expectations haveto be set appropriately. You know, one thing that really helps evaluation ispublishing your application as soon as you possibly can,even when you're still embarrassedabout how well it's working. And the reason for that is you're gonna start this flywheelwhere users are gonna give you feedback on goodand bad responses, right?Thumbs up, thumbs down every time you see a thumbs down,that's gold for improving your applicationand the overall stack. So I encourage folks to start acquiringthat user feedback in an automated scaled way sothat you can use that to improve the application.
Um, setting the right expectations,having the right guardrails. I think also enterprises, they did not necessarily build outthat full stack with the best of breed solutions. So a lot of folks are constrained by wanting to do things,not using certain cloud services. Maybe they don't wanna use open AIor anthropic, so maybe they're using underpowered models. Um, the open source models are super exciting, um,but LAMA 3.
1, maybe when you startto get into 400 billion, it, it gets a little bit better. But LAMA 3. 1 versus something like Claude 3. 5or uh, GPT-4 oh is a really a big, um,a, a big difference in performance like men,several orders of magnitude I would say. So you have to pick the, the right technologyand build the right stack.
And some folks have been constrained,whether it's about risk or trustor guardrails to use some of the, the best modelsand the best architectures. And so, um, that's somethingwhere I've also seen an adoption challenge. Okay, we got time for one last questionand I see it down there from proven. Uh, how do you think role-based accessed embedding in these,how do you think role-based accessed embedding inthese AI models?I think I get the drift proven. Um, so I think this is about the secure enforcementwhere we're talking about different userswith different entitlements.
They should get different results when they go to, uh,ask a question and uh, get retrieved context. So we do that using metadata. So we don't actually change the embedding itself,but you, when you store the, um, vector, uh,you also store the metadata around which groupsand users had access to this document. And then you have to implement secure retrieval in your, uh,retriever in your application. DAA, by the way, is a partnerthat we've worked with that does this.
It's an open source framework under the hood called Pueblo. Uh, but the secure retriever is going toacquire the users groups, uh,and context based on the authentication eventto the application, and it's going to match those, um,group memberships and entitlements againstthat metadata that we've stored. And it's only going to return retrieved results that matchthe, the right permissions of that user. And then only atthat time does it do semantic searchagainst those embeddings. So it's a hybrid search model.
You trust the application to acquire the right authorizationrules associated with the user's, uh, authentication contextand you match that against the metadata. You'll see metadata as like a really important aspectof this throughout the application side. I I tried to emphasize it a lotbecause it is really important both for hybrid search,re-ranking and secure access. Very cool. I think we're out of timeand they're gonna shut us down.
So thanksfrom, uh, your team there. Say hi to Joe and thank you for everyone attending. You will get the slides, you will get, uh, this video,maybe we'll clean it up and make us look magically better. I'll have a better jacket. Uh, thanks againand we'll see you probably, uh, next week.
Thanks a lot everyone. Thank you very much Tim. Thanks for having us and thank you to our audience. Thanks everyone. Bye.