Events
Beyond RAG Partitions: Per-User, Per-Chunk Access Policy

Webinar

Beyond RAG Partitions: Per-User, Per-Chunk Access Policy

Zilliz Webinar | Zoom

Join the Webinar

What will you learn?

Partitioning vector databases has proven to be a useful tool for privacy and per-tenant isolation. Recent releases of vector db software, including Milvus, have continued to improve partitioning capabilities such as pushing the number of partitions into the millions and providing improved selection of partitions per tenant.

Despite these advances, management overhead increases with the number of partitions. Relative to the capabilities enterprises require and have come to expect from their existing storage systems and databases, there is still a shortfall. New capabilities specific to how vector databases store data and how they are used in RAG applications are needed.

Topics covered:

Origins of enterprise requirements for granular access control and policy in storage systems.
Sensitive data identification: data classification versus access control.
The problem data-duplication in enterprise datasets presents when copying permissions from documents to chunks.
How enterprise access requirements can be met with per-user, per-chunk access control
Case study and example implementation.

View presentation slides

Transcript

Today I'm pleasedto introduce today's session Beyond Rag Partitionsand our guest speaker Rob Curs. Rob Curs is a technologistand entrepreneur with more than 30 years buildingand leading enterprise class solutions innetworking and security. His first job was to build a router to competewith pre IPO Cisco. Later he contributed to the creationof the first hardware raid system at uc, Berkeley,while working towards a PhD in quantum electronics. And since then, has led the growth of, uh,1 billion per year products lines at Cisco, um,and WAN optimization leader, uh, Riverbedand helped pioneer in the now multimillion dollar secureaccess secure edge market at Soja Systems.

And, and Akamai. His current mission isto bring new solutions to market to meet the challengesof generative AI native applications as CEOand Co-founder of Cap A Systems, um, uh,where he was an early team member and executive. So, welcome Rob, and, um, I'll hand it over to you. Alright, Stefan, thanks very muchfor the kind introduction and welcome everybody. I'm very, uh, happy to be able to talk with you today.

And, um, uh, like Stefan said, I've been aroundfor quite a bit of time in, uh, Silicon Valley working on,uh, technology products and the networking securityand data spaces. And, um, what's particularly interesting now is howuse of data has changed given generative ai. And so I'd like to, uh, I'd like to jump right into itand, uh, and start to tell you a little bit about how I see,uh, the world evolving in termsof generative AI applications. And again, please feel free to ask questionsas they come up. Um, and, uh, I will try to, uh, to answer them.

So I think we all are aware that there's a strong needfor governance of data use in ai. And, uh, I use GLE as an example here. Um, glean is one of the companies that, uh,has really done a fabulous job of penetrating, uh,enterprises with ai. And primarily, um, it's obviouslybecause of the AI capabilities, but alsobecause of their ability to provide permissionsand governance around the use of data. And this is, this is critically important within enterprisesbecause of all of the vast amounts of data that they have,the compliance requirements that they have, privacy,security and, and other things such as performance and cost.

Being able to govern the, that data use isof course quite critical. Um, you know, most fundamentally, uh,why are we talking about doing partitioningor per user access controls?It's really because users should only see the datathat they're allowed to see whatever the policy might be. Or if they're from comp, you know, one customeror one company, you don't want to have another company, uh,in terms of like a SaaS provider,see the data of anybody else. So we knowthat authorization in this case must be deterministic. Um, but one of the challenges that we face in, uh,in the generative AI world, of course, is that, uh,AI itself is probabilistic.

It's non-deterministic. We can't predict from one prompt to the next exactlywhat we're going to get. And yet that's really the expectation that enterprises have,uh, or hope for, I should say. Um, now partitioning has a lot of utility to be ableto segregate data across different, uh, customers,different use cases. Um, and it's quite effective when we use it for,for doing this kind of multi-tenant ssegregation in SaaS applications because we know aprioribefore we've ever, uh, accepted or movedor built any, uh, vectors in the database, what,what data belongs to which customer.

And so we can, we can deterministically build our vectordatabases and our rags according to, uh,each customer's data. Things get a little bit more complicated when we moveto looking at partitioning for user, uh,user access control, um, or if we're doing user permissionsbecause now we're at the level of what chunks can be givento any given user. And if we're, even if we're looking at the users of a,of a given company or a customer,or if it's for generative ai, uh, an applicationthat's being built within the enterprise, the, uh,the permissions on a per user basisand how they map to these chunks israther ambiguous. It becomes a very complex subject where, um,partitioning may not be the best solution. So where we get into trouble is with, um,with permission's, complexity, just the numberof different ways that, uh, permissions show up.

We get into trouble with, with chunked data where,because typically in the enterprise, we're used to dealingwith objects or database tables. Um, and, um,and when we're dealing with chunks, there's,there's some specific, uh, considerationsthat we need to take into account. And the other thing is service accounts. We have, we have agents, uh,that are making requests on behalf of users. So if we start with permissions complexity, uh, thisgets back to, uh, the last company I was with,which was in the, um, the SE spaceor the secure Access service edgewhere we were brokering accessto applications within the enterprise.

And one of the things that we came up with was just the vastarray of different types of authorizationthat enterprises want to see relative to their applications. Whether it's users, different policies,policies based on time of day policies based uponwhether you're an admin of the applicationor whether you are a developer,or whether you're, you're, whether you're simply a user. There's just a vast array of different possibilities here. And as we try to replicate those,that vast array of permissions, we end up with a lotof complexity, particularly if we, if we see some of the upand coming, uh, types of authorization schemes out there,like relationship based access controlor reback where we have dynamic privilegesand they need to map to, well,what would end up being static permissions within the, uh,within the vector database. So do you have the right metadata keys as you're pullingthe metadata keys to partition on the metadata?Do you have the right keys to be able to partition, uh,based upon what these, uh, access control rules need to be?And it's very hard to determine that upfront when you're,when you're designing your application.

Another big problem is chunking data. When we look at the source data, the documents that, uh,the data that we're putting into our rag comes from, um,it's very well organized. Like I said before, you know, we're dealing with filesor objects, database tables,and each of those has a relatively, uh,well-defined set of permissionsand, uh, characteristics that we could use to access it. As soon as we get into the realm of RAGand we've taken the data out of allof those different documentsand chunked it up, it looked vastlydifferent than it did before. And this leads to some complexity in terms of how do we dealwith permissions on chunksand how do we map those permissions from the objectsbecause it isn't just a one-to-one mapping anymore.

'cause chunks can show up in multiple documents. If you think about, uh, maybe you copy, uh,the same document or within an enterprise,you copy the same document to multiple SaaS applicationsor users cutand paste from a document that they receive it in emailand they send it to somebody else in a different form. And that's ingested into, into the RAG as well. Or maybe we are file sharing and, and such. There ends up being a very large amount ofduplication of data at the chunk level.

And we saw that, uh, at Riverbedand other companies within the storage domain have seen thatas well, where they're able to get, you know, 50to 90% reduction of data either stored or,or sent on the network, um, just by deduplicating itas it's moving in the wire or as it's being stored. And that same effect is applicable to RAGbecause we're doing the same level of chunking. So if you are, if you have say, uh, a documentthat gets moved to, uh, different, uh, SaaS applicationsand you're adjusting the data from those SaaS applicationsto populate your rag, well, what if oneof those applications allows Bob access, for example,to the document and the other one doesn't allow Bobaccess to the document?How do you resolve the permissions on a per chunk basis whenthat that chunk exists in both of those, uh,both of those documents?Similarly, if you have a chunk that shows up in multipledocuments, and we'll see this in the examplethat I bring up later on with, uh, apple 10 Q statements,um, we'll see a lot of data that that's in common acrossmultiple different, um, 10 Q filingsand the permissions on those, um,if you inherit them directly from the documents becomeambiguous, we have to be able to resolvewhich permissions we, uh, we accept. Now, is it, do we allow Bob access to these chunksor do we deny access to, uh, to each of these chunks?But even when we get the permissions per chunk, uh, correct,we have to deal with another big issue, which is thatBob doesn't access the RAG directly, it's the agent,it's the chatbot, it's the, the front end service,if you will, that's accessing the RAG on behalf of Boband the agent is using a service account. Um, and so how do you, how do you authorize,uh, the agent on a per user basisto access the data in the rack?Well, one way to do it is you could spoof, uh,the user's identity,but then you end up with, uh, a can of worms in termsof the, uh, the different security issuesthat are gonna come up.

I mean, once you give, um, a serviceor an agent the ability to, uh, impersonate a user,then anything that that user can do is fair gamefor that agent. And if it's compromised by, uh, by a malicious third partyor malicious insider, then they can usethat spoofing abilityto basically get anywhere within your infrastructure. And that's a really bad thing. So these three problems end up being verysignificant to, uh, to how we need to deal with, uh,with security, with governance of data access within,uh, within our RAG systems. And that's what, uh, my company capr is, uh,is all about providing governance of AI data,but at the level of chunksthat we can see in rag or moving on the network.

Um, and by connecting eachof these chunks back to the different objectsthat they come fromand looking at how they're moving, uh, we can overcome someof these issues, uh, to be ableto provide some significant benefits in terms of governance. Um, I'll tell you a little bit more about the product lateron in terms of how it deploys and what it looks like. Um, but the basic idea isthat we're tracing the lineage of the bite sequencesthat comprise these chunks as they're moving in APIs,as they're being, uh, consumed out of storage. And then we can build a graph essentially ofhow this data moves in the application. And then by doing that, we can associate the metadatathat we see at each point within the graph to the data,and then be able to analyze that graph,both deterministicallyand with AI to determine which setof permissions should go on each chunk,and also to be able to attribute, uh,which user is actually, uh, going to receive them as well.

So, as an example,let's look at those Apple 10 Q statements, um,and, uh, I'll show you someof these problems that I've talked about. So let's imagine that, um,the 10 Q statements from Apple from, uh, 2023, uh,are public so anybody can access them. And that the, uh, 2024 statementsare hypothetically non-publicor material non-public information. And if, uh, if you're in a public company,you probably know about, uh, allof the SEC rules about insider training. They take this very seriously.

And companies must provide, uh,strict access control to any non-public information. And anybody that has access to that non-public information,um, must be registered with the SECand an audit trail of all of their accesses provided. And that's all to prevent insider trading. But if you look at the data within these, particularlyas you ingested into RAG system, you'll seethat there's a lot of common boilerplate, a lotof common verbiage that exists across each of these. Um, and then there, the differences in the datathat exists in these two different, uh, sets of, of,of documents, um, are unpredictable.

It's very hard to, you know, turn an AI onto, um,each of these documentsand to have it classifywhat data is public and non-public. It just doesn't have that information. Um, these two documents, uh, you know, if you triedto classify them, we classify exactly the same. They're both 10 Q statements,and in fact, that's exactly what they are. But at the chunk level, how do you determine what, uh, what,uh, a user should be able to access and what shouldn't?So what we can do with CR is, is map the lineage of allof these chunks back to the documents they come from.

And we get a graph that looks something like this,where each of the big blue dots on the bottom are the,the objects, the 10 Q statements, the PDFs themselvesthat contain all of this data. Um, and then the red dots at the top are the commonsets of chunks that exist across all of those documents. And you can see not only are there a huge number of chunksthat are in common, um, they're distributed in kind of a,an interesting way where the common chunks that areseen across these documents varies. Uh, and so it's not even predictablewhat chunks we're gonna be able to see, um,that are in common across these documents. Um, but by di being able to dynamically map the lineageof each of these, as I mentionedbefore, we'll, we can, uh,attribute the authorizationor the inherit the metadata, if you will, that itthat belongs to each of these documents, to eachof those chunks, either by, um,by resolving these permissions, either by precedence, uh,you know, who wrote the, who wrote the, um, the data first,um, which one came first,essentially precedence or prevalence.

You know, if a chunk shows up a hundred times as a, as a,uh, as a public chunkor a, a docu, um, a chunk in public document, well,maybe we should assume that that's a public,uh, publicly accessible chunk. And if it only shows up in one, uh, document that's, um,that's not public, then,then we can make our decision based upon prevalence here. Or perhaps, um, by policy, you wanna saythat well least privileged policy wins,and we're always gonna choose the most stringentaccess controls, uh, regardlessof prevalence or precedence. Um, and the other thing that we can do is look across thedifferent types of policies that exist on documents. So for example, if you pull down a, a permissions objectfor an S3 bucket, uh, you pull down permissions, uh,policy for a, um, a document in Salesforce,these are gonna be completely different.

And this is where AI comes in to be very handy. 'cause we can look at these, these disparate sets ofpolicies and reconcile them, normalize them sothat we can then, um, apply these different, uh,precedents problems or policy-based deterministic, uh,uh, resolutions. Now, the other problem of connecting the agent, uh,API calls back to the user is, is something that we can doby just tracing the datathat we see in these API calls rather the requestsor the API responses. So a very simple example, if Bob enters a prompt, the dataof that prompt goes to the agent, then the agent uses thatto do a lookup in the rag or in the vector database. And so we can say, connect the two different API callstogether because they have that same data,and we can then determine that the, the datathat comes back from the rag is going to be usedfor Bob's request.

And then we can look at that and determine whetheror not we want to allowor redact some of those chunks based upon, um, the factthat the user is Bob and not the agent. So let me show you a, um, an exampleof how this all works within, um,within the system. And gimme just a moment as I switch the screen to,uh, to the demo here. I have to use private browsing since I'maccessing the same endpoint. Oops.

All right. So up on the top,You see we have user Amy logged in,and this is again, a rag that's builtwith these Apple 10 Q statements. Just as I've mentioned before, the 2020 fours arematerial non-public information,and the 2020 threes are accessible by anybody. Now, user Amy is, let's say she's the CFO of Apple. So of course she has access to all of this data,and she wants to ask a question abouthow things have changed from 2023 to 2024.

So she asks, how is Apple's gross margin percentage evolvedfrom 20 23, 20 24,and, uh, submits that to, uh, to the chat agent?And this is actually built upon a rag demo that, uh, thatthat, uh, vis has builta little bit of latency here in terms of accessing, um,open ai, which we're using here for the LLM. But as you see, what comes back is, as you get a, an answerthat's based upon all of the data,Apple's gross margin is shown notable evolution20 23, 20 24, um,and a, a complete answer relative to what her question was. If we look over here on the left ofwhat data was actually pulled out of the rag, we see that,you know, she's gotten a pretty complete setof information across these different documents. We see data here for 2024 and for 2023 as well. Now, let's look at user Bob.

He asks essentially the same question. Tell me, how has apple's gross margin percentage evolvedfrom fiscal 2023 to 2024?Now, Bob doesn't have access to the 2024 statements,so when he asks thisquestion, We'll see what happens. Again, remember, we're pulling, uh,we're using this question to pull the data out of the ragbefore it gets, andbefore it gets sent up to OpenAI, we inspect that dataand are able to redact at the chunk level, um,and determine whether or not, uh,Bob should have access to it. And so you see Bob's answer is as, uh, fluidas, um, he would expect of getting any answer,but it does say, you know, theprovided context does not include the specific gross marginpercentage data from fiscal 2023 to 24, however,and then inference proceeds from there. And if we look at the linesthat are retrieved from the rag in this case, well, we're,we're redacting quite a bit of informationthat's coming from the 2024 statements.

Um, but allowing both the common chunksthat exist across 2023and 2024, um, as well as all of the chunksthat exist in the 2023 statements. And so, by doing this, at this, at this very granular level,we can deterministically enforce permissionsand make sure that user Bob only has access to, uh,to the data that they, that they should have access to. This is, I mean, of course, one of the, um, the key issuesthat a lot of enterprises are facing right now,but it's not the only issue that we have in, uh,in generative AI that we can look at in termsof, of governance. Um, another issue is cleanliness of data. And we know that this is, uh, this is significant in termsof how we build our, our LLMs, how we train the data,uh, how we train the LLMs with data.

Um, if we don't have clean data that's, that's freeof duplicates, we can end upwith some pretty biased behavior in terms of the LLM. Now, if you build your rag with, uh, with chunksthat contain duplicate data,and there's a significant amountof duplicate data in those chunks, well,that's gonna be a significant issue as well,because every time somebody asks a question,you may just be retrieving, uh, a hundred chunks of,of information from the vector database that all contain,you know, within those chunks, similar information,and that the result that you're goingto get is gonna be heavily biasedbecause of the, the duplication,the repetition that exists there. So being able to seehow the data moves, see where it comes from,understand the lineageof the data starts becoming important, not just, um, foraccess control or for ensuring Bob doesn't have accessto data that he shouldn't, um,but also for understanding at the chunk levelhow the data is being moved into the,into the vector database. And we can, uh, do that by tracing, again,tracing the lineage of these chunks. We can see how data moves within a given application, um,from the time, uh, in this example here, from the timethat user Amy uploads the document, say, through a, um,uh, a front endor a, a chat application to where it's storedinto an S3 bucket,and how that data is also moved through other servicesand, uh, and other S3 buckets, um,to provide a complete view of what's going on in termsof the data movement within the application.

That gives us the ability to understand where duplicationof data exists, not just, um,by looking at documents that are, you know, exactor, you know, copied from one place to another,but down to the granularity of, you know, a paragraphof text was, was copied from this documentand pasted into a completely different document. Uh, and we can show thatand provide visibility in terms of duplicationof data at the sub trunk level. And being able to do that across all of the flowsthat we have within our AI applications gives us a powerfulcapability to, uh, observehow data is moving within these applications at this verygranular level to be able to analyzeand debug, uh, how our, ourchat agents are working. And this is particularly important when we start talkingabout agentic a EI, where we are giving autonomyto the agentsto perform multiple actions on behalf of the user. And it's not always clear what happens for allof those steps in the middle, between the timethat the user inputs their, uh, their promptand the time that they receive a response.

What are all the steps that the agent took to,to come up with that response?The SQL queries that may have made the, the accessesto storage systems, maybe accesses to, um,to even internet sites. Um, and here we can get a complete trail of, of exactlywhat happened, um, for both analysis observabilityand ultimately governance of that data. Um, and this is ultimately what CR is, is about providing,um, the ability to inspect that data, the flow ofthat data through these, these agentic AI applicationsto determine are we meeting our privacy goals?Are we meeting our compliance goals relative to policy?Um, and be able to, to conversationallyinspect that data such that we can detect issuesin, in our applications and, and resolve them. So that's really where, uh, we see the power of this. And, um, the next step isreally to, uh, to bring this to market.

So we're a very early stage company. We are looking for, um, people that want to try this out. So if you're interested, please do get in touch with meand I'd love to talk with you some more. So now I'm gonna open it up to questionsand turn it back to Stefan. Yeah, thanks so much, Ron.

Really great presentation. I think this is such an important sort of, um, aspectof the journey of AI infrastructure to develop. Um, when I was at, um, I'm not like so much like a, um,like a data engineer myself,but as like an applied researcher at Meta,I saw like the importance of, you know,these things like access controland some related things like trackingwhere your data's come from. Um, you just can't run the, the, the business without it. So, uh, I think it's super important.

Um, so we've got time for questions,we've got quite a bit of time actually. Um, let's open it up to the audience. So, uh, we've got a Quest question. Oh, well, this is a, a question for Zillow. Um, so world zealot.

com/resources be the place I go tofor more information on, uh,partitioning melvic vector databases. Um, so, um, uh,I think what, so, um, we need, yeah, so I, I think, uh,potentially yes, um, we need to develop some more contentthat sort of like shows the, um, the, uh, integration of,of CER and melvic. So, um, that'll sort of be coming, uh, in the near future. Um, did, uh, have you got any thoughts on that, Rob?Yeah, um, actually I should post the code that, uh,for this demo, um, we can integrate into, um,existing chat applications via a Python, SDK at this point,we can also integrate via proxy. So we're just inspecting the, uh, the API traffic going inand out of the agents.

Um, and in the future we'll have other integrationsand other language support. Got it. So, um, anyone else?Do you have any questions?Also, if, if you wanna actually sortof speak out your question, I can, um, I can all allow youto talk on the participants list. Oh, Flory, Um, oh, sorry. I, I, oh, okay, great.

Talk about a really important topic from Rob. Um, yeah, so, uh, maybe, um, a question for,for you, Rob, from for myself. Sure. So, um, what do you see as sortof like the timeline for like, the adoption of, of this kindof like part of the infrastructure pipeline?Do you think like, you know, within like three years,this is just gonna become like a standardsort of part of the workflow?Um, how, how's, like the, the, the traction been so far?Well, you know, the, the,the big driver at this point is AgTech ai,and I think that we can look at, um, some of the predictionsof adoption relative to AgTech AI within the enterprise. And, um, and that's the, you know,that's the course that we're charting as well.

Um, that's where the complexity comes in, right?Today we have some basic, uh,basic problems that we can solve. And I've talked to a number of customers that, you know,they're concerned,they're certainly concerned about the permissions,but it isn't necessarily a priority. Um, but as we get to agent AIand we lose the visibility of the typesof requests and the types of datathat are getting pulled into these agent responses, um,just from a privacyand compliance perspective,it becomes in incredibly important to be ableto understand the, you know, how that data is being usedand is it being used in a responsible way, um,and used, you know, for the user that is goingto receive the benefit of the knowledge of that data. Um, and that's really where, you know,what we're betting on. And, uh, relative to the timeframe, I think every, um,uh, every time I try to predicthow fast things are gonna move, and Jenny, I, I am wrong.

Yeah, I think, I think we all are. Yeah. Um, some good points there. So we've got a question from Arthur. Uh, what are the advantages of partitioning data insteadof creating collections of, of that data?Um, I don't know if this is the right questionfor me to answer.

Um, I mean, I mean, in a sense, I look at, uh,the ability to, uh, partition create collections as,as two sides of the same coin. Mm-Hmm. Um, I don't know, do you have, uh,do you have a thought on that Stepan?Um, yeah, not, I, I'm not sure I can anything sortof like super insightful, but I mean, I perhaps like sortof two abstractions, like very closely related. Um, maybe collections is just not sort of, um,uh, that abstraction doesn't sort of quietly sortof quite match up to like the, the intended purpose of,um, this access control. Um, and is there some sort of like also some aspect of,with the partitioning maybe, uh,that could sort of, um,also like imply you might have some sort of like physical,um, separation as well, uh, between the data?Yeah.

I mean, to the extent that you can determine whatthat partitioning should beor what the bounds of a given collection should be,uh, re right. If customer a dataor it's finance data,it's clearly finance data versus HR data. Yeah. Um, then you can be relatively effective in termsof using, um, you know, those, those boundariesto define governance rules. Uh, to the extent that there starts to be an overlapbetween those, where those lines start getting blurred, uh,is where we start running into trouble.

Right. Um, for example, you know, a resume uploadedto HR might exist in your SharePoint as well. Mm-Hmm. Um, and so is it HR dataor is it just data that, you know,somebody has sent within an email to, uh,an engineering manager, for example?Um, and that's where we have to start being careful. Right.

Got it. Got it. Um,so it looks like we've got a follow up question from, uh,well, actually, um, well, I'm just thinking, so did,so this question from Arthur, does that relate to, um,the initial question or to Rowan's question?Uh, let's see. Well, why don't we,What, uh, what, what are the boundaries for 'em?What are the boundaries? Oh,Okay. Just,just, just a comment.

Ah, okay. Well, let's take Rowan's question then. Okay. Yeah. Perfect.

So, um, so, uh,this like chunk level access control, it can be a,it can sort of introduce additional application latency. So do, do we know, do we have like sort of a quantit uh,quantitative understanding of the performance levelof adding this additional layer to the, um,uh, to the inference stack?Yeah, you know, when we look at the, like the demothat I just gave, the vast majorityof latency is in the requests to the LLM. Um, the basically what, what we do,uh, in terms of, uh,enforcement is a lookup into a database. Um, so the latency that we add is, is essentially the sameas you would get from the lookup into visYeah. And so,I mean, a very small portion of the overalltransaction from prompt a response.

Yeah. That, that's good to hear. And I think also, I mean, in many circumstances,like you might sort of legally have to do this, um,uh, in, in any case. So, um, but yeah, good to hear. It's just like a small part of the, the total, um, uh, laso we got a question from, from Vincent.

So permissions to the data and chunks change over time. How does cable support the dynamic changes of access controland how frequently does it synchronizewith other access control, uh, systems?Yeah, I mean, that's a, that's a great question. Um, we, um,in terms of the data that, that we're looking at,we look at the source objects,and this is something that a customer would configureto say, where should we be looking for updates, for example,um, you know, each time, uh, permissionsor an object is updated within Salesforceor an an S3 bucket, we'll be ingesting the,the log data, the event data to be able to determine, oh,there was a change, we should go and re-scan. So it's really, um, it's really on demand that, uh,that we will do that. And obviously if you want to set timeframes around it,that's something that we could do as well.

So, you know, ingest the logs,but don't actually do anything, you know,until the end of the day. So, you know, we don't impact the performance of, uh,of the other applications in the environment. Got it. And, um, uh, what can you tell us about, like,does, does cable like sort of integrate with other, uh,access control systems?Yeah. And in fact, that's an important part, right?We need to, um, we need to integrate withthe IDPs from a user identity perspective to be able to,to pull in their, uh, their privileges, right, their,the groups that they belong to, uh,what their permissions may be.

Um, and we need to be able to integrate with, um,with all the different sources of data. Mm-Hmm. Now from a, the perspective of actually readingand, uh, gaining access to, uh, to the data itself,that's relatively straightforward. There's, there's quite a lot of, um, you know,ingest projects, um, that, uh, that we're ableto take advantage of there. Um, the place that we're actually, um, having to do workand, and, uh, continuously expand our integrations isrelative to the permissions themselves.

Um, I guess that you could say this is another, um, driverfor, uh, for use of the product is that as we'reusing these ingest pipelinesthat can pull data from everywhere, we have to be cognizantthat those pipelines are notdeveloped to pull permissions. They're envelope developed to ingest data. And so being able to go and,and grab that metadata is something that, that, uh, um,that we're working on to expand the numberof integrations that we support today. Got it. Fantastic.

So, um, additional question from Vincent. So what are the prosand cons of using access control, um, at the granularityof chunks versus files,especially if the access control changes frequently?Yeah, I mean, in terms of our processingof the changes to permissions,that doesn't happen in real time. Um, there is a lag. Um, so if the permission's on a filechange, somebody sets the s uh, to restrict access,um, we won't pick that up immediately. So those SI mean,of course in a storage system will take effect immediately.

Um, but that would say that's the,if you're looking at just pure access control on files,you know, that would be your primary mechanism that you wantto use from a CR perspective. Access control on a per chunk basisallows you to, um, aggregateand analyze and resolve the permissionsthat may exist across multiple files, uh,multiple objects where a single chunk may exist,and then be able to determine if you should be ableto access that chunk or not. You know, a very, a very common example that I could use is,um, you know, the company logo on A PDF, the,the image itself is gonna be everywhere. Right now, your file permissions are gonna restrict accessto the, the document, uh, that that logo may be in,but that logo itself, anybody could access that, right?It is probably have it posted on your, uh, on,on the internet for peopleto download from a PR perspective. And so, um, that's really what the advantage of,of doing it at the per chunk level is we canseparate out the data that's unique to the file that, um,that somebody shouldn't have access toor should have access to, versus the data that's, you know,existed so many places that, um, that you may not needto put permissions on it at all.

Got it. So, um, I guess like, yeah,my thoughts along very similar lines. So the, the chunk is sort of like the fundamental unit ofinformation that, like you, you're pulling from your,for example, VIC database to put into your, your LM for rag. So, um, uh, it, it just seems to like naturalthat you put the access controls over the, the chunks,um, rather than the files. And, um, I think yeah, with like longer context, um,in embedding models, you know, maybechunks will be like the same length as, you know,like maybe your, your your your documentand chunk of the same thingbecause, um, you've got like these a million, um, uh,um, token character windows.

Um,Let me, let me, um, I'll,I'll become a little philosophical here for a moment. If you wanna think about the identity of data, you know,we have a very well-defined perspective on the identityof users, meaning from username passwordsto biometrics to MFA, we have a hundred different factorsthat we can use to positively identify a person. Yeah. When it comes to data, we have very few, right?We have the name of the file, the les that exist on a file. Um, but if you look at the, the onesand zeros, the bytes that comprise those files themselves,what is the identity of that?And if you wanna think about what the advantageof doing it at a chunk level is, it is being able touniquely identify those chunks of data.

What is this chunk and who should be able to access itand what should be, uh, its acceptable use. Um, and that's a new concept. I mean, people haven't thought about data this way before,and, you know, now that we have, uh,generative AI consuming data that way, I think it's timethat, uh, we start having, well, it's forcing us to haveto think about data in that respect, right?Yeah, totally. Um, um, so to, to clarify, so I, um,are you sort of like saying that, uh, I guess sortof like the, like the, the embeddingof the chunk in a sense sort of is like a, like oneof the best signals in terms of, um, who, who, likewhat the data is and therefore who owns it?Or are you sort of talking literally about the, the,the zeros and ones for that chunk?Well, it's comes down to, um,the uniqueness, right?If we think about, uh, DNA, for example,um, 70% of the DNA, uh, of,of humans is shared with a, with a banana. And so it's like, what is, what is the,if we look at the DNA, what part ofthat DNA is actually our identity versus, you know,a banana or an orangutanor something else, it's a very small portion, right?And so if we, if we wanted to boil down a fileto the essence of what is, what are the chunks of data,what are the bite sequences that uniquely identifythat file or the uniquely, you know, form the,the characteristics of that file, if we haveto look at the chunk level, because it's very similar interms of kind of the percentages that we see with,with data deduplication to DNA.

Hmm. Yeah. Yeah. No, um, that's a really good point. So I've got a question from Rowan.

So how dynamic is CAPA chunking strategies change then?Will the chunk level ACLS change too?Uh, since, you know, the chunks will not be the same,they might, uh, there's not gonna be like, they,they might sort of like, um,particularly like overlap in some ways. And if you have like two chunks with different, um, uh,acls, what, uh, what happens in that case?Yeah, so, um,the, the chunkingand analysis that Cabr does what we do to build our,our index similar to, uh,what happens in building the vector database. I mean, we can ingest data faster than, um,it takes to do the embeddings and tokenizationand build the vector database. And so, um, in the sense that, um,if you're changing your chunking strategy,rebuilding your vector database, you know,you would rebuild the R index, uh, alongside of that,um, it's very helpful for us to knowwhat the chunking strategy is. We want to be able to align with, uh mm-hmm,with the chunking strategy that you,you're using in the vector database.

Um, but at the same time, um, we want to bemore granular than, uh, thanwhat you have in the vector databasebecause then we can find, um, we can find commonality, uh,within different chunks that, thatin total are not the same, right?So two different chunksof a thousand bytes may be completely different,but contained within them, you know, uh,256 byte sequences that uniquely come from a,from a document that somebody doesn't have access to. And, and we want to be able to identify it at that level,um, to be able to, um, to helpwith the issues that I brought upbefore in terms of, you know, how much data do you have in,in your rag that is, uh, is duplicatedeven at the sub chunk level?Yep. Yeah, totally. So, so the, the the,the cost is very low to, to sort of build, building this,um, uh, ACL index. So, you know, just do the whole thing again.

Um,Well, I mean, I want, I don't want to mislead anybody. The cost is, is there, in terms of compute,it is a compute intensive task. Um, it's just that we can, within the timeframe,I mean, we're talking relative timeframes herewithin the timeframe of building the, the vector databaseand building our index, you know, we can build the index,um, easily within the timethat you would be building the Vector database. Got it. Got it.

Um, so, uh,have we got any more questions?Well, just before we wrap up, so, oh,we've got one more question from Vincent. Uh, where, where is access control info to the chunks keptin c or in the metadata of, of, uh, chunks?I mean, these are some really great questions. So the, um,the metadata that we keep is, um,it's in Elasticsearch if it, if it really matters, um,we're, we're keeping track of all of the, um, of allof the chunks, the metadata that's associatedwith the chunks on a per event basis. So, um, I mean we're,we're actually collecting a significant amountof data within the environment. Um, every time we see a chunk, um,accessed via an API, we will, we will keep the metadata forthat API requestand associated with the chunk in for that event.

And then, um, and then we can do analysis on the fly. And that keeps it very flexible in terms of, ofhow we apply these now in, in doing enforcement. Um, we wanna handle things a little bit differently. We want to resolve permissions mm-hmm, and then,and then put those permissionsand the association with the chunks into, uh,into a very fast databasethat we can retrieve those in your real time. Uh, but that's run in parallel to, uh, to the other, uh,uh, the other analysis that we do.

Got it. Um, yeah,it sounds like very efficient, uh, implementation. So, um, before we finish up, I just wanted to ask you, Rob,so, you know, um, listeners out there,they've heard about CA, um,perhaps they wanna like integrate it into their product. What is the best way to, um, to, to contact you folks?Should they just send you like a direct emailto rob@caper. com?Um, should they go to the website, h, HH, how should people,um, best get in touch with you?Um, people are, are welcome to contact me directly, Rob,at caper.

com. Um, at this point, I'm not getting a high enough volumeof email that I'm gonna have to turn anybody away,but, uh, depending upon how long this, this video is up on,uh, is up on the web, that may change. And in that case, I would say, uh, send itto info@caper. com. Got it.

Fantastic. Well, thanks again, Rob. Really, really interesting talk. I think it's a really cool technology. Um, it's, yeah,and it's just gonna sort of keep, keep, um,increasing in like, um, importanceand, um, how, how sort of widespread it is.

So, uh, really cool to to hear about it. Um, thanks again for coming on the webinarand uh, we are looking forward to,to hearing from you, uh, in the future. Well, thank you very much and I appreciate, uh,you inviting me to do thisand I wanna thank everybody who was on the webinar for the,the great questionsand uh, I hope we can connect in the future. Thanks. Thanks everyone for attending.

Okay. Okay, take care. Bye.

Meet the Speaker

Join the session for live Q&A with the speaker

Rob Quiros
CEO & Co-Founder, Caber Systems, Inc.
Rob Quiros is a technologist and entrepreneur with more than 30 years building and leading enterprise-class solutions in networking and security. His first job was to build a router to compete with pre-IPO Cisco. Later, he contributed to the creation of the first hardware RAID controller at UC Berkeley while working toward a PhD in Quantum Electronics. He has led the growth of $1B/yr product lines at Cisco, and WAN optimization leader Riverbed, and helped pioneer in the now multi-billion dollar Secure Access Secure Edge (SASE) market at Soha Systems and Akamai. His current mission is to bring new solutions to market to meet the challenges of GenAI-native applications as CEO & co-founder of Caber Systems, Inc. Secure Edge (SASE) market, where he was an early team member and executive.

Beyond RAG Partitions: Per-User, Per-Chunk Access Policy

What will you learn?

Topics covered:

Meet the Speaker

AI Assistant