Milvus in Action: The Vector Database Behind Troop's Shareholder Activism
I'm pleased to introduce today's session Vis in Action,the Vector Database Behind Truth Shareholder Activism,and our guest speaker Zen Yui. Um,so this is Yui Yui. Do you mind to introducing yourself to everyone here?Of course. Hi and Steffi, thanks so much for having me. My name is Zen Yui,I'm the co-founder and c t o at Troop.
Um, in short,we are a shareholder activism and proxy advisory company. Um,we use machine learning to do a whole bunch of stuff in this space that we'lltalk about soon. Um, so thanks so much for having me. Of course. Um, so firstly, we just really want to get to know Chu.
Um, could you provide a brief overview of Chu and the service you offer?Yeah, of course. If you don't mind,I've got some slides about this so I could just get right into it. Yep. Sounds great. Sure thing.
So, um, thanks everyone for joining. Um, very pleased to be here. Um, and before I tell you about troop,I just want to say that Troop and our engineering team owes so much of oursuccess with machine learning and gen AI to listening to cool talks from otherfolks who are hacking in this space and with these tools. And so I'm hoping that today in listening to what we've built and how we'vesolved some of our problems, um,that you might be able to cherry pick part of this for your own work and maybeunblock yourself in something that you're hacking at. Um,also at the end of this talk,I share my contact info more than happy to link up with everybody and talk aboutshareholder activism, gen ai, or whatever it is you're working on.
Um,and so today this talk is gonna be about using large language models as part ofstructured data pipelines,creating labels out of unstructured data at scale and unlocking a whole world ofmachine learning potential for your company. Um,also we're gonna be talking about using gen AI to compress data down tosomething that you can train smaller models with, um,downstream and maybe even layer in some traditional supervised learningtechniques. And of course, how all of this is made possible, um,storing our embeddings at scale with vis. Um,and before I get into, um, how we do all of that,just a quick background about myself. So, um,before troop I was a big data data engineer, um,working with kind of batch and streaming systems.
Um,I was working with data corpuses around a hundred terabytes. And, um,I define big data in two ways. One,it's something that can't be wrangled on one machine or not easily. And two,it's just a size of data that's generally uncomfortable to manage and whereprocessing takes a lot of time in planning. And I find that the most,the most worthwhile data problems don't fit in memory.
Um,and yet so many of the hello world examples of doing kind of rag and a lot ofthese in-context learning techniques, um,the hello world examples are very much one no examples and I understand whyusing in-memory databases. And so today, um,I want to talk about applying a lot of these techniques to larger distributeddata, how you do it,how you can use Viss kind of with a data set that spans multi machine,multiple machines,and an amount of data that you wouldn't wanna process all at once. And so with that, um, let's quickly talk about troop. So troop, uh,builds technology that lets everyday investors govern publicly tradedcompanies. Um,and the reason why is that we all own stocks of these companies.
We are shareholders. Those shares come with a right to vote. And so it means that corporate governance is our collective responsibility. It's something that not a lot of us think about when we buy stocks on Robinhood,but those stocks come with voting power and these corporations make importantdecisions that impact our lives. They even impact policy in our cities, um,et cetera.
And so, um, you know, troop,the origin of troop dates back to the blip in the matrix. I don't know who remembers this, but, um, about two years ago,a bunch of Redditors got together on a subreddit called Wallstreet Betts andCollectivized around GameStop a stock. And we were super excited about this moment in history becauseit represented the first true example of multiplayer finance,like total internet strangers coming together and trying to solve a problem withmoney together. Um,maybe the GameStop movement wasn't the most successful instance of collectiveaction in the markets,but it was the first very important widely distributed one. What a lot of people don't talk about with the GameStop movement was that sixmonths later,that same group of internet strangers actually managed to elect somebody aschairman of the board of GameStop in the annual shareholder meeting.
And that's wild. There isn't a whole lot of precedent for that. And so we thought that was super cool right around the same time, um,a small activist hedge fund called Engine Number One got ExxonMobil,a huge oil and gas company to give up three board seats in the name of renewableenergy. And that might not seem intuitive,but they actually made a strong financial argument for a while. An oil and gas company with the United States transitioning to electricvehicles, um,needs a transition plan to stay competitive that Exxon's peer companies have atransition plan.
And Exxon said, fine, you know, um,it makes good business sense. And so we at the troop team,we took a step back and said, whoa, the these two things are so cool, Mike,you know, multiplayer finance and activism is just cool. It's very interesting,um, and it seems to be effective. So what if we built a community that helped internet strangers come together,pull their brokerage accounts and actually do this on a repeated basis?And so out of that troop was born. Fast forward to today, we have, um,two products.
So Troop on the right side, uh, we are,we were first a shareholder activism community app,and it's very much what I've been describing. It's a community platform. You can go to troop. com or download our native apps and meet a bunch of otherstrangers who own the same stocks as you and spin up brand new activism. You can do due diligence together and we'll actually,if you get enough traction, get linked up with troop, um,our research and legal teams and help you file all of that with the S E C andmake it real two years into this project,one thing that we realized is that most people have significant wealthinvested through bundled passively managed financial products.
Think ETFs,mutual funds your retirement accounts,and those are relationships that should be high fidelity. We should be talking to our asset managers, we should be voting those shares. Um, but there isn't really a good communication mechanism. And furthermore,those asset managers really don't know what it is we care about. So today they vote what the set of voting directives that they build in-house,but they tend to not be representative of what we want.
And so troop aims to use Gen AI and machine learning to kind of bridge that gapto both help asset managers at scale, understand what it is we want,and in turn automatically vote all of our passively managed wealth in accordancewith our values and tounderstand the scale of this problem and perhaps why gen AI is a good fit. So a lot of people don't understand this,but the top 10 money managers alone manage$37 trillionthat is spread across hundreds of millions of Americans, right,who never speak to these asset managers. And these asset managers are voting these $37 trillion onbehalf of hundreds of millions of Americans at well over 3000 publicly tradedcompanies. Each of these companies has at least one annual shareholder meeting where theytalk about a whole bunch of things that need to be voted on. They make important decisions,and there isn't really a good way to wrangle this.
So much information,many of you that have been hacking in this space or using gen AI tools orwatching the latest stocks are probably familiar with some cool work that folksare doing, making chat box toning, chatbots, talking to earnings reports. Um,there was a famous one that blew up on Twitter where somebody made a chat bot,um, talking to the Tesla earnings report and kind of asking about key risks andwhere the company did well, where the company could improve. And so we wanted to build on that research, but at a much larger scale,you know, where the interface isn't chat,where we just figure out what it is that you care about,what are the directors of an asset manager and how can we blend those two thingstogether to basically automatically vote based on how you care at scale acrossall of these meetings for all of these people. And to do this, of course,one of the most elegant solutions is, uh, well-trained models in ai. And so in short, like so many of you out there,we're using rag to bulk process s e c filings intostructured features and summaries and labels that we can then downstream use infine tuned models and special purpose models that do voting, personalization,recommendation engines and all of the above just to make this a sane problemto manage and actually handle with traditional data science and data engineeringtools.
And so today's talk is very much about, you know,I know, I know you all see a lot of talks about rag,this is the precursor step there. How do you actually make sense of terabytes and terabytes of information?How do you embed it? How do you organize it?And how do you set yourself up for success, um, with Rag at scale?So let's start about, um, the humble origins of, you know,the V one of Troops' solution. So, um,I'm sure this looks familiar to a lot of you. Uh,what we were doing was essentially using Lang Chain, huge fan of that tool, um,and specifically using the Vector database face, um,comes out of the Facebook team. It has, um,implemented a lot of badass research around, um, you know, vector search,specifically search, you know,searching like billion vectors without actually taking the distance of everyone.
And so we're a big fan of both of these projects. Um,we wired it together and used their prebuilt question and answer, um,solution basically to automate extractions from s e C filings,which are in PDFs. Um, and in short it worked. It was super fast. Uh, we were able to pull structured data out, um,with a little bit more finagling.
We were able to get J Ss o n and and and structured data out. That's Machine Parable. Um,but we ran into kind of two key issues. One is that we were constrained by, um,operating with only one filing at a time. So the way that face works is that you're responsible for chunking theinformation up, embedding it, and loading it into this local vector store.
Um,you're of course limited by how much you can fit in memory on that machine. And two, although face does let you, um,create embeddings and then persist them for use later, the whole, you know,loading and unloading of your vector database, um,from long-term storage that's on you to manage and write all of that code. And we thought to ourselves,there must be a vendor out there that is actually solving this at scale,where we've decoupled kind of the vectors that we have in memory from the largercorpus of vectors that we have at rest. And so that brought us to searching for a solutionthat met a few specific criteria. So one, um, like many,um, you know, high-end large volume data tools,we were looking for something that separates compute from storage.
We wanted to have way more of a way larger corpus and way more embeddings atrest than we ever needed to query at once,even when we're querying several companies or several filings at the same time,we never need to have the entire SEC's history loaded in memory. So we were looking for a solution that separated these two items. Two, uh,we wanted the ability to scale the work out. And so as I'm sure a lot of you face this problem, you know,if you're working with traditional relational single no databases or documentdatabases, you kind of hit a point,a tipping point with your data where queries start to run really slow and theonly way to make it go faster is to buy a bigger node. And so we were searching for solutions that, um, actually let you scale out,you know, double your compute power by just adding more nodes.
And so, um,that was a pattern we were looking for. And then three, so we're,our application runs in Kubernetes, we're in Google Cloud. Um,ideally this would be something that we could self-host and run in our ownKubernetes cluster. You know, troop,our community app is very much like a daytime app, right?And we have a ton of latent compute power. So it makes sense that we could run a lot of our data pipelines, for example,at nights and weekends, and, um,take advantage of kind of the compute power that's otherwise not being used.
And so we wanted to self-host it. And the other thing is, um, you know,one priority of mine is that when I'm warehousing and buildinglarge data sets at scale, I prefer to own the storage. So we were looking for something ideally that led us right to a data store thatwe own. Ideally it would be a bucket and cheap storage,and ideally it would be written in a file format that's kind of open source andcommon. And so, um, enter viss, right,Viss elegantly separates compute from storage.
Um,we're actually able to write through the mini IO adapter straight into a bucket,which I love. Um, at its core, Viss is a distributed system. It runs on ET c d and you're able to add more nodes and they communicate andshare work with each other. And then finally, um,it was dead simple to install Viss into a Kubernetes cluster. They've actually got a helm chart.
We got it up very quickly, um,and with a few configurations we were running live. And so, um, you know,I'm pleased to find, I was pleased to find a solution like Viss because it very,it fits very well into what I consider to be a mature,scaled distributed database. So let's talk about, I mentioned that we've got way more data at rest than,um, we actually need to query at once,but let's talk a bit about what does it mean to actually ingest as muchinformation. Um, if we have any data engineers in the room,this is gonna look a lot like your work and I just wanna help, um,everybody fit the new kind of text to embedding,to Vector store paradigm into this, um, distributed data paradigm. So how does it work? If you're familiar with Lang Chain,you're probably familiar with two concepts.
They introduce a document store and a vector database,and a lot of folks are wrapping this into one solution. Some are actually co-locating the text payloads with the vectors,which is interesting. Um,and some folks are actually just shoving all of this into a solution likePostgres. And look, if that works for your corpus,that's simple and it's awesome. If you're like us and it doesn't work though,um,you really need to split apart these components into the respective pieces anduse the best tool for the job.
And so, as I mentioned,we opted to chunk our text, um,and then write that chunked text corpus into a bucket and partition itby time. So, uh,one detail of SS e c filings is that they're released every day on the S E Cdatabase and website. You're able to see kind of for the latest, for example,Tesla filing, what is its ID and what day was it filed. And so we actually partition our raw text chunks by time. We then send them to an embedding model, take back those vectors,and we've created a collection in mil vistas that is also partitioned by time.
And so what we do is we actually write to a partition that matches the timepartition of the raw text chunk. And if you're familiar with viss,what you do is you just write a bit of metadata into that viss vector documentand we're able to then query for vectors, pull that metadata,and then find the corresponding raw text chunk for use in a model. Um,and just more generally the key to using viss,in my opinion at scale,I mean you can get off the ground in five minutes with viss and it'll actuallymanage partitioning for you. But if you're gonna write something of real scale and size,I think a key is spend a lot of time thinking about your partition scheme,whatever that is. It could be time, it could be by customer id,it could be by user id,but think about how your data's gonna grow and a key for are you partitioningcorrectly?Is if I wanna randomly access the vectors or text payloads of a givendocument, can I directly target the partition where that data's written?So in our case, we know the data of the filing,I'm able to go and find those vectors.
Um,if you don't set any of this mil vista's,default partitioning is sort of a round robin scheme and it will beautifullydivide up your work into small chunks. Um,the only downside of this is, as I mentioned at scale,you probably won't be able to find just the partition where the data you'relooking for is. So here's a quick diagram that illustrates how RAGworks at troop on top of VUSs and some custom code that we've writtento enable this to, um, be a little bit more performance. So we start with a query,perhaps we're looking for the list of board members that are being, um,voted on at an upcoming shareholder meeting. What we do is we embed the query and then we submit that to a customviss server that we've written that sits between our Viss clients and the vissserver.
What this does is it basically keeps track of thepartitions that we're trying to read from disc when we bring them into memory. And if no, if no other, uh, Viss client has requested that partition in a while,it eventually ate it from memory. So if any of you're running viss at scale in production at your companies,you may have faced this problem where two different mils clients loading thesame partition into memory, um, and you and working with it. If one of those clients then cates it from memory while the other one is stillworking, you can kind of run into contention. Um,and all of this I'll mention is to avoid just loading the entire collection intomemory, which, um, for our problem we can't afford to do.
Um,and so we've written this custom vis server that basically manages the loadinginto memory and its passive agent when nobody's using it. We then like traditional rag, um, pull the metadata off of those vectors,fetch the raw text payloads from the buckets, submit it to the L lmm as context,and then we've retrieve a result. Um, this custom viss server. So it's written in go,it's actually something that we've been talking about contributing to the VISSproject right now. It's very custom tailored to kind of troops use case.
But if that's something you're interested in hearing about,hit me up after the call, I'm happy to walk you through how that works. And as I mentioned, we are talking about internally contributing this to vissnext. So as I mentioned before, um,a key goal of troops machine learning program is creating a reasonable data lakeof clean information and our raw data. You know,these PDFs can span hundreds of pages, some of them can be, you know,hundreds of megabytes large. And a goal here is to basically summarize and compress the information down intosomething much smaller, predictable and manageable,especially for use in fine tuning and custom training,smaller models downstream.
To that end, um,one thing we need to do is essentially create summarizationprompts and pipelines that take large amounts of information and summarize themto much smaller payloads. And if any of you're working with this sort of workflow, um,on top of like a RAG architecture,you've probably faced a common question either internally or from yourstakeholders. How do you know if the summary's right as a larger pattern?The question they're asking is how are you evaluating the, um,efficacy and accuracy of this black box extraction that you're running? Um,and we found a sort of interesting way to do this that I wanna share,and I think this kind of is the beginning of showing the power of embeddings andkind of how we think about embedding for a lot more than just search. So in the case of a summarization or data compression,um, prompt,one thing that you can do is take the result of your model,perhaps it's a paragraph summary of 20 pages of data,embed the summary,maybe run three variants of your summarization, prompt, embed them all,and then take the prompt or the model where the averageembedding of your result is most similar to the embedding of your sourcematerial in plain English. What that means is that the topics and concepts and entities talked aboutin your summary are the most similar to that of the source material.
In a way you've lost the lease information. And so we think about summarization as a compression algorithm andevaluating it should just be topically depict, talk about the same things. So at scale we're able to run multiple variants of summarization modelsand then basically compare with a machine,evaluate with a machine which one's the most effective by using embeddings. We then actually take those embeddings at the results and write them back intomiddle. This for another use case that we'll talk about in a moment.
Similarly, um,I'm sure many of you are faced with this,so human labeling is expensive. Human evaluation of labels is very expensive. One thing that's interesting though is the larger gen AI and ML communityregards G P T four as pretty accurate for mostgeneral use cases. It's also the slowest model. It's expensive.
And so how can we leverage a more expensive model,a fancier model like a G P T four to actually evaluate and train cheaper models?That's kind of core to troop and our architecture, how we work. And I'm sure that's the same for many of you. So similarly,we're able to use, um,an expensive foundation model on a subset of the data to produce summaries,labels, categories, embed those outputs,and then actually compare the result to the embedded output ofmuch cheaper, fine tuned or custom trained models. And finally, so you know,as a data engineer and once data analyst in my career, um,it's been a hot topic for a decade now, a decade now, uh,introducing data democracy using embeddings andhow do you basically give your entire team access?You've built this wealth of information, this huge data lake,all these embeddings, all this information, these results are also embedded. How do you give your team access to it? Um, for our transactional data,you know, this thing's coming in from Google Analytics, our application metrics,all of that is being fed into a traditional data lake in BigQuery and exposedwith Looker.
But for something unstructured like text payloads,um, you need to basically give your team access directly to um,the rag model in a sense. And so in addition to Looker for internal data discovery, um,troop has also introduced a simple streamlet application. And what we've done with this app is just pointed it directly at VIS and ourdocument store so that the team is able to run ad hoc queries against thetext payload and actually do, you know, um,their own custom rag pipelines taking information from our embedding storeand um, creating their own prompts with one of the main foundation models today?And so, um, you know, this Streamlet application,once we had everything else stood up with VIS and the data lake and everythingelse, um, this entire application probably took an afternoon to put together. Um, more than happy to talk to folks after the call about how we did this,but um, I suspect that most of you, if you have any, you know,sort of front end team would be able to hack this together pretty quickly. Actually, most of the code, um, you, you just need to know Python to,to set up Streamlet, the front end comes for free.
So, um,highly encouraged doing this and it's sort of taken vis from a tool that's justfor data pipeline. So something that's actually exposed to our internal team and I suspect it'ssomething that will eventually make user facing. Although from a product standpoint, we haven't totally figured that out. Alright, and with that, um,wanna say thank you, um,Steph for inviting us to talk about troop and how we're using novis, um,and specifically how we're using embeddings to basically compress andmake sense of our corpus of text making machine levels and verified labels, um,and powering kind of our downstream fine tuning and, and um,machine learning program. So thanks Steph and thanks everyone for, for watching.
Thanks Jen. Really appreciate the presentation. Uh, now we're open for q and a. Um,feel free just like to paste your questions at the bottom of the the Zoom bar. Um, let's give a couple minutes and to see what the questions are coming.
Okay. Uh, the first question, um,given the array of value in the US are you worried aboutthe bias in the embeddings and the LMS you use?So the question is, given how many different types of people there are,are we worried about the LLMs and embeddings having bias? Um,it's something our team talks about a lot actually. Um,and so much of the work that we put into this is actually to,um, we use techniques like grounding and um, as I mentioned,like bringing human verification into the fold so that we can actually,um, fine tune and train, um,these models to be able to answer for specific viewpoints. So as an example,one thing that we did early on in this project is make sure that our, um,l m and embeddings engines are actually able to represent the entire USpolitical spectrum of values. And, um,we consider success and accuracy in each of those personas as independent of oneanother.
And so we prioritize it. So in short, yes,we very much care about that, we are worried about bias, um,but we've been able to fine tune, um,the models to speak to a whole array of values. Cool. Um, let me see if there's any other question here. Um, would you ever make a version of you on chat Streamlet app public or userfacing?Um, yeah.
So as I mentioned with the streamli app that we've built, um,I do think that chat can be a clunky interface for a lot of data tools. Like sometimes you put a chat interface in front of a user and they don'teven know what to ask. So, um,we're starting with it as an internal tool 'cause our internal team is veryfocused and they know what to ask. I can 100% imagine a future version of troop where there's a chat interface forasset managers to collect information about their clients. Um,and maybe even for everyday investors like us to interface with the assetmanager at scale.
Um,it's something we're talking about and I think if the general public wants us toopen up our streamlet app, um, we'll seriously consider it. Um, and,and happy to do so. Uh,like let's say if anyone and who participate right now or thinkingabout to build such things,is there any specific tips you wanna share?Uh, sorry Stephanie, could you repeat that?Oh, um, actually I see another question here. Um,how do you deal with lots of data across document and matching from vectorstores? Was there limitation of token? Lancing, L l m um,summarization is limited. Uh,I guess in l l m that's what this person is saying.
Yeah. Uh,thanks Vincent for the question. Yeah, totally. So, um, it's a hard problem. I think the way that troop has built a solution that maybedeviates from the hello world examples is we break down the problem,um, into multiple steps.
So step one is let's say you have an a hundred page P D F, right? And you know,generally the sorts of things you're trying to extract from some unknown pagewithin the P D f. One example of a solution is to actually just roll through every single page,right? And pull out a mini extraction of the topics in that page. And so you can kind of think of it as a map produced problem, take every page,map it to a smaller amount of data or every chunk to a smaller amount of data,run extractions on that. And then, you know,iteratively do this until you've boiled it down to a much smaller summary. And at each step a key point is your evaluation technique.
So how are you, um,evaluating the efficacy of your summary? 'cause you're right,Vincent summarization is limited and hard. Um,one really cool way to do it that uses embeddings in Novus is to just embed theresults of your compressed data and then compare it to thesource material, the embeddings of the source material. So if the embeddings are very similar,it means that you've probably captured most of the concepts in your extractionthat were in the source material. Um, and if you actually look at the larger,you know, like llama and index just gave a talk about this, um,using embeddings to evaluate models is becoming an increasingly populartechnique. Uh, I hope that answers Vince's question.
Um, then, uh,the next question I have here is how do you think personalizationvia AI will change finance and asset management in the coming years?I think thatthe whole way that we interact with finance is about a turn upside down. Um,it very much went from an experience like a consumer experiencewhere, you know,the average person doesn't know a whole lot about what's going on in the financeworld and especially for our business in, you know,what's going on in shareholder meetings. Like almost nobody that we meet reads the voting packets and understands what'scoming up at an upcoming shareholder meeting. Um,and so it was very much on what's called an asset stewardship teamto know what's going on and make the right choices on your behalf. I think with ai,the opportunity for personalization is amazingbecause you actually can have at scale sort of automated conversationsand data exchanges between the folks that manage your money and theunderlying investor.
And so I think we're moving towards a world where both in the governance worldwhere troop is focused, but also the larger asset management and finance world,we're gonna start seeing a lot more products that their competitive angle,what the way they sell themselves is personalization for you. And I think that all of that is gonna be enabled with ai. Thanks for sharing your vision. Um, do we have more questions here? Um,let me check. Um,well if there's no more question, we'll conclude this session.
Thank you so much Zen for joining us and thanks for everyone participating inthis webinar. Thanks Seff for having us and thanks everyone for attending. Um,my contact info is on the screen, please hit me up, um, if you wanna chat. Thank you. Thank you.
Looking forward to more webinar with us. Thank you. Bye-bye.