Foundation Models Are Going Multimodal
Join the Webinar
What you'll learn
The success of foundation models such as BERT, GPT-3, CLIP, and Stable Diffusion has generated increased interest in models that combine vision and language modalities. These hybrid vision-language models have demonstrated impressive capabilities in challenging tasks, including image captioning, image generation, and visual question answering. A new paradigm of video foundation models that learn from video data using the principles of foundation models has recently emerged. Join James Le, developer experience at Twelve Labs, for a session that provides an overview of foundation models, large language and vision-language models, and video foundation models.
- The architecture of foundation models
- The scaling laws
- The rise of vision-language models
- The new paradigm of multimodal foundation models for video
Today I'm pleased to introduce today's session,foundation Modal are Going Multimodal, and our guest speaker James Lay. James currently leads developer experience at 12 Labs,a startup building foundation models for video understanding. Previously he worked at ML infrastructure startups such as Superb AI and Snorkelai, while contributing to the well-known full stack deep learning courses. He also hosts Data Cast podcasts that features conversation with founders,investors and operators in the AI data and AI infrastructure space tounpack the narrative journey of their careers. James is also joined by my colleague Frank Liu, our ML architect here at Zills.
Welcome James and Frank. Awesome. Um, yeah, thanks Emily for, for the introduction. Um,and thanks to the listen team for, for having me on on the webinar today. Um,I, um, you know, we've been knowing Frank and the team for, for a while now.
Actually, uh, you know, uh, based on Emily's introduction, I, I,I interviewed Frank for my podcast, just,just last learned a lot about kind of the early story of zla and,and the evolution of the Vector database base. So it's definitely my pleasure to, um, to kind of, um, to, to know that and,and cut this trend to talk about the work that we need go to our labs inrelations to, uh, you know, the evolution of the, the broader,uh, language model, you know, Ian model space and, and,and the important work of, uh, you know, uh, C X and, and,and the other factor database company, uh, to propel this industry forward. So yeah, the title of my top is called Ian Model, uh, going Model Model. Uh, so just a quick start, uh, introduction. You know,the success of a lot of these models as you see in this slide here, you know,GT four, Dali GitHub copilot has generated a lot of interest, um,in the title models that can perform a w of task break from things likeimage captioning into like code generations to visual reasoning.
So,uh, for today presentation, uh, I will, uh,basically, you know, cut, um, start from, be from from the beginning,talk about the architecture of, you know, this publishing models,the training and fine tuning paradigm, as well as, um,the important scaling loss. And then I will discuss how a vision language model show up, uh,and then that can combine the power of Competi vision and N L P and how they canbe used to solve a lot of different, uh, complex problem, uh, today's age. And then finally,I will talk about the new paradigm of video farish model,which is essentially a type, uh, multi-model farish model that, uh,combine different modalities and how they are, um, you know,changing the way that we do, uh, understanding and analysis of video data. So,yeah, let's talk about, you know, what is of heart is issue model. So according to, uh, this definition from Stanford,just roughly about two years ago, they defy a model as a type, uh,machining model that can learn from oir issue data using cellsupervision scale.
And the idea here is to create a model that can be used for many different tasksby training on a lot of data. This model can learn the general patterns in the data. And so when being used for a specific task,it can use such knowledge to quickly adapt the new task. And this concept,a ion model, uh, leveraged to one known, um, uh,ideology in, in, in modern ai, um, space. First one is deep new networks, which has been popular since, uh, 2012.
And then second one is self supervised learning,which I think has been around for almost as long. And some of the recent improvements in both of these areas have allowed for thecreation of even larger and more complex model. And they are often trained on massive amount data,often without explicit labels. And the result is, you know,these type models can lead a orange pattern relationships,which leads to significant peer improvement in, you know, N L p, uh,vision audio and speech processing, and even multimodal ai. And,uh, from a developers, you know, researchers point of view,this model can save time and resources and speed up progress, right? So,you know, in order understanding some of this, uh,backbone federation model is important to be familiar with,with this concept called transfer learning.
So, um,traditional machine models are transform scratch and that, you know,to perform well. Yeah. However, if you only have a small amount of data,then you can leverage the benefit of transfer learning. And the idea here is what you want to take the knowledge, learn from one task,and then apply that to another task so that you don't require as much label dataas you could. Um, if you were to train from scratch and for, uh,a lot of architecture in early days up neural networks, uh,pre-training is the dom dominant approach to transfer learning.
And you basically pre-train your model on that task and then fight,tune it into another industrial task of interest. Uh,in the field of Competi vision, we've been doing this since, um, 2014, right?Um,usually train the model on the well-known ImageNet dataset,keep the majority of the layers of that re and you probably release either thetop, top two or top three layers or so with the newly learned words. Uh,that can be five tuned for industrial test. And alternatively, you can either,you know, fight through the model end to end. And some of the most popular training model on computer vision task includedANet resnet mobile net, inception, efficien net, um, and Joloin the field.
Uh, n l p, you know,training was initially limited only to the first step,which is called word embeddings. Um, so as you're probably aware,input to a language model is worked. And so one way to encode them,encode them as a vector instead of a word is through one, not encoding. So given a large matrix word,you couldn't an embedding matrix and embed a short into a real value vectorspace. And this new matrix is related to the dimension on the order of thousandmagnitudes.
And perhaps some of this dimension correspond to some, um,semantic notion with, with that word, right? So, um,the, the model called ve trains some similar concept like this back in 2013,and it look at which words frequently co code together the learning object k pwas to maximize the co similarities between this word embeddings. And as a result of that,it can perform some pretty cool demo of vector mat on these embeddings. So,for example, what you embed towards king man and woman together,you can do some sort of vector mat operation to get a vector that is close tothe word queen in this embedding space. And after understanding this concept,a lot of people start realizing that, you know, you can,it is quite useful to see a lot of more context to quickly embed the words,because, you know,this work can play different roles in a sentence depending on the context. And if you can do this, uh, you know,effectively you could improve the accuracy in a lot of different industrialtasks.
So in 2018, several, uh,n o p models include Elmo O ml fit,and the original ship model have empirically demonstrate how language modelingcan be used for pre-training all these three methods by training language model. And they achieve, you know, the times that the results on a variety of, um,you know, n L B tasks including, you know, um, test classification,question answering natural language inference sequence, labelings and,and many other small. So that original G B T model, um,was built upon the backbone of the now very famous, uh,transformer architecture. So it is what noting that, you know,prior to transformer,a lot of the state of data N l P method was based on recurrent neural networkspare method,such as non shortterm memory and the while used sequence to sequence aarchitecture. And they, uh, effectively, you know, process the data, uh,sequentially,meaning that they look at h what at the time in the auto that the words appear.
Now with the transform architecture, uh, you know,it can parallel parallelize language processing by allowing the token in agiven body the text to be analyzed simultaneously rather than in a sequence. Um,they rely on an mechanism known as attention to support this parallelization. Um, in very simple term,attention enables a model to consider the relationships between the works,even if they far apart in the text and determine which words and phrase in apassage are most important to pay attention to. And so, you know, uh,with this process of prioritization transform, they found out that, you know,are much more competitionally efficient than some of the previous, uh,earnings method,allowing this transformer architecture to be trained on national dataset and bewith more parameters. And a lot of this three a architecture based on the transformer, uh,have this common characteristic of, you know, the massive size,which I'll talk about in in the few slide.
So in the domain of Competi vision, traditionally a lot of work has rely on,you know, the, the well-known convolution networks architecture, right?It's been the dominant architecture in the field for, for like decades. Um,however, given the success of transformer in N L P,a lot of researchers start looking into different ways to adapt, uh,such architecture to visual data. And so in 2021, some of the folks over at Google, uh,released this work called an image for 16 by 16 words,and they introduced this notion called the Vis vision transformer. And this architecture effectively applies the encode block of the T architectureto the image classification problem. Um, in short,they,they split the image into different patches and then provide the sequence linearembedding of this patch as input to a transformer.
So similar to, uh, the, the concept of a token in the N R B setting,this image patch can be shared as input. And the,you've seen an architecture included a stem that touched the image,and then a body based on the multilayer transform encoder. And then,and multilayer perception head, uh,with objective of transforming the global rep representation into somethingoutput labels, uh, and empirically speaking vision transformer,um, either, you know, sets or exit a lot of, instead,that results on many image classification data sets. Why being relatively inexpensive to pre-train. Now, although, uh, vision transformers show a lot of potential,they do actually do indeed have some technical problems.
And one significant issue here is that they have, uh,difficulty with some of the higher resolution images because they require a lotof compute power, which increase rapidly with the image size. And then additionally, um, the token in the ture have a,a fixed scale size. And as a result of that, they are not, uh,very useful for some of the tasks that involve, uh,visual elements of varying size, including video. And so AFL research work followed original trans Mark Fisher, and most of them,uh, did some sort of, uh,enhancement to the standard architecture in order to address some of theshortcomings that, uh, I just mentioned. And in this slide,I want to talk about, uh, I wanna quickly talk about the two, uh, uh, more,more, uh, popular variance of transformer.
The first one coming from Microsoft, it's called suite transformer. In this one,um, introduced two important concepts, hierarch co feature maps,and shifted window attention. So this model used hierarch co feature maps to enable some of the advancedtechniques for dense prediction. It achieves, uh,linear competition complexity by computing the cell attention mechanism locallywithin non-overlapping windows that addition an image. And as a result,uh, screen transformer can become a very good backbone for different type ofcompetition task,and then shifted windows that using, you know, shifted windows can enhance,enhance modeling power by bridging the windows of some of the preceding layer ofthe architecture.
And as a result of that,the strategy is quite efficient in terms of some of the real world encyconcerns. If you're building, like, you know, real world engineering problems,right? Or the,the query patch within the window share the same key set and making, um,the process of act accessing the memories in hardware to become much easier. And the second, uh,variant we wanna talk about is this one called perceive by team of a deep Mind. And, uh,perceiver is an acre that take a lot of expression from biological system. It, uh,it can handle a combination of different modalities without relying on anyspecific assumption about that, you know, particular, uh, modality.
And this acre introduce a, a small set of blade units to,from an attention bottleneck that process eliminate the problems offull, complete attention and allows for, uh, the creation of very large and,and deep models. It also attend to the most relevant input informed by the previous step. And then secondly, um, if you work in multimodal ai, right,it's very important to, uh,differentiate from the input from one modality to another. So with receiver,the auto associate position specific and modality specific featureswith every input element so that, you know, it can, it can, uh,make that distinction between, you know,whether it's an image or it's in text or, you know,a different type of modality. So, yeah, I hope that, um,you know, uh, sort of even educate you a little bit about kind of the,the evolution of publishing model and how it's trans architecture,computer scenes, and how that has been incorporated into a variety of different,uh, you know, modalities.
So let's quickly talk about, uh,some of the work that led into the blast language models as we, uh,known them today. So the,the original G P T and came out in 2018,and G P two quickly came out after that in 2019. And the names stand for generative bridge transformer. These are decoded early models, and they used, uh,this concept called mass self tension,that that just mean that at any point in the output sentence,you can only attend to two input sequence vector that came before that point inthe sequence. And this approach is at the core of some today most well known large languagemodel, you know, like, like G B T and, and bar, and very other, uh, you know,uh, models that you've seen recently.
Um,this original G P T model was trained on 8 million webpages. The largest model has about 1. 5 million parameters. The task that the GT two model was trained on is to predict the next word in allof the text on the 8 million webpage on the training data and the auto filesthat it works increasing well with an increasing number of parameters. And then in 2018, uh, T-Mobile, Google released the model called Bert,which stands for bidirectional encoder representation for transformer.
Bert has about 110 million parameters. It is an encoder only,and it's designed for also a predictive modeling task. And it's introduced this ion concept called mass language modeling. This mean that during the training paradigm work must are random words insequence, and the goal is to predict whatever the, the mass word is. And then in 2020, um, TFI came out, which stood for text to text transformer.
The input and output are both text strings,so you can provide the task that we're supposed to be doing. And unlike,you know,the other two TFI heads both and co and decoder inside the picture,and it was trained on the, uh, well-known C four dataset,which is about a hundred times and weak barrier. And, you know, TFI has about,uh, 10 billion parameters, you know, compared to some of the other ones. So after a lot of this empirical work that came out, um, it's important to, to,you know, think about how sort of the systematic process of like,training in building this model. So, uh,nowadays we known this as called the Scaling Law Equation.
So, uh,in very simple terms,scaling laws predict a continue improvement in motor quality as we continue toscale up the competition budget. So the team at OpenAI initially investigate the scaling laws of transformlanguage model, like in 2020,and they show that scaling loss are predictive of the future performance. And I put here in the slide equations, you know,performance equal data size times primary size times computer size,and more specifically, uh, the experiments in this work shows that, you know,the test loss of the model follows a parallel with respect to, you know,the motor size data science and the compute is for training. And this suggests that the relationships between all these three variablescan be described by this equation. And what, what is implication, meaning,implication here, is that, you know,they can be very useful to optimize different training configuration forlanguage model.
Uh, besides that, you know, uh, the,this work also did some other sort of, um, experiment. And they found that other architectural details, like, you know,tweaking the width of the depth of the network actually have very minimum,minimum effects, you know, uh, within a wide range in the eventual results. And based on, um, some of the experiment and the equations in, in this paper,um, it, it can be concluded that, you know, last year models, uh,significantly more sample efficient. In other words,optimal compute efficient training involves training very large model on arelatively modest amount data and stopping significantly beforeconversion. So since the publication of that scaling loss paper,there's been a lot of interest in continuing to scale up and which model, right?And it's been probably about two years, uh, two, three years, right? Since,since the, the, um, from some, from that work.
And, um,G P T T was one of the study the ad models in 2020,it was about a hundred times to Azure than G B T and G P two with 175 billionparameters. And these to, you know, that, that size of the model, um,G B T theory exhibits some of the, uh,never before since es in a variety few shot and zeroshot learning tasks. And, um,the fact that the more examples you read the model,the better the performance will be, and the larger the model,the better the performance will get the team over at the Google. You know,that this empirical analysis in 2022 in a work called immersion Abilities,which model, and the goal is to explore some of these, you know,emission abilities that are present in larger model,but not in the smaller ones. Um, highly recommend reading, give a look, but, um,in short, it, it means different research that analyze the influence of scale,comparing models, different size trend with varying competitional resources.
And they found that, you know,for many different few short and zero short learning tasks,the behavior of the model search unpredictably fromrandom performance to like OBO random at a very specific scale threshold. Uh, and for, for instance, like, you know,maybe if you pass the 70 billion parameters in the size in, in the motor size,then the performance, uh, showed up. Um, unpredictably. Uh, so continuing on some of these empirical analysis in 2022,DeepMind proposed score, code scaling loss to create compute opt models. And this is a,a little bit more accurate scaling loss from than the origin one proposed byopen ai.
So when we talk about the analysis that the auto study in this work,they train over a 400 last and wish model with the wiring ship parameters from70 million to 60 billion on a wiring ship token from 5 billion to hundredmillion tokens. And by predicting the optimal amount data,given the number of motor parameters, the auto derived, uh,different form of formulas for model and chain set size,and they found that most of the, you know, large and which model at the time,you know, uh,understand meaning that they haven't seen enough data in order to verify this,they chain another large model called gofer. The gofer has about 280 billion parameters and 200 billion token,and, you know,which in they would reduce the number of parameters to 70 billions whileincreasing the data fourfold to 1. 4 shills tokens. And despite having fewer parameters, chinchilla, you know, ex actually,you know, exceed the performance of golf.
And this suggests that, you know,both motorized and chain chain tokens are equally important,not just like either one of these, you know, uh, variables. And, uh,since some of the former empiric analysis of scaling laws, uh, you know,we've seen many, many more, you know, and which model that have been released. And some of the one I put in the slide only talk about the one that coming fromacademia,like obviously there's a lot commercial models came out lately as well, uh,that I have been not been, uh, able to cover here in the slide. But, um,generally speaking, some of this model, you know, achieved a lot of different,um, instead of the artificial results at the time it released, uh,and simply by, you know,scaling the motor science and shooting on larger datasets from more datasources. Some, the examples included, uh, megatron, lmm,glam Lambda, uh, Megatron Touring energy, uh, and Palm, you know,so I talk a lot about like the scaling loss for, for N L P, but uh,it turns out, you know, this, this concept also applied for Competi vision.
So this work from the team at Google in 2022, um, they,they conduct experiment on a variety of different vision transf fromarchitecture, and they did the same sort of like type experiment by, you know,varying the, the primary size, right, from 5 million to 2 million,varying the, the, the range dataset from 1 million to 3 billion training images,as well as the compute budget from, um,less than one T P U code A to more than 10,000, you know, T P U code A. And they finally show that simultaneously scaling,total compute and motor science is very effective. And the most optimal strategy here is to increase both the size to modelwith, you know, more available, you know, additional compete, right? Um,and finally they found that vision customers, some of this model with,with sufficient twin data abruptly follows a power law in performance,and then a larger model tend to perform better in, in a lot of different,you know, future learning, uh, experiments. So, you know, uh, so I talk about that, that vision transf market architecture,right? And thanks to, you know, that concept,there's been a lot of increased interest in building the sort architecture thatcan combine vision and language, more modalities in the same, um, you know,trading and, and, and learning paradigm. And, and, and briefly speaking,this hybrid vision and language model can demonstrate very impressiveIES in different tasks, like captioning of an image generate,generate new image from scratch, or even like doing visual question answering.
And typically speaking,this startup vision language model consists of three key elements,an image encoder, a text encoder,and a strategy to fuse the information from these two encoders. So for the next few slides,I want to put the review of the most quantum models in vision and which modelresearch over the past two years. So in 2021,open eye introduced, uh, the, the clip, you know,met contrast language image pre-training. So the input to clip is 400 and millions image to express that will crawl frominternet, and it encodes the text using a vanilla transformer. It encode image using, uh, a vision transformer.
And it applies, you know,a learning paradigm called contrastive learning to train this model. Uh,and in very simple term, contrastive learning match the courage, uh,image and taxpayers using some sort of, um, you know, similarity, uh, score. You know, it, it can be anything, right? But co s asim is probably the,the most, um, relevant one. Um,with this powerful trade model,you can map image and text using embeddings even on unseen data. And there are two ways to do this.
The first way is to use a linear probe by training a simple, uh, logistic,uh, regression classifier on top of the features that clip outputs,output performing inference. And alternatively, you can even use, you know,zero shot technique that can encode all detect labels and compare them to theencode images. What approach worked equally well? Uh, and,uh, the, in the paper they did, they found that, you know, actually, uh,creating a linear customer tend to perform study better. Um,so to clarify this, uh, I wanna make sure that you,you understand that clip does not directly go from to text or vice versa. They,they, they, they use embeddings to like perform this, uh, uh,transformation.
And this embedding space,which you can kinda see here in the slide,consists of like different image and text notion, Semitic notion of image text,or much in the same, uh, embedding space. And it, as a result,as a result of that, uh,this embedding space is extremely useful for performing search across different,uh, modalities. So I,I spent a lot of time talking about click because it's serve as the backbone,uh, you know, ideology for a lot the, the vision language model, uh,that came after that. So, um,just when you up to Google introduced this work called Coca Standing forContrastive Captioner. This isano another foundation model that can by contrastive learning and, uh,general deep learning.
And it is an, um,ENC code decode app architecture that has been modified and trade with bothcontrastive, blood contrastive loss, and captioning loss. And, um,you know, by, by, with that training program,it allows the model to learn both the global representationfrom unique model image, text embedding, as well as, as some of the final grain,uh, region level features from the model model embeddingin late 2022, uh, deep market,a group of visual language models called Flamingo. And this model can do, uh,many different things, you know,even with very few samples of input in output data. And these flamingo models have, uh, two key components. The first is the vision model that can understand the visual scenes,and second is the language model that can house with, with reasoning.
Uh, and the models use the virtual knowledge to work together. Um,and it's important to note that, uh, flamingo can, can, you know,take in as input very high quality image video thanks to, uh,the perceiver architecture as you see in the slide,which I talked a little bit about, uh, in the previous slide, uh, on, on, on,um, on, on the slide, on the variance, right?And as a result of that architecture,it can analyze a large number of visual input features and produce a smallnumber of visual tokens. So, yeah, thanks to some of these, uh,architectural innovation, uh,the model family can connect strong visual models for vision and forlanguage, and can handle sequences of mixed modalities, you know,between visual and text data. Uh, in this work,the biggest version is called Flamingo, a d B, and it has 80 billion parameters,right? And it's that, uh, on the Red Cross,on different fish short learning tasks, you know, that involve, uh,understanding language image and, and videos. And, you know, there's a lot of innovation in which more research coming out,more academia industry in the past few, few, few months, actually.
So in the slides, we wanna talk quickly about two, um,that is probably like gather, uh, a bit more interest from, from,from the public, right? So the one on the left side came from Microsoft. It's called Cosmos one,and it's a multimodal language model that can perceive different modalities,learn the, the unique context of the modalities,and also follow instruction given by, by, uh, by the prompter. So the model can generate,generate text based on the previous context and handle text and other modalitiesusing a transformer based causal language model. Uh,cosmos one was trained using various type data and can perform well in differentscenarios, including understanding and creating language,recognize images and answering questions based on images. The one on the left side come from Google is called pme and csna body multimodallanguage model that can handle various reasoning tasks based on observation fromdifferent sources and using different embodiments including internet scale,language, vision, and visual language domain.
Um, so in this work, um,data trial out different, um, architecture dev as well. And the biggest one is,uh, 5 6 2 B, and it has about five, six 2 billion parameters,and it can rise about different things. We are being trained beforehand,can even tell jobs based on the image performing different robotic tasks likeperceiving, talking and planning. It. Definitely very impressive and,and highly recommend, um, some of the viewers to check out for this work.
Awesome. Yeah. So, so for the last, uh, segment, uh, of my presentation,I will talk about, uh, this new paradigm of, uh, video foundation models. So, um, you know, uh, like I I,I think that video understanding some of these tasks have become increasinglyimportant in our society today, right? There's a lot of, uh, you know,video content happening in social media,and also this increasing use of even civilian car right in, in public spaces. And as a result, that is a growing need for, uh,getting sophisticated video understanding system.
However,despite some of the important of this problem actually has received, uh,leader attention compared to some of the text and image understanding tasks thatI talked about in the previous slide. And this is due to a couple major technical challenges. So the first one is that, um, you know, video processing, uh,entails very high computing burden. So video are much larger in terms of size, you know, uh,compared to like the text modalities or image modalities. And videos also require, uh,significantly more processing power to perform analysis.
And this is, issue is even more pronounced with, uh,adopting the transf architecture, which has quadratic complexity,respect to token nine. Um, so yeah, so we talk about compute a little bit,but also there's also like the, the actual modeling aspect of it as well, right?So if you think about inside a video, you are dealing with some sort of, uh,um, like a, like a image or, or, or a text moving too time, right?Uh, so put it in, in technical term. It, it means that, you know,you have to perform this sort of temporal modeling, right?And this temporal dimension definitely must be taken to a account when youperform analysis. And as a result that, uh, when you walk a video, you,you need, uh,specific and specialized techniques and models that are, are not very,uh, commonly used in compared to like some of the other modalities. And then finally, um,in addition to some of the visual information presented in the video clips,there are also, uh, synchronized audio cues that requires additional processing.
So, you know, in, in a video frame, like you have cells conversation, um,right happening inside a video frame. So you want to take into account these audio cues when you perform youranalysis. Um,and it's important that these audio cues are often just as important as thevisual information present in the video. And you have to make sure that you,you process the audio cues in conjunction with, with the visual elements, right?So, so how do you, um, I guess the ally, the, the,the audio modalities with the visual model artists in the same,you know, uh, leading space is actually, uh, quite, quite, um,an an unique and, and changing, you know,research direction that has been going on in the past, uh, couple of years,right? So, yeah, these are the,the major three challenges to towards video modeling that wanna bring it up. Um,although, you know, these are some important challenges,there's actually been quite a, a lot of progress being,being made in video understanding research.
So for the next few slides,I wanna quickly, um, go over some of the, uh,the most important kind of video fellowship model work that has been, um,designed to tackle some of the challenges. So the first one, uh,the earliest one actually, yeah, in 2019, scope video birth,and this come from Google, it applies sub supervision to video. So it used three preexisting methods,automatic space recognition,like the synchronization for spatial temporal visual features,and then a bird model to sequence, uh, for, for sequence of tokens. And, um,all these three components work together in order to model the relationshipsbetween the visual and the link and just the domain. So in order to make the bot, the, the bot architecture works with video,the authors,they turn the raw video data into visual works using the vectorization, um,paradigm.
And this help the model to focus only on the important parts of the video andhow, you know, these parts, uh, transform over time, right?Uh, at Berkeley speaking, they, they apply video birth to, uh, they,they evaluate, uh, the model on different video captioning task. And actually,you know, outperform a lot of existing, um,hand design architecture at the time,and at the work here is called All In One by the team, um, you know,from AC Academia in National University of Singapore. It's a video language model design for training that can capture video languagerepresentation from raw visual textual signals in, uh,in a fire backbone architecture. So looking here in, in the slide, you know,it used a temporal token rolling operation,this one right here to capture temporal representation, uh,sly sample frames without, without adding, you know,extra parameters or increasing time complexity, you know,and so this is the way that they used to capture that temporal dimension as justmentioned in the previous, uh, two slide. Uh, yeah,and then it performed pretty well on four separate datastream video languagetask video question, answering text to video retrieval,multiple choice q and a, and then, uh, visual common sense reasoning.
Microsoft also came to scene and they introduced X clip,which is a framework that adapt language image model to general videorecognition. And it has, uh, two separate, um, component. It has a cross frame communication transformer,as well as a multi frame integration transformer. So the final one allows the frames to exchange information using, uh,message tokens. Why the latter one, um,transfer frame level representation to, um, video level,right? And, uh,X clips use video content information to enhance text prompting using, uh,video specific prompting scheme.
So in different fully supervised zero shot and fish shot experiment,the fact that this, this xcl, you know, framework performs pretty well is by,uh, limited label data. Um, yeah, so this work has intern video and it's actually, uh,one of the best performing beautiful footage, shoe models,and most impressive one. Um, it combines two popular self supervising paradigmmass video modeling, and then multi multi-model contrastive learning. It is learnable interactions to run new features from these two separatetransformer,and it combines benefit for both generated and congestive learning tasks. And,uh, I, I think the impressive one that I wanna mention is in this,in this paper, the, um, the evaluation, um,scheme, they use a video understanding benchmark that include task like action,understanding, video language alignment, and open world video application task,right? And then inter video performed recently, well in, in most of these,um, different tasks.
And, you know, they,they represent some of the morbid is of generic video perception. Um, yeah,so that's, that's why it's pretty impressive because, uh,the aversion paradigm is more, I guess, broader in terms of scope, uh, which,you know, move us closer into like generic video understanding, right? Unlike,see the other one I mentioned, only personalized to a very specific, uh,evaluation paradigm. Uh, this is Nvidia,a work called me Lab Reserve,and it's a model that can learn multimodal neuro script knowledge,ation of videos by Jo, performing reasoning over video frames,text and audio. The model is designed to represent,to represent videos over time and across different modalities. This strain on over 20 million YouTube videos through a new congestive mass spanlearning objective, you know,to learn from both text and audio cell supervision,right? And as a result of that,it can capture some of the semantic and temporal relationships within differentelements of video, which allow it to learn a very rich, um,you representation of video content that can be used for ship, um,video understanding task.
So you see here the, the,the unique one about this one is that it covered the audio component that theytalk about, you know, in the, some of the previous slide, right?That's synchronized audio cues. Um, here is, is pretty, uh, pretty relevant,right?Video coca is an approach to, uh, video text modeling that leverage the,the coca work that I mentioned. Uh, I, I think back in, in the,in the section on vision language model and,and coca stand for contrastive caption. So video coca essentially, you know,used the contrast captioning models to gen generate candidate sentences for,for the video captioning. And then, um,sentences are then score by another transformer based model, uh,based on the re relevant, um, to the target video.
And it performed pretty well for video captioning tasks. V two SEC is another video captioning model. And this one, um, is, uh,single stage. It's been preaching on, uh, narrative video. It texts, uh,frames and transcribe speech from an entry media that is several minutes long asthe input.
It then outputs the event captions together with the temporal localization inthe video by predicting a single sequence of tokens. And architecture here relies on DT file language model using special timetokens,allowing it to seamlessly predict event batteries and test description in thesame output sequence, um, tion using the, the, uh,uh, how to 110 millions narrative videos. And given this narrative video,uh, they reformed sentence batteries as transcribed speech as, uh,serial events, batteries,and use to transfer speech sentences as serial event captions. Um, so yeah,I put here in indicate the shows like how executive works from the video framesto the final, um, speech caption. The only thing here is like, it, they, yeah,they, they, they, um,they can capture that temporal localization pretty well using that,that learning paradigm, which, uh, yeah,I just said it's quite unique compared to some of the other previousarchitecture dimension.
And the final work that we'll talk about in this presentation is called TrackAnything. And this is designed for video object,object tracking and segmentation. Uh, so track anything is developed upon, um,the segment. Anything model, Sam, uh, for short, and,and Sam is come, came from Meta, I think back in either March or or April,I think. Um,and Sam is a publish model for image segmentation focused on profitablesegmentation tasks, meaning that you can, you,you can perform problem engineering, you pro like, hey, um,segment means like this scenes, right? And then Sam gonna yeah,gonna perform, um, segment, uh, segmenting, uh, objective,uh, for, for, for that scene.
And, uh, yeah, check anything. Basically just adapt that, that same, uh, paradigm, uh, promise engineering for,for video segmentation. During tracking, the users can change the object. They wanna track or correct original interest if you know there are anyambiguity, um, they want to correct for. And so as a result of that check,anything is quite suitable for different video of tracking as segmentation with,uh, short chains, right? For instance, like in complex video, you have a lot of,you know, zoom in, zoom out, um, starting from a row to shooting from,from below.
And with the tructure chain, it very, um, useful,right? For, for, for, for check for, you know,checking simulation of these video objects. And it's also pretty useful for, uh,visualized development and, and data annotation for, for of this work, right?Uh, you wanna like doing labeling of data, right? For to,to construct your track training dataset for your video, uh, modeling work. And finally, it can, uh, can be suitable for some of the, uh,downstream video tasks that really focus on the object,like video editing and video in painting. So, um, yeah, like as a video editor,you probably had to, um, manipulate, right? The,the objects in just in some way, right?And so by pinpointing these objects, you can, can do that, uh,much more flexibly and, and, uh, contemplate, right?Awesome. So, yeah,so this is a, a, a conclusion, um, slide that I wanna talk about.
Um,so we talk about, um, yeah, we went through in, in introduction to models. We touched on, uh,concept for learning and what embeddings we talk about the original transplantfeature and, uh, the variance on different modalities. Then I talk about loss and which model from some of the original, original,uh, uh, G btt, G B T family from open AI to like Google T five to Bird one,right? And we talk about the skin loss equation, how that is becoming,uh, fundamental, uh, as an empirical notion to help, uh,researchers and, and, and, you know, engineers, uh,optimize track integration for larger and which model because of the imageabilities. We talk about the rise of some of the large region in which model that cancombine, you know, um, more than, uh,one modality is taken by both visual and text mod is in the same, uh,learning pattern. And this is all thanks to the, the open side clip.
Uh, uh,you know, back in 2021, contrastive, uh, learning for,for image free training and clip is by, you know, different work like Google's,Coca Flamingo, Microsoft customers one as well as Google's palm. And then finally we talk about the new product of video for edition models. Most basically, we talk about, uh, the unique challenges of video modeling,right? From the hack view burden to the temporal modeling, to, uh, aligning, um,audio and, and visual together. And then I talk about, uh, you know, uh,I guess a, a portfolio of video publishing models, uh,in the past two years from Video Bird to owning one to internet video,modies video, Coca Video electric kind of thing. Um, and then the, the,so just I guess one quick, uh, uh, note about, you know,12 apps, kind of where, where we come from and why we become so interested in,in this whole, um, you know, evolution is because, uh, yeah, we were,we were building multi model polish models for bigger understanding,and we leveraged a lot of these, um, you know, fundamental work here.
Like we,we leverage like transformer cliff and, and, you know, the training data set,right? All those things, you know, to construct our, our, our own, uh,ion for video understanding called marengo. Um, and it can perform different,uh, different tasks and it can like learn text and audio and, and, and, and, uh,envision in, in the same, uh, learning paradigm. Yeah,so actually wrote a whole full blog post on this topic, on, on our website,and I put here in the screen. So if you want to, uh, go deeper and,and kind of zoom in and double click on some of the points mentioned,feel free to check out the, the blog post. Uh, I put in the slide.
Uh,I also run a Discord community, um, called multi modernized that, um,that serve as a, as a venue to, uh,facilitate interaction between tinkerers, you know, s researchers,developers who are interested in, um, multimodal research application. Like I think this Especi is very new and, and at least, you know,compared to like Gen AI or lms, it, it is quite researching,but I think it's can become very important in the future. So interested in talking about this topic and extreme interest with,with other folks. Uh, yeah. Uh, good luck to, for, for you to join our discord.
I put here the Kio on this slide. Uh, we, um,we also like host a weekly webinars, uh, to share discussion, uh,about multi research, of which, uh,Frank has been a speaker just two weeks ago talking about, um,sort of the multimodal, uh, uh, evolution of embedding, which is pretty cool. Just gotta see how, how, you know,nervous and thinking about cooperating one models in space. So yeah,if you like building or I guess starting something new or,or just like doing research in space, you have to join and probably present our,our work to, to our communities. Um, and so with that, yeah, I'm gonna,I'm gonna stop here for, for from presentation and welcoming any question,um, from the audience.
Thank you. Thank you, James, for that awesome presentation. Um,and for folks in the audience, uh, for participants, attendees of this webinar,if you can paste your question in the q and a or in the chat, uh,we'll sort of get to some of those as we, uh,in the next 10 to 15 minutes. But I guess before that happens, James,do you wanna talk a little bit more about, uh, maybe spend two,three minutes talk about what 12 Labs provides from an a p I perspective orfrom, you know, a SaaS or other services perspective?Yeah, absolutely. Um, so 12 Labs, um,so comp company, um, I guess start, start incorporated about two years ago.
And,and R M U started out as, um, as,as a group of like researchers really, um, building models to,to understand video, right? So, so I, I, I'll call the company, um, like I said,building foundation model for visual understanding. Um,so that's one research point of view. And then from, from the product, uh,part of view, um, we, um, we offer,um, at the moment two separate APIs. Uh,first one is called video search,and the se second one is called video classification. Video search mean that you can search for, you know, specific, um,people, objects, modes, activities happening inside a,a video clips using, you know, natural language query and video classification.
Classification is a p i video know means that you can, um,uh,classify j videos into a specific categories,you know, using, uh, you know, like just by, uh,using a purified crowd shoot labels, right?Or even you can even perform zero short classification by just putting likea new, uh, labels, uh, when you perform classifier. Um,so yeah, those are the two major a p i, uh, that we offer. Uh,and you even notice, like both of these a p i are focused on, um,discriminate tasks, meaning that they, they leverage embedding to perform,uh, search or classification. And, and we're also working hard into, uh,some of the, uh, generative tasks as well, which shouldn't be released in,in upcoming months. And we are looking at different tasks like video captioning,um, video, um, question answering, right?Meaning that given the video as an input, can, can you, um,generate like the, the captions, right? For, for that, for that video,or you can interact with, with the videos by asking, asking is a question,right? And so, um, a lot of the slide that I talk about in, in, uh,in representation talk about some of these unique models for that.
And we are,uh, I guess, you know, trying to, um,take inspiration from this work in order to incorporate into a product, right?Um, so those are the major APIs that we have. Uh, and the unique, I,I guess on final thing I wanna talk about is that, uh, we, we are pretty, um,industry agnostic, meaning that, uh, we try to, to build like, uh, a single, uh,model, uh, underhood and stay horizontal, and it,it can adapt for different, uh, downstream video domain, right?So we work with spot video, um, security video, um, you know,e-commerce video, right? Um, so, um, so yeah, we don't,we don't like, you know, uh, position ourself into like any specific, um,domain. Um, yeah, hope that answered the question. Yeah, absolutely. That was great.
Uh, we do have,we do have one question here. Um, and then, you know,I think as some of the other ones come in,we can also have a discussion as well. I also have a couple that I would like to ask. The first one is from Siddharth and he, he's asking,I'm working on a similarity search platform, uh, with Zillow. Is there a proper way to vectorize the images and videos using the appropriatemodels from 12 Labs?So he's currently using Bert and Mini Lmm for the text embedding creation.
So do you have any advice for him there, James?Yeah. Um, let's see. Is there probably to vectorize pictures and videos in appropriate models?Um, so if you can, you can, you know, I think the, the,the first thing, first step is probably to look out any open source,so solution out there, right? To,to perform embeddings on images and video as mentioned. Uh,open optimized clip is the very popular one, and there's a different version of,of the model available on GitHub. I think Open Clip is one of them, so you can,you know, try to f**k that and, and, you know, download locally and, and try to,uh, work with it.
Um, and then app if, if, you know, if you want to improve,improve performance or change anything you can,you can make complication for that and then look for more of source solution. Um, in terms from, from a 12 app perspective,we actually will soon to soon to be released, uh, our video embeddeds, A p i,meaning that, you know, we could make our models available via,via the a p i right? For, for, for more technical, uh, audience to,um, to use. And so that they're probably gonna be like towards the end this year. So, um,not towards end this year, probably, yeah, September or,or October or something. So you can, um, you know, um,use a p i to to call our model, uh, and then,um, perform more modeling on your respective, um, picture or,or video.
Um, so yeah, that's, that's, uh,IHave a, that, that's from it. Yeah. Yeah. No, that's a great response. I sort of have a follow up question to that,uh, James, which is, when for 12 Labs is upcoming, embedding a p i,is that going to be limited? Is that going to be,let's say I wanna be able to take a clip of a video and understand what's goingon there, or will I be able to, uh, embed is there,is there going to be a limit to the number of frames in the video for theembedding, or is it going to be something that is, you know, I can have,I can give it a one and a half hour long video and have it,have it out put multiple embeddings for me?Yeah.
Um, I, I think at the point we are still trying out different,I, I, I think like, you know, one of the limitation with, with, um,with our models right now is, is that definitely that sort of long,long context, right? Um, input, right? Um,you're probably very familiar with it,given even something which worked recently on like try to prophesize the long,uh, of the, um, of the, the, the token size, right? So in, in video,obviously, you know, I guess the longer the, the, uh,the input of the video, the, the more complexity we have not do, you know, to,to, to, to process it, right?And so I think right now is like an hour is,is probably like super long for us, you know, to, to handle. And so I think like any, any between like,like less than 30 minutes is probably, um, ideal, you know, for, for first to,to to perform the indexing process and, um,and, and, uh, allows you to perform the search and conservation afterwards. Uh,we definitely like looking at different ways to, to improve the performance and,and take into, uh, as in as input longer, um, longer video size,right? Um, and that actually, you know, take, include different,um, criteria. Like we are working hard on the compute aspect of it,how to train model to more, more efficiently, you know,take into account the longer input, but also like from modelings as well,how can, how can you, as I said again, perform alignment between visual and,and, and audio? Because the, the longer the, the video, the more,uh,I guess intelligent the model has had to be able to pay attention to the morespecific part, right?So at attention here really means like you're not gonna look at every singleframe, right? You have to detect, you know, when this is saw change or,or sequence or battery change track then, then attend that then and take it out. Um, and so not just laying up the video, but also the, the tar video as well,right? Like, if you like a educational video, which is one person speaking,then I guess the longer, it doesn't gonna take a lot of compute, but like,it is like a action movie, right? Like a lot of lot of, lot of tructure chains,then it's definitely gonna take a lot of ask for us to, to, to process.
Yeah. Yeah, yeah. No, absolutely. And I sort of have a follow up question about that as well. Uh, you know,we were talking about scaling laws and you were talking about in one of yourearlier slides mm-hmm.
And in particular, I think this is very,very relevant for video because when it comes to video, there's a lot of,there's just so much data out there, right? At the same time,I think we've seen a lot of, not necessarily public research work, but,uh, there's been, there's, so G P T four, for example,I think is known to be an eight way mixture of experts. Um,and each expert I think is 220 billion parameters, something like,so not 200 billion parameters. Do you think we've reached,at least for language models, do you think we've reached the limits of that, uh,of those scaling laws, and do you think we're even close for video?Yeah. Um,Question. So happy to sort of have a back and forth here as well.
Yeah. So regarding the limits of, of, um, data size for, sorry,uh, the, the, the limit of Turing data for, for language. Mm-hmm. Um,yeah, not, uh, I, I, I think, you know,I think there's a lot of work or like a lot,a lot of conversation daily about like, you know, kind ofeven like a synthetic data, right? For, for,for text or something along the extent you, I think, I guess like, you know,you probably have, have a more purview on, on the other stuff given, uh,you know, through the work, um, on, on vector for text, right? Um, uh,but like if I, if I talk about from a video point of view, I,I just think that there's still a lot of un top potential in terms of video datathat can be harnessed, right? Like, you know, content from, from social TikTok,YouTube, um, you know, et cetera. I definitely,I like, uh, not just exist, but also like being generated in,in even more, um, in, in a magnitude, right? Compared to the,the, the inside analysis of it, right? And then, so that, that's one, one point.
So my point takes a lot of potential with existing data, data and then secondly,um, companies like, like runway synthesia, et cetera,actually performing video generations, meaning that, you know,they generate that new video from scratch, so that means they even more video,right? To, to process, to, to use the string data, right?So I think there's a still a lot of video over there for us to,to, to, to train a model on and, and to apply on, right? Um, so I,I think for the time being, um, we we're not super, uh,worried about like running out of video data to, to trade it on. Uh, it's,it's just, it's just a matter of like, you know, um, how to fire them,how to harness them, uh, efficiently. Um, yeah,Yeah, yeah. No, yeah, absolutely. For sure.
That makes a lot of sense. I know we're sort of running up on time here. Uh, so folks,if there is anything else, if there's any other questions that you'd like James,James to answer, please add them into the q and a or into the chat. Either is okay. And one thing, one last thing that I want to add, James, is, uh,when you, when that embedding, when that video embedding endpoint is available,um, love to have you post about it in our middle Slack channel, uh, you know,just to let the broader community, know and let folks know, Hey,now we do have a way for you to embed videos as well.
And I think with that, I'll, I'll kick it back to Emily and, uh, uh, yeah. Thank you James. Yeah, just, just one note on, on, on Frank's boy. Um, uh, yeah,some quick notice like, uh, when, when that everything become available,I think one, one of our main goal, and this is relevant to, to xx,is like to for first to, to integrate and partner with different, uh,vector database vendors because like, you know,users of let's say universe can, can, uh,extract right embeddings from, from our model and then use, use, say BU to,to store them and perform retrieval, et cetera. Right? So I think, uh, yeah,definitely excited to, to, to have those,those kind of conversation moving forward and hopefully any, any,if the users can, can, um, can benefit from that so soon, soon to be released.
Yeah. Yeah, we're excited about that as well. Thank you James. Absolutely. Thank you so much, James.
What a great presentation. I know our audience, uh,learned a whole lot. Um, so thank you so much for all of those who joined us. Uh, keep an eye on the zillows. com/event, uh,calendar for more upcoming sessions like this,and we will hope to see you on a future webinar.
Meet the Speaker
Join the session for live Q&A with the speaker
Developer Experience, Twelve LabsJames Le currently leads Developer Experience at Twelve Labs, a startup building foundation models for video understanding. Previously, he worked at ML infrastructure startups such as Superb AI and Snorkel AI, while contributing to the well-known Full-Stack Deep Learning courses. He also hosts Datacast, a podcast that features conversations with founders, investors, and operators in the data and AI infrastructure space to unpack the narrative journey of their careers.