You’re in!
Training
Agentic AI in Action: Real-Time Vision, Memory & Autonomy with Browser Use & Milvus
Resources
1 00:00:03.465 --> 00:00:05.685 So today I'm pleased to introduce today's session,
2 00:00:05.945 --> 00:00:10.205 age Agentic AI in action, and our guest speaker Stephen Batifol.
3 00:00:10.545 --> 00:00:13.685 So, Stephen Batifol is a developer advocate at Zilliz,
4 00:00:13.705 --> 00:00:16.725 and he previously worked as a machine learning engineer at
5 00:00:16.795 --> 00:00:19.165 Walt, where he was working on the ML platform.
6 00:00:19.225 --> 00:00:21.925 And as a data scientist at Bravo,
7 00:00:22.435 --> 00:00:24.085 Stephan studied computer science
8 00:00:24.185 --> 00:00:27.245 and artificial intelligence, and some of his hobbies
9 00:00:27.345 --> 00:00:29.845 and things he enjoys are dancing and surfing.
10 00:00:30.275 --> 00:00:31.285 Welcome, Stefan.
11 00:00:35.185 --> 00:00:37.165 Uh, thank you very much for the intro.
12 00:00:37.165 --> 00:00:39.885 Sorry everyone for, um, the problem I had.
13 00:00:40.305 --> 00:00:41.805 Uh, and thank you for joining.
14 00:00:42.065 --> 00:00:46.845 Yes, as said today, we are here to talk about identity AI
15 00:00:47.745 --> 00:00:51.845 and how we can use real time, vision, memory, um,
16 00:00:52.265 --> 00:00:54.845 and autonomy as well with browser use and members.
17 00:00:56.145 --> 00:00:57.685 So I'm Stefan, but
18 00:00:57.685 --> 00:01:00.445 before, I'm a develop advocate, as it was, um, said before.
19 00:01:01.465 --> 00:01:02.925 If you have any questions related
20 00:01:02.945 --> 00:01:05.085 to Gene AI vector database, uh,
21 00:01:05.585 --> 00:01:08.205 or anything honestly in the AI world, uh,
22 00:01:08.235 --> 00:01:09.325 feel free to hit me up.
23 00:01:09.425 --> 00:01:10.765 Uh, you can scan this here, I could,
24 00:01:10.765 --> 00:01:12.845 which will redirect you to my LinkedIn.
25 00:01:13.545 --> 00:01:16.205 Um, but yes, I also help a lot.
26 00:01:16.465 --> 00:01:19.565 Uh, if you have like bertes things on vis
27 00:01:19.565 --> 00:01:22.005 and if you want to deploy vis somewhere, uh,
28 00:01:22.045 --> 00:01:23.365 I can also really, really help you.
29 00:01:24.665 --> 00:01:26.125 But let's get started.
30 00:01:26.265 --> 00:01:29.365 And I'm gonna start with a small introduction about,
31 00:01:29.505 --> 00:01:31.925 you know, what are our agent systems in the first place?
32 00:01:33.315 --> 00:01:35.285 Then we also go through, you know,
33 00:01:35.285 --> 00:01:37.165 like the different tech stack I have, uh,
34 00:01:37.225 --> 00:01:38.765 at the end we'll finish by a demo.
35 00:01:39.105 --> 00:01:40.245 So you'll see it live
36 00:01:40.425 --> 00:01:43.045 and you'll see, uh, actually my computer take, uh,
37 00:01:43.105 --> 00:01:45.725 the agent take control of my computer, um,
38 00:01:45.945 --> 00:01:48.005 to then your performs in different actions.
39 00:01:49.345 --> 00:01:51.825 So, agentic Systems,
40 00:01:52.985 --> 00:01:55.465 I am using the Definit definition philanthropic here.
41 00:01:55.685 --> 00:01:57.865 Um, they are basically two different ways.
42 00:01:58.565 --> 00:02:01.785 Uh, so there is what we call workflows, which are, you know,
43 00:02:01.785 --> 00:02:04.665 systems where the LM tools are orchestrated
44 00:02:04.665 --> 00:02:07.705 through some like code path that you define yourself.
45 00:02:08.525 --> 00:02:12.145 Um, and then there's agents, which is systems where,
46 00:02:12.245 --> 00:02:14.425 you know, LLM are dynamically directing their
47 00:02:14.425 --> 00:02:15.585 own processes and tool usage.
48 00:02:16.365 --> 00:02:18.665 Uh, and the other one, you know, like making sure
49 00:02:18.665 --> 00:02:20.985 that the task is actually being accomplished.
50 00:02:21.405 --> 00:02:22.865 Uh, if there's a problem, you know,
51 00:02:22.865 --> 00:02:23.945 they're gonna have a look at it.
52 00:02:24.485 --> 00:02:26.425 Um, and then, you know, they're gonna try to,
53 00:02:26.565 --> 00:02:28.265 to fix this, uh, on their own.
54 00:02:30.685 --> 00:02:35.585 And it's good to know when to use and not to use agents.
55 00:02:35.925 --> 00:02:38.865 Um, so agent system can be very powerful,
56 00:02:39.405 --> 00:02:40.865 but they also trade latency
57 00:02:40.865 --> 00:02:43.305 and cost in order to have better performance.
58 00:02:43.305 --> 00:02:46.505 You know, so if you have, if you're working on something
59 00:02:46.505 --> 00:02:48.585 where you really need very low latency,
60 00:02:48.995 --> 00:02:51.345 maybe agent system might not be, uh, good for you.
61 00:02:53.335 --> 00:02:57.105 Also, the good part about workflows is that, you know,
62 00:02:57.105 --> 00:02:59.485 they offer predictability and consistency,
63 00:03:00.065 --> 00:03:02.285 but you need to have very well-defined task.
64 00:03:02.305 --> 00:03:04.205 You know, you, you need to like
65 00:03:04.905 --> 00:03:07.125 do the do the way you would do before agents
66 00:03:07.125 --> 00:03:10.125 and stuff is like actually define the functions, uh, define
67 00:03:10.125 --> 00:03:11.205 what the function is gonna do.
68 00:03:11.625 --> 00:03:14.165 Um, and you can't really go yolo, you know,
69 00:03:14.195 --> 00:03:16.045 like you would do sometimes with an agent.
70 00:03:17.345 --> 00:03:20.485 And agent can be a better option when you need flexibility
71 00:03:21.025 --> 00:03:23.365 and you need the model to actually make the decisions,
72 00:03:23.365 --> 00:03:26.565 you know, uh, so they'll be like, you ask a question, um,
73 00:03:26.865 --> 00:03:28.965 and you have a lot of users, they can ask many
74 00:03:28.965 --> 00:03:30.005 different type of questions.
75 00:03:30.915 --> 00:03:33.365 Then you can have the NLM decide, okay,
76 00:03:33.365 --> 00:03:35.085 this will be the action that I will do.
77 00:03:35.745 --> 00:03:38.485 Um, and this is when you then have an agent,
78 00:03:41.095 --> 00:03:42.715 and just a couple of examples here.
79 00:03:43.295 --> 00:03:46.875 Um, this is the base, basically what we work with.
80 00:03:47.455 --> 00:03:50.555 Uh, here you have the inputs that you see on the left, uh,
81 00:03:50.735 --> 00:03:52.515 and then the inputs, then you get to the LLM,
82 00:03:52.655 --> 00:03:56.275 and then the LM is augmented, you know, so it's augmented
83 00:03:56.275 --> 00:03:57.555 with retrieval tools.
84 00:03:57.615 --> 00:04:00.195 So think of like a database here, for example,
85 00:04:00.855 --> 00:04:02.675 but then you can also add other tools.
86 00:04:02.895 --> 00:04:04.955 You can add, I don't know, web search.
87 00:04:05.095 --> 00:04:08.195 You can add integration to the different tools
88 00:04:08.195 --> 00:04:09.275 that you have in your company.
89 00:04:09.935 --> 00:04:12.475 Um, and then that way, you know, when you,
90 00:04:12.585 --> 00:04:15.435 when a user is asking a question maybe about, I don't know,
91 00:04:15.475 --> 00:04:18.435 a notion page, you can have a tool that is calling Notion,
92 00:04:18.655 --> 00:04:20.195 and then it give you back the results.
93 00:04:21.825 --> 00:04:23.965 You can also read and write memory.
94 00:04:24.505 --> 00:04:26.565 Uh, so that's usually also, uh,
95 00:04:26.565 --> 00:04:28.525 what you store in the vector database here.
96 00:04:29.145 --> 00:04:31.285 Um, and then once you have all of those,
97 00:04:31.425 --> 00:04:32.605 you're gonna have an output.
98 00:04:33.145 --> 00:04:35.245 And this is what we call an augmented,
99 00:04:35.645 --> 00:04:38.805 and that's what we work with, uh, in most is most cases.
100 00:04:39.145 --> 00:04:41.165 Um, this is what I work with, uh,
101 00:04:41.165 --> 00:04:42.845 the customers in, yeah, in most cases.
102 00:04:45.105 --> 00:04:48.525 And to give you a better idea of what a workflow is, uh,
103 00:04:48.715 --> 00:04:50.925 this is a workflow that is called pro chaining.
104 00:04:51.465 --> 00:04:55.965 So you have your input, this input will go through an LM,
105 00:04:56.545 --> 00:04:58.605 and then you're gonna have an output of this LLM,
106 00:04:58.665 --> 00:05:00.405 and then you have some kind of gate, you know,
107 00:05:00.405 --> 00:05:02.605 that is like actually, um, programmed.
108 00:05:02.625 --> 00:05:03.885 So you're gonna write some code
109 00:05:04.265 --> 00:05:06.485 and you're be like, okay, depending on the output
110 00:05:06.485 --> 00:05:09.205 that I have, um, I'm either gonna make it up
111 00:05:09.305 --> 00:05:11.045 as passing or failing.
112 00:05:11.555 --> 00:05:13.325 Then if you fail, then you know you're
113 00:05:13.325 --> 00:05:14.405 gonna exit this workflow.
114 00:05:14.825 --> 00:05:17.405 But if you pass, then you can, you know, call a second lamb
115 00:05:17.545 --> 00:05:19.565 and a third one and so on and,
116 00:05:19.585 --> 00:05:21.885 and so forth up until you're happy,
117 00:05:22.345 --> 00:05:24.325 and then you can give the results back to your user.
118 00:05:25.905 --> 00:05:27.245 So this is what we call pro training.
119 00:05:27.305 --> 00:05:30.285 And then here, yeah, you, you will define the code yourself.
120 00:05:30.585 --> 00:05:33.245 Uh, it's not an agent that is making the decision.
121 00:05:34.875 --> 00:05:37.005 Then you can also think of like having,
122 00:05:37.345 --> 00:05:38.525 you know, some kind of routing.
123 00:05:38.825 --> 00:05:42.485 Um, so here it's an agent that is like making the decision,
124 00:05:42.545 --> 00:05:44.485 but it's, you know, you have like define,
125 00:05:44.485 --> 00:05:46.005 you're gonna specify the roots yourself.
126 00:05:46.665 --> 00:05:48.485 Um, so you have your input,
127 00:05:48.985 --> 00:05:51.285 and then again, you're gonna go through an LLM,
128 00:05:51.865 --> 00:05:55.205 and this LLM will then decide, okay, which route to take.
129 00:05:55.865 --> 00:05:59.605 And usually for that, for the LM to understand that you have
130 00:05:59.605 --> 00:06:02.805 to write quite a lengthy prompt, uh,
131 00:06:02.865 --> 00:06:06.085 and for this prompt to define, okay, if a user is asking,
132 00:06:06.405 --> 00:06:09.685 I don't know about how to get their money back, um,
133 00:06:09.755 --> 00:06:11.165 then we go into the LM
134 00:06:11.165 --> 00:06:12.525 that is like in charge of, you know, support.
135 00:06:13.665 --> 00:06:17.165 If someone is asking, you know, how to get more credits,
136 00:06:17.515 --> 00:06:19.365 then maybe it's an another M that is like, you know,
137 00:06:19.365 --> 00:06:20.685 adding credits or something.
138 00:06:21.305 --> 00:06:24.165 And then you're gonna have a usually quite a long prompt.
139 00:06:24.165 --> 00:06:27.725 And then you define different examples, um, on how it works.
140 00:06:28.265 --> 00:06:30.045 And then the routing system would understand it.
141 00:06:30.105 --> 00:06:33.365 And then when you have a query, it will then redirect it
142 00:06:33.365 --> 00:06:34.685 to the different lms you have.
143 00:06:36.195 --> 00:06:38.015 You may or may not combine the answers.
144 00:06:38.155 --> 00:06:40.575 Uh, it depends on this example, you don't.
145 00:06:40.575 --> 00:06:43.495 You just, uh, give the result back from different calls.
146 00:06:45.185 --> 00:06:48.885 And for agents, basically you always have, like,
147 00:06:49.065 --> 00:06:51.885 so you have the human here that is, you know, chatting, um,
148 00:06:52.275 --> 00:06:54.565 with the LLM, and then the LLM decides
149 00:06:54.565 --> 00:06:56.445 to take an action based on what is said.
150 00:06:56.905 --> 00:06:58.605 Uh, but it's not predefined usually.
151 00:06:58.705 --> 00:07:02.045 Or you can maybe define some actions that are, you know,
152 00:07:02.045 --> 00:07:05.125 that the LM has access to, uh, and give it a description.
153 00:07:05.705 --> 00:07:07.285 Uh, but you don't really say, Hey,
154 00:07:07.785 --> 00:07:09.285 if you have that, then you do that.
155 00:07:10.145 --> 00:07:12.525 And then those actions are gonna, you know,
156 00:07:12.705 --> 00:07:15.165 evolve in an environment, whatever that is,
157 00:07:15.985 --> 00:07:17.205 and then you're gonna get feedback.
158 00:07:17.465 --> 00:07:20.045 Um, and then, yeah, the LLM is then, you know,
159 00:07:20.045 --> 00:07:22.405 like looking at this feedback from your environment,
160 00:07:22.465 --> 00:07:24.045 and then it's like, am I happy or not?
161 00:07:24.065 --> 00:07:25.365 Do I need to take more actions?
162 00:07:25.945 --> 00:07:27.925 Um, and yeah, so on and so forth.
163 00:07:28.595 --> 00:07:29.885 Once the LLM is happy
164 00:07:30.265 --> 00:07:33.245 and then the feedback is positive, then you're going,
165 00:07:33.555 --> 00:07:35.605 it's gonna be like, okay, I need to stop now.
166 00:07:35.745 --> 00:07:37.045 And then this one will stop,
167 00:07:37.045 --> 00:07:38.605 and then it gives you back the results.
168 00:07:39.385 --> 00:07:42.085 And this is what we call agents, basically,
169 00:07:42.345 --> 00:07:44.685 or at least they finished philanthropic, uh,
170 00:07:44.685 --> 00:07:45.765 which is the one I agree with.
171 00:07:47.245 --> 00:07:50.345 And yeah, here you're gonna lose control.
172 00:07:50.765 --> 00:07:53.705 Um, you're gonna have to let your agents actually run.
173 00:07:54.205 --> 00:07:57.825 Um, so you ha like, it's very important to run it, you know,
174 00:07:57.825 --> 00:07:59.345 maybe in sandbox environments
175 00:07:59.685 --> 00:08:01.465 and to have some proper G raise as well,
176 00:08:01.895 --> 00:08:05.025 because you don't want, you know, your agent maybe
177 00:08:05.045 --> 00:08:06.585 to delete your database
178 00:08:07.245 --> 00:08:09.425 or you don't want your agents, you know, to,
179 00:08:09.485 --> 00:08:10.745 to like, you know, make payments.
180 00:08:11.125 --> 00:08:13.065 And that's one thing we see.
181 00:08:13.185 --> 00:08:14.345 I don't know if you've, if you've heard,
182 00:08:14.365 --> 00:08:18.545 but, uh, MCP, so modern contact protocol, uh, from Anthropic
183 00:08:18.645 --> 00:08:20.585 as well, um, is very popular
184 00:08:20.925 --> 00:08:24.025 and it allows, um, different, you know, LLMs
185 00:08:24.025 --> 00:08:27.785 to actually make actions on your environment.
186 00:08:28.445 --> 00:08:31.305 Um, and for vis, for example, we support it,
187 00:08:31.325 --> 00:08:35.265 but, uh, we didn't add the deletion of databases there
188 00:08:35.375 --> 00:08:36.665 because it would be too risky.
189 00:08:36.765 --> 00:08:39.705 And then it's likely, um, that you wanna do
190 00:08:39.705 --> 00:08:41.665 that yourself instead of letting an agent do it.
191 00:08:42.895 --> 00:08:44.595 So, like, those are the things you have
192 00:08:44.595 --> 00:08:45.795 to take into account.
193 00:08:47.495 --> 00:08:50.355 And yes, when to use agents.
194 00:08:50.655 --> 00:08:53.395 So use them for like open-ended problems
195 00:08:53.395 --> 00:08:55.765 where you don't really know what's gonna happen, you know,
196 00:08:55.765 --> 00:08:56.765 when it's like difficult
197 00:08:56.985 --> 00:08:58.845 to predict the required number of steps.
198 00:08:59.665 --> 00:09:01.805 And you also can't really hardcore a fixed path.
199 00:09:02.395 --> 00:09:05.295 And as I've said, use inbox environments if possible,
200 00:09:05.555 --> 00:09:06.735 and some guard rails as well.
201 00:09:07.155 --> 00:09:08.295 Uh, that will allow you
202 00:09:08.295 --> 00:09:09.975 to stay in control of what's happening.
203 00:09:12.855 --> 00:09:14.795 And for the people that are not familiar, maybe
204 00:09:14.825 --> 00:09:17.315 with Vector Search and Vector database,
205 00:09:17.315 --> 00:09:20.835 because this is what I will show as what, uh, later on is
206 00:09:20.835 --> 00:09:24.675 that browser use will, you know, browse the internet,
207 00:09:24.775 --> 00:09:26.715 but then it would actually store everything into vu,
208 00:09:26.815 --> 00:09:29.675 and then I will be able to then search through everything
209 00:09:29.675 --> 00:09:31.875 that has been, uh, going through the browser use.
210 00:09:32.855 --> 00:09:36.155 So just quickly, so vectors unlock the structured data,
211 00:09:36.355 --> 00:09:38.715 unstructured data for the people that are not familiar.
212 00:09:39.025 --> 00:09:41.515 It's, you know, everything going from images to video
213 00:09:42.155 --> 00:09:46.675 documents, audio, um, if not more, you put them through
214 00:09:46.675 --> 00:09:47.835 what we call an embedding model.
215 00:09:48.895 --> 00:09:51.955 And then you're gonna generate some embeddings through that.
216 00:09:52.455 --> 00:09:54.635 You store those directly in your vector database.
217 00:09:55.295 --> 00:09:57.275 And then after that, once you have that,
218 00:09:57.545 --> 00:10:00.115 then you can perform search, you know, through, um,
219 00:10:00.215 --> 00:10:01.955 for example, retrieval with degeneration.
220 00:10:01.975 --> 00:10:04.915 So rag or, you know, recommendation systems,
221 00:10:05.335 --> 00:10:07.955 or you can also search, um, text image audio.
222 00:10:08.865 --> 00:10:12.195 Also very useful for drug discovery and anomaly detection.
223 00:10:14.015 --> 00:10:16.955 And how it works is that you project all the vectors into
224 00:10:16.955 --> 00:10:18.075 what we call a vector space,
225 00:10:18.775 --> 00:10:21.355 and things that are semantically similar together, um,
226 00:10:21.375 --> 00:10:22.835 to each other will then be close.
227 00:10:23.535 --> 00:10:26.715 Um, you can see it here, for example, for the image
228 00:10:26.855 --> 00:10:27.875 of the banana
229 00:10:28.215 --> 00:10:30.395 and the text banana, you can see
230 00:10:30.395 --> 00:10:31.555 that they're very close together.
231 00:10:32.495 --> 00:10:35.715 And same, you know, for like the animals here,
232 00:10:35.815 --> 00:10:38.795 if you have the text dog and the text cat,
233 00:10:38.975 --> 00:10:40.275 and then the image of a cat,
234 00:10:40.275 --> 00:10:42.355 they're also gonna be fairly similar and fairly close.
235 00:10:42.855 --> 00:10:45.835 And you have to imagine that all those blue points here are
236 00:10:45.965 --> 00:10:47.515 other vectors, you know, they have,
237 00:10:47.545 --> 00:10:48.835 they have other meanings.
238 00:10:49.605 --> 00:10:51.025 So they are like, you know, spread around.
239 00:10:51.965 --> 00:10:54.905 Of course here it's an examples in 3D men,
240 00:10:54.935 --> 00:10:56.105 it's in three dimensions.
241 00:10:56.325 --> 00:10:58.305 But, uh, you have to imagine that, you know,
242 00:10:58.305 --> 00:11:02.745 in real life you're usually more like 512, 1024, um,
243 00:11:02.965 --> 00:11:06.905 or if not more, the latest of Gemini is quite, uh,
244 00:11:06.905 --> 00:11:08.385 4,000 if I remember correctly.
245 00:11:08.965 --> 00:11:12.105 Uh, but yes, that's how then you have an idea of
246 00:11:12.105 --> 00:11:16.455 what is similar to what, and then how does it work.
247 00:11:16.955 --> 00:11:18.775 Um, so you have your unscripted data,
248 00:11:19.535 --> 00:11:20.975 transform them into vectors, store
249 00:11:20.975 --> 00:11:22.455 them directly in your vector database.
250 00:11:22.455 --> 00:11:24.775 Then you have your query again, you put it
251 00:11:24.895 --> 00:11:25.935 through the same embedding model.
252 00:11:26.475 --> 00:11:28.535 So then you get one vector embedding,
253 00:11:29.075 --> 00:11:32.135 and then you're gonna perform, um, nearest neighbor, uh,
254 00:11:32.135 --> 00:11:34.655 similarity search, and then you get your results.
255 00:11:35.075 --> 00:11:38.455 And this is why WebU vis, it's
256 00:11:38.455 --> 00:11:39.535 to basically help you do that.
257 00:11:41.375 --> 00:11:43.755 So VIS is an open source vector database.
258 00:11:44.135 --> 00:11:46.555 Um, would like to say that we're easy to start.
259 00:11:47.295 --> 00:11:49.155 Uh, so you can say that you can start
260 00:11:49.155 --> 00:11:51.675 with a peep install on your laptop directly, uh,
261 00:11:51.695 --> 00:11:53.885 and then you can run it, you know, directly in your code,
262 00:11:53.945 --> 00:11:55.325 or you can run it in a notebook.
263 00:11:56.305 --> 00:11:59.485 And then you can push the production also directly.
264 00:11:59.505 --> 00:12:03.845 You just have to change the UAI of, um, the VU client.
265 00:12:04.145 --> 00:12:05.245 And then you can, like, you know,
266 00:12:05.345 --> 00:12:07.725 if you have it running somewhere on a WS
267 00:12:07.725 --> 00:12:10.165 or Google Cloud, uh, then it's possible to
268 00:12:10.875 --> 00:12:12.205 push the production directly.
269 00:12:13.945 --> 00:12:16.725 And we also support a lot of different features.
270 00:12:17.225 --> 00:12:18.925 So, you know, like there's vector search,
271 00:12:19.465 --> 00:12:21.645 but I'll talk a bit more, uh, about it later on.
272 00:12:21.645 --> 00:12:24.365 We don't only support vector search, uh,
273 00:12:24.465 --> 00:12:25.925 we also support full tech search.
274 00:12:27.025 --> 00:12:28.675 Then we also have, you know, some index
275 00:12:28.675 --> 00:12:32.955 that are like this basic, this base based index, sorry, uh,
276 00:12:32.955 --> 00:12:35.595 water simple dynamic schemas, uh, I can think of.
277 00:12:35.615 --> 00:12:38.395 So like fluids binary and sparse vectors.
278 00:12:40.575 --> 00:12:42.675 And yeah, we have like a lot of different features.
279 00:12:43.515 --> 00:12:45.935 So it's quite nice if you wanna have, uh,
280 00:12:46.205 --> 00:12:48.015 very powerful vector database.
281 00:12:49.795 --> 00:12:51.775 And we support different features,
282 00:12:51.775 --> 00:12:52.975 but we're also ready to scale.
283 00:12:53.275 --> 00:12:54.535 So we have VIS slide,
284 00:12:54.535 --> 00:12:56.775 which is the one you run when you do PIP install.
285 00:12:56.915 --> 00:13:00.015 By vis this one scales to about a million vectors.
286 00:13:00.565 --> 00:13:02.615 Then we have VIS standalone, uh,
287 00:13:02.625 --> 00:13:05.015 which is bundled directly in a single docker image.
288 00:13:05.525 --> 00:13:07.855 This one supports primary and secondary,
289 00:13:07.855 --> 00:13:09.935 and you can scale up to about a hundred million vectors.
290 00:13:11.015 --> 00:13:14.865 Then if you have, uh, lots of vectors, we have distributed,
291 00:13:15.095 --> 00:13:18.345 this one runs on es, uh, you have a load balancer
292 00:13:18.445 --> 00:13:19.625 and multi-node management,
293 00:13:20.205 --> 00:13:22.465 and this one scaled to about a hundred billion vectors.
294 00:13:24.645 --> 00:13:27.905 So yes, I managed say it's already, uh, before,
295 00:13:27.965 --> 00:13:30.385 but Metaverse is now more than just vectors.
296 00:13:31.085 --> 00:13:33.465 Um, because our vision is really more than that.
297 00:13:33.805 --> 00:13:35.545 Uh, we believe that, you know, the future
298 00:13:35.605 --> 00:13:38.625 of search is combining different search techniques.
299 00:13:39.285 --> 00:13:42.305 So first is semantic search through, you know, like using,
300 00:13:42.565 --> 00:13:44.465 um, embedding model and dense embeddings,
301 00:13:45.845 --> 00:13:49.065 but then keyword search, you know, which is kind
302 00:13:49.065 --> 00:13:51.185 of an old technique where you actually search
303 00:13:51.205 --> 00:13:53.345 for exact keywords and exact words,
304 00:13:54.205 --> 00:13:55.745 and then filtering on top of that.
305 00:13:55.885 --> 00:13:57.505 So you can also filter through data
306 00:13:58.045 --> 00:14:00.025 and in particular through metadata filtering.
307 00:14:00.325 --> 00:14:02.345 So you can remove a lot of data
308 00:14:02.345 --> 00:14:04.345 that you know you are really not interested in.
309 00:14:05.325 --> 00:14:09.065 And by combining those in one unified platform, then
310 00:14:09.325 --> 00:14:11.625 that's the best way basically to get the best results.
311 00:14:12.285 --> 00:14:13.465 And it's very, very important
312 00:14:13.465 --> 00:14:16.465 because agents, when they go on the internet, you know,
313 00:14:16.465 --> 00:14:17.825 they retrieve a lot of data.
314 00:14:18.425 --> 00:14:20.225 Whatever agents are doing, most
315 00:14:20.225 --> 00:14:21.825 of the time it's actually retrieving data.
316 00:14:22.885 --> 00:14:27.125 And the good retrieval is key to the success of the agent.
317 00:14:28.585 --> 00:14:31.005 You see it as well with, you know, open AI
318 00:14:31.185 --> 00:14:33.685 and different, uh, tools when they have like, you know,
319 00:14:33.685 --> 00:14:36.405 the deep search or deep research, um,
320 00:14:36.825 --> 00:14:38.925 the quality is gonna be in the retrieval of the data.
321 00:14:39.825 --> 00:14:41.605 So yes, for us, we believe that, you know,
322 00:14:41.605 --> 00:14:43.125 we can combine all of those.
323 00:14:44.225 --> 00:14:48.205 And before you always, you know, if you wanted
324 00:14:48.225 --> 00:14:49.885 to have keyword search
325 00:14:50.025 --> 00:14:52.685 and vector search, um, then you had
326 00:14:52.685 --> 00:14:55.525 to have two separate systems, one vector database
327 00:14:55.525 --> 00:14:58.765 for semantic search and something like elastic search
328 00:14:58.785 --> 00:15:00.365 or similar for keyword search.
329 00:15:01.385 --> 00:15:03.485 And that was nice, but then, you know, it was working,
330 00:15:03.625 --> 00:15:05.965 it was beautiful, but then you also have some kind
331 00:15:05.965 --> 00:15:09.055 of complex architecture through it.
332 00:15:09.715 --> 00:15:12.975 Um, so now we just decided that we would do it ourself.
333 00:15:13.075 --> 00:15:15.815 And then we provide full tech search,
334 00:15:15.815 --> 00:15:17.015 basically running through address.
335 00:15:18.195 --> 00:15:19.655 And I mentioned it before,
336 00:15:20.155 --> 00:15:22.335 but it allows you to augment the search quality
337 00:15:22.955 --> 00:15:24.775 of embedding based, uh, semantic search,
338 00:15:25.555 --> 00:15:27.975 and then also gonna provide search
339 00:15:27.975 --> 00:15:30.375 with more emphasis on the Q1 matching.
340 00:15:30.375 --> 00:15:34.255 So think of like, I don't know, you're working for Adidas
341 00:15:34.555 --> 00:15:37.535 and then you're working like at, maybe at Amazon,
342 00:15:37.715 --> 00:15:39.935 and then your users, you know, they're gonna look for
343 00:15:40.835 --> 00:15:42.975 Adidas shoes, and then you're gonna return at the beginning,
344 00:15:42.975 --> 00:15:44.455 you're gonna return, you know, Nike shoes
345 00:15:44.635 --> 00:15:45.815 or like other brands.
346 00:15:46.435 --> 00:15:50.495 And that's because if you only run a semantic search seman,
347 00:15:50.635 --> 00:15:52.375 you know, like those brands are very, very close
348 00:15:52.375 --> 00:15:54.415 to each other because they kind of sell the same things.
349 00:15:55.155 --> 00:15:57.095 So that's why then you have like,
350 00:15:57.095 --> 00:15:58.415 keyword search can be very useful
351 00:15:58.725 --> 00:16:01.215 because then it will really have like, you know,
352 00:16:01.275 --> 00:16:04.295 an emphasis on Adidas instead of like the other brands.
353 00:16:04.955 --> 00:16:06.495 So you will have better results
354 00:16:06.555 --> 00:16:07.895 and then the users will be happier.
355 00:16:10.515 --> 00:16:11.975 And our approach is
356 00:16:11.975 --> 00:16:14.255 that you can basically forget about vectors.
357 00:16:14.475 --> 00:16:16.375 Um, the way to do it,
358 00:16:16.435 --> 00:16:18.575 and that's why it's so nice actually, is
359 00:16:18.575 --> 00:16:19.935 that you're gonna have your text
360 00:16:20.395 --> 00:16:22.655 and you're gonna insert it directly into vis,
361 00:16:22.795 --> 00:16:25.135 and then you're gonna search using the same text.
362 00:16:25.675 --> 00:16:27.975 So you don't have to think about the embedding models,
363 00:16:28.515 --> 00:16:30.375 and you don't have to think, you know, about like, oh,
364 00:16:30.415 --> 00:16:31.895 I get vectors, what do I do with that?
365 00:16:32.235 --> 00:16:34.295 You are gonna deal with text, um,
366 00:16:34.525 --> 00:16:36.575 like when you insert it and when you search it.
367 00:16:37.735 --> 00:16:39.895 'cause we take care of, you know, tokenizing the text
368 00:16:40.595 --> 00:16:43.215 and then managing the distribution, uh,
369 00:16:43.215 --> 00:16:45.855 because actually it's not that simple, uh, to do that.
370 00:16:46.955 --> 00:16:48.855 And then, you know, like make sure
371 00:16:48.855 --> 00:16:49.975 that we can encode actually
372 00:16:49.975 --> 00:16:51.295 everything into vectors if needed.
373 00:16:52.155 --> 00:16:54.495 Um, and then we're gonna score everything on BM 25.
374 00:16:56.695 --> 00:16:57.795 And this is what it looks like.
375 00:16:58.255 --> 00:17:00.515 You have to think like, you know, it gets your tax data
376 00:17:01.135 --> 00:17:04.395 and then milli versus is taking care of analyzing the text,
377 00:17:04.395 --> 00:17:06.395 including some functions if you want to,
378 00:17:06.815 --> 00:17:08.635 and then creating the different embeddings.
379 00:17:08.735 --> 00:17:11.595 Um, and so when you query the text, um,
380 00:17:11.665 --> 00:17:13.275 then you don't have to think of anything.
381 00:17:13.275 --> 00:17:14.395 We'll just do everything for you,
382 00:17:17.015 --> 00:17:19.435 but enough about, enough about those.
383 00:17:19.735 --> 00:17:21.195 Um, so what are we gonna build today?
384 00:17:22.055 --> 00:17:23.555 Um, I've administered it,
385 00:17:24.495 --> 00:17:27.395 and it actually, the idea comes from a problem that I have.
386 00:17:27.695 --> 00:17:30.235 Um, so for the people that are not familiar
387 00:17:30.235 --> 00:17:33.675 with VIS is obviously, uh, the name of the Vector database,
388 00:17:33.775 --> 00:17:35.235 but it's also the name of a bird.
389 00:17:35.895 --> 00:17:40.005 Um, and if you go on X, for example,
390 00:17:40.105 --> 00:17:42.285 and you search about vis, you can think some,
391 00:17:42.305 --> 00:17:45.125 you can see some things like, you know, a tweet
392 00:17:45.185 --> 00:17:48.325 by Lang chain, which is, you know, about our graph agent
393 00:17:48.865 --> 00:17:51.605 and you know, what we did with NEO four J, which was, um,
394 00:17:51.905 --> 00:17:53.205 you know, collaborative with them.
395 00:17:53.825 --> 00:17:55.565 And that's like, cool, okay, I'm happy
396 00:17:55.565 --> 00:17:58.885 because as a dev advocates, I wanna talk to our users.
397 00:17:59.645 --> 00:18:01.325 I wanna see what people are building with nervous.
398 00:18:02.345 --> 00:18:04.925 But then you can see a very cool picture,
399 00:18:05.055 --> 00:18:06.605 don't get me wrong, uh,
400 00:18:06.625 --> 00:18:09.285 but then it's like, you know, it's a black kite,
401 00:18:09.285 --> 00:18:11.765 and then that's just like over, you know, a building.
402 00:18:12.585 --> 00:18:15.245 And if you search, you're gonna see a lot of those
403 00:18:15.245 --> 00:18:17.885 because Vis, um, is very popular for birds,
404 00:18:18.465 --> 00:18:19.845 um, is a popular bird.
405 00:18:19.985 --> 00:18:22.405 And those ones, I don't care about it and Twitter
406 00:18:22.505 --> 00:18:25.365 or X doesn't really provide a filter.
407 00:18:25.785 --> 00:18:28.085 So then if I get all the tweets, you know, then um,
408 00:18:28.235 --> 00:18:30.405 then I'm gonna have problems, then it's gonna be annoying.
409 00:18:31.665 --> 00:18:34.885 So the idea, um, is that I wanna bot, I want
410 00:18:34.885 --> 00:18:39.365 to use agents basically to build an AI that is like smarter.
411 00:18:39.785 --> 00:18:41.525 So I can, can brows my socials
412 00:18:41.625 --> 00:18:44.365 and it can brow like different websites without me
413 00:18:44.365 --> 00:18:45.485 having to check it.
414 00:18:45.665 --> 00:18:49.005 Uh, and without me, you know, having to then clean the data
415 00:18:49.265 --> 00:18:53.085 and remove, you know, pictures of the birds, remove pictures
416 00:18:53.785 --> 00:18:55.885 of cameras as well, because also very popular.
417 00:18:56.625 --> 00:19:00.755 And so yes, so we're gonna combine visual understanding,
418 00:19:01.415 --> 00:19:02.835 uh, with context awareness,
419 00:19:03.535 --> 00:19:05.275 and then the agent, you know,
420 00:19:05.275 --> 00:19:07.635 the assistant actually will know the difference between, um,
421 00:19:07.775 --> 00:19:11.035 the birds and the new article about vis, you know.
422 00:19:11.575 --> 00:19:14.435 And then what I want as well in the future is that
423 00:19:15.305 --> 00:19:17.835 then this assistant will be able to tell me, uh,
424 00:19:17.835 --> 00:19:20.475 what our users are talking about, uh, if they're happy
425 00:19:20.475 --> 00:19:22.395 with Melva, if they need help on something.
426 00:19:23.015 --> 00:19:26.435 And that way we'll be able to filter out all the pictures,
427 00:19:26.655 --> 00:19:27.715 all the things about birds.
428 00:19:28.415 --> 00:19:31.315 Um, but that's just the idea I have for the demo.
429 00:19:31.735 --> 00:19:33.555 But you'll see it in action later on.
430 00:19:33.895 --> 00:19:37.595 You can really go on any website, um, you can do a lot
431 00:19:37.595 --> 00:19:40.315 of things and you can let the agent, uh, handle
432 00:19:40.985 --> 00:19:42.475 your, uh, web browser.
433 00:19:42.525 --> 00:19:47.475 Sorry. So what I'm gonna use for the tech stack here,
434 00:19:48.975 --> 00:19:50.255 I mentioned it a couple of times already.
435 00:19:50.675 --> 00:19:53.535 Uh, I'm gonna use browser use, which is an open source, um,
436 00:19:54.155 --> 00:19:56.975 system, which enables AI agents to control your browser.
437 00:19:57.715 --> 00:20:01.255 Uh, you can also let it use a new browser if you,
438 00:20:01.315 --> 00:20:02.655 if you want, if you're not comfortable.
439 00:20:03.355 --> 00:20:06.535 Um, and it has some nice features like multiple tab
440 00:20:06.535 --> 00:20:08.935 managements, so we'll know which tab to go where.
441 00:20:09.595 --> 00:20:11.535 Um, also what's very nice
442 00:20:12.075 --> 00:20:15.535 is it supports vision plus HD ML extraction.
443 00:20:16.315 --> 00:20:18.255 So let's say you go on a website
444 00:20:18.915 --> 00:20:23.215 and you only have an image that is embedded in HTML tag,
445 00:20:23.235 --> 00:20:26.575 you know, then if you were to extract HTML there,
446 00:20:27.085 --> 00:20:28.255 it's not very useful, you know,
447 00:20:28.255 --> 00:20:30.335 you're just gonna have the tag of an image.
448 00:20:30.715 --> 00:20:34.055 So then we're using visual language model here to then,
449 00:20:34.055 --> 00:20:35.855 you know, understand what's happening on the image.
450 00:20:36.395 --> 00:20:38.735 Um, and this is what Browser uses is doing.
451 00:20:38.765 --> 00:20:41.575 Basically, it's gonna be able to understand what's happening
452 00:20:41.575 --> 00:20:44.815 through, um, vision language models and through HTML.
453 00:20:45.935 --> 00:20:49.405 You can also define some actions, uh, some custom actions
454 00:20:49.975 --> 00:20:51.725 where you have to then define the prompts
455 00:20:51.865 --> 00:20:53.965 and you define what you wanna have, what you wanna have.
456 00:20:54.755 --> 00:20:56.085 It's also self-correcting.
457 00:20:56.185 --> 00:20:57.845 Uh, you'll see later on in my terminal,
458 00:20:58.065 --> 00:20:59.765 but you know, it'd be like, okay,
459 00:21:00.145 --> 00:21:01.485 the agent is doing different steps
460 00:21:02.025 --> 00:21:04.125 and then it's checking if the steps are successful.
461 00:21:04.345 --> 00:21:06.885 If they're not, then it's gonna try to fix that.
462 00:21:07.505 --> 00:21:09.525 Um, and then, yeah, do different actions.
463 00:21:10.065 --> 00:21:12.445 And it supports different LLM, so you are not like,
464 00:21:12.465 --> 00:21:14.445 not stuck, uh, with anything.
465 00:21:14.465 --> 00:21:15.885 You can just use whatever you want.
466 00:21:18.315 --> 00:21:20.765 Then I'm gonna use Gemini Flash 2.0.
467 00:21:21.305 --> 00:21:23.605 Um, it's natively multimodal.
468 00:21:23.835 --> 00:21:26.725 It's very strong performance, uh, on multimodal,
469 00:21:26.735 --> 00:21:27.765 multimodal task.
470 00:21:28.185 --> 00:21:30.325 And it's really good at following instructions,
471 00:21:30.615 --> 00:21:31.685 which is very important
472 00:21:32.235 --> 00:21:33.805 because the prompts, uh,
473 00:21:34.105 --> 00:21:36.445 are gonna be quite long and quite advanced.
474 00:21:36.585 --> 00:21:37.925 So you need something that is really,
475 00:21:37.925 --> 00:21:39.325 really good at following instructions.
476 00:21:40.155 --> 00:21:41.725 Also, what's nice, the inputs, um,
477 00:21:41.895 --> 00:21:44.925 token limits is 1 million, so it's actually quite long.
478 00:21:45.545 --> 00:21:47.885 And it supports text, images, audio and video.
479 00:21:48.225 --> 00:21:51.765 So this is also nice if you go on different websites, um,
480 00:21:51.945 --> 00:21:53.365 and you want to be able, you know,
481 00:21:53.365 --> 00:21:55.445 to actually understand everything that is happening there.
482 00:21:58.295 --> 00:22:00.155 I'm also combining it, combining it with,
483 00:22:00.415 --> 00:22:01.475 uh, structured outputs.
484 00:22:01.895 --> 00:22:06.235 Um, so when you use, uh, L LMS and you use function calling
485 00:22:06.255 --> 00:22:08.835 and you use all all of those things, um,
486 00:22:09.015 --> 00:22:10.275 you're gonna have outputs
487 00:22:10.455 --> 00:22:12.315 and then you can define those outputs
488 00:22:12.315 --> 00:22:13.675 to be something specific, you know.
489 00:22:13.675 --> 00:22:16.675 So then instead of, you know, having text that you have
490 00:22:16.675 --> 00:22:19.395 to pass, uh, then you can generate, for example,
491 00:22:19.395 --> 00:22:21.875 either adjacent or a identity object,
492 00:22:22.605 --> 00:22:24.315 which gonna make your life way easier.
493 00:22:24.815 --> 00:22:27.355 Um, because then you can have, you know, type safety.
494 00:22:27.695 --> 00:22:29.235 So you don't need to like validate
495 00:22:29.535 --> 00:22:31.035 or retry, you know, something
496 00:22:31.035 --> 00:22:32.555 that is like not correctly formatted.
497 00:22:33.855 --> 00:22:36.075 TIC would do that for you automatically, for example, Jason,
498 00:22:36.255 --> 00:22:39.875 but with Den, when you define the class, you can also say,
499 00:22:39.975 --> 00:22:43.155 Hey, I want this attribute to be of value
500 00:22:43.155 --> 00:22:44.875 between zero and 10, for example.
501 00:22:45.215 --> 00:22:46.595 And then if you have a value of 20,
502 00:22:47.015 --> 00:22:48.515 but then it could raise an error for you.
503 00:22:48.515 --> 00:22:49.795 So you don't have to do that yourself.
504 00:22:50.295 --> 00:22:51.395 And it is very, very handy
505 00:22:51.465 --> 00:22:55.305 because then you can really, you know, be in control also.
506 00:22:55.305 --> 00:22:56.985 Well, it's basically what I said,
507 00:22:56.985 --> 00:22:58.625 but like the explicit refusals.
508 00:22:59.365 --> 00:23:01.185 Um, so you know, they can like do
509 00:23:01.185 --> 00:23:04.825 that pro program programmatically, sorry, uh, instead
510 00:23:04.825 --> 00:23:07.185 of like having to do it, you know, like in passing things
511 00:23:07.245 --> 00:23:08.625 and being complicated.
512 00:23:09.965 --> 00:23:12.985 And you can also have prompting that is a bit easier.
513 00:23:13.845 --> 00:23:15.345 You don't have, you know, to define
514 00:23:15.345 --> 00:23:16.665 and to do like future prompting.
515 00:23:16.885 --> 00:23:20.225 Um, when you're gonna using an LLM, you're gonna be like,
516 00:23:20.225 --> 00:23:22.465 Hey, please return it in Jason, um,
517 00:23:22.965 --> 00:23:26.265 or please, uh, return it in this TIC object.
518 00:23:26.805 --> 00:23:29.745 Um, and then the LLM will try, um, to do that.
519 00:23:29.925 --> 00:23:32.065 So then like the prompts gets easier and easier.
520 00:23:32.365 --> 00:23:36.915 So that's nice. And well, this is an example with Mistral,
521 00:23:37.135 --> 00:23:40.515 uh, but the idea is the same where you get, uh,
522 00:23:40.615 --> 00:23:42.595 on the left you have the J So this is
523 00:23:42.595 --> 00:23:44.195 how you define the structured output for J.
524 00:23:44.375 --> 00:23:47.325 So you, in this example, you call a missile,
525 00:23:47.685 --> 00:23:49.005 a missile model with a client.
526 00:23:50.665 --> 00:23:52.725 And then you say you're just asking a question, you know,
527 00:23:52.725 --> 00:23:54.325 what is the best French meal?
528 00:23:54.465 --> 00:23:55.645 And then you say, return the name
529 00:23:55.785 --> 00:23:57.965 and the ingredients in short, Jason object.
530 00:23:58.665 --> 00:24:01.245 So this is your prompt. And then we're gonna,
531 00:24:01.675 --> 00:24:03.725 when you're going, when you're gonna make the completion,
532 00:24:03.725 --> 00:24:06.725 sorry, then you define the response format here,
533 00:24:07.025 --> 00:24:09.525 and then you say, please type, uh, Jasons object.
534 00:24:10.315 --> 00:24:13.085 Then the model has been trained, so then the model knows how
535 00:24:13.085 --> 00:24:14.605 to do that and knows what Jasons is.
536 00:24:15.105 --> 00:24:16.925 Um, so then, yeah, you don't have, you know,
537 00:24:16.945 --> 00:24:18.565 to explain what Jasons is.
538 00:24:18.565 --> 00:24:19.925 You don't have to create like, you know,
539 00:24:19.925 --> 00:24:21.605 the dictionary and everything.
540 00:24:21.625 --> 00:24:25.965 The model will stand on the right side. Uh, this is for den.
541 00:24:26.625 --> 00:24:30.125 So we're declaring a class here, uh, which is the book class
542 00:24:30.255 --> 00:24:32.805 where you have a name, uh, that is a string,
543 00:24:32.805 --> 00:24:34.525 and then authors, which is a list of string,
544 00:24:35.505 --> 00:24:36.965 and it's a bit different,
545 00:24:36.985 --> 00:24:39.685 but the idea is the same, um, as that you have,
546 00:24:39.785 --> 00:24:43.205 you find your client and then you're gonna, you know,
547 00:24:43.205 --> 00:24:44.205 give the different messages.
548 00:24:44.225 --> 00:24:45.845 So the system, uh, prompt,
549 00:24:45.895 --> 00:24:47.765 which is just extract booked information.
550 00:24:49.275 --> 00:24:52.015 And then here we're passing one message from a user,
551 00:24:52.025 --> 00:24:53.215 which is, you know,
552 00:24:53.295 --> 00:24:56.535 a recently read two Kilo Mockingbird by her Lee.
553 00:24:56.995 --> 00:24:58.975 Uh, and then here you define the response
554 00:24:58.975 --> 00:25:00.095 format, actually that you want.
555 00:25:00.435 --> 00:25:04.375 So on the left we had the JS object, which is defined here.
556 00:25:05.075 --> 00:25:09.175 And on the right we have book, which is the den class, uh,
557 00:25:09.175 --> 00:25:10.175 that we've defined here.
558 00:25:10.875 --> 00:25:13.455 And this is then how you can have better outputs.
559 00:25:13.595 --> 00:25:15.215 Um, and you know, like in your code,
560 00:25:15.455 --> 00:25:16.455 actually things can make sense
561 00:25:17.155 --> 00:25:19.735 and it's very, very important for agents, uh, because
562 00:25:19.735 --> 00:25:22.895 otherwise if you have a mistake at the beginning, um,
563 00:25:23.155 --> 00:25:25.055 or like a format that is wrong at the beginning,
564 00:25:25.565 --> 00:25:27.455 then it's gonna get worse and worse and worse.
565 00:25:27.675 --> 00:25:29.735 So you really want to be still in control,
566 00:25:29.735 --> 00:25:31.295 even though they're gonna be independent.
567 00:25:33.975 --> 00:25:35.555 And I'm gonna use VIS, of course.
568 00:25:35.935 --> 00:25:38.675 Um, so as I said, you know, you can do a PIP install
569 00:25:38.675 --> 00:25:39.715 to install it on your laptop
570 00:25:40.215 --> 00:25:42.955 and then create collections, um, and play with it.
571 00:25:44.565 --> 00:25:46.265 And this is the architecture we'll have.
572 00:25:47.045 --> 00:25:49.385 So we'll have the user query, um,
573 00:25:49.575 --> 00:25:51.265 then we're gonna use browser use,
574 00:25:51.635 --> 00:25:53.305 which is then using Gemini as well.
575 00:25:54.045 --> 00:25:57.955 And we are then gonna have browser user
576 00:25:57.955 --> 00:25:59.915 that's doing autonomous web browsing, Gemini,
577 00:25:59.915 --> 00:26:01.795 that's gonna do the multimodal processing.
578 00:26:02.135 --> 00:26:03.995 And then we're gonna store everything into bu.
579 00:26:04.575 --> 00:26:06.635 And there, there we're gonna be able
580 00:26:06.635 --> 00:26:08.675 to do like vector plus full tech search,
581 00:26:08.855 --> 00:26:10.475 and then we're gonna build a whole rack system.
582 00:26:11.295 --> 00:26:13.035 And yeah, we should hopefully
583 00:26:13.825 --> 00:26:15.155 have something that is quite nice.
584 00:26:16.415 --> 00:26:20.595 Uh, let me share, uh, so I have
585 00:26:22.125 --> 00:26:25.825 my code that is running, um, here, I'm just checking.
586 00:26:26.085 --> 00:26:28.705 So it's gonna open the web UI you have here.
587 00:26:29.565 --> 00:26:31.905 So this is what you see, uh,
588 00:26:32.015 --> 00:26:34.345 when you're gonna use the web UI of browser use.
589 00:26:35.655 --> 00:26:36.715 You don't have to use this one,
590 00:26:36.735 --> 00:26:38.115 but that's just very, very handy.
591 00:26:39.135 --> 00:26:42.435 Um, and let me make sure it's free zoomed in.
592 00:26:44.585 --> 00:26:46.285 Um, so yeah, you can control everything.
593 00:26:46.505 --> 00:26:48.485 And then you're gonna define different
594 00:26:48.485 --> 00:26:49.805 settings for your agents.
595 00:26:50.025 --> 00:26:53.165 So you can define, you know, the default one or custom ones.
596 00:26:53.385 --> 00:26:54.645 I'm waiting for custom one.
597 00:26:55.025 --> 00:26:57.085 Uh, I'll show you a bit later, uh, what's happening
598 00:26:57.085 --> 00:26:59.125 behind the scene, but you're gonna see all the prompts.
599 00:27:00.035 --> 00:27:02.245 Then you're gonna define different steps, you know,
600 00:27:02.425 --> 00:27:05.325 how many steps do you have, do you wanna have maximum, um,
601 00:27:05.465 --> 00:27:07.165 and then maximum action of per step.
602 00:27:07.905 --> 00:27:10.205 And here I'm en enabling vision
603 00:27:10.235 --> 00:27:12.285 because I actually want to use vision language models.
604 00:27:15.005 --> 00:27:17.225 You can define different providers, as I said.
605 00:27:17.325 --> 00:27:18.545 So, you know, if you want to go
606 00:27:18.725 --> 00:27:20.625 for cloud models, it's possible.
607 00:27:21.605 --> 00:27:24.425 Um, but on my end, I'm gonna go for Google, uh,
608 00:27:24.605 --> 00:27:26.585 and I'm gonna go for Gemini 2.0.
609 00:27:27.405 --> 00:27:29.985 You then define the temperature as well.
610 00:27:30.885 --> 00:27:34.305 And here this is where, uh, you have a bit of magic.
611 00:27:34.885 --> 00:27:37.785 So I'm gonna say that I'm gonna use my own browser.
612 00:27:38.565 --> 00:27:40.625 So I'm using ARC at the moment,
613 00:27:40.685 --> 00:27:42.545 but then it's gonna open Chrome, uh,
614 00:27:42.545 --> 00:27:43.905 because this is the one I've defined.
615 00:27:44.605 --> 00:27:46.545 But then you also don't have to do that.
616 00:27:46.565 --> 00:27:48.785 So if I were, I were to not do that,
617 00:27:48.785 --> 00:27:49.985 then it would open chromium.
618 00:27:50.565 --> 00:27:52.425 And that way, you know, you, it's better.
619 00:27:52.425 --> 00:27:54.145 Like if you want to keep, uh, something
620 00:27:54.145 --> 00:27:55.665 that would be different, you know, like you want to,
621 00:27:55.665 --> 00:27:57.065 don't want to use your browser everywhere.
622 00:27:57.965 --> 00:28:01.385 But by doing that, what's really nice that the then the LLM,
623 00:28:01.445 --> 00:28:03.265 you know, is connected to your website already,
624 00:28:03.725 --> 00:28:05.825 and it's connected to social networks,
625 00:28:05.825 --> 00:28:08.305 and it's connected to, you know, like everything, uh,
626 00:28:08.305 --> 00:28:11.705 you're already connected with your own account, so
627 00:28:11.705 --> 00:28:14.185 that way you also don't get really detected, uh,
628 00:28:14.365 --> 00:28:15.465 by security systems
629 00:28:15.465 --> 00:28:19.025 because if you had to try to do it, you know, uh,
630 00:28:19.135 --> 00:28:20.825 through the terminal, usually like,
631 00:28:20.895 --> 00:28:22.425 then they're gonna block you and everything.
632 00:28:22.425 --> 00:28:26.025 Whereas here, it's actually your con your Google Chrome.
633 00:28:26.725 --> 00:28:29.065 Um, so then it's actually gonna be way nicer,
634 00:28:30.915 --> 00:28:33.015 and this is where you run the agent.
635 00:28:34.875 --> 00:28:38.415 So actually, uh, I'll just show you what's happening.
636 00:28:38.475 --> 00:28:41.735 So I have a description of the task that you see here
637 00:28:42.715 --> 00:28:46.535 and here you're like, okay, go to e.com
638 00:28:46.535 --> 00:28:49.175 and then search for vis that we have here,
639 00:28:49.915 --> 00:28:51.895 and then we're gonna go through the recent search.
640 00:28:53.315 --> 00:28:54.775 And then I'm like, okay, find the tweets
641 00:28:54.775 --> 00:28:55.895 that I'll talk about vis,
642 00:28:56.635 --> 00:28:58.935 and then I'm like, I only want to have tweets
643 00:28:58.935 --> 00:29:00.815 that talk about that, the vector database.
644 00:29:01.435 --> 00:29:03.535 And then I also really insist I'm like,
645 00:29:03.955 --> 00:29:07.055 do not include tweets about birds, photographies cameras
646 00:29:07.915 --> 00:29:08.935 or Canon the brand,
647 00:29:08.935 --> 00:29:10.855 because somehow it's also very, uh, popular,
648 00:29:11.635 --> 00:29:12.695 uh, for those tweets.
649 00:29:13.115 --> 00:29:15.615 And then I'm like, okay, please return it in adjacent format
650 00:29:16.145 --> 00:29:17.735 where we want to have the tweet text,
651 00:29:17.795 --> 00:29:19.415 the tweet and the tweet author.
652 00:29:21.175 --> 00:29:23.035 And now, well, it's gonna be the moment
653 00:29:23.035 --> 00:29:24.035 that's gonna be interesting.
654 00:29:24.095 --> 00:29:27.075 If I'm gonna run my agent, I have to likely
655 00:29:28.135 --> 00:29:30.905 it's gonna open Google Chrome, boop boop boop.
656 00:29:30.905 --> 00:29:31.905 Let me put it here.
657 00:29:32.485 --> 00:29:33.785 So I open Google Chrome,
658 00:29:36.285 --> 00:29:39.065 and then now it's like, you'll see my cur will not move.
659 00:29:39.525 --> 00:29:42.165 Um, so it's browsing Twitter,
660 00:29:42.875 --> 00:29:44.565 then it's gonna check everything
661 00:29:44.565 --> 00:29:45.765 that is happening in real time.
662 00:29:46.745 --> 00:29:49.845 Um, it, yeah, I was gonna say, it may scroll actually.
663 00:29:50.425 --> 00:29:53.165 So it's also scrolling to find more tweets, more events.
664 00:29:54.145 --> 00:29:57.085 Um, and then if I have a look,
665 00:29:58.395 --> 00:29:59.565 okay, now it seems to be over.
666 00:30:00.265 --> 00:30:02.125 No, it's not over. Okay, it's continuing.
667 00:30:02.745 --> 00:30:04.245 Uh, if I have a look here,
668 00:30:04.705 --> 00:30:06.205 let me just show you what we have.
669 00:30:07.385 --> 00:30:08.565 Uh, okay.
670 00:30:09.115 --> 00:30:13.005 It's, um, so this is when I started, you know, the web UI,
671 00:30:14.065 --> 00:30:16.485 and then this is, uh, the task I gave.
672 00:30:17.445 --> 00:30:19.215 It's like, okay, only I want to have tweets,
673 00:30:19.515 --> 00:30:21.095 and then here we can see the different steps.
674 00:30:22.305 --> 00:30:25.645 So it's like, okay, um, the page is blank.
675 00:30:25.785 --> 00:30:28.485 So then, you know, the model is actually telling, um,
676 00:30:28.675 --> 00:30:30.805 telling back the agents already telling
677 00:30:30.805 --> 00:30:32.165 to the airline what's happening as well.
678 00:30:32.625 --> 00:30:34.605 So it's like, okay, we have a blank page,
679 00:30:35.145 --> 00:30:36.605 so we have no previous action.
680 00:30:37.555 --> 00:30:40.485 Then this stack requires me to go to a specific URL.
681 00:30:41.305 --> 00:30:43.325 So then it, it provides some plans, you know,
682 00:30:43.325 --> 00:30:44.445 that it's gonna do in the future.
683 00:30:44.985 --> 00:30:47.245 So it could be like, okay, we're gonna go
684 00:30:47.245 --> 00:30:50.565 to this specified URL, then a second.
685 00:30:50.725 --> 00:30:52.645 Secondly, we're gonna extract the tweets from the page
686 00:30:53.185 --> 00:30:54.605 and filter out, you know,
687 00:30:54.605 --> 00:30:56.245 like the tweets that are not relevant.
688 00:30:57.155 --> 00:31:00.085 Then we're gonna format the tweet, uh, in a specific way.
689 00:31:00.905 --> 00:31:03.245 And yeah, so then we'll have the plans now.
690 00:31:03.625 --> 00:31:05.925 So now let's execute, you know, it's like, okay, I need
691 00:31:05.925 --> 00:31:06.925 to navigate two x.
692 00:31:07.705 --> 00:31:09.445 Um, and then this is the first action.
693 00:31:09.825 --> 00:31:13.405 We go to the URL here, and then we like,
694 00:31:13.665 --> 00:31:15.365 and then we go to the second action.
695 00:31:16.185 --> 00:31:19.845 So then we're gonna check, we're gonna make an evaluation.
696 00:31:20.075 --> 00:31:22.685 Okay, we actually navigated su successfully there.
697 00:31:23.225 --> 00:31:24.845 Um, so we are happy, or at least,
698 00:31:24.905 --> 00:31:26.045 you know, the agent is happy.
699 00:31:26.665 --> 00:31:28.845 And then it's like, you know, adding new things
700 00:31:28.845 --> 00:31:30.285 to the memory as well of the agent.
701 00:31:30.745 --> 00:31:34.645 So it's like, okay, we can see we have results about vis,
702 00:31:35.295 --> 00:31:36.765 there are tweets also from VIS
703 00:31:36.765 --> 00:31:38.925 and link chain also, they are like, you know,
704 00:31:38.925 --> 00:31:41.725 different tweets and there's one from link chain
705 00:31:41.725 --> 00:31:43.405 that's talking about deep research agents,
706 00:31:44.385 --> 00:31:45.605 and you can see you get the idea.
707 00:31:47.305 --> 00:31:51.845 Uh, and then here, this is where the LLM then decided,
708 00:31:52.545 --> 00:31:53.925 um, to scroll down
709 00:31:53.995 --> 00:31:56.565 because it was like, okay, I don't have
710 00:31:56.635 --> 00:31:57.725 that many tweets yet.
711 00:31:57.945 --> 00:31:59.165 Uh, so let me scroll down
712 00:31:59.165 --> 00:32:01.205 so I can find more, uh, more details.
713 00:32:01.865 --> 00:32:03.045 And then that's what is happening.
714 00:32:03.425 --> 00:32:04.925 I'm just gonna skip some parts.
715 00:32:05.065 --> 00:32:08.845 But then we see, uh, that we have, you know,
716 00:32:08.845 --> 00:32:11.845 like some tweets that are actually, uh, extracted.
717 00:32:12.385 --> 00:32:14.045 So you can see the tweet, uh,
718 00:32:14.265 --> 00:32:17.005 and then you see the author, uh, that is here, it's vis
719 00:32:17.005 --> 00:32:18.765 and then the at, um,
720 00:32:18.905 --> 00:32:20.685 and then it's, you know, exactly what we have.
721 00:32:20.785 --> 00:32:23.565 And then here is the same, uh, we have the author,
722 00:32:23.785 --> 00:32:26.165 we have the at, and then we have the tweet.
723 00:32:26.505 --> 00:32:31.245 And so this is, um, the way I wanted it, I wanted you to,
724 00:32:31.325 --> 00:32:33.645 I wanted to see it, um, presented that way,
725 00:32:35.345 --> 00:32:37.325 and then hopefully yes, okay, cool.
726 00:32:37.785 --> 00:32:39.805 And then here, uh, we can see
727 00:32:39.805 --> 00:32:41.685 that it completed the test successfully.
728 00:32:42.505 --> 00:32:45.725 So if I go back to Crow, you can see
729 00:32:45.725 --> 00:32:47.485 that now all the boxes are gone.
730 00:32:48.185 --> 00:32:49.565 Um, so this is good.
731 00:32:49.565 --> 00:32:51.725 That's the way I know at least that it's like, you know,
732 00:32:51.755 --> 00:32:52.885 it's, uh, done.
733 00:32:54.145 --> 00:32:58.685 And then if I go back to my ui, now,
734 00:32:58.835 --> 00:33:00.885 what it's gonna do, uh, is
735 00:33:00.885 --> 00:33:04.525 that it inserted all those results directly into vis,
736 00:33:06.025 --> 00:33:08.205 uh, which is the, like the vis the
737 00:33:08.205 --> 00:33:09.365 vector database this time.
738 00:33:09.865 --> 00:33:12.325 And now I can ask different questions, you know, like, uh,
739 00:33:12.505 --> 00:33:16.745 what's, uh, with bu,
740 00:33:18.295 --> 00:33:20.195 so we can ask, we're gonna search,
741 00:33:20.655 --> 00:33:24.155 and then if I go back, yeah, we can see, you know,
742 00:33:25.095 --> 00:33:28.675 we are running the search now directly into bu
743 00:33:29.335 --> 00:33:33.985 and then, cool, yes, now we can see,
744 00:33:34.285 --> 00:33:37.305 you know, all the suites, uh, that we saw that, you know,
745 00:33:37.305 --> 00:33:39.025 like the model, uh, went through
746 00:33:39.025 --> 00:33:41.465 and then browser use went through, uh, we're like, okay,
747 00:33:41.465 --> 00:33:43.785 we have L chain that is using vis.
748 00:33:44.135 --> 00:33:47.665 Then uh, we have an MCP implementation myself,
749 00:33:48.285 --> 00:33:49.345 uh, talking about it.
750 00:33:49.725 --> 00:33:52.785 Uh, and then, you know, we have like, um,
751 00:33:53.185 --> 00:33:55.185 VIS also shared a talk, you know,
752 00:33:55.265 --> 00:33:56.825 about like decoding iCal search.
753 00:33:58.125 --> 00:33:59.705 And so we can see then we have the summary,
754 00:34:01.105 --> 00:34:03.805 and this is basically what you can do, you know, this is,
755 00:34:04.385 --> 00:34:06.285 uh, what is possible.
756 00:34:06.475 --> 00:34:09.885 Also, just to show you here, there are
757 00:34:10.635 --> 00:34:12.005 some tweets about pictures.
758 00:34:12.355 --> 00:34:15.685 Like you can see it here, like this one is, you know, Canon
759 00:34:15.685 --> 00:34:17.725 and blah, blah, blah, but this one actually has not been
760 00:34:17.725 --> 00:34:20.325 added to Milway because then, uh, we filtered it.
761 00:34:20.745 --> 00:34:23.605 And the, so now when I asked about it, you know,
762 00:34:23.605 --> 00:34:25.205 it's only about the vector database
763 00:34:25.465 --> 00:34:28.445 and there's nothing about the birds or anything.
764 00:34:29.425 --> 00:34:32.485 And uh, yes, the code also is here,
765 00:34:32.485 --> 00:34:33.885 but I think it's gonna be shared.
766 00:34:35.145 --> 00:34:38.725 And I think that's it. Yes, that's it.
767 00:34:39.145 --> 00:34:41.245 I'm gonna take questions now. Thank you.
768 00:34:46.555 --> 00:34:47.655 Uh, yes.
769 00:34:50.445 --> 00:34:53.695 Let me, so one question,
770 00:34:54.605 --> 00:34:56.015 I'll take it directly.
771 00:34:57.435 --> 00:35:00.415 Can this be done without browser use with just API
772 00:35:00.435 --> 00:35:01.455 for doing the research?
773 00:35:01.915 --> 00:35:06.135 Uh, and wouldn't, yes. Oh, hey, ACHI, you're back.
774 00:35:07.065 --> 00:35:08.685 Hi, uh, sorry.
775 00:35:08.825 --> 00:35:10.445 You can go ahead and answer the first question.
776 00:35:10.505 --> 00:35:12.725 Go ahead. Yes. Um, so can
777 00:35:12.725 --> 00:35:13.925 this be done without browser use?
778 00:35:13.925 --> 00:35:16.365 Just with API doing the research? Uh, yeah, you could.
779 00:35:16.485 --> 00:35:19.525 I mean, the reason why I went for it is the API
780 00:35:19.525 --> 00:35:21.005 of Twitter is insanely expensive.
781 00:35:21.225 --> 00:35:23.405 Um, so I figured that I could just do that
782 00:35:23.405 --> 00:35:24.805 and then it would be, uh, cheaper.
783 00:35:25.825 --> 00:35:27.405 Um, and then
784 00:35:28.065 --> 00:35:30.885 you see me doing browser use through the ui.
785 00:35:31.665 --> 00:35:34.965 Uh, but the way it runs, like, I use the UI mostly
786 00:35:35.025 --> 00:35:37.380 for the demo, but the way it runs is like you define the
787 00:35:37.380 --> 00:35:40.125 task in your code, then it's gonna run
788 00:35:40.425 --> 00:35:41.525 and it's gonna run, you know,
789 00:35:41.675 --> 00:35:43.685 like gonna create your Google Chrome for example,
790 00:35:44.345 --> 00:35:45.965 and then you don't really see it, you know,
791 00:35:45.965 --> 00:35:47.485 it's just gonna give you back some data.
792 00:35:48.625 --> 00:35:51.745 So that would be like a good way and
793 00:35:51.745 --> 00:35:53.305 otherwise, yeah, you can of course use,
794 00:35:53.365 --> 00:35:54.705 uh, API based tooling.
795 00:35:55.405 --> 00:35:59.785 Um, but it's just like for, in this case, um, I know the EPI
796 00:35:59.785 --> 00:36:00.865 of Twitter is very expensive.
797 00:36:01.485 --> 00:36:03.585 Um, so that's why I went for this one.
798 00:36:05.355 --> 00:36:07.145 Okay. Um,
799 00:36:07.365 --> 00:36:10.505 and then I don't know if they, you, did you answer his, uh,
800 00:36:10.505 --> 00:36:12.385 their follow up question? Uh, if
801 00:36:12.545 --> 00:36:14.105 I, I just answered the follow up question. Okay,
802 00:36:14.105 --> 00:36:15.105 Great. Um,
803 00:36:15.105 --> 00:36:16.945 and then the next question is,
804 00:36:16.945 --> 00:36:19.265 what would the flow be if I'm collecting personal
805 00:36:19.265 --> 00:36:20.345 data about the end user?
806 00:36:21.005 --> 00:36:23.465 Uh, then they gave an example, user logs in
807 00:36:23.525 --> 00:36:25.545 and the system records data of his
808 00:36:25.545 --> 00:36:27.985 or her searches and da mm-hmm.
809 00:36:28.405 --> 00:36:30.305 For example, data is collected about shoes
810 00:36:30.325 --> 00:36:32.905 or clothing, data is retained for future references,
811 00:36:33.535 --> 00:36:36.865 shoe size clothes for male or female, et cetera.
812 00:36:38.155 --> 00:36:42.495 Mm-hmm. Uh, b if you're collecting personal data,
813 00:36:43.315 --> 00:36:46.815 so if the, like from,
814 00:36:49.955 --> 00:36:52.175 uh, can you maybe refine the question
815 00:36:52.175 --> 00:36:53.415 depending on which side you are on?
816 00:36:53.435 --> 00:36:55.175 Is it like the browser you site?
817 00:36:55.395 --> 00:36:58.455 So you go and you are like wondering how it works
818 00:36:58.835 --> 00:37:02.255 or if it's like you are a product or a company
819 00:37:02.275 --> 00:37:03.495 and then you are collecting data.
820 00:37:04.495 --> 00:37:05.295 'cause depending on this one,
821 00:37:05.295 --> 00:37:06.415 it's gonna be a different answer.
822 00:37:07.465 --> 00:37:11.195 Yeah. Um, I, you're anonymous, but please, uh, clarify.
823 00:37:11.855 --> 00:37:12.855 Oh, end user.
824 00:37:14.605 --> 00:37:16.145 I'm the end user login. Yeah.
825 00:37:16.145 --> 00:37:19.705 Then in that case, um, if they collect data about you,
826 00:37:19.895 --> 00:37:22.625 then if you're using your own browser with your own data,
827 00:37:23.405 --> 00:37:25.025 um, then it's gonna be the same, you know,
828 00:37:25.025 --> 00:37:26.505 they're just gonna see you like Twitter.
829 00:37:26.535 --> 00:37:29.505 When I go on Twitter, then they just see me
830 00:37:29.615 --> 00:37:31.065 with using my account, you know?
831 00:37:31.445 --> 00:37:32.945 Um, so that's the way,
832 00:37:32.945 --> 00:37:35.305 then they basically don't really know.
833 00:37:35.415 --> 00:37:38.425 Also, browser use is smart enough that it's like
834 00:37:39.525 --> 00:37:41.705 not too fast when it brought to the internet, you know?
835 00:37:41.705 --> 00:37:44.585 So like sometimes that's the way you get, uh, spotted
836 00:37:44.585 --> 00:37:46.545 as robots is you're gonna be like
837 00:37:46.545 --> 00:37:47.625 in milliseconds to do actions.
838 00:37:47.855 --> 00:37:50.065 Whereas this one actually, because it's clicking around
839 00:37:50.165 --> 00:37:54.505 and scrolling and things, it's a way to, to like really
840 00:37:55.245 --> 00:37:58.965 not be detected, uh, smart.
841 00:38:01.505 --> 00:38:03.285 Uh, yeah.
842 00:38:04.915 --> 00:38:07.095 Um, so this person's asking
843 00:38:08.195 --> 00:38:12.615 is browser use basically a smart LLM type selenium scraping,
844 00:38:13.235 --> 00:38:16.375 but, and would they not need to use costly APIs?
845 00:38:16.375 --> 00:38:17.575 Like replace it with?
846 00:38:17.795 --> 00:38:20.695 Uh, yeah, it can be. It's uh, it's really a way of like,
847 00:38:21.515 --> 00:38:23.135 you know, sometimes they don't have APIs
848 00:38:23.595 --> 00:38:26.695 or, you know, you wanna do something that is very specific,
849 00:38:26.695 --> 00:38:28.215 that might be visual, that is not offered
850 00:38:28.215 --> 00:38:29.455 by the API, for example.
851 00:38:30.115 --> 00:38:31.575 Uh, they, there's like,
852 00:38:31.795 --> 00:38:33.575 it can be really useful with browser use.
853 00:38:35.135 --> 00:38:37.305 Okay. And then someone asked in the chat,
854 00:38:37.605 --> 00:38:41.985 how does the extraction work from X using a, using VL lm?
855 00:38:42.145 --> 00:38:43.145 VLM? Yeah.
856 00:38:44.045 --> 00:38:47.265 Yes. So if you've seen, um,
857 00:38:48.845 --> 00:38:53.225 the when, uh, browser use opened my Twitter accounts,
858 00:38:53.225 --> 00:38:54.665 then you had all those colorful boxes.
859 00:38:55.485 --> 00:38:58.705 Uh, and this is basically, it's looking at the boxes,
860 00:38:58.845 --> 00:38:59.865 so the H CML content,
861 00:39:00.565 --> 00:39:02.745 but then it's also like the VLM as the moment.
862 00:39:02.745 --> 00:39:05.745 It's also like giving screenshots basically
863 00:39:05.765 --> 00:39:07.105 of what's happening on the webpage.
864 00:39:07.645 --> 00:39:09.945 Um, and then it's gonna extract the contents
865 00:39:09.945 --> 00:39:11.145 based on the screenshots.
866 00:39:12.195 --> 00:39:14.465 Sorry. Um, and so then that's the, um,
867 00:39:14.645 --> 00:39:15.745 that's the way you see it.
868 00:39:16.815 --> 00:39:19.395 That's the way the VLM sorry is, is being used here.
869 00:39:21.175 --> 00:39:23.395 And you can think of also if you have videos
870 00:39:23.575 --> 00:39:26.715 or something, then, um, it's also possible
871 00:39:26.715 --> 00:39:29.195 to then send those and then the VLM will also be able
872 00:39:29.195 --> 00:39:31.995 to process them like, uh, gamma three for example,
873 00:39:32.115 --> 00:39:33.435 there was released yesterday by Google.
874 00:39:33.775 --> 00:39:35.715 Um, can process videos and stuff.
875 00:39:39.375 --> 00:39:41.545 They're saying Thank you for the helpful replies.
876 00:39:42.005 --> 00:39:46.185 Um, and uh, one other question, I'm not sure if this is,
877 00:39:46.345 --> 00:39:47.865 I think it's just a clarifying question.
878 00:39:48.585 --> 00:39:51.625 Personalization in one db, then pointers to the other DB
879 00:39:52.165 --> 00:39:53.305 vector D I'm not
880 00:39:53.305 --> 00:39:55.425 Sure, I'm not sure about this one.
881 00:39:55.425 --> 00:39:59.225 What you mean is like from, maybe let me check the,
882 00:39:59.285 --> 00:40:02.185 the answers before, but if you're collecting data
883 00:40:02.285 --> 00:40:05.505 and stuff, um, I guess that's related to this one, uh, yeah,
884 00:40:05.505 --> 00:40:07.505 then you would have, you know,
885 00:40:07.505 --> 00:40:09.625 data collected about choose or personalization.
886 00:40:09.965 --> 00:40:13.295 Uh, you can have them directly in, um,
887 00:40:14.115 --> 00:40:16.135 in one database if you want to, or a collection
888 00:40:16.555 --> 00:40:18.615 and have, where's the other one?
889 00:40:19.035 --> 00:40:21.935 Uh, personalization in one database or one collection,
890 00:40:21.935 --> 00:40:25.055 and then the pointers to the vector database if you want to,
891 00:40:25.115 --> 00:40:27.575 but I'm not exactly sure of
892 00:40:27.575 --> 00:40:28.735 what you mean with this question.
893 00:40:29.525 --> 00:40:33.775 Yeah, feel free to put, um, a follow up question if, uh,
894 00:40:33.795 --> 00:40:36.335 if you have one on that, and then SEF can answer.
895 00:40:36.435 --> 00:40:40.175 But, uh, the next question is how could we make a projection
896 00:40:40.175 --> 00:40:43.575 of the cost of the LLM slash V-L-L-M-A-P-I?
897 00:40:44.265 --> 00:40:46.915 Yeah, so it's gonna depend of course on your LLM,
898 00:40:46.915 --> 00:40:50.795 it's gonna, you have to check the amount of, uh,
899 00:40:51.135 --> 00:40:52.195 tokens you're gonna send.
900 00:40:52.855 --> 00:40:55.235 Uh, browser user is sending lots of tokens.
901 00:40:55.655 --> 00:40:57.635 Um, so that's like the problem, you know,
902 00:40:57.635 --> 00:41:00.555 like you pay per input token is the output tokens.
903 00:41:01.095 --> 00:41:03.395 Uh, but input tokens is very, very cheap though.
904 00:41:03.455 --> 00:41:05.755 So, so you have to check, for example,
905 00:41:05.755 --> 00:41:07.755 what I'm gonna answer the other question, uh,
906 00:41:07.755 --> 00:41:10.795 that I was asked, which is which VM I'm using, uh,
907 00:41:11.435 --> 00:41:15.395 I am using Gemini 2.0, um, Gemini 2.0 Flash, sorry.
908 00:41:16.135 --> 00:41:20.195 Um, and the price per million token is very, very low.
909 00:41:21.175 --> 00:41:23.315 So then you have to estimate, you know,
910 00:41:23.315 --> 00:41:25.835 how many tokens you would send depending on
911 00:41:25.835 --> 00:41:27.075 which website you would browse.
912 00:41:27.175 --> 00:41:30.475 And then you can have a rough idea, uh, of the price.
913 00:41:31.055 --> 00:41:34.685 Uh, but yeah, so you have those, yeah, you also,
914 00:41:35.905 --> 00:41:38.445 you can have caching as well if you wanna reduce the price,
915 00:41:38.745 --> 00:41:41.925 uh, but it's gonna depend on which websites you
916 00:41:41.925 --> 00:41:43.165 are navigating to.
917 00:41:43.785 --> 00:41:47.405 Um, so yeah, those, um, like tricky questions.
918 00:41:49.165 --> 00:41:50.475 Thank you all for your questions.
919 00:41:50.625 --> 00:41:52.155 Keep asking in the q and a tool.
920 00:41:52.415 --> 00:41:54.355 I'm gonna ask a quick question here. So, mm-hmm.
921 00:41:54.575 --> 00:41:57.755 Why are you using Gemini models instead of OpenAI? Yeah,
922 00:41:58.615 --> 00:42:00.275 The first one is, uh, it's very cheap.
923 00:42:00.735 --> 00:42:03.315 Uh, so they are like really, really good at making it cheap.
924 00:42:03.515 --> 00:42:04.955 I still don't know how Google can do that,
925 00:42:05.015 --> 00:42:07.955 but, uh, that's, um, a good way.
926 00:42:08.055 --> 00:42:11.115 And then they support, uh, different modalities.
927 00:42:11.295 --> 00:42:13.075 So they support images, they support,
928 00:42:14.065 --> 00:42:16.035 they just added support to video as well.
929 00:42:16.375 --> 00:42:18.315 Um, and those things.
930 00:42:18.315 --> 00:42:19.795 So they're just like very, very convenient.
931 00:42:20.375 --> 00:42:24.715 Uh, and the input, um, the context, the input context,
932 00:42:24.715 --> 00:42:26.635 sorry, is also very long, uh,
933 00:42:26.635 --> 00:42:29.675 which is very good when you process a lot of data then.
934 00:42:30.195 --> 00:42:31.415 So that's the the reason.
935 00:42:32.645 --> 00:42:35.915 Thank you. Um, thank you all for your questions.
936 00:42:36.115 --> 00:42:37.675 I see another one here. Mm-hmm.
937 00:42:37.755 --> 00:42:39.515 I noticed the feature multimodal.
938 00:42:39.855 --> 00:42:41.835 Uh, does that mean I have the ability
939 00:42:41.855 --> 00:42:44.715 to create multiple autonomous agents, one for video, another
940 00:42:44.715 --> 00:42:45.795 for texts, et cetera?
941 00:42:46.735 --> 00:42:48.515 Uh, yeah, you can if you want to.
942 00:42:48.975 --> 00:42:52.995 Uh, so that's one of the slides I've shown at the beginning.
943 00:42:54.455 --> 00:42:57.475 Let me show you. There was a lot of slides.
944 00:42:58.365 --> 00:43:03.115 Maybe I should have escaped. I'm almost there.
945 00:43:04.895 --> 00:43:05.945 This one.
946 00:43:07.445 --> 00:43:08.985 Uh, so this one, for example,
947 00:43:09.015 --> 00:43:10.465 that would be what you would do.
948 00:43:10.765 --> 00:43:11.905 You would have a routing system
949 00:43:12.005 --> 00:43:14.985 and then, you know, would define, Hey, go for the a LM
950 00:43:14.985 --> 00:43:17.105 that is specialized about videos, go for the one
951 00:43:17.105 --> 00:43:18.225 that is specialized about text,
952 00:43:18.805 --> 00:43:21.265 or you can just use multimodel models,
953 00:43:21.835 --> 00:43:23.945 which then can understand images,
954 00:43:24.245 --> 00:43:26.225 or they can also understand text directly.
955 00:43:26.525 --> 00:43:28.945 So you don't actually have to do the switching system.
956 00:43:29.285 --> 00:43:33.505 Um, like pixel for AI is doing it,
957 00:43:34.205 --> 00:43:35.465 Gemini is doing it.
958 00:43:36.285 --> 00:43:38.625 Uh, Claude can also understand images and stuff.
959 00:43:39.165 --> 00:43:41.145 So then if you have images
960 00:43:41.245 --> 00:43:43.065 or text, you can throw them at the same model.
961 00:43:44.285 --> 00:43:46.735 It's a bit trickier if you have other type of data,
962 00:43:47.035 --> 00:43:49.455 but imagine if you have audio, then yes,
963 00:43:49.455 --> 00:43:50.535 you would have a routing system.
964 00:43:51.115 --> 00:43:53.615 And in the rotary you say, Hey, for audio,
965 00:43:54.355 --> 00:43:55.575 go to this LLM, please.
966 00:43:55.915 --> 00:43:56.915 That's the way to do it.
967 00:43:58.615 --> 00:44:00.435 Thanks for the explanation, Stepan.
968 00:44:00.895 --> 00:44:05.475 Um, another question is, could you explain
969 00:44:05.605 --> 00:44:08.555 where does the feedback mechanism come in the
970 00:44:08.555 --> 00:44:09.875 ag agent approach?
971 00:44:10.495 --> 00:44:13.715 Yes. So this one is actually, so it's part
972 00:44:13.715 --> 00:44:15.835 of the browser use, but otherwise it's actually checking.
973 00:44:16.575 --> 00:44:18.635 So it's asking the LLM what it has done,
974 00:44:18.855 --> 00:44:19.915 and it's checking the memory
975 00:44:20.295 --> 00:44:22.915 and then it's like, okay, what is the action?
976 00:44:22.935 --> 00:44:25.475 You know, you're gonna keep a list, um, of things
977 00:44:25.475 --> 00:44:27.275 that have been happening since you started.
978 00:44:27.695 --> 00:44:28.755 So you start with the action.
979 00:44:28.855 --> 00:44:30.235 The first one, the action is blank.
980 00:44:30.735 --> 00:44:32.965 Um, so you know you, like nothing has been done,
981 00:44:33.515 --> 00:44:35.805 then you're gonna do the next task, which is
982 00:44:36.425 --> 00:44:38.485 go on Twitter in this example, but it can be anything.
983 00:44:38.945 --> 00:44:41.965 And then you go on Twitter and then you check actually
984 00:44:42.635 --> 00:44:44.725 once the page has been loaded, you know,
985 00:44:44.725 --> 00:44:46.325 you check if actually the page has been loaded,
986 00:44:46.835 --> 00:44:48.245 then you're gonna check the,
987 00:44:48.265 --> 00:44:50.365 you're actually on the correct website, um,
988 00:44:50.945 --> 00:44:52.045 you know, not, you know, somewhere else.
989 00:44:53.425 --> 00:44:55.565 And then if you have those two conditions,
990 00:44:56.105 --> 00:44:58.565 you can firstly say, okay, I navigated
991 00:44:58.705 --> 00:44:59.965 to this website correctly.
992 00:45:00.915 --> 00:45:04.245 Then you do that, then you are like, you know, uh,
993 00:45:04.395 --> 00:45:06.445 it's been, it was shown directly in the logs,
994 00:45:06.445 --> 00:45:08.285 but it was like you have different steps,
995 00:45:08.465 --> 00:45:09.685 you know, then that are produced.
996 00:45:10.145 --> 00:45:11.685 And it's like, okay, the first step was
997 00:45:11.685 --> 00:45:13.045 navigating to this website.
998 00:45:13.465 --> 00:45:15.845 I'm on this website now. So I call that a success.
999 00:45:16.515 --> 00:45:19.125 Then the next step is extracting the tweets.
1000 00:45:19.825 --> 00:45:22.085 Uh, so then you're gonna extract the tweets,
1001 00:45:22.625 --> 00:45:25.725 and then if you have, um, if you're successful at that,
1002 00:45:25.875 --> 00:45:27.645 then you're gonna say that it's a success.
1003 00:45:28.265 --> 00:45:30.245 And then if you're not successful, if you have errors,
1004 00:45:30.245 --> 00:45:32.205 you know, maybe, I don't know, Twitter is down,
1005 00:45:32.255 --> 00:45:34.445 maybe you cannot extract the tweets
1006 00:45:34.445 --> 00:45:38.205 because in my prompt when I said extract, you know,
1007 00:45:38.205 --> 00:45:40.125 something, uh, then you just don't find it.
1008 00:45:40.425 --> 00:45:41.285 So maybe you're gonna crash,
1009 00:45:41.285 --> 00:45:42.205 maybe you're gonna have an error.
1010 00:45:42.665 --> 00:45:46.085 So then it's gonna be like, oh, I have an error here.
1011 00:45:46.705 --> 00:45:49.285 Uh, I need to go back to see what happening
1012 00:45:49.285 --> 00:45:50.725 and maybe change a bit my approach.
1013 00:45:51.185 --> 00:45:53.845 And that's the way basically, um, you,
1014 00:45:54.385 --> 00:45:55.565 you have the feedback mechanism.
1015 00:45:58.815 --> 00:46:01.205 Great, thank you so much for your answer there, Stepan.
1016 00:46:01.785 --> 00:46:04.245 Um, it looks like questions have been cleared,
1017 00:46:05.065 --> 00:46:06.365 really rapid fire there.
1018 00:46:06.905 --> 00:46:10.445 Um, if you have any more, just wait a minute or so.
1019 00:46:10.495 --> 00:46:11.965 We'll wait a minute, minute
1020 00:46:11.985 --> 00:46:14.205 or so for you to finish typing your questions.
1021 00:46:14.665 --> 00:46:18.815 Mm-hmm. Um, but overall, here's this, here's the screen
1022 00:46:18.985 --> 00:46:21.015 where you can connect with Stefan, um,
1023 00:46:21.955 --> 00:46:26.455 on different platforms, and then we also have office hours.
1024 00:46:26.795 --> 00:46:28.415 Um, so I'm gonna share a link to that
1025 00:46:28.955 --> 00:46:32.255 in case you're interested in meeting one-on-one with Stefan,
1026 00:46:32.715 --> 00:46:36.335 or we have actually another Stefan developer advocate here.
1027 00:46:36.715 --> 00:46:39.215 Um, so, um, one of them will be able
1028 00:46:39.215 --> 00:46:40.935 to answer one-on-one your questions.
1029 00:46:41.835 --> 00:46:45.095 Um, but I'm going to put that link in the chat.
1030 00:46:47.965 --> 00:46:51.465 Yes. Yeah, the office hour is useful if you deploy vis,
1031 00:46:51.465 --> 00:46:55.545 for example, somewhere, uh, and, um,
1032 00:46:56.045 --> 00:46:57.585 and then you need help somehow to like,
1033 00:46:57.585 --> 00:46:58.625 you know, scale it up or something.
1034 00:46:59.925 --> 00:47:03.265 And yes, there's a question if you wanna take this one.
1035 00:47:03.775 --> 00:47:06.505 Sure, sure. And then if the last question here is if you
1036 00:47:06.505 --> 00:47:08.345 found an image in one of your X feeded,
1037 00:47:09.085 --> 00:47:10.825 how do you store it into vis?
1038 00:47:11.485 --> 00:47:14.425 So you're gonna put it through an embedding model
1039 00:47:14.425 --> 00:47:16.145 that is like capable of using images
1040 00:47:16.365 --> 00:47:17.505 or understanding images.
1041 00:47:18.165 --> 00:47:22.785 Um, and then for us, you know, like for example, clip,
1042 00:47:23.245 --> 00:47:25.985 uh, which starts to be old now from OpenAI, um,
1043 00:47:26.165 --> 00:47:28.985 has been trained on the pairs of images and text.
1044 00:47:29.625 --> 00:47:31.485 And so then what we're doing is
1045 00:47:31.485 --> 00:47:33.925 that we're gonna put this image through the embedding model,
1046 00:47:34.105 --> 00:47:35.925 but then I can search it with text, you know,
1047 00:47:35.925 --> 00:47:38.485 then it's like I'm gonna have the similarity search on the
1048 00:47:38.485 --> 00:47:39.565 image and on the text.
1049 00:47:40.225 --> 00:47:41.925 Um, so in Metaverse it's just a vector.
1050 00:47:42.545 --> 00:47:46.445 Um, and then obviously I store the URL, uh, of the image
1051 00:47:46.445 --> 00:47:47.965 as well on top of later on.
1052 00:47:48.465 --> 00:47:49.485 Uh, but for us, otherwise,
1053 00:47:49.485 --> 00:47:51.445 if you wanna search it, it's only your vector.
1054 00:47:55.905 --> 00:47:59.605 Any other questions? We
1055 00:47:59.605 --> 00:48:01.125 appreciate all the questions here.
1056 00:48:01.905 --> 00:48:05.115 Mm-hmm. Means you all are actually
1057 00:48:05.115 --> 00:48:06.155 staying engaged in the session.
1058 00:48:13.065 --> 00:48:15.075 I'll just wait another minute or so,
1059 00:48:17.885 --> 00:48:19.545 And if you have more questions, also feel free
1060 00:48:19.545 --> 00:48:20.665 to hit me up on socials.
1061 00:48:21.525 --> 00:48:23.955 Um, I'll be happy to help.
1062 00:48:29.345 --> 00:48:31.715 I've also put some resources into the chat.
1063 00:48:31.735 --> 00:48:34.435 You'll find our upcoming events, our, uh,
1064 00:48:34.435 --> 00:48:38.915 overall resource page and our vis discord.
1065 00:48:39.255 --> 00:48:42.595 If you have follow ups, uh, we also have a podcast,
1066 00:48:42.765 --> 00:48:44.075 which I've put in the chat.
1067 00:48:44.295 --> 00:48:45.355 And then, um,
1068 00:48:45.355 --> 00:48:49.315 finally our office hours in case you wanna meet one-on-one
1069 00:48:49.315 --> 00:48:50.475 with anyone on our team.
1070 00:48:52.955 --> 00:48:55.005 Okay, cool. We're gonna end it here.
1071 00:48:55.415 --> 00:48:56.885 Thank you all for joining today,
1072 00:48:57.105 --> 00:49:00.205 and we look forward to seeing you in a future webinar.
1073 00:49:01.745 --> 00:49:03.835 Cool. Thank you very much. Have.