You’re in!
Webinar
Milvus Hybrid Search: Combining Keyword Precision with Semantic Power for Next-Gen Data Retrieval
Resources
1 00:00:03.545 --> 00:00:04.645 I'm pleased to introduce
2 00:00:04.645 --> 00:00:07.405 to this session research Combining Keyword Precision
3 00:00:07.405 --> 00:00:10.085 with semantic Power for NextGen Data Retrieval
4 00:00:10.085 --> 00:00:12.645 with our guest speaker today, kuka.
5 00:00:12.865 --> 00:00:17.165 He is a data and AI engineer at Data Datamax AI specializing
6 00:00:17.185 --> 00:00:19.325 in ML and AI development and integration.
7 00:00:20.105 --> 00:00:23.165 He has gained deep insight into data transformers,
8 00:00:23.165 --> 00:00:24.685 architecture, attention mechanism,
9 00:00:24.905 --> 00:00:26.725 and other critical areas of machine learning.
10 00:00:27.515 --> 00:00:29.045 He's passionate about SMASH learning
11 00:00:29.505 --> 00:00:32.125 and early is committed to advancing, advancing his career,
12 00:00:32.385 --> 00:00:33.685 and ultimately leading teams
13 00:00:33.745 --> 00:00:36.765 to create innovative solutions in cloud development and ai.
14 00:00:37.295 --> 00:00:39.125 Early the stage is yours.
15 00:00:40.325 --> 00:00:42.625 Uh, thank you Stefan. Hi everyone.
16 00:00:43.205 --> 00:00:45.745 So, yes, uh,
17 00:00:46.765 --> 00:00:49.565 the, this presentation.
18 00:00:49.825 --> 00:00:52.005 In this presentation we're gonna be talking about, uh,
19 00:00:52.795 --> 00:00:55.245 combining the VIS Pul text search
20 00:00:55.245 --> 00:00:58.045 and embedding search to achieve
21 00:00:58.045 --> 00:00:59.805 something interesting in my opinion.
22 00:01:00.465 --> 00:01:04.345 So the quest for this, uh, project started
23 00:01:04.345 --> 00:01:08.865 because I was in need of a certain tool that was able
24 00:01:08.865 --> 00:01:12.875 to crawl the web, uh, whatever websites they can find,
25 00:01:13.675 --> 00:01:17.925 generate, uh, some data such as a title for the thing,
26 00:01:18.045 --> 00:01:20.525 a summary basically to allow me
27 00:01:20.525 --> 00:01:22.565 to better find whatever pieces
28 00:01:22.625 --> 00:01:25.405 and chunks that I needed to help myself, uh,
29 00:01:25.475 --> 00:01:28.965 code a bit faster or understand, uh, like libraries better.
30 00:01:30.105 --> 00:01:34.125 So I will be sharing the presentation that I have made
31 00:01:34.225 --> 00:01:39.165 and then, uh, we will go deeper into a, the demo which, uh,
32 00:01:39.585 --> 00:01:41.685 in which we will showcase how
33 00:01:42.195 --> 00:01:45.005 this whole logic comes together so that, uh,
34 00:01:45.105 --> 00:01:46.445 you could use it in the future.
35 00:01:54.885 --> 00:01:59.465 So, as I said, uh, the
36 00:02:00.575 --> 00:02:03.435 key point of this presentation is about vis
37 00:02:03.575 --> 00:02:07.715 and, uh, there, uh, their sim uh, full text search
38 00:02:07.815 --> 00:02:08.955 and, uh, vector search,
39 00:02:09.735 --> 00:02:13.135 uh, sorry.
40 00:02:13.835 --> 00:02:18.095 So combining them, why would you need to combine these two,
41 00:02:18.475 --> 00:02:20.775 uh, retrieval, uh, methods?
42 00:02:21.325 --> 00:02:22.695 Well, uh, simply
43 00:02:22.725 --> 00:02:25.695 because maybe some of them, for example,
44 00:02:25.795 --> 00:02:28.215 if you use a full text search, maybe you're going
45 00:02:28.215 --> 00:02:31.255 to miss a couple of details in, uh, whatever
46 00:02:32.115 --> 00:02:34.775 big document you are going through.
47 00:02:35.165 --> 00:02:36.935 Same goes with Vector. They have their
48 00:02:37.185 --> 00:02:38.775 advantages and disadvantages.
49 00:02:39.075 --> 00:02:40.615 The key thing here is to make sure
50 00:02:40.615 --> 00:02:42.575 that you get the best of both worlds.
51 00:02:42.955 --> 00:02:45.975 And, uh, it's good to use VIS
52 00:02:45.975 --> 00:02:48.215 because as I will be showing you later,
53 00:02:48.475 --> 00:02:51.015 it is running locally on my Docker machine
54 00:02:51.515 --> 00:02:55.855 and, uh, it's absolutely very fast and very li reliable.
55 00:02:56.155 --> 00:02:59.685 It, uh, spins up in a couple seconds. Seconds.
56 00:02:59.685 --> 00:03:01.885 So how do I even do this?
57 00:03:02.145 --> 00:03:06.845 Uh, crawling of the Python NPN uh, libraries.
58 00:03:07.485 --> 00:03:09.565 I am currently using a GitHub library
59 00:03:09.565 --> 00:03:10.845 called the Crawl for ai.
60 00:03:11.025 --> 00:03:14.765 I'm pretty sure that people who are in the AI LLM uh,
61 00:03:15.735 --> 00:03:18.365 scene are very familiar with Crawl for ai.
62 00:03:19.115 --> 00:03:22.685 What, um, it basically does is you can, uh,
63 00:03:24.345 --> 00:03:28.555 just crawl a specific website and it formats it
64 00:03:28.655 --> 00:03:31.915 or gets parses the most important things
65 00:03:32.725 --> 00:03:34.905 for the AI to better understate the context
66 00:03:34.905 --> 00:03:36.625 that it will get later.
67 00:03:37.405 --> 00:03:41.545 Uh, I have also developed some smart logic in which I will
68 00:03:41.545 --> 00:03:45.025 touch on a bit later, that, uh, recognizes code blocks
69 00:03:45.055 --> 00:03:48.825 that you might find regularly in, uh, documentation.
70 00:03:49.725 --> 00:03:50.945 In doc websites.
71 00:03:51.525 --> 00:03:55.245 It, uh, goes through old,
72 00:03:55.465 --> 00:03:58.885 so it uses the site map to go through all of the
73 00:03:59.795 --> 00:04:04.605 different websites that are inside a documentation website.
74 00:04:05.025 --> 00:04:09.045 And, uh, it makes everything ready for the AI to understand.
75 00:04:09.225 --> 00:04:13.245 And then after it crawls the websites, it creates embeddings
76 00:04:13.245 --> 00:04:16.165 and whatnot, puts them in vu, uh, database
77 00:04:16.185 --> 00:04:17.925 and then we can access them later.
78 00:04:18.905 --> 00:04:21.725 So the Viss folks full text search mechanism,
79 00:04:22.155 --> 00:04:25.445 what it basically does is, uh, by the way, this is
80 00:04:25.465 --> 00:04:28.565 behind the scenes, so it's not like you get to see much of,
81 00:04:28.565 --> 00:04:29.565 uh, the VUS part.
82 00:04:29.905 --> 00:04:34.125 It, uh, uses BM 25 scoring BMI 25 scoring, who
83 00:04:35.585 --> 00:04:37.605 you give it some natural language
84 00:04:37.985 --> 00:04:39.605 and then it generates a query.
85 00:04:39.825 --> 00:04:41.805 And then based on that query, it goes
86 00:04:41.805 --> 00:04:44.725 and retrieves the most relevant, uh, documentation
87 00:04:44.865 --> 00:04:48.085 or whatever chunks that, uh, you might have.
88 00:04:48.145 --> 00:04:51.005 Chunks are very important here. I will show it a bit later.
89 00:04:52.025 --> 00:04:54.525 The same goes with, uh, vector.
90 00:04:55.145 --> 00:04:58.645 Now, as I said before, there are di advantages
91 00:04:58.645 --> 00:05:00.205 and disadvantages to them both.
92 00:05:00.785 --> 00:05:04.405 Uh, they are very strong solutions each
93 00:05:04.625 --> 00:05:05.925 to about their own thing.
94 00:05:06.075 --> 00:05:08.765 However, they do have a very big problem,
95 00:05:08.815 --> 00:05:10.285 especially in Iraq.
96 00:05:10.605 --> 00:05:13.685 'cause at the end of the day, the solution is rag, uh,
97 00:05:14.005 --> 00:05:15.085 retrieval augmented.
98 00:05:15.085 --> 00:05:19.575 Yeah, the problem is that if you give it too much data
99 00:05:19.755 --> 00:05:22.335 or you saturate the database a lot,
100 00:05:22.675 --> 00:05:24.055 it will get con confused.
101 00:05:24.185 --> 00:05:25.455 It'll definitely get confused.
102 00:05:26.035 --> 00:05:27.135 So the workaround
103 00:05:27.475 --> 00:05:31.575 and why we are using both of them, uh, is
104 00:05:31.765 --> 00:05:34.575 that we are generating embeddings for each
105 00:05:34.755 --> 00:05:36.775 of the chunks.
106 00:05:37.535 --> 00:05:39.635 We are generating summaries for each of the chunks
107 00:05:39.735 --> 00:05:43.925 and a title, just to have them a bit more.
108 00:05:45.495 --> 00:05:47.195 The title was just for show actually,
109 00:05:47.295 --> 00:05:49.315 but still we are using the summarization
110 00:05:49.855 --> 00:05:51.915 and we're passing queries to the summarization
111 00:05:51.975 --> 00:05:54.315 and the full context of each of the chunks.
112 00:05:55.275 --> 00:05:58.365 Uh, the chunk, as I said earlier, it is a bit, uh,
113 00:05:58.865 --> 00:06:02.525 varied from documentation side to documentation side.
114 00:06:02.685 --> 00:06:04.085 'cause one might have coding blocks
115 00:06:04.185 --> 00:06:06.245 and it has some smart logic to either
116 00:06:06.835 --> 00:06:08.965 read the whole coding block or stop a bit
117 00:06:08.965 --> 00:06:11.845 before that, either way to improve accuracy
118 00:06:12.105 --> 00:06:14.605 to make it more redundant, better
119 00:06:15.245 --> 00:06:18.495 coverage, scalable of course.
120 00:06:18.895 --> 00:06:22.715 'cause uh, if you're gonna be passing a full documentation
121 00:06:22.715 --> 00:06:26.915 of some NPM library, whatever you might want to add on top
122 00:06:26.915 --> 00:06:29.795 of that, a couple of, uh, libraries and whatnot.
123 00:06:31.255 --> 00:06:35.235 So why did we do this?
124 00:06:36.665 --> 00:06:40.645 As I said, the main problem is actually saturation.
125 00:06:41.425 --> 00:06:45.435 So for me at least, uh, one of the first projects
126 00:06:45.435 --> 00:06:48.195 that I've started with AI is, uh, was a rag.
127 00:06:48.735 --> 00:06:50.995 And, uh, I noticed that the second that you passed
128 00:06:51.715 --> 00:06:53.075 a couple hundreds
129 00:06:53.175 --> 00:06:57.555 or tens of chunks, actually the model would get very,
130 00:06:57.575 --> 00:07:01.555 the embedding model would get very, uh, confused
131 00:07:01.975 --> 00:07:05.515 and uh, the solution used to be to just use the rear anchors
132 00:07:05.735 --> 00:07:06.835 or things like that.
133 00:07:07.415 --> 00:07:08.915 The problem with rear anchors is
134 00:07:08.915 --> 00:07:10.675 that they can also get confused.
135 00:07:11.455 --> 00:07:15.705 So yeah, the solution to
136 00:07:15.705 --> 00:07:17.505 that is a agentic rag.
137 00:07:18.125 --> 00:07:22.785 So the summary that we generate for each of the chunks,
138 00:07:23.085 --> 00:07:24.385 we also, in that summary,
139 00:07:24.485 --> 00:07:27.345 we specify which chunk it is and everything.
140 00:07:27.605 --> 00:07:31.225 And then we pass the queries, both vector and uh, full text.
141 00:07:31.225 --> 00:07:34.905 We pass them to the summary of the summaries
142 00:07:35.245 --> 00:07:38.585 and then we pass the most relevant result from the queries,
143 00:07:38.695 --> 00:07:40.545 both of them to chat
144 00:07:40.805 --> 00:07:43.945 or whatever, LLM of your choice.
145 00:07:45.065 --> 00:07:46.955 This, from my experience
146 00:07:46.955 --> 00:07:48.395 and from the experience of my colleagues,
147 00:07:48.395 --> 00:07:51.355 which have been using this as an internal tool, has
148 00:07:51.875 --> 00:07:56.655 resulted in very accurate, uh, results.
149 00:07:57.495 --> 00:07:59.895 I have not implement, I have yet to implement some logic
150 00:07:59.925 --> 00:08:01.975 that would, uh, differentiate libraries
151 00:08:01.975 --> 00:08:03.455 better and things like that.
152 00:08:03.455 --> 00:08:07.335 However, for what we use it, it's perfectly fine.
153 00:08:09.065 --> 00:08:12.285 And now I will be showcasing the live demo.
154 00:08:12.585 --> 00:08:16.645 The live demo will go as this, uh, we will be crawling a
155 00:08:17.235 --> 00:08:18.445 website of my choice.
156 00:08:18.945 --> 00:08:21.125 Uh, in this example I've chosen pedantic
157 00:08:21.125 --> 00:08:23.925 because I had a project with that.
158 00:08:24.025 --> 00:08:25.285 So I thought, why not?
159 00:08:25.945 --> 00:08:29.845 Uh, I will showcase the site map, uh, logic
160 00:08:30.345 --> 00:08:32.365 and how that actually looks.
161 00:08:33.645 --> 00:08:37.105 And then after it goes through all of the, of,
162 00:08:37.355 --> 00:08:39.345 after it crawls all of the
163 00:08:41.895 --> 00:08:43.455 documentation websites from identity,
164 00:08:43.545 --> 00:08:46.495 which it finds from the site map, then we can query it
165 00:08:46.555 --> 00:08:48.575 as if it were a regular rack.
166 00:08:48.645 --> 00:08:52.175 However, we're not going to be using just one or the other.
167 00:08:52.355 --> 00:08:53.975 So neither Vector
168 00:08:54.195 --> 00:08:56.165 or Tex, we're gonna be using both of them
169 00:08:56.265 --> 00:08:57.725 to have even better accuracy.
170 00:09:01.745 --> 00:09:04.555 Okay. So this is the script.
171 00:09:04.915 --> 00:09:06.835 I really won't go over too much.
172 00:09:07.175 --> 00:09:11.695 Um, yeah, I think the chunk size
173 00:09:11.795 --> 00:09:13.135 and the to
174 00:09:13.315 --> 00:09:15.775 and the logic to find the coding blocks, that's, uh,
175 00:09:16.165 --> 00:09:17.695 like interesting in my opinion.
176 00:09:17.795 --> 00:09:21.075 And I'm sure whoever wants to can have a look at it.
177 00:09:21.655 --> 00:09:24.195 So I will be opening up the site map just
178 00:09:24.195 --> 00:09:26.155 so you have an understanding of how
179 00:09:26.155 --> 00:09:27.755 that actually might look like.
180 00:09:29.005 --> 00:09:33.455 This is it basically, I, it has multiple links
181 00:09:33.635 --> 00:09:35.895 inside this one site map.
182 00:09:36.115 --> 00:09:38.575 Uh, people usually, so companies usually use this
183 00:09:38.575 --> 00:09:40.455 for SEO search engine optimization,
184 00:09:41.115 --> 00:09:42.695 but, uh, that's good for us
185 00:09:42.695 --> 00:09:47.135 because we get to utilize it for whatever we want.
186 00:09:47.855 --> 00:09:50.935 I should also state that you should also always look at the
187 00:09:51.115 --> 00:09:54.975 robots dot TXD of every website just to be
188 00:09:55.555 --> 00:09:56.775 on the safer side.
189 00:09:57.375 --> 00:10:00.455 'cause maybe it is maybe illegal or I don't know.
190 00:10:01.115 --> 00:10:05.135 So I have, uh, already crawled the
191 00:10:07.235 --> 00:10:11.285 website websites and uh, it shows for example, one chunk.
192 00:10:11.385 --> 00:10:16.045 The first chunk just as a demo, it has a title summary,
193 00:10:16.585 --> 00:10:21.005 uh, the whole context chunk, ID chunk number, everything
194 00:10:21.025 --> 00:10:23.495 and embeddings a bit, uh,
195 00:10:23.495 --> 00:10:25.335 further down timestamp and everything.
196 00:10:26.035 --> 00:10:30.095 So what we do is we run crawl identical AI documentation.
197 00:10:30.195 --> 00:10:31.415 That's how I named it.
198 00:10:31.955 --> 00:10:35.695 And then, uh, after it crawls everything,
199 00:10:36.075 --> 00:10:39.215 we just do extremely run Streamli ui.
200 00:10:40.465 --> 00:10:43.965 So this Streamli UI is very simple.
201 00:10:44.225 --> 00:10:48.455 It just has, uh, an in like
202 00:10:48.965 --> 00:10:51.375 text box you can clear the chat, upload pictures.
203 00:10:51.595 --> 00:10:54.255 Uh, actually Stefan recommended this.
204 00:10:54.355 --> 00:10:58.815 So if you are at a meeting for example, or a presentation
205 00:10:58.815 --> 00:11:00.615 or something, you can just take a picture of
206 00:11:00.615 --> 00:11:01.615 what you're seeing on the screen
207 00:11:01.995 --> 00:11:05.455 and uh, if you have already crawled their documentation,
208 00:11:05.765 --> 00:11:09.935 then you can refer with your own, um, lag about
209 00:11:10.495 --> 00:11:12.375 whatever it is you're seeing here.
210 00:11:12.635 --> 00:11:14.215 I'm passing a very simple query.
211 00:11:14.395 --> 00:11:15.895 I'm sure everyone is, uh, familiar
212 00:11:16.005 --> 00:11:18.935 with the p weather agent example.
213 00:11:19.555 --> 00:11:24.215 And uh, what we're looking to get is whatever is on
214 00:11:25.505 --> 00:11:27.165 the P website.
215 00:11:27.305 --> 00:11:31.245 And I will also copy this so we can run it.
216 00:11:37.015 --> 00:11:40.685 So this is the example code that P provides.
217 00:11:41.315 --> 00:11:44.405 This is something close to what we're supposed to see.
218 00:11:44.425 --> 00:11:46.885 And if the model does a very good job, it'll understand
219 00:11:46.885 --> 00:11:48.285 that you don't need all of these,
220 00:11:48.505 --> 00:11:50.565 you just need a part, this and that.
221 00:11:50.785 --> 00:11:52.045 You don't really need this.
222 00:11:53.965 --> 00:11:58.385 And it gives you the exact same thing with less, uh,
223 00:11:59.855 --> 00:12:03.535 of these inputs and things like that fully functioning.
224 00:12:03.995 --> 00:12:07.055 It, uh, shows you how to do it or what you need to install.
225 00:12:07.275 --> 00:12:10.115 And, uh, yeah, that's it.
226 00:12:10.495 --> 00:12:15.425 Uh, the way that you, so the way that you kind
227 00:12:15.425 --> 00:12:19.825 of orchestrate the whole operation
228 00:12:20.365 --> 00:12:24.585 is by using this other file in which you just pass a
229 00:12:24.585 --> 00:12:25.745 simple query.
230 00:12:26.225 --> 00:12:29.465 I have to, you're an expert at my identity ai,
231 00:12:29.545 --> 00:12:31.625 a Python AI agent framework with access
232 00:12:31.625 --> 00:12:32.905 to extensive documentation.
233 00:12:33.045 --> 00:12:36.825 You give it some tools, uh, and it just works.
234 00:12:37.605 --> 00:12:40.425 By the way, fun fact, I have also used iden
235 00:12:41.325 --> 00:12:45.105 for this project, so that's how the idea even came about.
236 00:12:45.625 --> 00:12:47.745 I wanted to know a bit more about identity
237 00:12:48.365 --> 00:12:51.405 and uh, this is how the whole thing works.
238 00:12:51.945 --> 00:12:52.945 So yeah.
239 00:12:59.125 --> 00:13:02.725 Hello? Uh, that was quick.
240 00:13:05.335 --> 00:13:07.925 Maybe, uh, I have a couple of questions,
241 00:13:07.985 --> 00:13:10.485 but maybe you can explain them, like
242 00:13:10.485 --> 00:13:13.285 how you use it on a day-to-day basis on your end,
243 00:13:14.185 --> 00:13:15.185 Of course. Like what you did
244 00:13:15.185 --> 00:13:15.765 achieve.
245 00:13:17.065 --> 00:13:21.045 So, um, as I said, with the identity ai, uh, example,
246 00:13:21.755 --> 00:13:25.765 what I like, what I was trying to do at first was
247 00:13:25.825 --> 00:13:28.405 to have a tool for myself
248 00:13:28.505 --> 00:13:32.485 and my colleagues maybe so that, uh, you could just scrape
249 00:13:33.085 --> 00:13:36.245 documentation from different NPM
250 00:13:36.245 --> 00:13:39.765 or Python libraries or things that you find on the web.
251 00:13:39.865 --> 00:13:43.085 Mm-hmm. And, uh, it would help you better understand it,
252 00:13:43.085 --> 00:13:44.325 so you would understand it faster.
253 00:13:44.555 --> 00:13:47.245 Instead of you having to go through all the documentation,
254 00:13:47.985 --> 00:13:52.125 you could literally just give, like skim it over
255 00:13:52.125 --> 00:13:54.125 so you understand where to look.
256 00:13:54.465 --> 00:13:56.245 And then you can query each at GPT
257 00:13:56.245 --> 00:13:59.445 or another LLM to give you actual
258 00:14:00.475 --> 00:14:02.325 like correct responses.
259 00:14:02.325 --> 00:14:06.185 Mm-hmm. That's how the whole uh, thing came about.
260 00:14:07.915 --> 00:14:09.005 Okay. Okay. Cool.
261 00:14:09.305 --> 00:14:11.605 Uh, I don't know if we have questions in the chat.
262 00:14:12.105 --> 00:14:15.165 Uh, so for the people, you can ask directly in the q and a
263 00:14:16.025 --> 00:14:18.605 and otherwise I'm gonna go
264 00:14:18.605 --> 00:14:20.325 and ask some question myself in the meantime.
265 00:14:21.225 --> 00:14:24.005 Um, so you mentioned quickly BM 25.
266 00:14:24.465 --> 00:14:26.725 Uh, so can you explain, you know, the role
267 00:14:26.725 --> 00:14:28.965 of BM 25 in the scoring for like
268 00:14:29.475 --> 00:14:31.445 with vis full tech search that you have?
269 00:14:32.105 --> 00:14:32.755 Yeah, of course.
270 00:14:37.335 --> 00:14:40.125 Wait, you froze. Hello again.
271 00:14:48.145 --> 00:14:51.595 Getting someone in the chat. Okay, you're back. You froze.
272 00:14:51.855 --> 00:14:55.155 You were freezing. Hear me? Yes, now I can hear you.
273 00:14:55.295 --> 00:14:58.755 I'm very sorry. My internet. I don't know.
274 00:14:59.255 --> 00:15:01.875 Uh, the question was about, uh, DM 25. Yes,
275 00:15:02.335 --> 00:15:03.335 Yes.
276 00:15:03.655 --> 00:15:07.915 Okay. So I will,
277 00:15:08.195 --> 00:15:12.625 I will explain how like, okay, so DM 25,
278 00:15:12.625 --> 00:15:15.705 basically the reason why you would use,
279 00:15:16.845 --> 00:15:18.865 the reason why you would use BM 25. Yes.
280 00:15:19.455 --> 00:15:22.305 Yeah. Yeah. Why you like, okay.
281 00:15:22.305 --> 00:15:23.865 What's the benefit you saw in yourself?
282 00:15:24.005 --> 00:15:26.065 Was it, uh, when you use it,
283 00:15:26.205 --> 00:15:28.225 did you see like something better than only
284 00:15:28.225 --> 00:15:29.585 using, you know, vector search?
285 00:15:30.175 --> 00:15:31.425 Okay, yeah. Yeah.
286 00:15:31.765 --> 00:15:35.785 So basically since uh, the idea of the project was utilize
287 00:15:36.005 --> 00:15:40.185 to utilize viss in order to make something great,
288 00:15:40.515 --> 00:15:42.425 great make something very interesting so
289 00:15:42.425 --> 00:15:45.745 that we could showcase to better understand Viss
290 00:15:45.765 --> 00:15:48.425 and, uh, our capabilities as a team.
291 00:15:49.335 --> 00:15:51.385 What, uh, BM 25 is this?
292 00:15:51.485 --> 00:15:53.185 I'm pretty sure it's just best matching.
293 00:15:53.485 --> 00:15:56.385 Uh, it's a retrieval
294 00:15:57.095 --> 00:15:58.545 rank search results thing.
295 00:15:59.085 --> 00:16:01.865 And, uh, what it does on the backend, at least
296 00:16:01.865 --> 00:16:05.745 for Mild's site, why I've chosen also to use it is
297 00:16:05.745 --> 00:16:09.425 because since it allows you to utilize just, um,
298 00:16:10.055 --> 00:16:14.425 natural language, it could also have benefits when it comes
299 00:16:14.445 --> 00:16:17.465 to how it understands that natural language
300 00:16:17.525 --> 00:16:18.945 as like a query by itself.
301 00:16:19.605 --> 00:16:20.945 Uh, I have noticed kind
302 00:16:20.945 --> 00:16:24.785 of different results like from Vector tool, the M 25.
303 00:16:25.245 --> 00:16:29.025 At the end of the day, they both use the same kind of logic,
304 00:16:29.765 --> 00:16:34.095 but, um, it's a bit of mild, this magic, the way
305 00:16:34.095 --> 00:16:35.575 that it, uh, works.
306 00:16:35.575 --> 00:16:39.005 Like it does. I, from my experiments,
307 00:16:39.175 --> 00:16:43.085 after a couple of back and forth, uh, a couple of minutes
308 00:16:43.305 --> 00:16:44.925 or messages of going back
309 00:16:45.045 --> 00:16:49.815 and forth with the chat, the vector part gets, starts
310 00:16:49.815 --> 00:16:54.215 to get saturated and maybe a bit, uh, confused.
311 00:16:54.555 --> 00:16:59.135 Mm-hmm. I can't really put my finger on why.
312 00:16:59.305 --> 00:17:02.335 Maybe, maybe it's also far
313 00:17:02.335 --> 00:17:05.215 and fault of, uh, my logic, like the way
314 00:17:05.215 --> 00:17:06.575 that I wrote the script.
315 00:17:07.035 --> 00:17:10.575 But, uh, the combination of both of them has proven to just
316 00:17:12.275 --> 00:17:16.625 bring the best, uh, results from each respective query.
317 00:17:17.195 --> 00:17:18.925 Yeah. In this application.
318 00:17:19.465 --> 00:17:22.995 Yeah, no, uh, also I can add, uh, on that, it's, um, it's
319 00:17:22.995 --> 00:17:24.795 what we see as well is the customers.
320 00:17:25.055 --> 00:17:28.195 And in particular it's very useful if you're looking
321 00:17:28.215 --> 00:17:30.635 for specific names or specific brands.
322 00:17:30.705 --> 00:17:33.675 Like you're looking for a specific library name, you know,
323 00:17:33.675 --> 00:17:35.995 maybe you have another one that is similar, uh,
324 00:17:36.065 --> 00:17:38.755 with keyword search, then you will actually find this one
325 00:17:38.755 --> 00:17:40.555 and not, you know, another one that would be similar.
326 00:17:40.615 --> 00:17:42.755 That's usually a good way.
327 00:17:43.095 --> 00:17:45.155 Uh, we have some questions as well,
328 00:17:45.615 --> 00:17:48.355 so I'm gonna ask them is like, first one is, uh,
329 00:17:48.375 --> 00:17:51.365 how do you manage switching, switching between embeddings
330 00:17:51.465 --> 00:17:53.845 to string or string to embeddings on request?
331 00:17:58.765 --> 00:18:02.665 So I'm assuming it's, uh, I'm gonna try to rephrase it
332 00:18:02.665 --> 00:18:04.145 and the person let me know if it's correct.
333 00:18:04.805 --> 00:18:07.665 But how do you go from like having the query, you know,
334 00:18:07.665 --> 00:18:09.465 that you search to then having the embeddings
335 00:18:10.695 --> 00:18:12.625 Does, does that, that's uh, when it comes
336 00:18:12.625 --> 00:18:14.465 to full text search, at least VIS does that
337 00:18:14.655 --> 00:18:18.745 because, uh, on my end at least you just import it,
338 00:18:18.805 --> 00:18:21.905 you utilize it and that's it on the backend.
339 00:18:22.365 --> 00:18:23.945 I'm pretty sure you can go a bit more in
340 00:18:23.945 --> 00:18:25.025 detail when it comes to that.
341 00:18:25.685 --> 00:18:29.625 Yes. So basically, yeah, we, we have our own, uh,
342 00:18:29.815 --> 00:18:30.865 analyzers, um,
343 00:18:31.005 --> 00:18:33.985 and different functions, uh, where you have your,
344 00:18:33.985 --> 00:18:35.665 you're gonna pass the input query, uh,
345 00:18:35.665 --> 00:18:36.905 then it's gonna be transformed.
346 00:18:36.925 --> 00:18:38.465 So removing, you know,
347 00:18:38.465 --> 00:18:40.145 maybe you're gonna put everything in in
348 00:18:40.145 --> 00:18:41.385 lower case or something.
349 00:18:41.385 --> 00:18:43.305 And then, uh, we have the tokenize
350 00:18:43.325 --> 00:18:44.865 and analyzer, uh, that is running.
351 00:18:45.565 --> 00:18:48.785 So then you don't have to think about the embeddings.
352 00:18:48.785 --> 00:18:50.865 That's also why we released full text search
353 00:18:51.485 --> 00:18:54.865 and that way you write text as an output, you get text
354 00:18:55.085 --> 00:18:57.745 and in the be in between, as you said, uh, it's mini
355 00:18:57.775 --> 00:18:59.105 that is gonna use everything here.
356 00:19:00.365 --> 00:19:05.345 Uh, there is another question, um,
357 00:19:06.755 --> 00:19:10.585 which is, is it an app strategy for rag index strategy?
358 00:19:11.285 --> 00:19:14.695 So maybe do you have like, I'm gonna try
359 00:19:14.695 --> 00:19:16.335 to rephrase this one as well because I'm not sure.
360 00:19:17.035 --> 00:19:19.495 Uh, do you, did you try different, um,
361 00:19:19.785 --> 00:19:21.615 index strategy maybe for your rag?
362 00:19:22.835 --> 00:19:25.555 Hmm. Um, in the past?
363 00:19:25.625 --> 00:19:28.875 Yeah, in the couple of, in the last couple of months.
364 00:19:29.015 --> 00:19:31.995 Not really because I found my bread and butter. Mm-hmm.
365 00:19:32.075 --> 00:19:35.195 Which was just using open AI's, um, embeddings
366 00:19:35.215 --> 00:19:38.075 and then uh, using a rear anchor, which was either
367 00:19:39.245 --> 00:19:40.895 from hugging face or somewhere else.
368 00:19:41.115 --> 00:19:42.335 But this was my bread and butter
369 00:19:42.335 --> 00:19:45.615 because up until recently I had never had the need for
370 00:19:46.445 --> 00:19:50.335 such a complex solution to rag
371 00:19:50.605 --> 00:19:54.135 because uh, I didn't have the need to go through
372 00:19:55.245 --> 00:19:57.685 hundreds of, uh, websites and query them at the same time.
373 00:19:57.685 --> 00:20:01.645 Mm-hmm. So I hope this was, uh, a good enough answer.
374 00:20:02.795 --> 00:20:05.605 Yeah. And I guess we kind replied to it,
375 00:20:05.625 --> 00:20:07.005 but just gonna ask it again.
376 00:20:07.025 --> 00:20:08.765 So did you try to combine Vector
377 00:20:08.765 --> 00:20:09.845 with sparse and embedding models?
378 00:20:09.985 --> 00:20:14.325 So like BG M three that we have in Milus, um, so did you try
379 00:20:15.305 --> 00:20:16.305 Sparse and Vectors?
380 00:20:16.605 --> 00:20:18.865 Um, just to, to understand, you know, the advantages
381 00:20:18.865 --> 00:20:21.585 of using full text instead of and billings.
382 00:20:22.595 --> 00:20:26.415 So in, uh, when I did, when I started working
383 00:20:27.155 --> 00:20:29.415 on this tool, I have tried, uh,
384 00:20:30.165 --> 00:20:32.255 many different combinations of many different things.
385 00:20:32.795 --> 00:20:36.055 But uh, at the end, I don't know if it was just
386 00:20:37.425 --> 00:20:38.985 convenience plus performance,
387 00:20:39.205 --> 00:20:41.225 but I kept going back to this vector
388 00:20:41.285 --> 00:20:43.545 and the full tax uh, solution.
389 00:20:43.965 --> 00:20:47.845 It just seemed to give the better answers all the time.
390 00:20:48.285 --> 00:20:51.125 'cause with the other like combinations,
391 00:20:52.045 --> 00:20:55.725 I would also just get some very good response
392 00:20:55.825 --> 00:20:59.525 and then a completely out of the ball out of the park,
393 00:20:59.715 --> 00:21:01.205 like query response.
394 00:21:01.995 --> 00:21:06.485 This was the perfect balance to whatever I am working on.
395 00:21:06.485 --> 00:21:10.285 Because at the end of the day, uh, what I envision as a user
396 00:21:10.345 --> 00:21:12.765 for this is someone who also has an understanding of
397 00:21:12.765 --> 00:21:15.325 what they're looking at and what they're going through.
398 00:21:16.145 --> 00:21:20.105 So you have to be a bit more specific maybe.
399 00:21:20.365 --> 00:21:24.295 Uh, and, but it usually has produced better
400 00:21:24.355 --> 00:21:26.895 and more reliable uh, results.
401 00:21:27.135 --> 00:21:30.495 I have been using this for a couple of, mm-hmm.
402 00:21:30.975 --> 00:21:32.135 A couple of days minimum.
403 00:21:33.115 --> 00:21:36.645 Cool. And also to add to to your response, so one
404 00:21:36.645 --> 00:21:38.485 of the advantage of using full text instead of par
405 00:21:38.485 --> 00:21:40.925 and Billings, um, when you use parts
406 00:21:40.925 --> 00:21:43.165 and billings, you have to compute them yourself usually.
407 00:21:43.785 --> 00:21:45.765 Uh, which is, which can be tricky.
408 00:21:45.905 --> 00:21:48.245 You know, you have to update the statistics yourself.
409 00:21:48.465 --> 00:21:51.445 Um, whereas with full text search, basically we take care of
410 00:21:51.445 --> 00:21:53.685 that for you so you don't have to have another pipeline.
411 00:21:53.685 --> 00:21:56.645 You know, there is um, doing that as well.
412 00:21:57.625 --> 00:22:01.925 Uh, there is one which is quite long. Uh, yeah, I'm reading
413 00:22:02.235 --> 00:22:03.235 This Like Currently,
414 00:22:03.235 --> 00:22:05.925 Currently building something utilizing Crawl
415 00:22:05.925 --> 00:22:08.285 four AI and Zeis basically monitoring financial news
416 00:22:08.285 --> 00:22:09.845 and adding rack capabilities on top.
417 00:22:10.045 --> 00:22:14.045 I had two issues. I'm face, I'm facing one, there's a lot
418 00:22:14.045 --> 00:22:15.365 of articles are very different in
419 00:22:15.365 --> 00:22:16.845 length, so are extreme be long.
420 00:22:17.505 --> 00:22:19.285 So what should be my approach to chunking?
421 00:22:20.185 --> 00:22:22.885 And the other one is that they different languages.
422 00:22:23.225 --> 00:22:24.965 How would you embed multilingual documents?
423 00:22:24.985 --> 00:22:26.565 How would you embed it in the same space
424 00:22:26.865 --> 00:22:28.245 or separate for each language?
425 00:22:28.545 --> 00:22:30.245 How would you have a multilingual embedding models?
426 00:22:31.985 --> 00:22:33.205 Uh, let's go for those two
427 00:22:33.505 --> 00:22:35.045 and then we'll have the follow up questions after.
428 00:22:35.945 --> 00:22:40.075 Okay. So, um, first of all, pretty interesting project.
429 00:22:40.335 --> 00:22:43.755 I'm not even gonna lie, this has some very good, uh,
430 00:22:43.755 --> 00:22:46.635 especially if you, you're into stocks and things like that.
431 00:22:47.175 --> 00:22:49.515 So about uh, the chunking thing, this is
432 00:22:49.535 --> 00:22:51.155 how I personally would go about it.
433 00:22:51.155 --> 00:22:52.675 And this is what I kind
434 00:22:52.675 --> 00:22:54.075 of have done in this project as well.
435 00:22:54.665 --> 00:22:59.355 Instead of, so I keep track of uh, chunks in two ways.
436 00:22:59.615 --> 00:23:02.955 The ID of the chunk, like every chunk has its own id
437 00:23:03.255 --> 00:23:05.395 and then the actual chunk number
438 00:23:05.535 --> 00:23:09.395 of the specific single website that you're reading, uh,
439 00:23:09.395 --> 00:23:12.635 that you're crawling, that's how I would do it.
440 00:23:12.695 --> 00:23:16.115 And then the very last chunk of that specific website,
441 00:23:16.115 --> 00:23:17.835 that's gonna be a bit shorter,
442 00:23:18.575 --> 00:23:20.515 but uh, I really don't think that
443 00:23:20.515 --> 00:23:21.795 that's that big of a problem.
444 00:23:22.395 --> 00:23:25.915 'cause usually in, especially in financial articles
445 00:23:25.915 --> 00:23:28.515 and things like that, the meat is in between.
446 00:23:29.215 --> 00:23:31.395 So that's in my opinion.
447 00:23:32.215 --> 00:23:35.515 And then about the language part, I have experimented
448 00:23:35.545 --> 00:23:38.515 with uh, different embedding models for
449 00:23:39.465 --> 00:23:40.795 different languages.
450 00:23:42.635 --> 00:23:44.515 I still am sad
451 00:23:45.105 --> 00:23:48.405 or whatever to say that uh, the best way to go about this is
452 00:23:48.405 --> 00:23:50.085 to translate them to English
453 00:23:50.515 --> 00:23:55.435 because it'll never perform the same if you use
454 00:23:55.475 --> 00:23:58.355 a multimodel, uh, multi-language model,
455 00:23:58.355 --> 00:24:01.395 like embedding model, it has the drawback
456 00:24:01.395 --> 00:24:03.995 that it's trained on more, uh,
457 00:24:04.335 --> 00:24:06.355 on different languages, multi-language.
458 00:24:06.815 --> 00:24:09.955 So I personally would just translate whatever article in
459 00:24:09.955 --> 00:24:12.635 English and then resume with the chunking logic
460 00:24:12.815 --> 00:24:13.995 and then the IC logic.
461 00:24:14.375 --> 00:24:16.995 And then for the very last part, I really,
462 00:24:17.165 --> 00:24:19.395 especially when it comes to financial, uh,
463 00:24:23.915 --> 00:24:27.675 I really personally would use, uh, BM 25
464 00:24:27.695 --> 00:24:31.395 and a NN like mm-hmm.
465 00:24:31.945 --> 00:24:35.995 Yeah. But I, that's how I would do it.
466 00:24:37.395 --> 00:24:40.085 Cool, thank you. And yeah, Ari,
467 00:24:40.165 --> 00:24:42.125 I think you already mentioned it's a bit like the
468 00:24:42.835 --> 00:24:45.085 performance for hybrid or semantic search.
469 00:24:45.265 --> 00:24:47.245 You said basically
470 00:24:47.305 --> 00:24:48.525 by using full tech search you
471 00:24:48.525 --> 00:24:49.565 have better performance, right?
472 00:24:49.565 --> 00:24:51.085 The first question you have in the q and a.
473 00:24:51.755 --> 00:24:53.885 Yeah, and it,
474 00:24:54.105 --> 00:24:58.415 and in my experience it really has, I have yet
475 00:24:58.415 --> 00:25:00.695 to find a single drawback, like something
476 00:25:00.695 --> 00:25:02.935 that really makes me question the whole solution.
477 00:25:04.105 --> 00:25:05.575 There might have been caveats,
478 00:25:05.575 --> 00:25:06.695 but I don't even remember them.
479 00:25:07.075 --> 00:25:09.735 So it's worked out pretty well.
480 00:25:10.245 --> 00:25:12.935 Cool. Um, just gonna wait one
481 00:25:12.935 --> 00:25:16.735 or two minutes to know if we have more questions from the
482 00:25:16.735 --> 00:25:20.125 people otherwise, uh, yeah,
483 00:25:20.125 --> 00:25:21.965 maybe they can add you on LinkedIn
484 00:25:22.025 --> 00:25:23.445 or somewhere where can people
485 00:25:23.625 --> 00:25:25.085 follow up if they have questions?
486 00:25:25.945 --> 00:25:28.075 Okay. They can add you on LinkedIn. Uh,
487 00:25:28.705 --> 00:25:31.755 Yeah, pretty sure LinkedIn if you have it,
488 00:25:32.055 --> 00:25:35.305 or either my LinkedIn or my GitHub.
489 00:25:35.565 --> 00:25:37.275 Uh, yeah,
490 00:25:39.875 --> 00:25:44.555 Just waiting quickly and then we'll see.
491 00:25:44.705 --> 00:25:48.165 Otherwise we know wait one at 30
492 00:25:49.065 --> 00:25:52.045 if we don't have questions we can uh, close this one.
493 00:25:53.185 --> 00:25:54.565 But yeah, that was very interesting.
494 00:25:54.685 --> 00:25:57.645 I mean the use case is also interesting as a
495 00:25:58.975 --> 00:26:00.905 more than a toy project as well.
496 00:26:00.905 --> 00:26:03.825 Mm-hmm. So this is, uh, this is quite cool.
497 00:26:04.575 --> 00:26:07.025 It's being used on the daily, uh, it,
498 00:26:07.365 --> 00:26:09.985 of course it has room for improvement as I mentioned.
499 00:26:10.015 --> 00:26:13.225 Like preferably I would like to implement some logic
500 00:26:13.225 --> 00:26:16.985 that differentiates different, uh, site maps, so
501 00:26:17.045 --> 00:26:19.945 of different projects or libraries.
502 00:26:20.765 --> 00:26:24.665 But I have yet to work on something that complex. Mm-hmm.
503 00:26:24.745 --> 00:26:27.145 So I didn't need to refer to multiple
504 00:26:27.765 --> 00:26:28.865 things at the same time.
505 00:26:28.975 --> 00:26:30.065 Like it's been fine.
506 00:26:31.025 --> 00:26:33.115 Okay. Okay. It seems like, yeah,
507 00:26:33.115 --> 00:26:34.635 there's no more questions.
508 00:26:35.065 --> 00:26:36.715 Well, thank you very much for the presentation.
509 00:26:37.095 --> 00:26:39.955 Uh, thank you very everyone, everyone, sorry for attending.
510 00:26:40.415 --> 00:26:42.595 Uh, we'll follow up with a recording
511 00:26:42.985 --> 00:26:44.605 so you will see it in a couple of days
512 00:26:45.465 --> 00:26:47.965 and I will see you soon on my end.
513 00:26:48.365 --> 00:26:50.245 I actually see you next week for the other webinar,
514 00:26:50.335 --> 00:26:53.685 which will be with Feast On How You Can Do Real Time Rag.
515 00:26:54.225 --> 00:26:56.685 So thank you very much. Have a lovely morning, afternoon,
516 00:26:56.685 --> 00:26:57.885 or evening, wherever you are in the world.
517 00:26:58.705 --> 00:26:59.845 And goodbye.