- Events
Milvus Hybrid Search: Combining Keyword Precision with Semantic Power for Next-Gen Data Retrieval
Webinar
Milvus Hybrid Search: Combining Keyword Precision with Semantic Power for Next-Gen Data Retrieval
Join the Webinar
Loading...

About this webinar
This webinar demonstrates a unified approach to document retrieval by combining advanced web crawling with a hybrid search architecture that leverages both full text and dense vector search within Milvus. Participants will see how Crawl4AI is used to extract documentation which is then ingested into Milvus. The system utilizes Milvus’ full text search capabilities—powered by built-in BM25 relevance scoring—to handle keyword matching, while dense vector search is employed for semantic similarity. Together, these methods address the challenge of accurately retrieving relevant information.
Topics Covered
- Data Extraction with Crawl4AI: Techniques for scraping documentation for python and npm libraries.
- Milvus Full Text Search: Internal mechanisms including text tokenization, sparse embedding generation, and BM25 scoring.
- Milvus Dense Vector Search: Use of ANN search and efficient indexing strategies to rapidly compute semantic similarity.
- Hybrid Search Integration: Method of merging keyword-based and vector-based retrieval to improve accuracy and coverage.
1 00:00:03.545 --> 00:00:04.645 I'm pleased to introduce
2 00:00:04.645 --> 00:00:07.405 to this session research Combining Keyword Precision
3 00:00:07.405 --> 00:00:10.085 with semantic Power for NextGen Data Retrieval
4 00:00:10.085 --> 00:00:12.645 with our guest speaker today, kuka.
5 00:00:12.865 --> 00:00:17.165 He is a data and AI engineer at Data Datamax AI specializing
6 00:00:17.185 --> 00:00:19.325 in ML and AI development and integration.
7 00:00:20.105 --> 00:00:23.165 He has gained deep insight into data transformers,
8 00:00:23.165 --> 00:00:24.685 architecture, attention mechanism,
9 00:00:24.905 --> 00:00:26.725 and other critical areas of machine learning.
10 00:00:27.515 --> 00:00:29.045 He's passionate about SMASH learning
11 00:00:29.505 --> 00:00:32.125 and early is committed to advancing, advancing his career,
12 00:00:32.385 --> 00:00:33.685 and ultimately leading teams
13 00:00:33.745 --> 00:00:36.765 to create innovative solutions in cloud development and ai.
14 00:00:37.295 --> 00:00:39.125 Early the stage is yours.
15 00:00:40.325 --> 00:00:42.625 Uh, thank you Stefan. Hi everyone.
16 00:00:43.205 --> 00:00:45.745 So, yes, uh,
17 00:00:46.765 --> 00:00:49.565 the, this presentation.
18 00:00:49.825 --> 00:00:52.005 In this presentation we're gonna be talking about, uh,
19 00:00:52.795 --> 00:00:55.245 combining the VIS Pul text search
20 00:00:55.245 --> 00:00:58.045 and embedding search to achieve
21 00:00:58.045 --> 00:00:59.805 something interesting in my opinion.
22 00:01:00.465 --> 00:01:04.345 So the quest for this, uh, project started
23 00:01:04.345 --> 00:01:08.865 because I was in need of a certain tool that was able
24 00:01:08.865 --> 00:01:12.875 to crawl the web, uh, whatever websites they can find,
25 00:01:13.675 --> 00:01:17.925 generate, uh, some data such as a title for the thing,
26 00:01:18.045 --> 00:01:20.525 a summary basically to allow me
27 00:01:20.525 --> 00:01:22.565 to better find whatever pieces
28 00:01:22.625 --> 00:01:25.405 and chunks that I needed to help myself, uh,
29 00:01:25.475 --> 00:01:28.965 code a bit faster or understand, uh, like libraries better.
30 00:01:30.105 --> 00:01:34.125 So I will be sharing the presentation that I have made
31 00:01:34.225 --> 00:01:39.165 and then, uh, we will go deeper into a, the demo which, uh,
32 00:01:39.585 --> 00:01:41.685 in which we will showcase how
33 00:01:42.195 --> 00:01:45.005 this whole logic comes together so that, uh,
34 00:01:45.105 --> 00:01:46.445 you could use it in the future.
35 00:01:54.885 --> 00:01:59.465 So, as I said, uh, the
36 00:02:00.575 --> 00:02:03.435 key point of this presentation is about vis
37 00:02:03.575 --> 00:02:07.715 and, uh, there, uh, their sim uh, full text search
38 00:02:07.815 --> 00:02:08.955 and, uh, vector search,
39 00:02:09.735 --> 00:02:13.135 uh, sorry.
40 00:02:13.835 --> 00:02:18.095 So combining them, why would you need to combine these two,
41 00:02:18.475 --> 00:02:20.775 uh, retrieval, uh, methods?
42 00:02:21.325 --> 00:02:22.695 Well, uh, simply
43 00:02:22.725 --> 00:02:25.695 because maybe some of them, for example,
44 00:02:25.795 --> 00:02:28.215 if you use a full text search, maybe you're going
45 00:02:28.215 --> 00:02:31.255 to miss a couple of details in, uh, whatever
46 00:02:32.115 --> 00:02:34.775 big document you are going through.
47 00:02:35.165 --> 00:02:36.935 Same goes with Vector. They have their
48 00:02:37.185 --> 00:02:38.775 advantages and disadvantages.
49 00:02:39.075 --> 00:02:40.615 The key thing here is to make sure
50 00:02:40.615 --> 00:02:42.575 that you get the best of both worlds.
51 00:02:42.955 --> 00:02:45.975 And, uh, it's good to use VIS
52 00:02:45.975 --> 00:02:48.215 because as I will be showing you later,
53 00:02:48.475 --> 00:02:51.015 it is running locally on my Docker machine
54 00:02:51.515 --> 00:02:55.855 and, uh, it's absolutely very fast and very li reliable.
55 00:02:56.155 --> 00:02:59.685 It, uh, spins up in a couple seconds. Seconds.
56 00:02:59.685 --> 00:03:01.885 So how do I even do this?
57 00:03:02.145 --> 00:03:06.845 Uh, crawling of the Python NPN uh, libraries.
58 00:03:07.485 --> 00:03:09.565 I am currently using a GitHub library
59 00:03:09.565 --> 00:03:10.845 called the Crawl for ai.
60 00:03:11.025 --> 00:03:14.765 I'm pretty sure that people who are in the AI LLM uh,
61 00:03:15.735 --> 00:03:18.365 scene are very familiar with Crawl for ai.
62 00:03:19.115 --> 00:03:22.685 What, um, it basically does is you can, uh,
63 00:03:24.345 --> 00:03:28.555 just crawl a specific website and it formats it
64 00:03:28.655 --> 00:03:31.915 or gets parses the most important things
65 00:03:32.725 --> 00:03:34.905 for the AI to better understate the context
66 00:03:34.905 --> 00:03:36.625 that it will get later.
67 00:03:37.405 --> 00:03:41.545 Uh, I have also developed some smart logic in which I will
68 00:03:41.545 --> 00:03:45.025 touch on a bit later, that, uh, recognizes code blocks
69 00:03:45.055 --> 00:03:48.825 that you might find regularly in, uh, documentation.
70 00:03:49.725 --> 00:03:50.945 In doc websites.
71 00:03:51.525 --> 00:03:55.245 It, uh, goes through old,
72 00:03:55.465 --> 00:03:58.885 so it uses the site map to go through all of the
73 00:03:59.795 --> 00:04:04.605 different websites that are inside a documentation website.
74 00:04:05.025 --> 00:04:09.045 And, uh, it makes everything ready for the AI to understand.
75 00:04:09.225 --> 00:04:13.245 And then after it crawls the websites, it creates embeddings
76 00:04:13.245 --> 00:04:16.165 and whatnot, puts them in vu, uh, database
77 00:04:16.185 --> 00:04:17.925 and then we can access them later.
78 00:04:18.905 --> 00:04:21.725 So the Viss folks full text search mechanism,
79 00:04:22.155 --> 00:04:25.445 what it basically does is, uh, by the way, this is
80 00:04:25.465 --> 00:04:28.565 behind the scenes, so it's not like you get to see much of,
81 00:04:28.565 --> 00:04:29.565 uh, the VUS part.
82 00:04:29.905 --> 00:04:34.125 It, uh, uses BM 25 scoring BMI 25 scoring, who
83 00:04:35.585 --> 00:04:37.605 you give it some natural language
84 00:04:37.985 --> 00:04:39.605 and then it generates a query.
85 00:04:39.825 --> 00:04:41.805 And then based on that query, it goes
86 00:04:41.805 --> 00:04:44.725 and retrieves the most relevant, uh, documentation
87 00:04:44.865 --> 00:04:48.085 or whatever chunks that, uh, you might have.
88 00:04:48.145 --> 00:04:51.005 Chunks are very important here. I will show it a bit later.
89 00:04:52.025 --> 00:04:54.525 The same goes with, uh, vector.
90 00:04:55.145 --> 00:04:58.645 Now, as I said before, there are di advantages
91 00:04:58.645 --> 00:05:00.205 and disadvantages to them both.
92 00:05:00.785 --> 00:05:04.405 Uh, they are very strong solutions each
93 00:05:04.625 --> 00:05:05.925 to about their own thing.
94 00:05:06.075 --> 00:05:08.765 However, they do have a very big problem,
95 00:05:08.815 --> 00:05:10.285 especially in Iraq.
96 00:05:10.605 --> 00:05:13.685 'cause at the end of the day, the solution is rag, uh,
97 00:05:14.005 --> 00:05:15.085 retrieval augmented.
98 00:05:15.085 --> 00:05:19.575 Yeah, the problem is that if you give it too much data
99 00:05:19.755 --> 00:05:22.335 or you saturate the database a lot,
100 00:05:22.675 --> 00:05:24.055 it will get con confused.
101 00:05:24.185 --> 00:05:25.455 It'll definitely get confused.
102 00:05:26.035 --> 00:05:27.135 So the workaround
103 00:05:27.475 --> 00:05:31.575 and why we are using both of them, uh, is
104 00:05:31.765 --> 00:05:34.575 that we are generating embeddings for each
105 00:05:34.755 --> 00:05:36.775 of the chunks.
106 00:05:37.535 --> 00:05:39.635 We are generating summaries for each of the chunks
107 00:05:39.735 --> 00:05:43.925 and a title, just to have them a bit more.
108 00:05:45.495 --> 00:05:47.195 The title was just for show actually,
109 00:05:47.295 --> 00:05:49.315 but still we are using the summarization
110 00:05:49.855 --> 00:05:51.915 and we're passing queries to the summarization
111 00:05:51.975 --> 00:05:54.315 and the full context of each of the chunks.
112 00:05:55.275 --> 00:05:58.365 Uh, the chunk, as I said earlier, it is a bit, uh,
113 00:05:58.865 --> 00:06:02.525 varied from documentation side to documentation side.
114 00:06:02.685 --> 00:06:04.085 'cause one might have coding blocks
115 00:06:04.185 --> 00:06:06.245 and it has some smart logic to either
116 00:06:06.835 --> 00:06:08.965 read the whole coding block or stop a bit
117 00:06:08.965 --> 00:06:11.845 before that, either way to improve accuracy
118 00:06:12.105 --> 00:06:14.605 to make it more redundant, better
119 00:06:15.245 --> 00:06:18.495 coverage, scalable of course.
120 00:06:18.895 --> 00:06:22.715 'cause uh, if you're gonna be passing a full documentation
121 00:06:22.715 --> 00:06:26.915 of some NPM library, whatever you might want to add on top
122 00:06:26.915 --> 00:06:29.795 of that, a couple of, uh, libraries and whatnot.
123 00:06:31.255 --> 00:06:35.235 So why did we do this?
124 00:06:36.665 --> 00:06:40.645 As I said, the main problem is actually saturation.
125 00:06:41.425 --> 00:06:45.435 So for me at least, uh, one of the first projects
126 00:06:45.435 --> 00:06:48.195 that I've started with AI is, uh, was a rag.
127 00:06:48.735 --> 00:06:50.995 And, uh, I noticed that the second that you passed
128 00:06:51.715 --> 00:06:53.075 a couple hundreds
129 00:06:53.175 --> 00:06:57.555 or tens of chunks, actually the model would get very,
130 00:06:57.575 --> 00:07:01.555 the embedding model would get very, uh, confused
131 00:07:01.975 --> 00:07:05.515 and uh, the solution used to be to just use the rear anchors
132 00:07:05.735 --> 00:07:06.835 or things like that.
133 00:07:07.415 --> 00:07:08.915 The problem with rear anchors is
134 00:07:08.915 --> 00:07:10.675 that they can also get confused.
135 00:07:11.455 --> 00:07:15.705 So yeah, the solution to
136 00:07:15.705 --> 00:07:17.505 that is a agentic rag.
137 00:07:18.125 --> 00:07:22.785 So the summary that we generate for each of the chunks,
138 00:07:23.085 --> 00:07:24.385 we also, in that summary,
139 00:07:24.485 --> 00:07:27.345 we specify which chunk it is and everything.
140 00:07:27.605 --> 00:07:31.225 And then we pass the queries, both vector and uh, full text.
141 00:07:31.225 --> 00:07:34.905 We pass them to the summary of the summaries
142 00:07:35.245 --> 00:07:38.585 and then we pass the most relevant result from the queries,
143 00:07:38.695 --> 00:07:40.545 both of them to chat
144 00:07:40.805 --> 00:07:43.945 or whatever, LLM of your choice.
145 00:07:45.065 --> 00:07:46.955 This, from my experience
146 00:07:46.955 --> 00:07:48.395 and from the experience of my colleagues,
147 00:07:48.395 --> 00:07:51.355 which have been using this as an internal tool, has
148 00:07:51.875 --> 00:07:56.655 resulted in very accurate, uh, results.
149 00:07:57.495 --> 00:07:59.895 I have not implement, I have yet to implement some logic
150 00:07:59.925 --> 00:08:01.975 that would, uh, differentiate libraries
151 00:08:01.975 --> 00:08:03.455 better and things like that.
152 00:08:03.455 --> 00:08:07.335 However, for what we use it, it's perfectly fine.
153 00:08:09.065 --> 00:08:12.285 And now I will be showcasing the live demo.
154 00:08:12.585 --> 00:08:16.645 The live demo will go as this, uh, we will be crawling a
155 00:08:17.235 --> 00:08:18.445 website of my choice.
156 00:08:18.945 --> 00:08:21.125 Uh, in this example I've chosen pedantic
157 00:08:21.125 --> 00:08:23.925 because I had a project with that.
158 00:08:24.025 --> 00:08:25.285 So I thought, why not?
159 00:08:25.945 --> 00:08:29.845 Uh, I will showcase the site map, uh, logic
160 00:08:30.345 --> 00:08:32.365 and how that actually looks.
161 00:08:33.645 --> 00:08:37.105 And then after it goes through all of the, of,
162 00:08:37.355 --> 00:08:39.345 after it crawls all of the
163 00:08:41.895 --> 00:08:43.455 documentation websites from identity,
164 00:08:43.545 --> 00:08:46.495 which it finds from the site map, then we can query it
165 00:08:46.555 --> 00:08:48.575 as if it were a regular rack.
166 00:08:48.645 --> 00:08:52.175 However, we're not going to be using just one or the other.
167 00:08:52.355 --> 00:08:53.975 So neither Vector
168 00:08:54.195 --> 00:08:56.165 or Tex, we're gonna be using both of them
169 00:08:56.265 --> 00:08:57.725 to have even better accuracy.
170 00:09:01.745 --> 00:09:04.555 Okay. So this is the script.
171 00:09:04.915 --> 00:09:06.835 I really won't go over too much.
172 00:09:07.175 --> 00:09:11.695 Um, yeah, I think the chunk size
173 00:09:11.795 --> 00:09:13.135 and the to
174 00:09:13.315 --> 00:09:15.775 and the logic to find the coding blocks, that's, uh,
175 00:09:16.165 --> 00:09:17.695 like interesting in my opinion.
176 00:09:17.795 --> 00:09:21.075 And I'm sure whoever wants to can have a look at it.
177 00:09:21.655 --> 00:09:24.195 So I will be opening up the site map just
178 00:09:24.195 --> 00:09:26.155 so you have an understanding of how
179 00:09:26.155 --> 00:09:27.755 that actually might look like.
180 00:09:29.005 --> 00:09:33.455 This is it basically, I, it has multiple links
181 00:09:33.635 --> 00:09:35.895 inside this one site map.
182 00:09:36.115 --> 00:09:38.575 Uh, people usually, so companies usually use this
183 00:09:38.575 --> 00:09:40.455 for SEO search engine optimization,
184 00:09:41.115 --> 00:09:42.695 but, uh, that's good for us
185 00:09:42.695 --> 00:09:47.135 because we get to utilize it for whatever we want.
186 00:09:47.855 --> 00:09:50.935 I should also state that you should also always look at the
187 00:09:51.115 --> 00:09:54.975 robots dot TXD of every website just to be
188 00:09:55.555 --> 00:09:56.775 on the safer side.
189 00:09:57.375 --> 00:10:00.455 'cause maybe it is maybe illegal or I don't know.
190 00:10:01.115 --> 00:10:05.135 So I have, uh, already crawled the
191 00:10:07.235 --> 00:10:11.285 website websites and uh, it shows for example, one chunk.
192 00:10:11.385 --> 00:10:16.045 The first chunk just as a demo, it has a title summary,
193 00:10:16.585 --> 00:10:21.005 uh, the whole context chunk, ID chunk number, everything
194 00:10:21.025 --> 00:10:23.495 and embeddings a bit, uh,
195 00:10:23.495 --> 00:10:25.335 further down timestamp and everything.
196 00:10:26.035 --> 00:10:30.095 So what we do is we run crawl identical AI documentation.
197 00:10:30.195 --> 00:10:31.415 That's how I named it.
198 00:10:31.955 --> 00:10:35.695 And then, uh, after it crawls everything,
199 00:10:36.075 --> 00:10:39.215 we just do extremely run Streamli ui.
200 00:10:40.465 --> 00:10:43.965 So this Streamli UI is very simple.
201 00:10:44.225 --> 00:10:48.455 It just has, uh, an in like
202 00:10:48.965 --> 00:10:51.375 text box you can clear the chat, upload pictures.
203 00:10:51.595 --> 00:10:54.255 Uh, actually Stefan recommended this.
204 00:10:54.355 --> 00:10:58.815 So if you are at a meeting for example, or a presentation
205 00:10:58.815 --> 00:11:00.615 or something, you can just take a picture of
206 00:11:00.615 --> 00:11:01.615 what you're seeing on the screen
207 00:11:01.995 --> 00:11:05.455 and uh, if you have already crawled their documentation,
208 00:11:05.765 --> 00:11:09.935 then you can refer with your own, um, lag about
209 00:11:10.495 --> 00:11:12.375 whatever it is you're seeing here.
210 00:11:12.635 --> 00:11:14.215 I'm passing a very simple query.
211 00:11:14.395 --> 00:11:15.895 I'm sure everyone is, uh, familiar
212 00:11:16.005 --> 00:11:18.935 with the p weather agent example.
213 00:11:19.555 --> 00:11:24.215 And uh, what we're looking to get is whatever is on
214 00:11:25.505 --> 00:11:27.165 the P website.
215 00:11:27.305 --> 00:11:31.245 And I will also copy this so we can run it.
216 00:11:37.015 --> 00:11:40.685 So this is the example code that P provides.
217 00:11:41.315 --> 00:11:44.405 This is something close to what we're supposed to see.
218 00:11:44.425 --> 00:11:46.885 And if the model does a very good job, it'll understand
219 00:11:46.885 --> 00:11:48.285 that you don't need all of these,
220 00:11:48.505 --> 00:11:50.565 you just need a part, this and that.
221 00:11:50.785 --> 00:11:52.045 You don't really need this.
222 00:11:53.965 --> 00:11:58.385 And it gives you the exact same thing with less, uh,
223 00:11:59.855 --> 00:12:03.535 of these inputs and things like that fully functioning.
224 00:12:03.995 --> 00:12:07.055 It, uh, shows you how to do it or what you need to install.
225 00:12:07.275 --> 00:12:10.115 And, uh, yeah, that's it.
226 00:12:10.495 --> 00:12:15.425 Uh, the way that you, so the way that you kind
227 00:12:15.425 --> 00:12:19.825 of orchestrate the whole operation
228 00:12:20.365 --> 00:12:24.585 is by using this other file in which you just pass a
229 00:12:24.585 --> 00:12:25.745 simple query.
230 00:12:26.225 --> 00:12:29.465 I have to, you're an expert at my identity ai,
231 00:12:29.545 --> 00:12:31.625 a Python AI agent framework with access
232 00:12:31.625 --> 00:12:32.905 to extensive documentation.
233 00:12:33.045 --> 00:12:36.825 You give it some tools, uh, and it just works.
234 00:12:37.605 --> 00:12:40.425 By the way, fun fact, I have also used iden
235 00:12:41.325 --> 00:12:45.105 for this project, so that's how the idea even came about.
236 00:12:45.625 --> 00:12:47.745 I wanted to know a bit more about identity
237 00:12:48.365 --> 00:12:51.405 and uh, this is how the whole thing works.
238 00:12:51.945 --> 00:12:52.945 So yeah.
239 00:12:59.125 --> 00:13:02.725 Hello? Uh, that was quick.
240 00:13:05.335 --> 00:13:07.925 Maybe, uh, I have a couple of questions,
241 00:13:07.985 --> 00:13:10.485 but maybe you can explain them, like
242 00:13:10.485 --> 00:13:13.285 how you use it on a day-to-day basis on your end,
243 00:13:14.185 --> 00:13:15.185 Of course. Like what you did
244 00:13:15.185 --> 00:13:15.765 achieve.
245 00:13:17.065 --> 00:13:21.045 So, um, as I said, with the identity ai, uh, example,
246 00:13:21.755 --> 00:13:25.765 what I like, what I was trying to do at first was
247 00:13:25.825 --> 00:13:28.405 to have a tool for myself
248 00:13:28.505 --> 00:13:32.485 and my colleagues maybe so that, uh, you could just scrape
249 00:13:33.085 --> 00:13:36.245 documentation from different NPM
250 00:13:36.245 --> 00:13:39.765 or Python libraries or things that you find on the web.
251 00:13:39.865 --> 00:13:43.085 Mm-hmm. And, uh, it would help you better understand it,
252 00:13:43.085 --> 00:13:44.325 so you would understand it faster.
253 00:13:44.555 --> 00:13:47.245 Instead of you having to go through all the documentation,
254 00:13:47.985 --> 00:13:52.125 you could literally just give, like skim it over
255 00:13:52.125 --> 00:13:54.125 so you understand where to look.
256 00:13:54.465 --> 00:13:56.245 And then you can query each at GPT
257 00:13:56.245 --> 00:13:59.445 or another LLM to give you actual
258 00:14:00.475 --> 00:14:02.325 like correct responses.
259 00:14:02.325 --> 00:14:06.185 Mm-hmm. That's how the whole uh, thing came about.
260 00:14:07.915 --> 00:14:09.005 Okay. Okay. Cool.
261 00:14:09.305 --> 00:14:11.605 Uh, I don't know if we have questions in the chat.
262 00:14:12.105 --> 00:14:15.165 Uh, so for the people, you can ask directly in the q and a
263 00:14:16.025 --> 00:14:18.605 and otherwise I'm gonna go
264 00:14:18.605 --> 00:14:20.325 and ask some question myself in the meantime.
265 00:14:21.225 --> 00:14:24.005 Um, so you mentioned quickly BM 25.
266 00:14:24.465 --> 00:14:26.725 Uh, so can you explain, you know, the role
267 00:14:26.725 --> 00:14:28.965 of BM 25 in the scoring for like
268 00:14:29.475 --> 00:14:31.445 with vis full tech search that you have?
269 00:14:32.105 --> 00:14:32.755 Yeah, of course.
270 00:14:37.335 --> 00:14:40.125 Wait, you froze. Hello again.
271 00:14:48.145 --> 00:14:51.595 Getting someone in the chat. Okay, you're back. You froze.
272 00:14:51.855 --> 00:14:55.155 You were freezing. Hear me? Yes, now I can hear you.
273 00:14:55.295 --> 00:14:58.755 I'm very sorry. My internet. I don't know.
274 00:14:59.255 --> 00:15:01.875 Uh, the question was about, uh, DM 25. Yes,
275 00:15:02.335 --> 00:15:03.335 Yes.
276 00:15:03.655 --> 00:15:07.915 Okay. So I will,
277 00:15:08.195 --> 00:15:12.625 I will explain how like, okay, so DM 25,
278 00:15:12.625 --> 00:15:15.705 basically the reason why you would use,
279 00:15:16.845 --> 00:15:18.865 the reason why you would use BM 25. Yes.
280 00:15:19.455 --> 00:15:22.305 Yeah. Yeah. Why you like, okay.
281 00:15:22.305 --> 00:15:23.865 What's the benefit you saw in yourself?
282 00:15:24.005 --> 00:15:26.065 Was it, uh, when you use it,
283 00:15:26.205 --> 00:15:28.225 did you see like something better than only
284 00:15:28.225 --> 00:15:29.585 using, you know, vector search?
285 00:15:30.175 --> 00:15:31.425 Okay, yeah. Yeah.
286 00:15:31.765 --> 00:15:35.785 So basically since uh, the idea of the project was utilize
287 00:15:36.005 --> 00:15:40.185 to utilize viss in order to make something great,
288 00:15:40.515 --> 00:15:42.425 great make something very interesting so
289 00:15:42.425 --> 00:15:45.745 that we could showcase to better understand Viss
290 00:15:45.765 --> 00:15:48.425 and, uh, our capabilities as a team.
291 00:15:49.335 --> 00:15:51.385 What, uh, BM 25 is this?
292 00:15:51.485 --> 00:15:53.185 I'm pretty sure it's just best matching.
293 00:15:53.485 --> 00:15:56.385 Uh, it's a retrieval
294 00:15:57.095 --> 00:15:58.545 rank search results thing.
295 00:15:59.085 --> 00:16:01.865 And, uh, what it does on the backend, at least
296 00:16:01.865 --> 00:16:05.745 for Mild's site, why I've chosen also to use it is
297 00:16:05.745 --> 00:16:09.425 because since it allows you to utilize just, um,
298 00:16:10.055 --> 00:16:14.425 natural language, it could also have benefits when it comes
299 00:16:14.445 --> 00:16:17.465 to how it understands that natural language
300 00:16:17.525 --> 00:16:18.945 as like a query by itself.
301 00:16:19.605 --> 00:16:20.945 Uh, I have noticed kind
302 00:16:20.945 --> 00:16:24.785 of different results like from Vector tool, the M 25.
303 00:16:25.245 --> 00:16:29.025 At the end of the day, they both use the same kind of logic,
304 00:16:29.765 --> 00:16:34.095 but, um, it's a bit of mild, this magic, the way
305 00:16:34.095 --> 00:16:35.575 that it, uh, works.
306 00:16:35.575 --> 00:16:39.005 Like it does. I, from my experiments,
307 00:16:39.175 --> 00:16:43.085 after a couple of back and forth, uh, a couple of minutes
308 00:16:43.305 --> 00:16:44.925 or messages of going back
309 00:16:45.045 --> 00:16:49.815 and forth with the chat, the vector part gets, starts
310 00:16:49.815 --> 00:16:54.215 to get saturated and maybe a bit, uh, confused.
311 00:16:54.555 --> 00:16:59.135 Mm-hmm. I can't really put my finger on why.
312 00:16:59.305 --> 00:17:02.335 Maybe, maybe it's also far
313 00:17:02.335 --> 00:17:05.215 and fault of, uh, my logic, like the way
314 00:17:05.215 --> 00:17:06.575 that I wrote the script.
315 00:17:07.035 --> 00:17:10.575 But, uh, the combination of both of them has proven to just
316 00:17:12.275 --> 00:17:16.625 bring the best, uh, results from each respective query.
317 00:17:17.195 --> 00:17:18.925 Yeah. In this application.
318 00:17:19.465 --> 00:17:22.995 Yeah, no, uh, also I can add, uh, on that, it's, um, it's
319 00:17:22.995 --> 00:17:24.795 what we see as well is the customers.
320 00:17:25.055 --> 00:17:28.195 And in particular it's very useful if you're looking
321 00:17:28.215 --> 00:17:30.635 for specific names or specific brands.
322 00:17:30.705 --> 00:17:33.675 Like you're looking for a specific library name, you know,
323 00:17:33.675 --> 00:17:35.995 maybe you have another one that is similar, uh,
324 00:17:36.065 --> 00:17:38.755 with keyword search, then you will actually find this one
325 00:17:38.755 --> 00:17:40.555 and not, you know, another one that would be similar.
326 00:17:40.615 --> 00:17:42.755 That's usually a good way.
327 00:17:43.095 --> 00:17:45.155 Uh, we have some questions as well,
328 00:17:45.615 --> 00:17:48.355 so I'm gonna ask them is like, first one is, uh,
329 00:17:48.375 --> 00:17:51.365 how do you manage switching, switching between embeddings
330 00:17:51.465 --> 00:17:53.845 to string or string to embeddings on request?
331 00:17:58.765 --> 00:18:02.665 So I'm assuming it's, uh, I'm gonna try to rephrase it
332 00:18:02.665 --> 00:18:04.145 and the person let me know if it's correct.
333 00:18:04.805 --> 00:18:07.665 But how do you go from like having the query, you know,
334 00:18:07.665 --> 00:18:09.465 that you search to then having the embeddings
335 00:18:10.695 --> 00:18:12.625 Does, does that, that's uh, when it comes
336 00:18:12.625 --> 00:18:14.465 to full text search, at least VIS does that
337 00:18:14.655 --> 00:18:18.745 because, uh, on my end at least you just import it,
338 00:18:18.805 --> 00:18:21.905 you utilize it and that's it on the backend.
339 00:18:22.365 --> 00:18:23.945 I'm pretty sure you can go a bit more in
340 00:18:23.945 --> 00:18:25.025 detail when it comes to that.
341 00:18:25.685 --> 00:18:29.625 Yes. So basically, yeah, we, we have our own, uh,
342 00:18:29.815 --> 00:18:30.865 analyzers, um,
343 00:18:31.005 --> 00:18:33.985 and different functions, uh, where you have your,
344 00:18:33.985 --> 00:18:35.665 you're gonna pass the input query, uh,
345 00:18:35.665 --> 00:18:36.905 then it's gonna be transformed.
346 00:18:36.925 --> 00:18:38.465 So removing, you know,
347 00:18:38.465 --> 00:18:40.145 maybe you're gonna put everything in in
348 00:18:40.145 --> 00:18:41.385 lower case or something.
349 00:18:41.385 --> 00:18:43.305 And then, uh, we have the tokenize
350 00:18:43.325 --> 00:18:44.865 and analyzer, uh, that is running.
351 00:18:45.565 --> 00:18:48.785 So then you don't have to think about the embeddings.
352 00:18:48.785 --> 00:18:50.865 That's also why we released full text search
353 00:18:51.485 --> 00:18:54.865 and that way you write text as an output, you get text
354 00:18:55.085 --> 00:18:57.745 and in the be in between, as you said, uh, it's mini
355 00:18:57.775 --> 00:18:59.105 that is gonna use everything here.
356 00:19:00.365 --> 00:19:05.345 Uh, there is another question, um,
357 00:19:06.755 --> 00:19:10.585 which is, is it an app strategy for rag index strategy?
358 00:19:11.285 --> 00:19:14.695 So maybe do you have like, I'm gonna try
359 00:19:14.695 --> 00:19:16.335 to rephrase this one as well because I'm not sure.
360 00:19:17.035 --> 00:19:19.495 Uh, do you, did you try different, um,
361 00:19:19.785 --> 00:19:21.615 index strategy maybe for your rag?
362 00:19:22.835 --> 00:19:25.555 Hmm. Um, in the past?
363 00:19:25.625 --> 00:19:28.875 Yeah, in the couple of, in the last couple of months.
364 00:19:29.015 --> 00:19:31.995 Not really because I found my bread and butter. Mm-hmm.
365 00:19:32.075 --> 00:19:35.195 Which was just using open AI's, um, embeddings
366 00:19:35.215 --> 00:19:38.075 and then uh, using a rear anchor, which was either
367 00:19:39.245 --> 00:19:40.895 from hugging face or somewhere else.
368 00:19:41.115 --> 00:19:42.335 But this was my bread and butter
369 00:19:42.335 --> 00:19:45.615 because up until recently I had never had the need for
370 00:19:46.445 --> 00:19:50.335 such a complex solution to rag
371 00:19:50.605 --> 00:19:54.135 because uh, I didn't have the need to go through
372 00:19:55.245 --> 00:19:57.685 hundreds of, uh, websites and query them at the same time.
373 00:19:57.685 --> 00:20:01.645 Mm-hmm. So I hope this was, uh, a good enough answer.
374 00:20:02.795 --> 00:20:05.605 Yeah. And I guess we kind replied to it,
375 00:20:05.625 --> 00:20:07.005 but just gonna ask it again.
376 00:20:07.025 --> 00:20:08.765 So did you try to combine Vector
377 00:20:08.765 --> 00:20:09.845 with sparse and embedding models?
378 00:20:09.985 --> 00:20:14.325 So like BG M three that we have in Milus, um, so did you try
379 00:20:15.305 --> 00:20:16.305 Sparse and Vectors?
380 00:20:16.605 --> 00:20:18.865 Um, just to, to understand, you know, the advantages
381 00:20:18.865 --> 00:20:21.585 of using full text instead of and billings.
382 00:20:22.595 --> 00:20:26.415 So in, uh, when I did, when I started working
383 00:20:27.155 --> 00:20:29.415 on this tool, I have tried, uh,
384 00:20:30.165 --> 00:20:32.255 many different combinations of many different things.
385 00:20:32.795 --> 00:20:36.055 But uh, at the end, I don't know if it was just
386 00:20:37.425 --> 00:20:38.985 convenience plus performance,
387 00:20:39.205 --> 00:20:41.225 but I kept going back to this vector
388 00:20:41.285 --> 00:20:43.545 and the full tax uh, solution.
389 00:20:43.965 --> 00:20:47.845 It just seemed to give the better answers all the time.
390 00:20:48.285 --> 00:20:51.125 'cause with the other like combinations,
391 00:20:52.045 --> 00:20:55.725 I would also just get some very good response
392 00:20:55.825 --> 00:20:59.525 and then a completely out of the ball out of the park,
393 00:20:59.715 --> 00:21:01.205 like query response.
394 00:21:01.995 --> 00:21:06.485 This was the perfect balance to whatever I am working on.
395 00:21:06.485 --> 00:21:10.285 Because at the end of the day, uh, what I envision as a user
396 00:21:10.345 --> 00:21:12.765 for this is someone who also has an understanding of
397 00:21:12.765 --> 00:21:15.325 what they're looking at and what they're going through.
398 00:21:16.145 --> 00:21:20.105 So you have to be a bit more specific maybe.
399 00:21:20.365 --> 00:21:24.295 Uh, and, but it usually has produced better
400 00:21:24.355 --> 00:21:26.895 and more reliable uh, results.
401 00:21:27.135 --> 00:21:30.495 I have been using this for a couple of, mm-hmm.
402 00:21:30.975 --> 00:21:32.135 A couple of days minimum.
403 00:21:33.115 --> 00:21:36.645 Cool. And also to add to to your response, so one
404 00:21:36.645 --> 00:21:38.485 of the advantage of using full text instead of par
405 00:21:38.485 --> 00:21:40.925 and Billings, um, when you use parts
406 00:21:40.925 --> 00:21:43.165 and billings, you have to compute them yourself usually.
407 00:21:43.785 --> 00:21:45.765 Uh, which is, which can be tricky.
408 00:21:45.905 --> 00:21:48.245 You know, you have to update the statistics yourself.
409 00:21:48.465 --> 00:21:51.445 Um, whereas with full text search, basically we take care of
410 00:21:51.445 --> 00:21:53.685 that for you so you don't have to have another pipeline.
411 00:21:53.685 --> 00:21:56.645 You know, there is um, doing that as well.
412 00:21:57.625 --> 00:22:01.925 Uh, there is one which is quite long. Uh, yeah, I'm reading
413 00:22:02.235 --> 00:22:03.235 This Like Currently,
414 00:22:03.235 --> 00:22:05.925 Currently building something utilizing Crawl
415 00:22:05.925 --> 00:22:08.285 four AI and Zeis basically monitoring financial news
416 00:22:08.285 --> 00:22:09.845 and adding rack capabilities on top.
417 00:22:10.045 --> 00:22:14.045 I had two issues. I'm face, I'm facing one, there's a lot
418 00:22:14.045 --> 00:22:15.365 of articles are very different in
419 00:22:15.365 --> 00:22:16.845 length, so are extreme be long.
420 00:22:17.505 --> 00:22:19.285 So what should be my approach to chunking?
421 00:22:20.185 --> 00:22:22.885 And the other one is that they different languages.
422 00:22:23.225 --> 00:22:24.965 How would you embed multilingual documents?
423 00:22:24.985 --> 00:22:26.565 How would you embed it in the same space
424 00:22:26.865 --> 00:22:28.245 or separate for each language?
425 00:22:28.545 --> 00:22:30.245 How would you have a multilingual embedding models?
426 00:22:31.985 --> 00:22:33.205 Uh, let's go for those two
427 00:22:33.505 --> 00:22:35.045 and then we'll have the follow up questions after.
428 00:22:35.945 --> 00:22:40.075 Okay. So, um, first of all, pretty interesting project.
429 00:22:40.335 --> 00:22:43.755 I'm not even gonna lie, this has some very good, uh,
430 00:22:43.755 --> 00:22:46.635 especially if you, you're into stocks and things like that.
431 00:22:47.175 --> 00:22:49.515 So about uh, the chunking thing, this is
432 00:22:49.535 --> 00:22:51.155 how I personally would go about it.
433 00:22:51.155 --> 00:22:52.675 And this is what I kind
434 00:22:52.675 --> 00:22:54.075 of have done in this project as well.
435 00:22:54.665 --> 00:22:59.355 Instead of, so I keep track of uh, chunks in two ways.
436 00:22:59.615 --> 00:23:02.955 The ID of the chunk, like every chunk has its own id
437 00:23:03.255 --> 00:23:05.395 and then the actual chunk number
438 00:23:05.535 --> 00:23:09.395 of the specific single website that you're reading, uh,
439 00:23:09.395 --> 00:23:12.635 that you're crawling, that's how I would do it.
440 00:23:12.695 --> 00:23:16.115 And then the very last chunk of that specific website,
441 00:23:16.115 --> 00:23:17.835 that's gonna be a bit shorter,
442 00:23:18.575 --> 00:23:20.515 but uh, I really don't think that
443 00:23:20.515 --> 00:23:21.795 that's that big of a problem.
444 00:23:22.395 --> 00:23:25.915 'cause usually in, especially in financial articles
445 00:23:25.915 --> 00:23:28.515 and things like that, the meat is in between.
446 00:23:29.215 --> 00:23:31.395 So that's in my opinion.
447 00:23:32.215 --> 00:23:35.515 And then about the language part, I have experimented
448 00:23:35.545 --> 00:23:38.515 with uh, different embedding models for
449 00:23:39.465 --> 00:23:40.795 different languages.
450 00:23:42.635 --> 00:23:44.515 I still am sad
451 00:23:45.105 --> 00:23:48.405 or whatever to say that uh, the best way to go about this is
452 00:23:48.405 --> 00:23:50.085 to translate them to English
453 00:23:50.515 --> 00:23:55.435 because it'll never perform the same if you use
454 00:23:55.475 --> 00:23:58.355 a multimodel, uh, multi-language model,
455 00:23:58.355 --> 00:24:01.395 like embedding model, it has the drawback
456 00:24:01.395 --> 00:24:03.995 that it's trained on more, uh,
457 00:24:04.335 --> 00:24:06.355 on different languages, multi-language.
458 00:24:06.815 --> 00:24:09.955 So I personally would just translate whatever article in
459 00:24:09.955 --> 00:24:12.635 English and then resume with the chunking logic
460 00:24:12.815 --> 00:24:13.995 and then the IC logic.
461 00:24:14.375 --> 00:24:16.995 And then for the very last part, I really,
462 00:24:17.165 --> 00:24:19.395 especially when it comes to financial, uh,
463 00:24:23.915 --> 00:24:27.675 I really personally would use, uh, BM 25
464 00:24:27.695 --> 00:24:31.395 and a NN like mm-hmm.
465 00:24:31.945 --> 00:24:35.995 Yeah. But I, that's how I would do it.
466 00:24:37.395 --> 00:24:40.085 Cool, thank you. And yeah, Ari,
467 00:24:40.165 --> 00:24:42.125 I think you already mentioned it's a bit like the
468 00:24:42.835 --> 00:24:45.085 performance for hybrid or semantic search.
469 00:24:45.265 --> 00:24:47.245 You said basically
470 00:24:47.305 --> 00:24:48.525 by using full tech search you
471 00:24:48.525 --> 00:24:49.565 have better performance, right?
472 00:24:49.565 --> 00:24:51.085 The first question you have in the q and a.
473 00:24:51.755 --> 00:24:53.885 Yeah, and it,
474 00:24:54.105 --> 00:24:58.415 and in my experience it really has, I have yet
475 00:24:58.415 --> 00:25:00.695 to find a single drawback, like something
476 00:25:00.695 --> 00:25:02.935 that really makes me question the whole solution.
477 00:25:04.105 --> 00:25:05.575 There might have been caveats,
478 00:25:05.575 --> 00:25:06.695 but I don't even remember them.
479 00:25:07.075 --> 00:25:09.735 So it's worked out pretty well.
480 00:25:10.245 --> 00:25:12.935 Cool. Um, just gonna wait one
481 00:25:12.935 --> 00:25:16.735 or two minutes to know if we have more questions from the
482 00:25:16.735 --> 00:25:20.125 people otherwise, uh, yeah,
483 00:25:20.125 --> 00:25:21.965 maybe they can add you on LinkedIn
484 00:25:22.025 --> 00:25:23.445 or somewhere where can people
485 00:25:23.625 --> 00:25:25.085 follow up if they have questions?
486 00:25:25.945 --> 00:25:28.075 Okay. They can add you on LinkedIn. Uh,
487 00:25:28.705 --> 00:25:31.755 Yeah, pretty sure LinkedIn if you have it,
488 00:25:32.055 --> 00:25:35.305 or either my LinkedIn or my GitHub.
489 00:25:35.565 --> 00:25:37.275 Uh, yeah,
490 00:25:39.875 --> 00:25:44.555 Just waiting quickly and then we'll see.
491 00:25:44.705 --> 00:25:48.165 Otherwise we know wait one at 30
492 00:25:49.065 --> 00:25:52.045 if we don't have questions we can uh, close this one.
493 00:25:53.185 --> 00:25:54.565 But yeah, that was very interesting.
494 00:25:54.685 --> 00:25:57.645 I mean the use case is also interesting as a
495 00:25:58.975 --> 00:26:00.905 more than a toy project as well.
496 00:26:00.905 --> 00:26:03.825 Mm-hmm. So this is, uh, this is quite cool.
497 00:26:04.575 --> 00:26:07.025 It's being used on the daily, uh, it,
498 00:26:07.365 --> 00:26:09.985 of course it has room for improvement as I mentioned.
499 00:26:10.015 --> 00:26:13.225 Like preferably I would like to implement some logic
500 00:26:13.225 --> 00:26:16.985 that differentiates different, uh, site maps, so
501 00:26:17.045 --> 00:26:19.945 of different projects or libraries.
502 00:26:20.765 --> 00:26:24.665 But I have yet to work on something that complex. Mm-hmm.
503 00:26:24.745 --> 00:26:27.145 So I didn't need to refer to multiple
504 00:26:27.765 --> 00:26:28.865 things at the same time.
505 00:26:28.975 --> 00:26:30.065 Like it's been fine.
506 00:26:31.025 --> 00:26:33.115 Okay. Okay. It seems like, yeah,
507 00:26:33.115 --> 00:26:34.635 there's no more questions.
508 00:26:35.065 --> 00:26:36.715 Well, thank you very much for the presentation.
509 00:26:37.095 --> 00:26:39.955 Uh, thank you very everyone, everyone, sorry for attending.
510 00:26:40.415 --> 00:26:42.595 Uh, we'll follow up with a recording
511 00:26:42.985 --> 00:26:44.605 so you will see it in a couple of days
512 00:26:45.465 --> 00:26:47.965 and I will see you soon on my end.
513 00:26:48.365 --> 00:26:50.245 I actually see you next week for the other webinar,
514 00:26:50.335 --> 00:26:53.685 which will be with Feast On How You Can Do Real Time Rag.
515 00:26:54.225 --> 00:26:56.685 So thank you very much. Have a lovely morning, afternoon,
516 00:26:56.685 --> 00:26:57.885 or evening, wherever you are in the world.
517 00:26:58.705 --> 00:26:59.845 And goodbye.
Meet the Speaker
Join the session for live Q&A with the speaker
Erbli Kuka
Data / AI Engineer at datamax.ai
Erbli Kuka is a Data / AI Engineer at datamax.ai specializing in ML/AI development and integration. He has gained deep insights into the TRANSFORMERS architecture, attention mechanisms, and other critical areas of machine learning. Passionate about ML, Erbli is committed to advancing his career and ultimately leading teams to create innovative solutions in cloud development and AI.