Events
Vector Databases for Enhanced Classification

Webinar

Vector Databases for Enhanced Classification

Zilliz Webinar | Zoom

Join the Webinar

What will you learn?

In this webinar, we dive into the use of Milvus as a high-performance vector database tailored for handling large-scale document collections, focusing on European Commission and Parliament acts. Our approach shifts from traditional RAG-based classification to a hybrid search method, leveraging K-Nearest Neighbor (KNN) for pinpointing top documents relevant to classification tasks. This session is ideal for those aiming to refine classification accuracy by leveraging vector-based indexing and hybrid retrieval in vast datasets.

Topics covered:

KNN and Sparse Search Integration: How KNN retrieval combined with sparse search helps extract top documents aligned with classification needs.
Versatile Embeddings for Multilingual and Multi-Domain Applications: The BGE M3-Embedding model is designed to provide robust, high-quality embeddings across multiple languages and domains, making it adaptable for diverse tasks in multilingual and cross-functional environments.
Real-World Application: Step-by-step demonstration using European legislative acts to showcase KNN-driven retrieval and classification workflows.

View presentation slides

Transcript

1 00:00:03.875 --> 00:00:06.455 So today I'm pleased to introduce, um,

2 00:00:06.675 --> 00:00:09.295 to the session Victor Database for enhanced classification

3 00:00:09.515 --> 00:00:11.975 and our guest speaker, Alessandro Koya.

4 00:00:12.515 --> 00:00:14.215 He will talk about vector databases

5 00:00:14.395 --> 00:00:15.575 for enhanced classification

6 00:00:15.675 --> 00:00:16.815 and how they use VIS

7 00:00:16.815 --> 00:00:18.735 for handling large scale document collections.

8 00:00:19.915 --> 00:00:22.015 Um, Alessandro is a data-driven marketing

9 00:00:22.015 --> 00:00:23.455 and AI solution expert.

10 00:00:23.795 --> 00:00:26.655 He has extension experience in international companies like

11 00:00:26.655 --> 00:00:28.575 Nielsen Media and Vodafone.

12 00:00:28.845 --> 00:00:30.375 He's also leading data science

13 00:00:30.375 --> 00:00:32.375 and product team focused on applying machine

14 00:00:32.575 --> 00:00:33.695 learning to business challenges.

15 00:00:34.365 --> 00:00:35.735 He's also teaching AI

16 00:00:35.795 --> 00:00:39.735 and data-driven marketing at the IULM

17 00:00:39.735 --> 00:00:41.375 University in Milan.

18 00:00:42.495 --> 00:00:44.865 Welcome, Alessandro. The stage is yours now.

19 00:00:45.275 --> 00:00:47.545 Thank you, Stephen. Stephen, uh, yeah,

20 00:00:47.545 --> 00:00:48.665 thanks for the introduction.

21 00:00:48.665 --> 00:00:49.745 That was, uh, perfect.

22 00:00:50.605 --> 00:00:52.705 And, and sorry everyone for my throaty voice,

23 00:00:52.725 --> 00:00:54.225 but I just got sick today,

24 00:00:54.245 --> 00:00:56.745 so I'm getting sick right in this moment.

25 00:00:57.375 --> 00:01:02.105 Okay, let's get to the, to the presentation. One sec.

26 00:01:17.175 --> 00:01:17.525 Sorry.

27 00:01:20.645 --> 00:01:22.455 Hmm, sorry.

28 00:01:22.555 --> 00:01:24.815 One second. Um, Chrome tab.

29 00:01:27.695 --> 00:01:28.695 There we go.

30 00:01:31.555 --> 00:01:32.775 We can go. We

31 00:01:32.775 --> 00:01:33.775 See. Thank you.

32 00:01:33.775 --> 00:01:37.905 Okay.

33 00:01:38.525 --> 00:01:40.265 Uh, so, uh, well, about me,

34 00:01:40.425 --> 00:01:42.745 I just got introduced, uh, thank you Steven.

35 00:01:43.525 --> 00:01:47.905 And, um, so just, uh, about my current focus, uh, we're,

36 00:01:48.355 --> 00:01:52.745 we've been, um, like jump starting a startup in the last,

37 00:01:52.805 --> 00:01:54.825 uh, uh, year that, uh,

38 00:01:55.495 --> 00:01:59.225 that focuses on enhancing policy understanding

39 00:01:59.845 --> 00:02:01.305 and driving, uh, uh,

40 00:02:01.625 --> 00:02:03.665 informed decision making in the public sector.

41 00:02:04.245 --> 00:02:06.385 So we have an attorney that works with us

42 00:02:06.445 --> 00:02:09.305 and is super expert about everything

43 00:02:09.305 --> 00:02:11.745 that is about public policy and law making.

44 00:02:11.925 --> 00:02:13.545 So these are our current focus,

45 00:02:14.045 --> 00:02:15.945 but we are actually creating products

46 00:02:16.085 --> 00:02:19.425 and algorithms that can be used in a more generic way.

47 00:02:20.445 --> 00:02:23.225 Uh, in particular, we all focus, uh,

48 00:02:23.405 --> 00:02:26.905 and our background in is in knowledge solutions.

49 00:02:27.365 --> 00:02:30.385 So we have out at the moment, uh, a few products.

50 00:02:30.685 --> 00:02:34.905 One is policy manager that is, uh, aggregating news, uh,

51 00:02:34.905 --> 00:02:37.345 according to topic and policy areas.

52 00:02:37.605 --> 00:02:41.225 And that is why we need classification, uh, stream scope,

53 00:02:41.335 --> 00:02:45.225 that is a real time broadcast, uh, analysis tool

54 00:02:45.575 --> 00:02:49.185 that can be used to extract knowledge in real time from,

55 00:02:49.445 --> 00:02:51.265 for example, parliamentary recessions

56 00:02:51.605 --> 00:02:53.345 or like, uh, like recently

57 00:02:53.485 --> 00:02:56.865 for commissioner designate from the EU Commission.

58 00:02:57.445 --> 00:02:58.705 And we focus on policy.

59 00:02:58.885 --> 00:03:02.805 And so our funding team, uh, as I said, includes an attorney

60 00:03:03.305 --> 00:03:04.525 and a a

61 00:03:05.025 --> 00:03:06.165 and a couple of people

62 00:03:06.195 --> 00:03:09.045 that are very expert in quantitative analysis.

63 00:03:09.195 --> 00:03:12.565 Because of these, we, we try to create, uh, AI agents

64 00:03:12.835 --> 00:03:15.485 that can give our, our cus our clients

65 00:03:15.865 --> 00:03:20.565 and customers, uh, an advantage in understanding what, what,

66 00:03:20.565 --> 00:03:22.325 what is going on in that sector.

67 00:03:24.085 --> 00:03:27.345 So, um, I give you, uh, some, uh, a few examples of

68 00:03:27.345 --> 00:03:28.545 what we do, uh,

69 00:03:28.545 --> 00:03:32.025 because, uh, I want you to understand why we needed a,

70 00:03:32.225 --> 00:03:36.105 a very, uh, fine way, uh, way of, uh,

71 00:03:36.425 --> 00:03:38.825 classifying documents, uh, in this case.

72 00:03:38.965 --> 00:03:43.825 Uh, this product that is live, uh, is used to, uh, to, to,

73 00:03:44.165 --> 00:03:46.865 to transcribe diarize

74 00:03:47.205 --> 00:03:49.705 and classify live streams

75 00:03:49.965 --> 00:03:54.585 and classify them according to EU policy areas

76 00:03:55.015 --> 00:03:56.065 that are given.

77 00:03:56.295 --> 00:04:00.545 They are quite a lot and they can overlap somehow.

78 00:04:01.005 --> 00:04:05.945 So our clients, uh, we wanted to give our, our clients

79 00:04:06.745 --> 00:04:09.545 a system that does not hallucinate, uh,

80 00:04:09.565 --> 00:04:14.305 and that can, uh, like a very, uh, laser like

81 00:04:14.965 --> 00:04:16.185 way of classification.

82 00:04:16.885 --> 00:04:20.265 Um, maybe at the end of the, at the end of the webinar,

83 00:04:20.625 --> 00:04:23.625 I will also show you these, uh, these interface live.

84 00:04:24.775 --> 00:04:28.105 Another thing we do is like classic reg style projects,

85 00:04:28.485 --> 00:04:32.825 but, uh, we, uh, so while we talking about mebo,

86 00:04:32.825 --> 00:04:36.025 because in all our products we use Mebo also,

87 00:04:36.295 --> 00:04:39.545 when there is a rag involved, we considering that we,

88 00:04:39.605 --> 00:04:43.425 we are striving for grounding, so giving, uh, trustworthy,

89 00:04:43.725 --> 00:04:46.025 uh, responses multimodality.

90 00:04:46.285 --> 00:04:49.825 So of course we, we est documents of all kinds

91 00:04:50.245 --> 00:04:51.865 and ex explainability.

92 00:04:52.165 --> 00:04:56.105 So for example, in these, um, in this ui, you see

93 00:04:56.105 --> 00:04:58.945 that we actually, uh, in our type solutions,

94 00:04:59.405 --> 00:05:03.985 we are showcasing the thought steps, uh, in the chain

95 00:05:04.005 --> 00:05:05.665 of thought, uh, fashion.

96 00:05:06.205 --> 00:05:11.105 So we actually want to always give our user recognition

97 00:05:11.245 --> 00:05:14.145 of what kind of data the you are,

98 00:05:14.245 --> 00:05:16.305 the AI agents have used

99 00:05:16.485 --> 00:05:19.345 to formulate their final answer that is here.

100 00:05:20.045 --> 00:05:23.025 And, uh, because of that, of course, we can handle text,

101 00:05:23.575 --> 00:05:27.545 PDFs, videos, as you just see, and audio files as well.

102 00:05:29.485 --> 00:05:32.345 All of these, uh, for, for,

103 00:05:33.445 --> 00:05:34.945 for managing the knowledge.

104 00:05:35.405 --> 00:05:37.865 Uh, we use, uh, jungle, uh, because, uh,

105 00:05:37.865 --> 00:05:41.025 because it's a very robust system, it's made in Python

106 00:05:41.165 --> 00:05:44.385 and we are all like, uh, we all have a Python background

107 00:05:45.405 --> 00:05:48.785 and, uh, we, we use a parallel, um,

108 00:05:49.705 --> 00:05:51.185 BU database.

109 00:05:51.525 --> 00:05:53.185 So we have hooks that

110 00:05:53.585 --> 00:05:57.685 whenever a new document is added to our jungle backend,

111 00:05:58.155 --> 00:06:01.045 it's also added in our Migos backend

112 00:06:01.045 --> 00:06:04.805 because we want to use Migos for its search capabilities for

113 00:06:04.865 --> 00:06:06.205 as a vector database.

114 00:06:06.705 --> 00:06:11.085 And we use, um, while, while we use, um, jungle

115 00:06:11.545 --> 00:06:14.485 for keeping, uh, all the relational database.

116 00:06:15.105 --> 00:06:17.925 So, um, all, all the products that

117 00:06:18.075 --> 00:06:21.925 that we are doing have these common characteristic of,

118 00:06:22.105 --> 00:06:25.525 of having a backend with jungle that is somehow

119 00:06:26.645 --> 00:06:27.765 paralleled in.

120 00:06:28.745 --> 00:06:31.685 And in fact, this is something that at the beginning, uh,

121 00:06:31.685 --> 00:06:34.485 when we were starting out, I was trying to,

122 00:06:34.585 --> 00:06:36.365 to understand if it was already made.

123 00:06:36.945 --> 00:06:41.925 It was not, but it was using the jungle primary

124 00:06:42.265 --> 00:06:46.325 key as Milus primary key became really natural.

125 00:06:46.585 --> 00:06:49.205 And the two systems work really well, actually,

126 00:06:49.425 --> 00:06:52.445 we just created one wrapper class,

127 00:06:52.945 --> 00:06:56.845 and then everything is, um, is basically, uh, working

128 00:06:57.525 --> 00:06:59.045 seamlessly and transparently.

129 00:07:00.625 --> 00:07:05.095 Other technologies we use are guard, guardrail, AI, near

130 00:07:05.115 --> 00:07:08.975 for J because we use quite a lot of knowledge graphs

131 00:07:09.435 --> 00:07:12.495 and also media wiki because we are media wiki.

132 00:07:12.595 --> 00:07:14.935 I'm a, I'm a Wiki media, uh, member.

133 00:07:15.555 --> 00:07:19.855 And, uh, so we think that somehow, uh, it's, it's a good way

134 00:07:19.915 --> 00:07:23.895 to, to use, uh, to use it in a semantic way.

135 00:07:25.135 --> 00:07:28.305 Okay, let's, uh, let's get to, to the main point.

136 00:07:29.325 --> 00:07:31.905 So we have a technical challenge in

137 00:07:31.905 --> 00:07:33.185 everything that you've seen.

138 00:07:33.845 --> 00:07:35.745 Um, we need to evolve, uh,

139 00:07:36.455 --> 00:07:39.945 from basic classification keyword base, for example,

140 00:07:40.445 --> 00:07:45.225 to semantic understanding in domains that are very jargon,

141 00:07:45.275 --> 00:07:48.825 heavy jargon reach, there are also evolving

142 00:07:49.325 --> 00:07:51.425 and they're really deep and vertical.

143 00:07:51.925 --> 00:07:56.185 Uh, if you think about policy, uh, each uh, month, there is,

144 00:07:56.275 --> 00:07:58.585 there are new, uh, new words

145 00:07:58.855 --> 00:08:01.465 that can be associated in a classification.

146 00:08:02.005 --> 00:08:05.185 And, uh, and also if you,

147 00:08:06.545 --> 00:08:10.765 if you think about it, uh, in a vector space, if you take,

148 00:08:10.905 --> 00:08:12.685 uh, for example, generic

149 00:08:12.905 --> 00:08:16.045 or general, uh, embedding, uh, models,

150 00:08:17.105 --> 00:08:21.525 the vector space of policy of two, um, two fields

151 00:08:21.625 --> 00:08:25.125 of policy error, two fields of policy

152 00:08:25.655 --> 00:08:28.725 areas are very close in vector space.

153 00:08:29.745 --> 00:08:33.045 And also we have quite a lot of policy areas.

154 00:08:33.665 --> 00:08:36.285 So if you consider, I will show you in, in one

155 00:08:36.285 --> 00:08:38.725 of the next slides that are three, two

156 00:08:39.365 --> 00:08:42.765 official policy areas in, uh, in the European

157 00:08:43.275 --> 00:08:46.005 legislative system, uh, you can see

158 00:08:46.005 --> 00:08:49.165 that we cannot rely on any zero shot,

159 00:08:49.545 --> 00:08:51.125 for example, classification.

160 00:08:51.585 --> 00:08:55.685 So we had to, we had to engineer, uh, a, a better solution.

161 00:08:59.195 --> 00:09:00.535 So, um, there are some,

162 00:09:00.965 --> 00:09:04.495 some good things about traditional document classification

163 00:09:04.605 --> 00:09:05.815 that we have inherited.

164 00:09:06.405 --> 00:09:10.655 Just to give like a historical like background remember, uh,

165 00:09:10.715 --> 00:09:14.415 before neural, uh, classification,

166 00:09:14.985 --> 00:09:18.735 there was pre-processing vectorization be words

167 00:09:19.395 --> 00:09:22.975 pf IDF, that is the spars kind of search

168 00:09:23.585 --> 00:09:25.175 named entity recognition.

169 00:09:25.315 --> 00:09:27.255 So you go, you, you check out, uh,

170 00:09:27.585 --> 00:09:31.175 which are the corporate names, which are the top names

171 00:09:31.395 --> 00:09:36.055 or names, et cetera, and some manual future engineering.

172 00:09:36.675 --> 00:09:40.215 And on these features, normally it has always been built,

173 00:09:40.795 --> 00:09:42.975 uh, some, some kind of classifier

174 00:09:42.975 --> 00:09:45.975 that can be software vector machines, random forests

175 00:09:46.515 --> 00:09:48.095 or other kind of classifiers

176 00:09:51.555 --> 00:09:56.005 jumping to the last few years, of course not.

177 00:09:56.105 --> 00:10:00.445 Uh, taking a look at all the LSTM, for example, based,

178 00:10:00.905 --> 00:10:04.805 um, ways of classifying documents,

179 00:10:05.585 --> 00:10:07.605 we can see a few approaches.

180 00:10:07.785 --> 00:10:09.445 So if I have an unknown document

181 00:10:09.825 --> 00:10:12.765 and I have a corpus that is made

182 00:10:12.985 --> 00:10:17.725 of let's say 32 categories, what are my, what are my choices

183 00:10:18.105 --> 00:10:21.565 to classify this unknown document into one

184 00:10:22.025 --> 00:10:25.125 or one category, the, the most important one

185 00:10:25.185 --> 00:10:28.485 or multiple category categories, if there is an overlap,

186 00:10:28.585 --> 00:10:32.685 but also with the distance score from these categories.

187 00:10:33.485 --> 00:10:37.425 So, uh, one option is to use embedding based classification.

188 00:10:37.805 --> 00:10:41.105 So converting the text to vectors,

189 00:10:41.275 --> 00:10:45.225 using the pre-train models, calculate calculating the

190 00:10:46.375 --> 00:10:50.255 category OIDs for the, for the training corpus,

191 00:10:50.675 --> 00:10:54.215 and then each new document, see, to which

192 00:10:54.315 --> 00:10:56.695 of these OIDs is the, is closer.

193 00:10:57.315 --> 00:11:00.055 But this means to use, uh,

194 00:11:00.355 --> 00:11:03.415 to use a generic pre-trained, uh,

195 00:11:03.525 --> 00:11:05.455 embedding model from open AI

196 00:11:05.515 --> 00:11:07.935 or open source embeddings, for example.

197 00:11:08.245 --> 00:11:10.735 Also that base, uh,

198 00:11:11.755 --> 00:11:14.255 and we experimentally found out.

199 00:11:14.715 --> 00:11:17.015 So I, I also wanted to give a warning.

200 00:11:17.555 --> 00:11:20.015 Uh, these, uh, everything we've done,

201 00:11:20.285 --> 00:11:21.615 it's really experimental

202 00:11:22.195 --> 00:11:25.295 and actually, uh, it's something that gave our,

203 00:11:25.565 --> 00:11:30.215 gave us the best results for our, for our domain.

204 00:11:30.585 --> 00:11:34.575 Maybe on other domains it'll work better or worse.

205 00:11:34.955 --> 00:11:37.655 So, uh, in our case,

206 00:11:37.905 --> 00:11:40.935 these embedding based classification was didn't,

207 00:11:40.935 --> 00:11:42.415 didn't give us very good results.

208 00:11:42.515 --> 00:11:44.855 So we normally start with an input sentence,

209 00:11:45.345 --> 00:11:48.215 which we check the embedding, we pass it

210 00:11:48.215 --> 00:11:50.975 through an embedding model, and then we see what is, uh,

211 00:11:51.475 --> 00:11:53.925 to which OID is closest.

212 00:11:55.055 --> 00:11:59.755 The other way of, uh, the other approach to classification,

213 00:12:00.415 --> 00:12:02.595 uh, could be by fine tuning transformers.

214 00:12:03.135 --> 00:12:07.035 So we give an input sentence to, uh, uh, encoder only

215 00:12:07.635 --> 00:12:09.355 transformer like bird, for example.

216 00:12:09.855 --> 00:12:13.035 And then we put on top the classification head.

217 00:12:13.545 --> 00:12:15.155 This is normally the hugging,

218 00:12:15.345 --> 00:12:18.315 hugging face style kind of classification.

219 00:12:18.815 --> 00:12:23.315 So whenever we ask the hugging face, uh,

220 00:12:24.115 --> 00:12:25.915 transformers library for a classification,

221 00:12:26.335 --> 00:12:28.355 it normally gives us a way

222 00:12:28.355 --> 00:12:32.035 to fine tune just this classification head

223 00:12:32.495 --> 00:12:36.235 by putting normally, um, a fully connected layer on top,

224 00:12:36.815 --> 00:12:41.195 on top of bird or, and that that is going to be fine tuned.

225 00:12:41.855 --> 00:12:43.955 Uh, just the classification head

226 00:12:43.975 --> 00:12:46.795 or there is the possibility of fine tuning the whole system

227 00:12:47.185 --> 00:12:48.195 down to bird,

228 00:12:49.095 --> 00:12:52.115 but still, like it requires extensive labeled data.

229 00:12:52.775 --> 00:12:55.915 And, uh, and it's kind of a black box.

230 00:12:56.235 --> 00:13:00.635 I mean, I don't really know what, uh, how Bert was trained.

231 00:13:01.135 --> 00:13:05.995 So prob probably, uh, it's, it could, uh, we didn't know

232 00:13:06.065 --> 00:13:07.435 what, what was in there.

233 00:13:08.495 --> 00:13:10.875 We, we go to the, to the next possibility,

234 00:13:11.695 --> 00:13:14.795 and I don't like this one too much because it's zero

235 00:13:15.015 --> 00:13:18.915 and few shot classification by giving GPT or code

236 00:13:18.935 --> 00:13:21.115 or whatever, a huge prompt.

237 00:13:21.535 --> 00:13:25.835 Uh, and asking for a, for structure structure j out,

238 00:13:26.455 --> 00:13:31.115 giving the possibility the, the possible classes

239 00:13:31.775 --> 00:13:34.195 as an enumeration or in the prompt,

240 00:13:34.695 --> 00:13:36.795 and then parsing this prompt.

241 00:13:37.655 --> 00:13:40.675 Uh, like in this case, I give a long prompt telling

242 00:13:40.775 --> 00:13:43.875 to provide the answer with the in format.

243 00:13:44.575 --> 00:13:46.275 And then I give the input text.

244 00:13:46.855 --> 00:13:50.235 Why I don't like this, because it's completely black box.

245 00:13:50.565 --> 00:13:51.995 There is the risk of bias.

246 00:13:52.295 --> 00:13:54.835 And in our case, we have 32 categories.

247 00:13:55.055 --> 00:13:56.155 So we tested it

248 00:13:56.415 --> 00:13:58.355 and we found that that actually

249 00:13:58.935 --> 00:14:01.715 by splitting into a train test

250 00:14:02.215 --> 00:14:07.195 and a test set, our initial corpus, the,

251 00:14:09.615 --> 00:14:14.155 the, the, the metrics, so the accuracy of the metrics was,

252 00:14:14.535 --> 00:14:17.435 was really bad actually, the confusion metrics.

253 00:14:18.615 --> 00:14:22.755 So we, we went for a hybrid approach, hybrid

254 00:14:22.755 --> 00:14:23.955 that actually uses mi.

255 00:14:24.895 --> 00:14:29.515 And, uh, we took in a very experimental way, uh, the best

256 00:14:29.735 --> 00:14:31.235 of the two words somehow.

257 00:14:32.255 --> 00:14:35.835 Uh, so from traditional methods, we use keyboard.

258 00:14:36.175 --> 00:14:37.675 Uh, I will show you in the next slide

259 00:14:37.675 --> 00:14:38.755 what keyboard is about.

260 00:14:39.025 --> 00:14:43.835 It's, it's a very, uh, it's a very thin algorithms,

261 00:14:43.855 --> 00:14:45.875 but algorithmically works really well.

262 00:14:46.415 --> 00:14:47.755 We took T-F-I-D-F.

263 00:14:47.935 --> 00:14:50.315 In fact, we're, we're talking about hybrid search,

264 00:14:50.735 --> 00:14:51.965 and we use KNN.

265 00:14:52.225 --> 00:14:55.205 See, that is a simple interpretable classifier.

266 00:14:55.985 --> 00:14:59.525 And from the neural approaches, we took embeddings, BG

267 00:15:00.185 --> 00:15:02.765 BGM three, that is the embedding model we are

268 00:15:02.765 --> 00:15:03.845 talking about today.

269 00:15:04.385 --> 00:15:07.925 And, uh, uh, and we use a model vector database mailbox

270 00:15:08.225 --> 00:15:10.485 for efficient hybrid search.

271 00:15:11.805 --> 00:15:16.305 So, uh, we, the, the pipeline from like from above,

272 00:15:16.965 --> 00:15:18.545 uh, it takes a text.

273 00:15:19.315 --> 00:15:22.945 There is a l an initial language detection that we used to,

274 00:15:23.285 --> 00:15:27.505 uh, to, to set the language of NLTK

275 00:15:27.685 --> 00:15:28.945 for stop word removal.

276 00:15:29.365 --> 00:15:32.345 But these are like very common pro processing steps.

277 00:15:33.745 --> 00:15:38.725 We give the sentence to key bird that extracts keywords.

278 00:15:39.025 --> 00:15:43.805 So key bird is ma is a special like algorithm that uses bird

279 00:15:43.905 --> 00:15:45.445 to accept, which are the,

280 00:15:45.665 --> 00:15:50.045 be the most important keywords given, uh, given a text.

281 00:15:51.025 --> 00:15:54.445 And we create a, a new transformed version of,

282 00:15:54.585 --> 00:15:56.045 of the input text,

283 00:15:56.505 --> 00:15:59.365 and we give that version to this parts vector,

284 00:15:59.865 --> 00:16:03.165 and we give the original, well, without stop words,

285 00:16:03.345 --> 00:16:05.925 but we can, we could also as well give it

286 00:16:05.925 --> 00:16:10.245 with the stop words, we give the original sentence, uh,

287 00:16:10.305 --> 00:16:13.565 to be transformed with, with the dense vector.

288 00:16:14.225 --> 00:16:17.765 So this approach somehow, uh, well, first of all,

289 00:16:17.825 --> 00:16:19.365 it gave us the best results,

290 00:16:19.625 --> 00:16:21.445 and it's kind of a, of a hybrid.

291 00:16:22.425 --> 00:16:25.565 Uh, always show you now how we work with key birth.

292 00:16:25.825 --> 00:16:30.085 But please take in consideration that somehow here

293 00:16:30.775 --> 00:16:35.245 we're, uh, creating a new sentence that is really, uh,

294 00:16:35.305 --> 00:16:38.245 likely to be, uh, well treated

295 00:16:38.625 --> 00:16:41.285 by the TF IDF algorithm.

296 00:16:41.705 --> 00:16:44.405 So we are kind of cheating, let's say.

297 00:16:44.665 --> 00:16:49.485 So we are, uh, augmenting the sentence to, uh, exploit

298 00:16:50.025 --> 00:16:51.525 how kf IDF works.

299 00:16:52.075 --> 00:16:54.725 That is, it works with, it works with the,

300 00:16:54.785 --> 00:16:56.085 the term frequency.

301 00:16:56.305 --> 00:17:00.965 So we are somehow artificially inflating the,

302 00:17:01.425 --> 00:17:04.685 the term frequencies of the keywords we care about.

303 00:17:05.265 --> 00:17:09.885 And in this way, when we ask, so when we will get to the K

304 00:17:09.885 --> 00:17:12.565 and N part, we got really good results

305 00:17:12.565 --> 00:17:15.525 because the spars GIDF version

306 00:17:15.705 --> 00:17:17.965 of these algorithm works really well.

307 00:17:18.465 --> 00:17:23.405 And like, as a, as a spoiler, we are actually retrieving fif

308 00:17:23.405 --> 00:17:26.045 with the weight of 50% to dense vector

309 00:17:26.475 --> 00:17:29.765 that gives us the whole meaning of the sentence of course.

310 00:17:30.305 --> 00:17:34.085 And 50% T-F-I-D-F that in this case we use

311 00:17:34.085 --> 00:17:36.005 because we are sure that in a very,

312 00:17:36.105 --> 00:17:39.645 in such a domain specific, uh, jargon

313 00:17:39.745 --> 00:17:44.725 and language, we still want to, to use keywords.

314 00:17:44.875 --> 00:17:47.765 Because keywords in poly politics,

315 00:17:47.945 --> 00:17:52.085 in poly in law law making the name of the committees,

316 00:17:52.145 --> 00:17:54.725 for example, those are really important names.

317 00:17:54.985 --> 00:17:59.605 We cannot just, uh, retrieve, uh, documents

318 00:18:00.225 --> 00:18:03.845 for, for classifying them just based on meaning.

319 00:18:04.665 --> 00:18:06.325 We still want to keep some meaning,

320 00:18:06.425 --> 00:18:10.325 but we want to, this is a way to highlighting the value

321 00:18:10.585 --> 00:18:14.245 of the specific keywords that are, that belong

322 00:18:14.265 --> 00:18:15.485 to our domain.

323 00:18:17.145 --> 00:18:20.925 So what is keyword keyboard is a, is a very, uh,

324 00:18:21.285 --> 00:18:24.085 straightforward technique of extracting, uh,

325 00:18:24.165 --> 00:18:26.325 keywords from, uh, from a text.

326 00:18:26.905 --> 00:18:30.325 And it's actually the, uh, it extracts the,

327 00:18:31.665 --> 00:18:35.325 the words that have the, the, the highest

328 00:18:35.905 --> 00:18:39.965 cosign distance, uh, in the embed, single word embedding

329 00:18:40.065 --> 00:18:41.205 to the whole document.

330 00:18:41.355 --> 00:18:44.405 Embedding that means that it extracts the words

331 00:18:44.435 --> 00:18:47.085 that are the most representative of,

332 00:18:47.505 --> 00:18:49.765 of the document they are within.

333 00:18:50.305 --> 00:18:51.365 And those are the key words.

334 00:18:51.825 --> 00:18:55.965 So this is like, uh, this is, uh, very straightforward,

335 00:18:56.225 --> 00:18:57.685 as I said, and, uh,

336 00:18:58.465 --> 00:19:00.525 and it can, it can be, it can be downloaded,

337 00:19:00.665 --> 00:19:01.925 it works really well,

338 00:19:02.225 --> 00:19:04.765 and it always, from our experiment,

339 00:19:05.025 --> 00:19:08.205 it always gives a very high quality of keywords.

340 00:19:09.845 --> 00:19:13.385 So just to show you keyboard in action, so if this,

341 00:19:13.455 --> 00:19:16.145 this is part of the speech of the first system

342 00:19:16.175 --> 00:19:17.705 that I showed you, stream scope,

343 00:19:18.245 --> 00:19:21.545 and this is a commission designate, uh, the beginning

344 00:19:21.545 --> 00:19:23.025 of November it was talking.

345 00:19:23.645 --> 00:19:28.425 So this commissioner designate is for, um, is designate for,

346 00:19:28.805 --> 00:19:29.905 uh, for climate.

347 00:19:30.485 --> 00:19:34.345 Of course, he was speaking about climate, and this is how,

348 00:19:35.115 --> 00:19:38.145 after treating it with Bert, this is the output,

349 00:19:38.205 --> 00:19:39.425 the output of Bert.

350 00:19:39.965 --> 00:19:44.515 So we have the most important keywords are climate proposal

351 00:19:44.615 --> 00:19:46.955 2015 emission proposal.

352 00:19:47.335 --> 00:19:50.835 And this is, I mean, these, as you see it work,

353 00:19:51.055 --> 00:19:53.795 it works quite well to extract keywords.

354 00:19:54.435 --> 00:19:58.845 Just one warning, we find the version of bird, uh,

355 00:19:59.185 --> 00:20:01.325 to the legislative documents,

356 00:20:01.595 --> 00:20:06.085 because in any case, it's an unsupervised kind of training.

357 00:20:06.705 --> 00:20:09.645 So we had the legislative texts tagged

358 00:20:10.265 --> 00:20:12.445 by people at the European Commission at the,

359 00:20:12.445 --> 00:20:15.285 at the European Parliament, tagged very well also

360 00:20:15.285 --> 00:20:18.485 because we did some, uh, we did some, um,

361 00:20:18.865 --> 00:20:20.565 of course some data visualization.

362 00:20:20.665 --> 00:20:22.365 And we, we saw that, uh,

363 00:20:22.365 --> 00:20:25.405 the clusters were like quite far away, one from each other.

364 00:20:26.305 --> 00:20:29.965 And, uh, we, uh, we did, uh, the fine tuning

365 00:20:30.545 --> 00:20:33.525 of bird format language modeling.

366 00:20:33.915 --> 00:20:37.045 That is the kind of, um, the, the kind of, um,

367 00:20:37.295 --> 00:20:40.965 downstream task of, sorry, of training objective, where

368 00:20:41.495 --> 00:20:44.445 words are removed from within the sentence.

369 00:20:44.825 --> 00:20:48.205 And that actually one of the two, um, two objectives

370 00:20:48.675 --> 00:20:50.445 that Bert was trained for.

371 00:20:50.945 --> 00:20:54.445 So we, we really fine tuned the, the core

372 00:20:55.065 --> 00:20:56.405 of Bert in this way.

373 00:20:56.985 --> 00:21:00.205 And since key Bert uses Bert,

374 00:21:00.875 --> 00:21:05.285 this gave us the possibility of using, uh, the new,

375 00:21:05.745 --> 00:21:07.965 uh, the, the new keywords that come up.

376 00:21:08.185 --> 00:21:11.725 So if we have recent documents, we know

377 00:21:11.725 --> 00:21:13.885 that Bert will know those words.

378 00:21:14.465 --> 00:21:18.005 And if one word is really important, because of course,

379 00:21:18.005 --> 00:21:20.205 because Bert saw them in context,

380 00:21:20.625 --> 00:21:25.245 and so it gave, it gave it a, a good,

381 00:21:25.765 --> 00:21:28.485 a, a good meaning, uh, for, for our objective.

382 00:21:28.905 --> 00:21:31.245 So, uh, we know that Bert,

383 00:21:31.545 --> 00:21:34.845 if we do this fine tuning will be always updated

384 00:21:35.225 --> 00:21:38.245 and extracted, also very important keywords.

385 00:21:38.865 --> 00:21:42.365 So the, the result of keyword, keyword is, uh,

386 00:21:42.725 --> 00:21:45.445 keywords plus the cosign distance from the whole

387 00:21:45.725 --> 00:21:46.885 document of each word.

388 00:21:50.875 --> 00:21:54.765 Okay, so let's, let's drive to how, uh,

389 00:21:54.865 --> 00:21:56.445 we, we did it.

390 00:21:56.445 --> 00:21:57.925 Like I would put this in action.

391 00:21:58.505 --> 00:22:01.365 As I said, there are 32, uh, three,

392 00:22:01.385 --> 00:22:03.245 two categories in the eu.

393 00:22:03.785 --> 00:22:07.365 And, uh, so policy domains, and they're quite evolving.

394 00:22:07.675 --> 00:22:11.725 That is every month there are new acronyms, new names

395 00:22:11.955 --> 00:22:16.085 that go in the, uh, that appear in the,

396 00:22:17.405 --> 00:22:18.465 in the conversation.

397 00:22:18.885 --> 00:22:22.665 So we wanted to do a system that would like, keep,

398 00:22:23.175 --> 00:22:24.745 keep being updated.

399 00:22:27.695 --> 00:22:31.195 How did we train the system in a iterative way?

400 00:22:31.975 --> 00:22:34.795 So first of all, taking all the documents

401 00:22:35.755 --> 00:22:36.895 of the previous slide.

402 00:22:36.955 --> 00:22:38.455 So we have 32 topic areas,

403 00:22:38.595 --> 00:22:41.015 and we have quite a lot of documents that belong

404 00:22:41.015 --> 00:22:42.255 to each of the topic areas.

405 00:22:43.075 --> 00:22:47.615 We finetune bird for, for using them with key bird on those,

406 00:22:47.955 --> 00:22:50.855 uh, on those documents, uh,

407 00:22:50.855 --> 00:22:54.695 that are official from the uk from the European Parliament.

408 00:22:55.285 --> 00:22:59.175 Then we do some pre-processing, uh, for example, um,

409 00:22:59.815 --> 00:23:02.655 considering that some documents can have multiple tags,

410 00:23:03.075 --> 00:23:05.615 we just keep the ones that have a single tag.

411 00:23:06.035 --> 00:23:07.855 And, uh, so we skip some of them.

412 00:23:08.635 --> 00:23:11.735 And then through keyboard, we create a trans,

413 00:23:12.055 --> 00:23:14.015 a transform document that

414 00:23:14.115 --> 00:23:17.255 by repeating the most important keywords found by keyboard.

415 00:23:17.955 --> 00:23:20.975 So for example, the, the segment that I showed you

416 00:23:20.975 --> 00:23:22.375 before became,

417 00:23:22.795 --> 00:23:26.655 and this is, uh, has been cut, is truncated at the end,

418 00:23:27.115 --> 00:23:29.335 but it became, it becomes something like this.

419 00:23:29.955 --> 00:23:34.775 So you can see that here we are somehow tricking, uh, the,

420 00:23:34.875 --> 00:23:39.335 the following TF IDF system in giving more importance

421 00:23:39.715 --> 00:23:42.695 to, to keywords that we know that semantically

422 00:23:43.285 --> 00:23:44.855 have more importance.

423 00:23:45.545 --> 00:23:48.165 So basically we are using T-F-I-D-F

424 00:23:48.585 --> 00:23:50.645 as a counter somehow

425 00:23:50.795 --> 00:23:53.805 because we have already extracted quite a lot of meaning.

426 00:23:54.425 --> 00:23:58.085 But the cool thing is that, uh, through the vues, uh,

427 00:23:58.105 --> 00:24:01.485 hybrid search system, if we pass these kind

428 00:24:01.485 --> 00:24:05.205 of pre-pro pre-processed document, we can use out

429 00:24:05.205 --> 00:24:07.005 of the box VUS as it comes.

430 00:24:07.225 --> 00:24:10.845 So we can use hybrid search with 50% weight on dance

431 00:24:10.945 --> 00:24:15.245 and 50% weight on, uh, SPS embeddings.

432 00:24:15.785 --> 00:24:18.565 And then we apply BG M three

433 00:24:18.945 --> 00:24:22.885 and we insert the train documents in VUS in a special

434 00:24:23.565 --> 00:24:27.405 training collection that we actually keep, uh,

435 00:24:27.755 --> 00:24:31.445 keep feeding, uh, every, every 15 days.

436 00:24:31.465 --> 00:24:34.005 We have a process that takes the last document

437 00:24:34.585 --> 00:24:38.485 and irate here so that we know that these birds

438 00:24:38.635 --> 00:24:40.605 that we are using eventually

439 00:24:41.225 --> 00:24:44.645 for transforming these documents, we actually

440 00:24:45.475 --> 00:24:49.565 know the meaning of these new tokens.

441 00:24:50.125 --> 00:24:52.885 I mean, even if Bert uses workpiece tokenization.

442 00:24:53.145 --> 00:24:55.405 So acronyms are not, are not a problem.

443 00:24:55.985 --> 00:25:00.605 Uh, with Bert, uh, we found out experimentally

444 00:25:00.795 --> 00:25:02.005 that it was important

445 00:25:02.115 --> 00:25:06.325 because those tokens that are surrounded as in

446 00:25:08.935 --> 00:25:13.145 like, as in normal embedding embeddings that are surrounded

447 00:25:13.245 --> 00:25:16.905 by that context, actually by fine tuning,

448 00:25:17.205 --> 00:25:19.865 we give them the meaning that they deserve.

449 00:25:20.245 --> 00:25:24.825 And so in the, in the process of Bert that I showed you

450 00:25:24.825 --> 00:25:29.745 before, they are kind of enhanced for, for the, for the

451 00:25:30.815 --> 00:25:34.585 special T-F-I-D-F treatment that we do, that we do later.

452 00:25:35.445 --> 00:25:37.215 This, this is for training system.

453 00:25:37.475 --> 00:25:39.495 So we end up with a training collection

454 00:25:39.495 --> 00:25:42.735 with all the documents with these two kind of, uh,

455 00:25:42.915 --> 00:25:47.295 so APAR vector with the transform document, advanced vector

456 00:25:47.405 --> 00:25:48.975 with original document,

457 00:25:49.785 --> 00:25:52.285 and then how do we classify in unknown document?

458 00:25:52.825 --> 00:25:55.485 So we do exactly as we did before.

459 00:25:55.665 --> 00:26:00.645 So we use keyword, our fine keyword to, to get the best,

460 00:26:00.785 --> 00:26:02.765 uh, keywords out of that document.

461 00:26:03.585 --> 00:26:06.005 Uh, we apply BGM M three

462 00:26:06.865 --> 00:26:08.725 in the same way we, we should before.

463 00:26:08.825 --> 00:26:11.445 So we have dance vector on the regional one

464 00:26:12.205 --> 00:26:14.805 sparse vector on the transform document.

465 00:26:15.745 --> 00:26:18.005 Uh, we have a special API endpoint

466 00:26:18.035 --> 00:26:20.725 because we don't do this on the production server,

467 00:26:20.785 --> 00:26:24.245 of course, we have a GPU server with an embed endpoint

468 00:26:24.275 --> 00:26:27.405 that hosts these, these model.

469 00:26:28.145 --> 00:26:31.845 And then we retrieve the key closest

470 00:26:32.605 --> 00:26:36.525 documents using the vanilla normal like

471 00:26:36.945 --> 00:26:38.125 BU hybrid search.

472 00:26:38.945 --> 00:26:43.085 And, uh, since they are, uh, retrieved in, uh, in order

473 00:26:43.385 --> 00:26:47.925 of distance, uh, close document document

474 00:26:47.925 --> 00:26:50.965 that have been found to be close, closer in,

475 00:26:51.025 --> 00:26:54.045 in both the dance, uh, uh, world

476 00:26:54.465 --> 00:26:58.125 and in the spark world are retrieved first with, with the,

477 00:26:58.625 --> 00:27:02.005 of course, and they, they must have more influence.

478 00:27:03.145 --> 00:27:06.485 And we do a sort of voting, uh, such

479 00:27:06.485 --> 00:27:10.405 as the similarity score determines the voting power of each

480 00:27:10.425 --> 00:27:11.845 of the training documents.

481 00:27:12.265 --> 00:27:15.765 And each document is actually voting for its own category.

482 00:27:16.455 --> 00:27:20.075 So at the, at the end we have a sort of weighted

483 00:27:20.595 --> 00:27:21.915 KNN you can call it.

484 00:27:22.455 --> 00:27:25.755 So we sum up the weighted votes for each category,

485 00:27:26.055 --> 00:27:27.235 and we choose the category

486 00:27:27.235 --> 00:27:29.275 with the high highest total weight.

487 00:27:29.855 --> 00:27:31.395 And this is an example.

488 00:27:31.575 --> 00:27:34.275 So if we have K equals five,

489 00:27:34.615 --> 00:27:38.115 and we, we find that these are the first five documents,

490 00:27:38.255 --> 00:27:42.355 and we have these kind of similarity in the, uh, since it's

491 00:27:42.495 --> 00:27:45.235 of course they were labeled, we know that

492 00:27:45.955 --> 00:27:48.595 category A eventually wins with the weight

493 00:27:48.595 --> 00:27:53.195 of 2 69 versus category B with the weight of 1 65.

494 00:27:54.615 --> 00:27:58.995 And so this is the can n part of this, uh, of this webinar.

495 00:27:59.935 --> 00:28:03.875 Um, and these gave us very good results.

496 00:28:04.375 --> 00:28:09.275 Um, I'll show you how did we benchmark it actually in y um,

497 00:28:10.015 --> 00:28:13.635 in other, uh, question could be why BGM three

498 00:28:14.595 --> 00:28:16.075 actually the experimental results.

499 00:28:16.375 --> 00:28:18.715 And, uh, you can find this article on medium.

500 00:28:19.295 --> 00:28:21.355 Uh, first of all, it's multi-language

501 00:28:21.535 --> 00:28:23.475 and being multi-language is, is, uh,

502 00:28:23.815 --> 00:28:25.995 and why it is multi language.

503 00:28:26.145 --> 00:28:28.235 It's really good at most languages.

504 00:28:28.745 --> 00:28:32.075 It's actually surpassing in a mineral super rank

505 00:28:32.075 --> 00:28:35.955 that is this metric that, uh, shows how good, uh,

506 00:28:36.235 --> 00:28:40.435 a model is at retrieving the most relevant results.

507 00:28:41.135 --> 00:28:45.235 It surpasses most commercial, uh, embeddings.

508 00:28:45.375 --> 00:28:48.635 So open ai, for example, these are the two open ais

509 00:28:49.055 --> 00:28:53.235 and actually the mean of the accuracy

510 00:28:53.495 --> 00:28:55.675 or in reeving, the correct

511 00:28:56.355 --> 00:29:00.475 documents in b GM three is surpassing in all languages,

512 00:29:00.895 --> 00:29:01.995 all the other options.

513 00:29:02.265 --> 00:29:04.035 This is why we use B GM three.

514 00:29:04.215 --> 00:29:06.955 And also because it was embedded in vus

515 00:29:06.955 --> 00:29:10.675 and we actually, we built a lot of stuff on top of mi mi,

516 00:29:10.815 --> 00:29:13.195 so it was natural for us to, to use it.

517 00:29:15.285 --> 00:29:18.665 Um, last thing, how we did we benchmark these?

518 00:29:19.175 --> 00:29:20.585 It's not exact science.

519 00:29:21.185 --> 00:29:25.585 I mean, uh, this was made, uh, um, I mean we are coders.

520 00:29:25.885 --> 00:29:29.705 We don't do, uh, we, we do the things

521 00:29:29.735 --> 00:29:33.865 that actually work well for us and for our specific domain.

522 00:29:33.865 --> 00:29:36.345 It worked really well how we did it.

523 00:29:36.645 --> 00:29:39.905 So basically we split, we split our training data set

524 00:29:40.415 --> 00:29:42.245 into training and validation sets,

525 00:29:42.465 --> 00:29:44.805 and then we, we, we checked the confusion metrics

526 00:29:45.225 --> 00:29:49.685 and actually this whole pipeline gave us a very, very good

527 00:29:50.325 --> 00:29:52.765 accuracy on all the 32 classes.

528 00:29:53.755 --> 00:29:55.085 Also, we have a human in the loop.

529 00:29:55.365 --> 00:29:57.525 'cause remember we have our policy specialists,

530 00:29:57.865 --> 00:29:58.965 and actually they were,

531 00:29:59.475 --> 00:30:02.405 they were really happy about the algorithm works.

532 00:30:03.915 --> 00:30:05.695 We, we do have next steps.

533 00:30:05.975 --> 00:30:10.255 I mean, these, this system is, it was made quite fast also

534 00:30:10.255 --> 00:30:11.415 to create an MVP.

535 00:30:11.675 --> 00:30:13.295 And of course, I think that some

536 00:30:13.295 --> 00:30:15.655 of these complexity could be avoided

537 00:30:15.655 --> 00:30:19.735 by fine tuning our own version of BG and three.

538 00:30:21.755 --> 00:30:24.295 And the other thing is that there is no out

539 00:30:24.295 --> 00:30:25.375 of domain detection.

540 00:30:25.595 --> 00:30:28.695 So if a text is talking about, I don't know,

541 00:30:28.855 --> 00:30:32.255 a recipe in cooking, we will still be, uh,

542 00:30:32.615 --> 00:30:34.015 classifying it as policy.

543 00:30:34.475 --> 00:30:37.855 But the good thing is that everything we, we try

544 00:30:37.855 --> 00:30:39.215 to classify is policy.

545 00:30:39.475 --> 00:30:41.535 So we are always somehow in domain.

546 00:30:41.795 --> 00:30:45.175 So this, this could be, could be something, uh,

547 00:30:45.695 --> 00:30:47.015 a nice to have for the future.

548 00:30:49.095 --> 00:30:51.515 Um, I give you a small demo.

549 00:30:51.855 --> 00:30:56.475 Uh, this is the last thing, uh, for, for the, the,

550 00:30:58.545 --> 00:31:01.085 for the broadcast, um, system.

551 00:31:02.505 --> 00:31:05.695 I need to, I need to change window.

552 00:31:06.475 --> 00:31:08.695 Um, there we go.

553 00:31:11.055 --> 00:31:12.355 So it's called the screen scope.

554 00:31:13.815 --> 00:31:17.015 And so,

555 00:31:21.825 --> 00:31:25.005 Hey, so we, so this is the recording.

556 00:31:25.065 --> 00:31:27.165 It was made, um, basically

557 00:31:27.165 --> 00:31:31.805 after 40 minutes from the end of these, uh, three hour long,

558 00:31:32.225 --> 00:31:36.965 um, speech, we had, like, the whole pipeline had already,

559 00:31:37.825 --> 00:31:39.805 uh, created this classification.

560 00:31:40.305 --> 00:31:42.005 We do quite, quite a lot of things

561 00:31:42.005 --> 00:31:45.525 because we segment into parts, we,

562 00:31:45.585 --> 00:31:47.405 we understand whether it's a question

563 00:31:47.465 --> 00:31:49.445 and there is a follow up answer.

564 00:31:50.225 --> 00:31:53.605 But, uh, regarding our webinar today, we are

565 00:31:54.605 --> 00:31:57.245 actually tagging into policy areas.

566 00:31:57.865 --> 00:32:01.325 So, uh, if I, if I select, uh, that, uh,

567 00:32:01.355 --> 00:32:05.245 when he's talking about enterprise, I get just the segments

568 00:32:05.245 --> 00:32:06.925 that talk about enterprise.

569 00:32:07.305 --> 00:32:10.205 Of course we have also a full text search.

570 00:32:10.585 --> 00:32:15.125 We have summaries, we have how, like the possible other,

571 00:32:15.705 --> 00:32:19.205 um, topics that we're talking that we talked about

572 00:32:19.305 --> 00:32:20.325 during this segment.

573 00:32:21.025 --> 00:32:24.445 And last thing, we of course, there, there,

574 00:32:24.655 --> 00:32:29.005 there must have been a chatbot somewhere we can ask, uh,

575 00:32:29.025 --> 00:32:31.125 the person for example, what, what are

576 00:32:32.345 --> 00:32:35.315 your priorities as a commissioner?

577 00:32:37.545 --> 00:32:39.645 And also here there is some mild

578 00:32:39.875 --> 00:32:44.385 because, um, the, the system will

579 00:32:44.945 --> 00:32:49.625 actually, uh, ask answer about,

580 00:32:50.165 --> 00:32:53.385 uh, give a good answer to this question.

581 00:32:53.485 --> 00:32:57.185 So based on the hearing transcript from stemerman priorities

582 00:32:57.325 --> 00:32:58.745 are blah, blah, blah.

583 00:32:58.965 --> 00:33:03.025 But actually, as I said, we always want to ground our,

584 00:33:03.405 --> 00:33:05.745 our answers in something that is real.

585 00:33:06.245 --> 00:33:08.865 So we also give the actual quotes from the commissioner.

586 00:33:09.525 --> 00:33:11.985 So basically I think that any stakeholder

587 00:33:12.175 --> 00:33:15.065 that is interested in knowing, considering

588 00:33:15.065 --> 00:33:18.545 that there were 80 hours of these, of this video,

589 00:33:19.285 --> 00:33:21.305 any stakeholder that is interested in knowing

590 00:33:21.615 --> 00:33:25.625 what one person had to say about a particular topic

591 00:33:26.135 --> 00:33:27.225 will just go here.

592 00:33:27.865 --> 00:33:31.305 Actually we see it through the, through the Google analytics

593 00:33:31.535 --> 00:33:34.305 that people use this part the most

594 00:33:34.655 --> 00:33:39.345 because it can an, it can answer really well to all the,

595 00:33:40.325 --> 00:33:42.225 all the precise questions.

596 00:33:44.445 --> 00:33:48.335 Um, and this was it.

597 00:33:49.545 --> 00:33:52.235 Cool. Thank you very much. Thank you. Thank

598 00:33:52.235 --> 00:33:53.235 You very much.

599 00:33:53.575 --> 00:33:57.555 Uh, yes, we had one question in the chat. Mm-hmm.

600 00:33:57.775 --> 00:33:59.995 Uh, which is, so which LM are you using here?

601 00:34:00.215 --> 00:34:03.475 Is it like Gemini 1.5 as you're saying? It's multimodal.

602 00:34:03.985 --> 00:34:06.315 Yeah, it is. It is Gemini. Yes,

603 00:34:06.385 --> 00:34:07.385 It's Gemini. Okay.

604 00:34:07.385 --> 00:34:11.235 Is it 1.5 then? Yeah. Okay. Yeah.

605 00:34:11.295 --> 00:34:14.475 Are you planning on moving to the 2.0 now that was released?

606 00:34:14.705 --> 00:34:15.915 Yeah, I saw it. It was yesterday.

607 00:34:16.305 --> 00:34:19.395 Yeah, I didn't have time today. That's what I saw.

608 00:34:19.395 --> 00:34:22.155 Incredible stuff. People talking to the UI

609 00:34:24.415 --> 00:34:25.795 And yeah.

610 00:34:25.795 --> 00:34:27.515 Maybe I have one on my end.

611 00:34:27.575 --> 00:34:30.675 So like, have you seen like a really big improvement

612 00:34:30.675 --> 00:34:32.155 with like hybrid search, for example, for you,

613 00:34:32.155 --> 00:34:33.555 was it like a night

614 00:34:33.555 --> 00:34:35.435 and day improvement, for example, by using it?

615 00:34:35.625 --> 00:34:39.075 Yeah, completely because uh, as I said, uh,

616 00:34:39.135 --> 00:34:42.515 vector only search, uh, in these, in these very deep

617 00:34:43.075 --> 00:34:44.955 specific, uh, language, uh,

618 00:34:45.255 --> 00:34:48.715 the points in the multidimensional space are two close.

619 00:34:49.055 --> 00:34:51.235 So vectors are like two similar

620 00:34:51.495 --> 00:34:53.475 and you get somehow some random results.

621 00:34:54.015 --> 00:34:55.795 And at the same time, um,

622 00:34:56.585 --> 00:35:00.235 only sparse search was not the way to go.

623 00:35:00.495 --> 00:35:04.155 And this kind of hybrid search, we also did some, uh,

624 00:35:04.625 --> 00:35:07.315 grid search, uh, on the parameter of

625 00:35:07.615 --> 00:35:09.955 how much weight to give mm-hmm.

626 00:35:10.035 --> 00:35:11.275 To the, the two types.

627 00:35:11.615 --> 00:35:14.435 And we actually found out that 50 50 was the best.

628 00:35:15.025 --> 00:35:17.915 Okay. Just for as an information? Yeah.

629 00:35:18.605 --> 00:35:20.415 Okay. Interesting. Um, yeah,

630 00:35:20.415 --> 00:35:21.975 and I had a follow up question, but I forgot.

631 00:35:21.975 --> 00:35:23.375 Oh yeah. With Neo four J.

632 00:35:23.755 --> 00:35:27.135 Uh, so how's it, like, do you work directly

633 00:35:27.135 --> 00:35:28.335 with a graph rack for example,

634 00:35:28.435 --> 00:35:31.855 or is it's like, um, different systems working together

635 00:35:32.445 --> 00:35:34.695 with Neo 4G, you're using Neo 4G, right?

636 00:35:34.695 --> 00:35:35.975 So is it like graph rack based

637 00:35:36.035 --> 00:35:38.925 or it's mostly like entity and then vector? Yeah.

638 00:35:39.065 --> 00:35:41.965 No, no, we do, we do just entities actually, we're just,

639 00:35:42.465 --> 00:35:44.645 uh, starting using, using it now.

640 00:35:44.755 --> 00:35:49.125 Okay. We use the, we use, um, neo extract

641 00:35:49.355 --> 00:35:52.085 that is in a LM that is specialized in a, um,

642 00:35:53.195 --> 00:35:56.645 open source specialized in a getting structured output.

643 00:35:57.185 --> 00:35:58.485 Mm-hmm. Well, to,

644 00:35:58.555 --> 00:36:02.685 because of course the problem is still text to graph.

645 00:36:03.085 --> 00:36:06.965 I mean, yeah, I like to get graph to text, it's done.

646 00:36:07.345 --> 00:36:09.285 But the text to graph, I think that

647 00:36:09.285 --> 00:36:11.765 besides a lot of people talking about it,

648 00:36:12.325 --> 00:36:16.005 I still haven't seen like a gen a general implementation

649 00:36:16.005 --> 00:36:18.805 because of course it's like, I mean, it,

650 00:36:19.075 --> 00:36:20.365 it's really complicated

651 00:36:20.465 --> 00:36:22.285 to have a general implementation on that.

652 00:36:22.315 --> 00:36:23.315 Yeah.

653 00:36:23.675 --> 00:36:25.765 Okay. Cool. Perfect. Thank you.

654 00:36:26.125 --> 00:36:27.485 I think that was it for my questions.

655 00:36:28.065 --> 00:36:29.325 I'm just gonna wait quickly

656 00:36:29.385 --> 00:36:30.965 to see if anyone answers the question.

657 00:36:31.955 --> 00:36:34.175 Uh, but otherwise thank you.

658 00:36:34.395 --> 00:36:35.535 And for the people as well,

659 00:36:35.755 --> 00:36:38.015 and also the people that couldn't make it,

660 00:36:38.015 --> 00:36:39.375 like everything would be shared online.

661 00:36:39.955 --> 00:36:41.775 Uh, everything will be shared on YouTube in a couple

662 00:36:41.775 --> 00:36:43.575 of days, we'll send it to our editors.

663 00:36:44.075 --> 00:36:46.535 Um, so then, uh, it's also available there.

664 00:36:47.115 --> 00:36:49.815 But I think that's it, sir. Sandra, thank you very much.

665 00:36:50.145 --> 00:36:51.535 Thank you for the presentation

666 00:36:52.355 --> 00:36:54.135 and see you next time everyone. Yeah.

667 00:36:54.195 --> 00:36:55.735 Bye bye. Bye-bye.

Meet the Speaker

Join the session for live Q&A with the speaker

Alessandro Saccoia
Co-Founder, Veridien.ai
Data-driven marketing and AI solutions expert. He has extensive experience in international companies like Nielsen Media and Vodafone, leading data science and product teams focused on applying machine learning to business challenges. Teaches AI and Data-Driven Marketing at the IULM University in Milan.

Vector Databases for Enhanced Classification

What will you learn?

Topics covered:

Meet the Speaker

AI Assistant