1
00:00:03.875 --> 00:00:06.455
So today I'm pleased to introduce, um,
2
00:00:06.675 --> 00:00:09.295
to the session Victor Database for enhanced classification
3
00:00:09.515 --> 00:00:11.975
and our guest speaker, Alessandro Koya.
4
00:00:12.515 --> 00:00:14.215
He will talk about vector databases
5
00:00:14.395 --> 00:00:15.575
for enhanced classification
6
00:00:15.675 --> 00:00:16.815
and how they use VIS
7
00:00:16.815 --> 00:00:18.735
for handling large scale document collections.
8
00:00:19.915 --> 00:00:22.015
Um, Alessandro is a data-driven marketing
9
00:00:22.015 --> 00:00:23.455
and AI solution expert.
10
00:00:23.795 --> 00:00:26.655
He has extension experience in international companies like
11
00:00:26.655 --> 00:00:28.575
Nielsen Media and Vodafone.
12
00:00:28.845 --> 00:00:30.375
He's also leading data science
13
00:00:30.375 --> 00:00:32.375
and product team focused on applying machine
14
00:00:32.575 --> 00:00:33.695
learning to business challenges.
15
00:00:34.365 --> 00:00:35.735
He's also teaching AI
16
00:00:35.795 --> 00:00:39.735
and data-driven marketing at the IULM
17
00:00:39.735 --> 00:00:41.375
University in Milan.
18
00:00:42.495 --> 00:00:44.865
Welcome, Alessandro. The stage is yours now.
19
00:00:45.275 --> 00:00:47.545
Thank you, Stephen. Stephen, uh, yeah,
20
00:00:47.545 --> 00:00:48.665
thanks for the introduction.
21
00:00:48.665 --> 00:00:49.745
That was, uh, perfect.
22
00:00:50.605 --> 00:00:52.705
And, and sorry everyone for my throaty voice,
23
00:00:52.725 --> 00:00:54.225
but I just got sick today,
24
00:00:54.245 --> 00:00:56.745
so I'm getting sick right in this moment.
25
00:00:57.375 --> 00:01:02.105
Okay, let's get to the, to the presentation. One sec.
26
00:01:17.175 --> 00:01:17.525
Sorry.
27
00:01:20.645 --> 00:01:22.455
Hmm, sorry.
28
00:01:22.555 --> 00:01:24.815
One second. Um, Chrome tab.
29
00:01:27.695 --> 00:01:28.695
There we go.
30
00:01:31.555 --> 00:01:32.775
We can go. We
31
00:01:32.775 --> 00:01:33.775
See. Thank you.
32
00:01:33.775 --> 00:01:37.905
Okay.
33
00:01:38.525 --> 00:01:40.265
Uh, so, uh, well, about me,
34
00:01:40.425 --> 00:01:42.745
I just got introduced, uh, thank you Steven.
35
00:01:43.525 --> 00:01:47.905
And, um, so just, uh, about my current focus, uh, we're,
36
00:01:48.355 --> 00:01:52.745
we've been, um, like jump starting a startup in the last,
37
00:01:52.805 --> 00:01:54.825
uh, uh, year that, uh,
38
00:01:55.495 --> 00:01:59.225
that focuses on enhancing policy understanding
39
00:01:59.845 --> 00:02:01.305
and driving, uh, uh,
40
00:02:01.625 --> 00:02:03.665
informed decision making in the public sector.
41
00:02:04.245 --> 00:02:06.385
So we have an attorney that works with us
42
00:02:06.445 --> 00:02:09.305
and is super expert about everything
43
00:02:09.305 --> 00:02:11.745
that is about public policy and law making.
44
00:02:11.925 --> 00:02:13.545
So these are our current focus,
45
00:02:14.045 --> 00:02:15.945
but we are actually creating products
46
00:02:16.085 --> 00:02:19.425
and algorithms that can be used in a more generic way.
47
00:02:20.445 --> 00:02:23.225
Uh, in particular, we all focus, uh,
48
00:02:23.405 --> 00:02:26.905
and our background in is in knowledge solutions.
49
00:02:27.365 --> 00:02:30.385
So we have out at the moment, uh, a few products.
50
00:02:30.685 --> 00:02:34.905
One is policy manager that is, uh, aggregating news, uh,
51
00:02:34.905 --> 00:02:37.345
according to topic and policy areas.
52
00:02:37.605 --> 00:02:41.225
And that is why we need classification, uh, stream scope,
53
00:02:41.335 --> 00:02:45.225
that is a real time broadcast, uh, analysis tool
54
00:02:45.575 --> 00:02:49.185
that can be used to extract knowledge in real time from,
55
00:02:49.445 --> 00:02:51.265
for example, parliamentary recessions
56
00:02:51.605 --> 00:02:53.345
or like, uh, like recently
57
00:02:53.485 --> 00:02:56.865
for commissioner designate from the EU Commission.
58
00:02:57.445 --> 00:02:58.705
And we focus on policy.
59
00:02:58.885 --> 00:03:02.805
And so our funding team, uh, as I said, includes an attorney
60
00:03:03.305 --> 00:03:04.525
and a a
61
00:03:05.025 --> 00:03:06.165
and a couple of people
62
00:03:06.195 --> 00:03:09.045
that are very expert in quantitative analysis.
63
00:03:09.195 --> 00:03:12.565
Because of these, we, we try to create, uh, AI agents
64
00:03:12.835 --> 00:03:15.485
that can give our, our cus our clients
65
00:03:15.865 --> 00:03:20.565
and customers, uh, an advantage in understanding what, what,
66
00:03:20.565 --> 00:03:22.325
what is going on in that sector.
67
00:03:24.085 --> 00:03:27.345
So, um, I give you, uh, some, uh, a few examples of
68
00:03:27.345 --> 00:03:28.545
what we do, uh,
69
00:03:28.545 --> 00:03:32.025
because, uh, I want you to understand why we needed a,
70
00:03:32.225 --> 00:03:36.105
a very, uh, fine way, uh, way of, uh,
71
00:03:36.425 --> 00:03:38.825
classifying documents, uh, in this case.
72
00:03:38.965 --> 00:03:43.825
Uh, this product that is live, uh, is used to, uh, to, to,
73
00:03:44.165 --> 00:03:46.865
to transcribe diarize
74
00:03:47.205 --> 00:03:49.705
and classify live streams
75
00:03:49.965 --> 00:03:54.585
and classify them according to EU policy areas
76
00:03:55.015 --> 00:03:56.065
that are given.
77
00:03:56.295 --> 00:04:00.545
They are quite a lot and they can overlap somehow.
78
00:04:01.005 --> 00:04:05.945
So our clients, uh, we wanted to give our, our clients
79
00:04:06.745 --> 00:04:09.545
a system that does not hallucinate, uh,
80
00:04:09.565 --> 00:04:14.305
and that can, uh, like a very, uh, laser like
81
00:04:14.965 --> 00:04:16.185
way of classification.
82
00:04:16.885 --> 00:04:20.265
Um, maybe at the end of the, at the end of the webinar,
83
00:04:20.625 --> 00:04:23.625
I will also show you these, uh, these interface live.
84
00:04:24.775 --> 00:04:28.105
Another thing we do is like classic reg style projects,
85
00:04:28.485 --> 00:04:32.825
but, uh, we, uh, so while we talking about mebo,
86
00:04:32.825 --> 00:04:36.025
because in all our products we use Mebo also,
87
00:04:36.295 --> 00:04:39.545
when there is a rag involved, we considering that we,
88
00:04:39.605 --> 00:04:43.425
we are striving for grounding, so giving, uh, trustworthy,
89
00:04:43.725 --> 00:04:46.025
uh, responses multimodality.
90
00:04:46.285 --> 00:04:49.825
So of course we, we est documents of all kinds
91
00:04:50.245 --> 00:04:51.865
and ex explainability.
92
00:04:52.165 --> 00:04:56.105
So for example, in these, um, in this ui, you see
93
00:04:56.105 --> 00:04:58.945
that we actually, uh, in our type solutions,
94
00:04:59.405 --> 00:05:03.985
we are showcasing the thought steps, uh, in the chain
95
00:05:04.005 --> 00:05:05.665
of thought, uh, fashion.
96
00:05:06.205 --> 00:05:11.105
So we actually want to always give our user recognition
97
00:05:11.245 --> 00:05:14.145
of what kind of data the you are,
98
00:05:14.245 --> 00:05:16.305
the AI agents have used
99
00:05:16.485 --> 00:05:19.345
to formulate their final answer that is here.
100
00:05:20.045 --> 00:05:23.025
And, uh, because of that, of course, we can handle text,
101
00:05:23.575 --> 00:05:27.545
PDFs, videos, as you just see, and audio files as well.
102
00:05:29.485 --> 00:05:32.345
All of these, uh, for, for,
103
00:05:33.445 --> 00:05:34.945
for managing the knowledge.
104
00:05:35.405 --> 00:05:37.865
Uh, we use, uh, jungle, uh, because, uh,
105
00:05:37.865 --> 00:05:41.025
because it's a very robust system, it's made in Python
106
00:05:41.165 --> 00:05:44.385
and we are all like, uh, we all have a Python background
107
00:05:45.405 --> 00:05:48.785
and, uh, we, we use a parallel, um,
108
00:05:49.705 --> 00:05:51.185
BU database.
109
00:05:51.525 --> 00:05:53.185
So we have hooks that
110
00:05:53.585 --> 00:05:57.685
whenever a new document is added to our jungle backend,
111
00:05:58.155 --> 00:06:01.045
it's also added in our Migos backend
112
00:06:01.045 --> 00:06:04.805
because we want to use Migos for its search capabilities for
113
00:06:04.865 --> 00:06:06.205
as a vector database.
114
00:06:06.705 --> 00:06:11.085
And we use, um, while, while we use, um, jungle
115
00:06:11.545 --> 00:06:14.485
for keeping, uh, all the relational database.
116
00:06:15.105 --> 00:06:17.925
So, um, all, all the products that
117
00:06:18.075 --> 00:06:21.925
that we are doing have these common characteristic of,
118
00:06:22.105 --> 00:06:25.525
of having a backend with jungle that is somehow
119
00:06:26.645 --> 00:06:27.765
paralleled in.
120
00:06:28.745 --> 00:06:31.685
And in fact, this is something that at the beginning, uh,
121
00:06:31.685 --> 00:06:34.485
when we were starting out, I was trying to,
122
00:06:34.585 --> 00:06:36.365
to understand if it was already made.
123
00:06:36.945 --> 00:06:41.925
It was not, but it was using the jungle primary
124
00:06:42.265 --> 00:06:46.325
key as Milus primary key became really natural.
125
00:06:46.585 --> 00:06:49.205
And the two systems work really well, actually,
126
00:06:49.425 --> 00:06:52.445
we just created one wrapper class,
127
00:06:52.945 --> 00:06:56.845
and then everything is, um, is basically, uh, working
128
00:06:57.525 --> 00:06:59.045
seamlessly and transparently.
129
00:07:00.625 --> 00:07:05.095
Other technologies we use are guard, guardrail, AI, near
130
00:07:05.115 --> 00:07:08.975
for J because we use quite a lot of knowledge graphs
131
00:07:09.435 --> 00:07:12.495
and also media wiki because we are media wiki.
132
00:07:12.595 --> 00:07:14.935
I'm a, I'm a Wiki media, uh, member.
133
00:07:15.555 --> 00:07:19.855
And, uh, so we think that somehow, uh, it's, it's a good way
134
00:07:19.915 --> 00:07:23.895
to, to use, uh, to use it in a semantic way.
135
00:07:25.135 --> 00:07:28.305
Okay, let's, uh, let's get to, to the main point.
136
00:07:29.325 --> 00:07:31.905
So we have a technical challenge in
137
00:07:31.905 --> 00:07:33.185
everything that you've seen.
138
00:07:33.845 --> 00:07:35.745
Um, we need to evolve, uh,
139
00:07:36.455 --> 00:07:39.945
from basic classification keyword base, for example,
140
00:07:40.445 --> 00:07:45.225
to semantic understanding in domains that are very jargon,
141
00:07:45.275 --> 00:07:48.825
heavy jargon reach, there are also evolving
142
00:07:49.325 --> 00:07:51.425
and they're really deep and vertical.
143
00:07:51.925 --> 00:07:56.185
Uh, if you think about policy, uh, each uh, month, there is,
144
00:07:56.275 --> 00:07:58.585
there are new, uh, new words
145
00:07:58.855 --> 00:08:01.465
that can be associated in a classification.
146
00:08:02.005 --> 00:08:05.185
And, uh, and also if you,
147
00:08:06.545 --> 00:08:10.765
if you think about it, uh, in a vector space, if you take,
148
00:08:10.905 --> 00:08:12.685
uh, for example, generic
149
00:08:12.905 --> 00:08:16.045
or general, uh, embedding, uh, models,
150
00:08:17.105 --> 00:08:21.525
the vector space of policy of two, um, two fields
151
00:08:21.625 --> 00:08:25.125
of policy error, two fields of policy
152
00:08:25.655 --> 00:08:28.725
areas are very close in vector space.
153
00:08:29.745 --> 00:08:33.045
And also we have quite a lot of policy areas.
154
00:08:33.665 --> 00:08:36.285
So if you consider, I will show you in, in one
155
00:08:36.285 --> 00:08:38.725
of the next slides that are three, two
156
00:08:39.365 --> 00:08:42.765
official policy areas in, uh, in the European
157
00:08:43.275 --> 00:08:46.005
legislative system, uh, you can see
158
00:08:46.005 --> 00:08:49.165
that we cannot rely on any zero shot,
159
00:08:49.545 --> 00:08:51.125
for example, classification.
160
00:08:51.585 --> 00:08:55.685
So we had to, we had to engineer, uh, a, a better solution.
161
00:08:59.195 --> 00:09:00.535
So, um, there are some,
162
00:09:00.965 --> 00:09:04.495
some good things about traditional document classification
163
00:09:04.605 --> 00:09:05.815
that we have inherited.
164
00:09:06.405 --> 00:09:10.655
Just to give like a historical like background remember, uh,
165
00:09:10.715 --> 00:09:14.415
before neural, uh, classification,
166
00:09:14.985 --> 00:09:18.735
there was pre-processing vectorization be words
167
00:09:19.395 --> 00:09:22.975
pf IDF, that is the spars kind of search
168
00:09:23.585 --> 00:09:25.175
named entity recognition.
169
00:09:25.315 --> 00:09:27.255
So you go, you, you check out, uh,
170
00:09:27.585 --> 00:09:31.175
which are the corporate names, which are the top names
171
00:09:31.395 --> 00:09:36.055
or names, et cetera, and some manual future engineering.
172
00:09:36.675 --> 00:09:40.215
And on these features, normally it has always been built,
173
00:09:40.795 --> 00:09:42.975
uh, some, some kind of classifier
174
00:09:42.975 --> 00:09:45.975
that can be software vector machines, random forests
175
00:09:46.515 --> 00:09:48.095
or other kind of classifiers
176
00:09:51.555 --> 00:09:56.005
jumping to the last few years, of course not.
177
00:09:56.105 --> 00:10:00.445
Uh, taking a look at all the LSTM, for example, based,
178
00:10:00.905 --> 00:10:04.805
um, ways of classifying documents,
179
00:10:05.585 --> 00:10:07.605
we can see a few approaches.
180
00:10:07.785 --> 00:10:09.445
So if I have an unknown document
181
00:10:09.825 --> 00:10:12.765
and I have a corpus that is made
182
00:10:12.985 --> 00:10:17.725
of let's say 32 categories, what are my, what are my choices
183
00:10:18.105 --> 00:10:21.565
to classify this unknown document into one
184
00:10:22.025 --> 00:10:25.125
or one category, the, the most important one
185
00:10:25.185 --> 00:10:28.485
or multiple category categories, if there is an overlap,
186
00:10:28.585 --> 00:10:32.685
but also with the distance score from these categories.
187
00:10:33.485 --> 00:10:37.425
So, uh, one option is to use embedding based classification.
188
00:10:37.805 --> 00:10:41.105
So converting the text to vectors,
189
00:10:41.275 --> 00:10:45.225
using the pre-train models, calculate calculating the
190
00:10:46.375 --> 00:10:50.255
category OIDs for the, for the training corpus,
191
00:10:50.675 --> 00:10:54.215
and then each new document, see, to which
192
00:10:54.315 --> 00:10:56.695
of these OIDs is the, is closer.
193
00:10:57.315 --> 00:11:00.055
But this means to use, uh,
194
00:11:00.355 --> 00:11:03.415
to use a generic pre-trained, uh,
195
00:11:03.525 --> 00:11:05.455
embedding model from open AI
196
00:11:05.515 --> 00:11:07.935
or open source embeddings, for example.
197
00:11:08.245 --> 00:11:10.735
Also that base, uh,
198
00:11:11.755 --> 00:11:14.255
and we experimentally found out.
199
00:11:14.715 --> 00:11:17.015
So I, I also wanted to give a warning.
200
00:11:17.555 --> 00:11:20.015
Uh, these, uh, everything we've done,
201
00:11:20.285 --> 00:11:21.615
it's really experimental
202
00:11:22.195 --> 00:11:25.295
and actually, uh, it's something that gave our,
203
00:11:25.565 --> 00:11:30.215
gave us the best results for our, for our domain.
204
00:11:30.585 --> 00:11:34.575
Maybe on other domains it'll work better or worse.
205
00:11:34.955 --> 00:11:37.655
So, uh, in our case,
206
00:11:37.905 --> 00:11:40.935
these embedding based classification was didn't,
207
00:11:40.935 --> 00:11:42.415
didn't give us very good results.
208
00:11:42.515 --> 00:11:44.855
So we normally start with an input sentence,
209
00:11:45.345 --> 00:11:48.215
which we check the embedding, we pass it
210
00:11:48.215 --> 00:11:50.975
through an embedding model, and then we see what is, uh,
211
00:11:51.475 --> 00:11:53.925
to which OID is closest.
212
00:11:55.055 --> 00:11:59.755
The other way of, uh, the other approach to classification,
213
00:12:00.415 --> 00:12:02.595
uh, could be by fine tuning transformers.
214
00:12:03.135 --> 00:12:07.035
So we give an input sentence to, uh, uh, encoder only
215
00:12:07.635 --> 00:12:09.355
transformer like bird, for example.
216
00:12:09.855 --> 00:12:13.035
And then we put on top the classification head.
217
00:12:13.545 --> 00:12:15.155
This is normally the hugging,
218
00:12:15.345 --> 00:12:18.315
hugging face style kind of classification.
219
00:12:18.815 --> 00:12:23.315
So whenever we ask the hugging face, uh,
220
00:12:24.115 --> 00:12:25.915
transformers library for a classification,
221
00:12:26.335 --> 00:12:28.355
it normally gives us a way
222
00:12:28.355 --> 00:12:32.035
to fine tune just this classification head
223
00:12:32.495 --> 00:12:36.235
by putting normally, um, a fully connected layer on top,
224
00:12:36.815 --> 00:12:41.195
on top of bird or, and that that is going to be fine tuned.
225
00:12:41.855 --> 00:12:43.955
Uh, just the classification head
226
00:12:43.975 --> 00:12:46.795
or there is the possibility of fine tuning the whole system
227
00:12:47.185 --> 00:12:48.195
down to bird,
228
00:12:49.095 --> 00:12:52.115
but still, like it requires extensive labeled data.
229
00:12:52.775 --> 00:12:55.915
And, uh, and it's kind of a black box.
230
00:12:56.235 --> 00:13:00.635
I mean, I don't really know what, uh, how Bert was trained.
231
00:13:01.135 --> 00:13:05.995
So prob probably, uh, it's, it could, uh, we didn't know
232
00:13:06.065 --> 00:13:07.435
what, what was in there.
233
00:13:08.495 --> 00:13:10.875
We, we go to the, to the next possibility,
234
00:13:11.695 --> 00:13:14.795
and I don't like this one too much because it's zero
235
00:13:15.015 --> 00:13:18.915
and few shot classification by giving GPT or code
236
00:13:18.935 --> 00:13:21.115
or whatever, a huge prompt.
237
00:13:21.535 --> 00:13:25.835
Uh, and asking for a, for structure structure j out,
238
00:13:26.455 --> 00:13:31.115
giving the possibility the, the possible classes
239
00:13:31.775 --> 00:13:34.195
as an enumeration or in the prompt,
240
00:13:34.695 --> 00:13:36.795
and then parsing this prompt.
241
00:13:37.655 --> 00:13:40.675
Uh, like in this case, I give a long prompt telling
242
00:13:40.775 --> 00:13:43.875
to provide the answer with the in format.
243
00:13:44.575 --> 00:13:46.275
And then I give the input text.
244
00:13:46.855 --> 00:13:50.235
Why I don't like this, because it's completely black box.
245
00:13:50.565 --> 00:13:51.995
There is the risk of bias.
246
00:13:52.295 --> 00:13:54.835
And in our case, we have 32 categories.
247
00:13:55.055 --> 00:13:56.155
So we tested it
248
00:13:56.415 --> 00:13:58.355
and we found that that actually
249
00:13:58.935 --> 00:14:01.715
by splitting into a train test
250
00:14:02.215 --> 00:14:07.195
and a test set, our initial corpus, the,
251
00:14:09.615 --> 00:14:14.155
the, the, the metrics, so the accuracy of the metrics was,
252
00:14:14.535 --> 00:14:17.435
was really bad actually, the confusion metrics.
253
00:14:18.615 --> 00:14:22.755
So we, we went for a hybrid approach, hybrid
254
00:14:22.755 --> 00:14:23.955
that actually uses mi.
255
00:14:24.895 --> 00:14:29.515
And, uh, we took in a very experimental way, uh, the best
256
00:14:29.735 --> 00:14:31.235
of the two words somehow.
257
00:14:32.255 --> 00:14:35.835
Uh, so from traditional methods, we use keyboard.
258
00:14:36.175 --> 00:14:37.675
Uh, I will show you in the next slide
259
00:14:37.675 --> 00:14:38.755
what keyboard is about.
260
00:14:39.025 --> 00:14:43.835
It's, it's a very, uh, it's a very thin algorithms,
261
00:14:43.855 --> 00:14:45.875
but algorithmically works really well.
262
00:14:46.415 --> 00:14:47.755
We took T-F-I-D-F.
263
00:14:47.935 --> 00:14:50.315
In fact, we're, we're talking about hybrid search,
264
00:14:50.735 --> 00:14:51.965
and we use KNN.
265
00:14:52.225 --> 00:14:55.205
See, that is a simple interpretable classifier.
266
00:14:55.985 --> 00:14:59.525
And from the neural approaches, we took embeddings, BG
267
00:15:00.185 --> 00:15:02.765
BGM three, that is the embedding model we are
268
00:15:02.765 --> 00:15:03.845
talking about today.
269
00:15:04.385 --> 00:15:07.925
And, uh, uh, and we use a model vector database mailbox
270
00:15:08.225 --> 00:15:10.485
for efficient hybrid search.
271
00:15:11.805 --> 00:15:16.305
So, uh, we, the, the pipeline from like from above,
272
00:15:16.965 --> 00:15:18.545
uh, it takes a text.
273
00:15:19.315 --> 00:15:22.945
There is a l an initial language detection that we used to,
274
00:15:23.285 --> 00:15:27.505
uh, to, to set the language of NLTK
275
00:15:27.685 --> 00:15:28.945
for stop word removal.
276
00:15:29.365 --> 00:15:32.345
But these are like very common pro processing steps.
277
00:15:33.745 --> 00:15:38.725
We give the sentence to key bird that extracts keywords.
278
00:15:39.025 --> 00:15:43.805
So key bird is ma is a special like algorithm that uses bird
279
00:15:43.905 --> 00:15:45.445
to accept, which are the,
280
00:15:45.665 --> 00:15:50.045
be the most important keywords given, uh, given a text.
281
00:15:51.025 --> 00:15:54.445
And we create a, a new transformed version of,
282
00:15:54.585 --> 00:15:56.045
of the input text,
283
00:15:56.505 --> 00:15:59.365
and we give that version to this parts vector,
284
00:15:59.865 --> 00:16:03.165
and we give the original, well, without stop words,
285
00:16:03.345 --> 00:16:05.925
but we can, we could also as well give it
286
00:16:05.925 --> 00:16:10.245
with the stop words, we give the original sentence, uh,
287
00:16:10.305 --> 00:16:13.565
to be transformed with, with the dense vector.
288
00:16:14.225 --> 00:16:17.765
So this approach somehow, uh, well, first of all,
289
00:16:17.825 --> 00:16:19.365
it gave us the best results,
290
00:16:19.625 --> 00:16:21.445
and it's kind of a, of a hybrid.
291
00:16:22.425 --> 00:16:25.565
Uh, always show you now how we work with key birth.
292
00:16:25.825 --> 00:16:30.085
But please take in consideration that somehow here
293
00:16:30.775 --> 00:16:35.245
we're, uh, creating a new sentence that is really, uh,
294
00:16:35.305 --> 00:16:38.245
likely to be, uh, well treated
295
00:16:38.625 --> 00:16:41.285
by the TF IDF algorithm.
296
00:16:41.705 --> 00:16:44.405
So we are kind of cheating, let's say.
297
00:16:44.665 --> 00:16:49.485
So we are, uh, augmenting the sentence to, uh, exploit
298
00:16:50.025 --> 00:16:51.525
how kf IDF works.
299
00:16:52.075 --> 00:16:54.725
That is, it works with, it works with the,
300
00:16:54.785 --> 00:16:56.085
the term frequency.
301
00:16:56.305 --> 00:17:00.965
So we are somehow artificially inflating the,
302
00:17:01.425 --> 00:17:04.685
the term frequencies of the keywords we care about.
303
00:17:05.265 --> 00:17:09.885
And in this way, when we ask, so when we will get to the K
304
00:17:09.885 --> 00:17:12.565
and N part, we got really good results
305
00:17:12.565 --> 00:17:15.525
because the spars GIDF version
306
00:17:15.705 --> 00:17:17.965
of these algorithm works really well.
307
00:17:18.465 --> 00:17:23.405
And like, as a, as a spoiler, we are actually retrieving fif
308
00:17:23.405 --> 00:17:26.045
with the weight of 50% to dense vector
309
00:17:26.475 --> 00:17:29.765
that gives us the whole meaning of the sentence of course.
310
00:17:30.305 --> 00:17:34.085
And 50% T-F-I-D-F that in this case we use
311
00:17:34.085 --> 00:17:36.005
because we are sure that in a very,
312
00:17:36.105 --> 00:17:39.645
in such a domain specific, uh, jargon
313
00:17:39.745 --> 00:17:44.725
and language, we still want to, to use keywords.
314
00:17:44.875 --> 00:17:47.765
Because keywords in poly politics,
315
00:17:47.945 --> 00:17:52.085
in poly in law law making the name of the committees,
316
00:17:52.145 --> 00:17:54.725
for example, those are really important names.
317
00:17:54.985 --> 00:17:59.605
We cannot just, uh, retrieve, uh, documents
318
00:18:00.225 --> 00:18:03.845
for, for classifying them just based on meaning.
319
00:18:04.665 --> 00:18:06.325
We still want to keep some meaning,
320
00:18:06.425 --> 00:18:10.325
but we want to, this is a way to highlighting the value
321
00:18:10.585 --> 00:18:14.245
of the specific keywords that are, that belong
322
00:18:14.265 --> 00:18:15.485
to our domain.
323
00:18:17.145 --> 00:18:20.925
So what is keyword keyboard is a, is a very, uh,
324
00:18:21.285 --> 00:18:24.085
straightforward technique of extracting, uh,
325
00:18:24.165 --> 00:18:26.325
keywords from, uh, from a text.
326
00:18:26.905 --> 00:18:30.325
And it's actually the, uh, it extracts the,
327
00:18:31.665 --> 00:18:35.325
the words that have the, the, the highest
328
00:18:35.905 --> 00:18:39.965
cosign distance, uh, in the embed, single word embedding
329
00:18:40.065 --> 00:18:41.205
to the whole document.
330
00:18:41.355 --> 00:18:44.405
Embedding that means that it extracts the words
331
00:18:44.435 --> 00:18:47.085
that are the most representative of,
332
00:18:47.505 --> 00:18:49.765
of the document they are within.
333
00:18:50.305 --> 00:18:51.365
And those are the key words.
334
00:18:51.825 --> 00:18:55.965
So this is like, uh, this is, uh, very straightforward,
335
00:18:56.225 --> 00:18:57.685
as I said, and, uh,
336
00:18:58.465 --> 00:19:00.525
and it can, it can be, it can be downloaded,
337
00:19:00.665 --> 00:19:01.925
it works really well,
338
00:19:02.225 --> 00:19:04.765
and it always, from our experiment,
339
00:19:05.025 --> 00:19:08.205
it always gives a very high quality of keywords.
340
00:19:09.845 --> 00:19:13.385
So just to show you keyboard in action, so if this,
341
00:19:13.455 --> 00:19:16.145
this is part of the speech of the first system
342
00:19:16.175 --> 00:19:17.705
that I showed you, stream scope,
343
00:19:18.245 --> 00:19:21.545
and this is a commission designate, uh, the beginning
344
00:19:21.545 --> 00:19:23.025
of November it was talking.
345
00:19:23.645 --> 00:19:28.425
So this commissioner designate is for, um, is designate for,
346
00:19:28.805 --> 00:19:29.905
uh, for climate.
347
00:19:30.485 --> 00:19:34.345
Of course, he was speaking about climate, and this is how,
348
00:19:35.115 --> 00:19:38.145
after treating it with Bert, this is the output,
349
00:19:38.205 --> 00:19:39.425
the output of Bert.
350
00:19:39.965 --> 00:19:44.515
So we have the most important keywords are climate proposal
351
00:19:44.615 --> 00:19:46.955
2015 emission proposal.
352
00:19:47.335 --> 00:19:50.835
And this is, I mean, these, as you see it work,
353
00:19:51.055 --> 00:19:53.795
it works quite well to extract keywords.
354
00:19:54.435 --> 00:19:58.845
Just one warning, we find the version of bird, uh,
355
00:19:59.185 --> 00:20:01.325
to the legislative documents,
356
00:20:01.595 --> 00:20:06.085
because in any case, it's an unsupervised kind of training.
357
00:20:06.705 --> 00:20:09.645
So we had the legislative texts tagged
358
00:20:10.265 --> 00:20:12.445
by people at the European Commission at the,
359
00:20:12.445 --> 00:20:15.285
at the European Parliament, tagged very well also
360
00:20:15.285 --> 00:20:18.485
because we did some, uh, we did some, um,
361
00:20:18.865 --> 00:20:20.565
of course some data visualization.
362
00:20:20.665 --> 00:20:22.365
And we, we saw that, uh,
363
00:20:22.365 --> 00:20:25.405
the clusters were like quite far away, one from each other.
364
00:20:26.305 --> 00:20:29.965
And, uh, we, uh, we did, uh, the fine tuning
365
00:20:30.545 --> 00:20:33.525
of bird format language modeling.
366
00:20:33.915 --> 00:20:37.045
That is the kind of, um, the, the kind of, um,
367
00:20:37.295 --> 00:20:40.965
downstream task of, sorry, of training objective, where
368
00:20:41.495 --> 00:20:44.445
words are removed from within the sentence.
369
00:20:44.825 --> 00:20:48.205
And that actually one of the two, um, two objectives
370
00:20:48.675 --> 00:20:50.445
that Bert was trained for.
371
00:20:50.945 --> 00:20:54.445
So we, we really fine tuned the, the core
372
00:20:55.065 --> 00:20:56.405
of Bert in this way.
373
00:20:56.985 --> 00:21:00.205
And since key Bert uses Bert,
374
00:21:00.875 --> 00:21:05.285
this gave us the possibility of using, uh, the new,
375
00:21:05.745 --> 00:21:07.965
uh, the, the new keywords that come up.
376
00:21:08.185 --> 00:21:11.725
So if we have recent documents, we know
377
00:21:11.725 --> 00:21:13.885
that Bert will know those words.
378
00:21:14.465 --> 00:21:18.005
And if one word is really important, because of course,
379
00:21:18.005 --> 00:21:20.205
because Bert saw them in context,
380
00:21:20.625 --> 00:21:25.245
and so it gave, it gave it a, a good,
381
00:21:25.765 --> 00:21:28.485
a, a good meaning, uh, for, for our objective.
382
00:21:28.905 --> 00:21:31.245
So, uh, we know that Bert,
383
00:21:31.545 --> 00:21:34.845
if we do this fine tuning will be always updated
384
00:21:35.225 --> 00:21:38.245
and extracted, also very important keywords.
385
00:21:38.865 --> 00:21:42.365
So the, the result of keyword, keyword is, uh,
386
00:21:42.725 --> 00:21:45.445
keywords plus the cosign distance from the whole
387
00:21:45.725 --> 00:21:46.885
document of each word.
388
00:21:50.875 --> 00:21:54.765
Okay, so let's, let's drive to how, uh,
389
00:21:54.865 --> 00:21:56.445
we, we did it.
390
00:21:56.445 --> 00:21:57.925
Like I would put this in action.
391
00:21:58.505 --> 00:22:01.365
As I said, there are 32, uh, three,
392
00:22:01.385 --> 00:22:03.245
two categories in the eu.
393
00:22:03.785 --> 00:22:07.365
And, uh, so policy domains, and they're quite evolving.
394
00:22:07.675 --> 00:22:11.725
That is every month there are new acronyms, new names
395
00:22:11.955 --> 00:22:16.085
that go in the, uh, that appear in the,
396
00:22:17.405 --> 00:22:18.465
in the conversation.
397
00:22:18.885 --> 00:22:22.665
So we wanted to do a system that would like, keep,
398
00:22:23.175 --> 00:22:24.745
keep being updated.
399
00:22:27.695 --> 00:22:31.195
How did we train the system in a iterative way?
400
00:22:31.975 --> 00:22:34.795
So first of all, taking all the documents
401
00:22:35.755 --> 00:22:36.895
of the previous slide.
402
00:22:36.955 --> 00:22:38.455
So we have 32 topic areas,
403
00:22:38.595 --> 00:22:41.015
and we have quite a lot of documents that belong
404
00:22:41.015 --> 00:22:42.255
to each of the topic areas.
405
00:22:43.075 --> 00:22:47.615
We finetune bird for, for using them with key bird on those,
406
00:22:47.955 --> 00:22:50.855
uh, on those documents, uh,
407
00:22:50.855 --> 00:22:54.695
that are official from the uk from the European Parliament.
408
00:22:55.285 --> 00:22:59.175
Then we do some pre-processing, uh, for example, um,
409
00:22:59.815 --> 00:23:02.655
considering that some documents can have multiple tags,
410
00:23:03.075 --> 00:23:05.615
we just keep the ones that have a single tag.
411
00:23:06.035 --> 00:23:07.855
And, uh, so we skip some of them.
412
00:23:08.635 --> 00:23:11.735
And then through keyboard, we create a trans,
413
00:23:12.055 --> 00:23:14.015
a transform document that
414
00:23:14.115 --> 00:23:17.255
by repeating the most important keywords found by keyboard.
415
00:23:17.955 --> 00:23:20.975
So for example, the, the segment that I showed you
416
00:23:20.975 --> 00:23:22.375
before became,
417
00:23:22.795 --> 00:23:26.655
and this is, uh, has been cut, is truncated at the end,
418
00:23:27.115 --> 00:23:29.335
but it became, it becomes something like this.
419
00:23:29.955 --> 00:23:34.775
So you can see that here we are somehow tricking, uh, the,
420
00:23:34.875 --> 00:23:39.335
the following TF IDF system in giving more importance
421
00:23:39.715 --> 00:23:42.695
to, to keywords that we know that semantically
422
00:23:43.285 --> 00:23:44.855
have more importance.
423
00:23:45.545 --> 00:23:48.165
So basically we are using T-F-I-D-F
424
00:23:48.585 --> 00:23:50.645
as a counter somehow
425
00:23:50.795 --> 00:23:53.805
because we have already extracted quite a lot of meaning.
426
00:23:54.425 --> 00:23:58.085
But the cool thing is that, uh, through the vues, uh,
427
00:23:58.105 --> 00:24:01.485
hybrid search system, if we pass these kind
428
00:24:01.485 --> 00:24:05.205
of pre-pro pre-processed document, we can use out
429
00:24:05.205 --> 00:24:07.005
of the box VUS as it comes.
430
00:24:07.225 --> 00:24:10.845
So we can use hybrid search with 50% weight on dance
431
00:24:10.945 --> 00:24:15.245
and 50% weight on, uh, SPS embeddings.
432
00:24:15.785 --> 00:24:18.565
And then we apply BG M three
433
00:24:18.945 --> 00:24:22.885
and we insert the train documents in VUS in a special
434
00:24:23.565 --> 00:24:27.405
training collection that we actually keep, uh,
435
00:24:27.755 --> 00:24:31.445
keep feeding, uh, every, every 15 days.
436
00:24:31.465 --> 00:24:34.005
We have a process that takes the last document
437
00:24:34.585 --> 00:24:38.485
and irate here so that we know that these birds
438
00:24:38.635 --> 00:24:40.605
that we are using eventually
439
00:24:41.225 --> 00:24:44.645
for transforming these documents, we actually
440
00:24:45.475 --> 00:24:49.565
know the meaning of these new tokens.
441
00:24:50.125 --> 00:24:52.885
I mean, even if Bert uses workpiece tokenization.
442
00:24:53.145 --> 00:24:55.405
So acronyms are not, are not a problem.
443
00:24:55.985 --> 00:25:00.605
Uh, with Bert, uh, we found out experimentally
444
00:25:00.795 --> 00:25:02.005
that it was important
445
00:25:02.115 --> 00:25:06.325
because those tokens that are surrounded as in
446
00:25:08.935 --> 00:25:13.145
like, as in normal embedding embeddings that are surrounded
447
00:25:13.245 --> 00:25:16.905
by that context, actually by fine tuning,
448
00:25:17.205 --> 00:25:19.865
we give them the meaning that they deserve.
449
00:25:20.245 --> 00:25:24.825
And so in the, in the process of Bert that I showed you
450
00:25:24.825 --> 00:25:29.745
before, they are kind of enhanced for, for the, for the
451
00:25:30.815 --> 00:25:34.585
special T-F-I-D-F treatment that we do, that we do later.
452
00:25:35.445 --> 00:25:37.215
This, this is for training system.
453
00:25:37.475 --> 00:25:39.495
So we end up with a training collection
454
00:25:39.495 --> 00:25:42.735
with all the documents with these two kind of, uh,
455
00:25:42.915 --> 00:25:47.295
so APAR vector with the transform document, advanced vector
456
00:25:47.405 --> 00:25:48.975
with original document,
457
00:25:49.785 --> 00:25:52.285
and then how do we classify in unknown document?
458
00:25:52.825 --> 00:25:55.485
So we do exactly as we did before.
459
00:25:55.665 --> 00:26:00.645
So we use keyword, our fine keyword to, to get the best,
460
00:26:00.785 --> 00:26:02.765
uh, keywords out of that document.
461
00:26:03.585 --> 00:26:06.005
Uh, we apply BGM M three
462
00:26:06.865 --> 00:26:08.725
in the same way we, we should before.
463
00:26:08.825 --> 00:26:11.445
So we have dance vector on the regional one
464
00:26:12.205 --> 00:26:14.805
sparse vector on the transform document.
465
00:26:15.745 --> 00:26:18.005
Uh, we have a special API endpoint
466
00:26:18.035 --> 00:26:20.725
because we don't do this on the production server,
467
00:26:20.785 --> 00:26:24.245
of course, we have a GPU server with an embed endpoint
468
00:26:24.275 --> 00:26:27.405
that hosts these, these model.
469
00:26:28.145 --> 00:26:31.845
And then we retrieve the key closest
470
00:26:32.605 --> 00:26:36.525
documents using the vanilla normal like
471
00:26:36.945 --> 00:26:38.125
BU hybrid search.
472
00:26:38.945 --> 00:26:43.085
And, uh, since they are, uh, retrieved in, uh, in order
473
00:26:43.385 --> 00:26:47.925
of distance, uh, close document document
474
00:26:47.925 --> 00:26:50.965
that have been found to be close, closer in,
475
00:26:51.025 --> 00:26:54.045
in both the dance, uh, uh, world
476
00:26:54.465 --> 00:26:58.125
and in the spark world are retrieved first with, with the,
477
00:26:58.625 --> 00:27:02.005
of course, and they, they must have more influence.
478
00:27:03.145 --> 00:27:06.485
And we do a sort of voting, uh, such
479
00:27:06.485 --> 00:27:10.405
as the similarity score determines the voting power of each
480
00:27:10.425 --> 00:27:11.845
of the training documents.
481
00:27:12.265 --> 00:27:15.765
And each document is actually voting for its own category.
482
00:27:16.455 --> 00:27:20.075
So at the, at the end we have a sort of weighted
483
00:27:20.595 --> 00:27:21.915
KNN you can call it.
484
00:27:22.455 --> 00:27:25.755
So we sum up the weighted votes for each category,
485
00:27:26.055 --> 00:27:27.235
and we choose the category
486
00:27:27.235 --> 00:27:29.275
with the high highest total weight.
487
00:27:29.855 --> 00:27:31.395
And this is an example.
488
00:27:31.575 --> 00:27:34.275
So if we have K equals five,
489
00:27:34.615 --> 00:27:38.115
and we, we find that these are the first five documents,
490
00:27:38.255 --> 00:27:42.355
and we have these kind of similarity in the, uh, since it's
491
00:27:42.495 --> 00:27:45.235
of course they were labeled, we know that
492
00:27:45.955 --> 00:27:48.595
category A eventually wins with the weight
493
00:27:48.595 --> 00:27:53.195
of 2 69 versus category B with the weight of 1 65.
494
00:27:54.615 --> 00:27:58.995
And so this is the can n part of this, uh, of this webinar.
495
00:27:59.935 --> 00:28:03.875
Um, and these gave us very good results.
496
00:28:04.375 --> 00:28:09.275
Um, I'll show you how did we benchmark it actually in y um,
497
00:28:10.015 --> 00:28:13.635
in other, uh, question could be why BGM three
498
00:28:14.595 --> 00:28:16.075
actually the experimental results.
499
00:28:16.375 --> 00:28:18.715
And, uh, you can find this article on medium.
500
00:28:19.295 --> 00:28:21.355
Uh, first of all, it's multi-language
501
00:28:21.535 --> 00:28:23.475
and being multi-language is, is, uh,
502
00:28:23.815 --> 00:28:25.995
and why it is multi language.
503
00:28:26.145 --> 00:28:28.235
It's really good at most languages.
504
00:28:28.745 --> 00:28:32.075
It's actually surpassing in a mineral super rank
505
00:28:32.075 --> 00:28:35.955
that is this metric that, uh, shows how good, uh,
506
00:28:36.235 --> 00:28:40.435
a model is at retrieving the most relevant results.
507
00:28:41.135 --> 00:28:45.235
It surpasses most commercial, uh, embeddings.
508
00:28:45.375 --> 00:28:48.635
So open ai, for example, these are the two open ais
509
00:28:49.055 --> 00:28:53.235
and actually the mean of the accuracy
510
00:28:53.495 --> 00:28:55.675
or in reeving, the correct
511
00:28:56.355 --> 00:29:00.475
documents in b GM three is surpassing in all languages,
512
00:29:00.895 --> 00:29:01.995
all the other options.
513
00:29:02.265 --> 00:29:04.035
This is why we use B GM three.
514
00:29:04.215 --> 00:29:06.955
And also because it was embedded in vus
515
00:29:06.955 --> 00:29:10.675
and we actually, we built a lot of stuff on top of mi mi,
516
00:29:10.815 --> 00:29:13.195
so it was natural for us to, to use it.
517
00:29:15.285 --> 00:29:18.665
Um, last thing, how we did we benchmark these?
518
00:29:19.175 --> 00:29:20.585
It's not exact science.
519
00:29:21.185 --> 00:29:25.585
I mean, uh, this was made, uh, um, I mean we are coders.
520
00:29:25.885 --> 00:29:29.705
We don't do, uh, we, we do the things
521
00:29:29.735 --> 00:29:33.865
that actually work well for us and for our specific domain.
522
00:29:33.865 --> 00:29:36.345
It worked really well how we did it.
523
00:29:36.645 --> 00:29:39.905
So basically we split, we split our training data set
524
00:29:40.415 --> 00:29:42.245
into training and validation sets,
525
00:29:42.465 --> 00:29:44.805
and then we, we, we checked the confusion metrics
526
00:29:45.225 --> 00:29:49.685
and actually this whole pipeline gave us a very, very good
527
00:29:50.325 --> 00:29:52.765
accuracy on all the 32 classes.
528
00:29:53.755 --> 00:29:55.085
Also, we have a human in the loop.
529
00:29:55.365 --> 00:29:57.525
'cause remember we have our policy specialists,
530
00:29:57.865 --> 00:29:58.965
and actually they were,
531
00:29:59.475 --> 00:30:02.405
they were really happy about the algorithm works.
532
00:30:03.915 --> 00:30:05.695
We, we do have next steps.
533
00:30:05.975 --> 00:30:10.255
I mean, these, this system is, it was made quite fast also
534
00:30:10.255 --> 00:30:11.415
to create an MVP.
535
00:30:11.675 --> 00:30:13.295
And of course, I think that some
536
00:30:13.295 --> 00:30:15.655
of these complexity could be avoided
537
00:30:15.655 --> 00:30:19.735
by fine tuning our own version of BG and three.
538
00:30:21.755 --> 00:30:24.295
And the other thing is that there is no out
539
00:30:24.295 --> 00:30:25.375
of domain detection.
540
00:30:25.595 --> 00:30:28.695
So if a text is talking about, I don't know,
541
00:30:28.855 --> 00:30:32.255
a recipe in cooking, we will still be, uh,
542
00:30:32.615 --> 00:30:34.015
classifying it as policy.
543
00:30:34.475 --> 00:30:37.855
But the good thing is that everything we, we try
544
00:30:37.855 --> 00:30:39.215
to classify is policy.
545
00:30:39.475 --> 00:30:41.535
So we are always somehow in domain.
546
00:30:41.795 --> 00:30:45.175
So this, this could be, could be something, uh,
547
00:30:45.695 --> 00:30:47.015
a nice to have for the future.
548
00:30:49.095 --> 00:30:51.515
Um, I give you a small demo.
549
00:30:51.855 --> 00:30:56.475
Uh, this is the last thing, uh, for, for the, the,
550
00:30:58.545 --> 00:31:01.085
for the broadcast, um, system.
551
00:31:02.505 --> 00:31:05.695
I need to, I need to change window.
552
00:31:06.475 --> 00:31:08.695
Um, there we go.
553
00:31:11.055 --> 00:31:12.355
So it's called the screen scope.
554
00:31:13.815 --> 00:31:17.015
And so,
555
00:31:21.825 --> 00:31:25.005
Hey, so we, so this is the recording.
556
00:31:25.065 --> 00:31:27.165
It was made, um, basically
557
00:31:27.165 --> 00:31:31.805
after 40 minutes from the end of these, uh, three hour long,
558
00:31:32.225 --> 00:31:36.965
um, speech, we had, like, the whole pipeline had already,
559
00:31:37.825 --> 00:31:39.805
uh, created this classification.
560
00:31:40.305 --> 00:31:42.005
We do quite, quite a lot of things
561
00:31:42.005 --> 00:31:45.525
because we segment into parts, we,
562
00:31:45.585 --> 00:31:47.405
we understand whether it's a question
563
00:31:47.465 --> 00:31:49.445
and there is a follow up answer.
564
00:31:50.225 --> 00:31:53.605
But, uh, regarding our webinar today, we are
565
00:31:54.605 --> 00:31:57.245
actually tagging into policy areas.
566
00:31:57.865 --> 00:32:01.325
So, uh, if I, if I select, uh, that, uh,
567
00:32:01.355 --> 00:32:05.245
when he's talking about enterprise, I get just the segments
568
00:32:05.245 --> 00:32:06.925
that talk about enterprise.
569
00:32:07.305 --> 00:32:10.205
Of course we have also a full text search.
570
00:32:10.585 --> 00:32:15.125
We have summaries, we have how, like the possible other,
571
00:32:15.705 --> 00:32:19.205
um, topics that we're talking that we talked about
572
00:32:19.305 --> 00:32:20.325
during this segment.
573
00:32:21.025 --> 00:32:24.445
And last thing, we of course, there, there,
574
00:32:24.655 --> 00:32:29.005
there must have been a chatbot somewhere we can ask, uh,
575
00:32:29.025 --> 00:32:31.125
the person for example, what, what are
576
00:32:32.345 --> 00:32:35.315
your priorities as a commissioner?
577
00:32:37.545 --> 00:32:39.645
And also here there is some mild
578
00:32:39.875 --> 00:32:44.385
because, um, the, the system will
579
00:32:44.945 --> 00:32:49.625
actually, uh, ask answer about,
580
00:32:50.165 --> 00:32:53.385
uh, give a good answer to this question.
581
00:32:53.485 --> 00:32:57.185
So based on the hearing transcript from stemerman priorities
582
00:32:57.325 --> 00:32:58.745
are blah, blah, blah.
583
00:32:58.965 --> 00:33:03.025
But actually, as I said, we always want to ground our,
584
00:33:03.405 --> 00:33:05.745
our answers in something that is real.
585
00:33:06.245 --> 00:33:08.865
So we also give the actual quotes from the commissioner.
586
00:33:09.525 --> 00:33:11.985
So basically I think that any stakeholder
587
00:33:12.175 --> 00:33:15.065
that is interested in knowing, considering
588
00:33:15.065 --> 00:33:18.545
that there were 80 hours of these, of this video,
589
00:33:19.285 --> 00:33:21.305
any stakeholder that is interested in knowing
590
00:33:21.615 --> 00:33:25.625
what one person had to say about a particular topic
591
00:33:26.135 --> 00:33:27.225
will just go here.
592
00:33:27.865 --> 00:33:31.305
Actually we see it through the, through the Google analytics
593
00:33:31.535 --> 00:33:34.305
that people use this part the most
594
00:33:34.655 --> 00:33:39.345
because it can an, it can answer really well to all the,
595
00:33:40.325 --> 00:33:42.225
all the precise questions.
596
00:33:44.445 --> 00:33:48.335
Um, and this was it.
597
00:33:49.545 --> 00:33:52.235
Cool. Thank you very much. Thank you. Thank
598
00:33:52.235 --> 00:33:53.235
You very much.
599
00:33:53.575 --> 00:33:57.555
Uh, yes, we had one question in the chat. Mm-hmm.
600
00:33:57.775 --> 00:33:59.995
Uh, which is, so which LM are you using here?
601
00:34:00.215 --> 00:34:03.475
Is it like Gemini 1.5 as you're saying? It's multimodal.
602
00:34:03.985 --> 00:34:06.315
Yeah, it is. It is Gemini. Yes,
603
00:34:06.385 --> 00:34:07.385
It's Gemini. Okay.
604
00:34:07.385 --> 00:34:11.235
Is it 1.5 then? Yeah. Okay. Yeah.
605
00:34:11.295 --> 00:34:14.475
Are you planning on moving to the 2.0 now that was released?
606
00:34:14.705 --> 00:34:15.915
Yeah, I saw it. It was yesterday.
607
00:34:16.305 --> 00:34:19.395
Yeah, I didn't have time today. That's what I saw.
608
00:34:19.395 --> 00:34:22.155
Incredible stuff. People talking to the UI
609
00:34:24.415 --> 00:34:25.795
And yeah.
610
00:34:25.795 --> 00:34:27.515
Maybe I have one on my end.
611
00:34:27.575 --> 00:34:30.675
So like, have you seen like a really big improvement
612
00:34:30.675 --> 00:34:32.155
with like hybrid search, for example, for you,
613
00:34:32.155 --> 00:34:33.555
was it like a night
614
00:34:33.555 --> 00:34:35.435
and day improvement, for example, by using it?
615
00:34:35.625 --> 00:34:39.075
Yeah, completely because uh, as I said, uh,
616
00:34:39.135 --> 00:34:42.515
vector only search, uh, in these, in these very deep
617
00:34:43.075 --> 00:34:44.955
specific, uh, language, uh,
618
00:34:45.255 --> 00:34:48.715
the points in the multidimensional space are two close.
619
00:34:49.055 --> 00:34:51.235
So vectors are like two similar
620
00:34:51.495 --> 00:34:53.475
and you get somehow some random results.
621
00:34:54.015 --> 00:34:55.795
And at the same time, um,
622
00:34:56.585 --> 00:35:00.235
only sparse search was not the way to go.
623
00:35:00.495 --> 00:35:04.155
And this kind of hybrid search, we also did some, uh,
624
00:35:04.625 --> 00:35:07.315
grid search, uh, on the parameter of
625
00:35:07.615 --> 00:35:09.955
how much weight to give mm-hmm.
626
00:35:10.035 --> 00:35:11.275
To the, the two types.
627
00:35:11.615 --> 00:35:14.435
And we actually found out that 50 50 was the best.
628
00:35:15.025 --> 00:35:17.915
Okay. Just for as an information? Yeah.
629
00:35:18.605 --> 00:35:20.415
Okay. Interesting. Um, yeah,
630
00:35:20.415 --> 00:35:21.975
and I had a follow up question, but I forgot.
631
00:35:21.975 --> 00:35:23.375
Oh yeah. With Neo four J.
632
00:35:23.755 --> 00:35:27.135
Uh, so how's it, like, do you work directly
633
00:35:27.135 --> 00:35:28.335
with a graph rack for example,
634
00:35:28.435 --> 00:35:31.855
or is it's like, um, different systems working together
635
00:35:32.445 --> 00:35:34.695
with Neo 4G, you're using Neo 4G, right?
636
00:35:34.695 --> 00:35:35.975
So is it like graph rack based
637
00:35:36.035 --> 00:35:38.925
or it's mostly like entity and then vector? Yeah.
638
00:35:39.065 --> 00:35:41.965
No, no, we do, we do just entities actually, we're just,
639
00:35:42.465 --> 00:35:44.645
uh, starting using, using it now.
640
00:35:44.755 --> 00:35:49.125
Okay. We use the, we use, um, neo extract
641
00:35:49.355 --> 00:35:52.085
that is in a LM that is specialized in a, um,
642
00:35:53.195 --> 00:35:56.645
open source specialized in a getting structured output.
643
00:35:57.185 --> 00:35:58.485
Mm-hmm. Well, to,
644
00:35:58.555 --> 00:36:02.685
because of course the problem is still text to graph.
645
00:36:03.085 --> 00:36:06.965
I mean, yeah, I like to get graph to text, it's done.
646
00:36:07.345 --> 00:36:09.285
But the text to graph, I think that
647
00:36:09.285 --> 00:36:11.765
besides a lot of people talking about it,
648
00:36:12.325 --> 00:36:16.005
I still haven't seen like a gen a general implementation
649
00:36:16.005 --> 00:36:18.805
because of course it's like, I mean, it,
650
00:36:19.075 --> 00:36:20.365
it's really complicated
651
00:36:20.465 --> 00:36:22.285
to have a general implementation on that.
652
00:36:22.315 --> 00:36:23.315
Yeah.
653
00:36:23.675 --> 00:36:25.765
Okay. Cool. Perfect. Thank you.
654
00:36:26.125 --> 00:36:27.485
I think that was it for my questions.
655
00:36:28.065 --> 00:36:29.325
I'm just gonna wait quickly
656
00:36:29.385 --> 00:36:30.965
to see if anyone answers the question.
657
00:36:31.955 --> 00:36:34.175
Uh, but otherwise thank you.
658
00:36:34.395 --> 00:36:35.535
And for the people as well,
659
00:36:35.755 --> 00:36:38.015
and also the people that couldn't make it,
660
00:36:38.015 --> 00:36:39.375
like everything would be shared online.
661
00:36:39.955 --> 00:36:41.775
Uh, everything will be shared on YouTube in a couple
662
00:36:41.775 --> 00:36:43.575
of days, we'll send it to our editors.
663
00:36:44.075 --> 00:36:46.535
Um, so then, uh, it's also available there.
664
00:36:47.115 --> 00:36:49.815
But I think that's it, sir. Sandra, thank you very much.
665
00:36:50.145 --> 00:36:51.535
Thank you for the presentation
666
00:36:52.355 --> 00:36:54.135
and see you next time everyone. Yeah.
667
00:36:54.195 --> 00:36:55.735
Bye bye. Bye-bye.