- Events
Time Series to Vectors: Leveraging InfluxDB and Milvus for Similarity Search
Webinar
Time Series to Vectors: Leveraging InfluxDB and Milvus for Similarity Search
Join the Webinar
Loading...
What will you learn?
In this webinar, we’ll explain the powerful combination of time series data and vector similarity search to revolutionize urban traffic management. Learn how to transform raw sensor data from InfluxDB into meaningful vectors, enabling advanced pattern recognition and anomaly detection using Milvus, a high-performance vector database.
Through a practical use case of real-time traffic monitoring, we'll demonstrate how this innovative approach can swiftly identify and categorize traffic anomalies, from accidents to construction zones. This webinar is essential for data scientists, traffic engineers, and urban planners looking to harness the full potential of their time series data for complex, real-world applications.
Topics covered:
- Fundamentals of time series vectorization: Converting InfluxDB data for vector database use
- Integrating InfluxDB and Milvus for a comprehensive traffic monitoring solution
- Implementing similarity search in Milvus to classify traffic anomalies
- Best practices for real-time data processing and anomaly detection
Today I'm pleased to introduce, uh,the sessions time series two vectors, leveraging influx, IBand MAL List for similarity search. And our guest speaker is Anai. Uh, she will talk about, um, time series two vectors. And Anai is a developer advocate for influx datawith a passion for making data beautiful with the useof data analytics, AI, and machine learning. She takes the data that she collects, does a mixof research, explorationand engineering to translate the data into something onfunction, value, and beauty.
When she's not behind the screen,you can see you can find her outside drawing, stretching,boring or chasing after a bowl. Welcome, anise. The stage is yours. Thank you so much, Stephan. Uh, welcome everybody.
So today we're gonna be learning abouthow we can transform time series data into vectors sothat we can store it in visand actually perform some similarity searches onthat time series data so that we can handle, um, you know,time series data with any other type of unstructured dataand combine them together for some imaginary use cases. And this talk will include a quick demo runningthrough a Jupyter Notebook for how we can do thatand actually convert the data. Um, and it will, uh, serve to be just kindof an introduction and how you, how you would do this, um,so that you could apply the same logicto whatever data you're actually storing, um,whatever time series data you're actually storing,and, uh, leverage it with something like VIS together. So Stefan mentioned my name is an,and I am a developer advocate at Influx data. Influx Data creates InfluxDB,which is a time series database and platform.
And I wanna encourage you to also connectwith me on LinkedIn if you want,and ask me any questions about today's presentation,about time series, time series forecasting, time series,language models, um, statistics, uh, databases,yeah, all Think Time series. So, uh, please feel free to reach out thereand connect with me if you want,but today we're gonna be first talking about InfluxDBand learning a little bit about what that iswith the assumption that if you're here for vis,maybe you know a little bit more about VIS already. And I'll talk about time series databases in general. Next, we'll move on to talking about time series databases,VER versus vector databases. And quite, quite frankly, this is kindof comparing apples to oranges.
They're pretty different, but I think it can be useful justto compare the two to better understand each one. Uh, then next we'll go into some various projectsthat you can try for yourself. I just wanted to introduce youto some resources and some projects. If the time series forecastingand machine learning space interests youand you're looking to do something with time series,but then maybe also something with your unstructured dataand uh, see how you could maybe fit, uh, that typeof data into a project together. Then we'll go over a demoand we'll talk about leveraging InfluxDBand Novus for a similarity search.
And for this, we'll use an imaginary use casewhere we're trying to find someanomalies within traffic data. Then we'll talk about some use casesfor InfluxDB in the machine learning space, just sothat you have a better understanding for some of the,you know, spaces that I work in more specifically. And I don't think, we'll, we'll go through the toolsthat you can use for tasks such as machine learningand, uh, data processing, but we might, so that's a maybe. Uh, so yeah, let's get started. Introduction to InfluxDB and time series databases.
So in order to understand influx, CCBand time series databases in general, the first thingthat you need to understand is what is time series data?And time series data is any datathat has a timestamp associated with it. So the earliestand maybe simplest example is stock market data. And one interesting thing about time series,or one thing that makes it unique isthat a single value is not interesting likeit is with relational data. Instead, you're looking at the trend of data over time,and that helps you make decisions likewhether or not to buy or sell. And that also makes time series interesting from astatistical perspective because most data with an Xand yxi, you think of the X axis as being, uh, uh, the Y axior y variables being independent.
But in time series it's actually dependent. And so that means that has all sorts of repercussionsand, um, effects on how you analyze the dataand what sort of forecastingand anomaly detection tools you may you need to use. Other examples of time series data include anything from theIOT space or, um, the, the virtual world. So in the iotor sensor world, you can thinkof time series data coming from sensors like pressuresensors, temperature, humidity, concentration,light flow rate, et cetera. Like anything that you're measuring about your physicalworld, that is all IOT data and that's all time series data.
And then we apply the same logicto the virtual world as well. Um, but virtual concepts related to software development. So we think of application monitoring,infrastructure monitoring, DevOps monitoring,and uh, just wanting to understand trendsand time about their virtual or physical environment. So we like to categorize times raisedata as two different types. The first is metrics and metrics appear at aregular time interval.
And the second is events and events are unpredictableand we cannot really determine when an event will occur,but we can still store event data all the same. So in the healthcare space,your heart rate would be a metricand a, you know, cardiovascular event like an AFibor a heart attack would be an event. Uh, but one interesting thing between the two typesof time series data isthat we can transform events into metrics. So if we count, you know, how many events we have,let's say in an industrial IOT scenario,maybe we're looking at how many machine faults occur daily,then this metric widely be zero or more. And so we get now a metric from our event data.
And so this is something that InfluxDB isalso very good at doing. Uh, additionally, um,time series data occurs in really every industry. We see it in manufacturing, like I said,we're monitoring machinesand we're doing things like predictive maintenance. We see it in renewable energy from power consumptionto production of that renewable energy. We see it in developer tools and APIs.
You know, you think about how many times a usertriggers a specific request. For example, in Kubernetes we have pod application, uh, podand application monitoring. And we even see it in gaming application. You can think of monitoring gaming servers,you can look at trendsand gamer activity, um, you know, wanting to reduce latency. And then of course in network monitoring as welland SMP monitoring.
So we also see time series rising as a category,although we now see the the new, you know,cool kit on the block is vector databases. Um, but nonetheless, time series' databases are also risingfor this reason, just because time series data is part ofso many different use cases. And uh, it really emerged originally, you know,we had relational data and that's really great for ordersand record retrieval, um,but it's not really good at handling requestsfor really large volumes of data. And then we got document databasesthat are really high throughputand have that adaptive schema, but they are spaceand resource, uh, intense intensive. Um, and then we have search databasesand those are great for searching logsand other text space time series data, uh,but can be kind of slow for metrics.
Um, so that's why we see the emergence of time seriesthat are specifically meant to handle event data,metric data logs and traces as well as spans too. Um, so what are the time series databaseand what is it good at?What are some of the componentsor pillars that comprise a time series database?Well, the first is that it can accommodateand write timestamp data that should be kind of a givenevery point in a time series database isassociated with the timestamp. It's indexed and ordered by time sothat you can query in time orderand actually return your data with meaning. And it also has really high write throughput. So if, for example, we were looking at aindustrial manufacturing, maybe we're looking at monitoring,um, some machine or a beltand we have a vibration sensor,a industrial vibration sensor sensor, uh,has a really high throughput something like one, oneto 10 kilohertz of data per second.
So that's around up to 10,000 points per second of datathat you are writing into your time series database. And if you are monitoring several pieces of equipmentthat have vibration sensors, let alone all the other sensorsthat you have in one single plant, you can seehow you quickly have hundreds of thousandsor maybe millions of pointsper second that you need to write. Um, additionally, if you can write a bunch of datato a time series database,but you can't query it, it's not very useful. So you need to also be ableto perform really efficient queries over time ranges. And you also need to perform, uh,aggregations over those time ranges.
You know, you wanna be able to find things like localand global minimum and maximum. So that means that you need to be ableto perform really fast scans across your data as well sothat you can actually get those points returned to you. And then last but not least, this is not uniqueto time series databases though,scalability and performance, right?You need a scalable architecture, something that's designedto scale horizontally to handle any sortof increased load across distributed clusters of machines. And so this is what InfluxDB 3. 0 looks like.
It is our architecture diagram for 3. 0. Uh, one fun fact is that the open source 3. 0 is goingto be out in January. So I'm really excited about that.
Um, but it's all built on the Apache ecosystem. So it's built on data fusion. And data fusion is the query execution framework. It allows users to query InfluxDB with both SQL and influx. QL Influx QL is just a SQL-like query languagethat's proprietary to InfluxDB.
And then it's also responsible for PR pruningand push downs to make those queries really efficient. Then we also have Apache Arrowand that's our in-memory calmer data format. And Parquet is our durable file format that's also calmer. And so one reason that we are committed to being a part ofthe Apache ecosystem is to be a part ofthat open data architecture. So that just means that we are ableto have more interoperability with other toolsthat leverage things like Parquetthat leverage things like Arrowand the Arrow flight clients sothat we can easily pull data out of Influx CBand use it with other tools.
Um, a lot of machine learning toolsleverage Parquet directly and operate on Parquet files. And so unlike, you know, some data historianswhere there's a lot of vendor lockin, one of the main goalswith InfluxDB 3. 0 is to allow you to pull inquiry, you know,parquet files eventually directly out of InfluxDB sothat you can leverage them with whatevertools uh, you want to. Um, and then also additionally, just being a partof the Apache ecosystem means that we are continuouslycontributing to those, um, upstream technologiesand being a part of that larger community,which then enables not only other companiesto take advantage of the contributions that we've made,but also just have interoperabilitywith all the other tools that leverage them as well. So now let's talk about vector databases.
So hopefully you are of familiarwith vector databases already since you're herefor a webinar. On vis, there are a lot of vector databases,and Vector databases are definitely the newcool kid on the block. Um, so there are databasesthat specialize in handling vector embeddings,which are used in applications like machine learning searchengines and recommendation systems to representand compare really complex data, textual data,video data, et cetera. Um, so we have, you know, VIS of course,which is an open source vector databaseand it supports, um, both, uh,proximate nearest neighbor searchand n search and exact search. Um, then we have Pine Cone, we have, uh, ATE face, which is,uh, created by Facebook Annoy v various other, um,vector databases.
Um, but not very many are open sourceand I found VIS very easy to use. So,so, and now I wanna take a moment to just kind of compareand contrast time series databases versus vector databases,just so that we can have sort of a broad understandingof some of the differences between the twoand also just how they work, um, and just generally as well. Before I move on to that,I do wanna take a quick second In case you aren't familiarwith how vector databases work,basically the first step is representation. So whether or not you are working with datathat's like texts or images for shoe, transformthat data into a vector. And then for texts, that could meanthat you're using some sort of model that converts each wordor maybe a sentence into a vector based on the meaningof the words for images.
It could mean, you know, extracting features like edgesor colors or textures and representing those as a vector. And then you store those vectors in the vector database. And unlike traditional rate databasesthat store data in rowsand columns, vector databases are optimizedto handle the really high dimensionality of vectors. And then you use indexing to make the search more efficient. So the database builds an index based off of these vectors.
And indexing is all about organizing the vectors in a way sothat the similar vectors are located near each other in,in the databases, storage system, storage system, um,and that's using special algorithmslike A and n and stuff like that. And then finally you perform searches on that data. So when you wanna find data that is similar to the datathat you're trying to query, like finding images similarto a particular image you can think of,like facial recognition for example,that query data is also converted to a vector. And then the database uses that indexto quickly find vectors that are closestto the query vector. And closeness here is how we measure a distanceand literally we are doing something likean Euclidean distance.
So, um, that's in nutshell how, uh, vector databases work. Um, but now let's talk about the difference between the two. So time series databases, uh,I talked about these use cases already. They are monitoring realtime, uh, use cases. So we're looking at tracking data over time.
We're looking at monitoring service performance,financial market trends or environmental sensor data. Uh, some of the advantages is that they're optimized. Four time series data allow you to query data over timeand handle really large volumes of sequential data. Eff effectively effect effectively, excuse me, also,uh, you can perform time-based aggregationsand perform fast scans over the datato find minimum maximum counts, averages, et cetera. It also enables you to do really fast insertsand queries, uh, vector databases, very different use cases.
Uh, you know, you're performing similarity searches. So that's ideal for scenarios where you are tryingto find items that are similarto a given query image search recommendation systemsor document retrieval based on content. Um, and they're also used for machine learningand ai, so useful in applications involving really complexdata models where you have these embeddings or vectorsand you are representing them in this high dimensional spacefor tasks specifically like clustering and classification. So now I wanna talk a little bit abouthow we use machine learning with time series databasesand how we use machine learning with vector databases. So machine learning with time series databaseswould include things like forecasting,predicting future values based off of past series.
So you know, you're forecasting sales,you're forecasting weather conditions, um,you're forecasting heart rate, et cetera. You also might be doing things like timeseries classification. You're identifying categoriesor events based offof historical time series data like you're tryingto classify whetheror not you have, um, an anomaly in your machine data,um, for predictive maintenance for example. Um, then we also have anomaly detection in time series data. So in general, tryingto detect unusual patterns which do not conformto the expected behavior.
So you're identifying spikes in network trafficthat could maybe indicate a cyber attackor any sort of anomalythat you might have in your time series data. And then you're also frequently doing things likeperforming regression analysis. So you're modeling the relationshipsbetween variables over time. And you might be looking at something like the impactof marketing spend on salesand various correlation analysis as well. Um, and then machine learningwith vector databases looks like performing similaritysearches, which are used for recommendation systemsclustering to organize your data into clusters based ontheir similarity, um,which can be useful in things also like customersegmentation, uh, in grouping similar documents, um,and finding related products.
So things like similar to recommendation systems, uh,and then anomaly detections as well. Uh, by measuring exactly how much data oror a vector deviates from the norm of a cluster. You can then identify outliers in your data setand that can help you in detecting things like fraudulenttransactions or unusual user behavior. Um, you also perform things like nearest neighborclassification or really any classificationthat was just one very common example of, um,a classification algorithm like K Ns, uh,to classify new data points based offof the most similar training examplesthat are stored in a vector database. So kind of just to summarize here, vector databasesfor vector databasesand machine learning, they're best suited for applicationswhere the core requirement is understanding the relationshipbetween data points in a vector space and for time seriesand ml it's best used for applicationsfor data is inherently sequential timestamped.
And the tasks are revolved around understanding the trends,changes, anomalies and forecasts over time. But that doesn't mean that we can't use them together. So I wanna share with you, uh, kind of some projectsthat you can try particularly onewith about actually vectorizing time series data, um,so that you can use it with something like novisand just in general talk about some use caseswhere you could use time series with, uh, Novis. So the very first thing that I wanna show you is the um,Python client librarybecause if you are writing any data to Influx cb, um,I'd highly recommend that you use it, uh,especially if you are wanting to combine itwith then with Mel vs. Because you'll want to do some data transformationto actually convert that time series data into vectors sothat you can use it with Mel.
So what you would do here is you would first import thelibrary, then you would specify your host, your org id,your token and your database,you'd instantiate the client providingthose, um, credentials. Then you would query your data, um,or similarly you could write it with the right methodinstead of the query method,and you would pass in your data frame, um, intothe data frame option and you'd be good to go. And then you'd write a data frame as well. Um, but this is also how once you had your data in InfluxDB,how you would query it out and then go aheadand leverage it with, uh, Python to do someof this data transformation requiredto make vectors outta your time series data. So then you can use it in a similarity search with Elvis.
Um, and this is what the Influxt b UI looks like. It's, uh, pretty easy to use to create a bucket. A bucket is basically just like a databasewhere you'd store your data, you'd goto the buckets tab in the load data page,you'd hit create a bucket, you'd give your bucket a nameand a retention policy. And this determines when your data is automatically expiredbecause with time series data, youfrequently only care about the current data. You don't care about older data.
And so you wanna be able to automaticallyexpire data that is too old. And so you'd include that retention data preference,you know, 30 days or whatever it might be. Give it a name, hit create, you have a bucket, um,and then your, uh,org ID would be found right in in the URL right here. That's the fastest way to find it. So that's what I always recommend there.
And then similarly to create a token,you can just generate a token. You can create a custom token to scope it,to give permission to particular buckets, whether it's reador write up to you. So, um, that being said, let's talk abouthow we can transform time series into vectors. So you can follow this QR code or the URL below. It will take you to this, uh,community influx community organization on GitHub.
And the Influx Community Organization on GitHub createcontains a bunch of different examples for howto use various technology stacks with InfluxDB. So if you're interested in doing just any time seriesforecasting or anomaly detectionor using InfluxDB with a variety of different, um,tech stacks, I highly recommend that you come here. It's also where our client libraries are maintained,but we have projects around, um, you know,creating a stock tracker demo. This, uh, my, uh, coworker OSH just made this, um,to, to track, uh, stock dataand do realtime analytics with it. We have a whole bunch of Grafana examples.
We have, um, fake factories that mirror, um,a beverage manufacturing companiesand all of the sanitationand bottling com bottling, uh,bottling, not robots, bottling. Wow, I just like my brain just died. All the bottling, um, machines. Thank you. Uh, all the bottling andand sanitation machines, um, almost directly from, uh,a real life customer, um, at almost at scale.
So that's another cool product, a project we have, um,some other, my other favorite projects include, you know,getting started with MQTT and various simulators there. Uh, I recently made this other project that mirrors howto use, um, Kafkaand Faust to simulate a PID controllerfor a continuous stirred tank reactor, um,which is a common type of chemical reactor used, uh, in, uh,chemical engineering and manufacturing, um,to both like monitor to create a digital twinand actually use a controller. So long story short, there's a bunch of different projectsworth checking out here if you wanna do anythingwith time series and, um, you wanna get in the weeds there. But let's talk about the MIL project todaybecause that's what we're here for. Um, so basically to run this, it's really simple.
You're just going to, um, go aheadand, you know, clone this directory into,um, your machine. And I've done that already. I just needto run the docker compose. So I guess I'll,I'll share that with you there. So I just running Docker compose up to start an instanceof viss and then, um,I can create a new taband simply run Jupyter Notebook to actually getthat notebook up and running.
We're still on the tab note by the way. Okay, thank you. And I will now share my screen to goto the right tab. Perfect. Yes.
So this is thenotebook for running that and I will go aheadand clear all the outputs real quickly. Perfect. Okay, so,so basically in this also, is it big enough?Can you see it? Well?Yeah, it's fine. Totally fine. Okay, thank you.
Okay. Yeah. Um,so basically this notebook will just demonstrate a workflowto process sensor data. And actually let me come back here. So this is kind of the imaginary use case that I created foran excuse to vectorize some time series data.
Um, so it kind of helps us better understand therelationship between time seriesdatabases and vector databases. And it can just be helpful in this imaginary use caseto contextualize how you might use the two together. Um, so let's imagine that we're developing a platformthat offers a real time traffic monitoringand pattern recognition to improve city traffic management. So specifically what we wanna do is we wanna create asolution that identifies when traffic conditions areoperating anonymously. An ano, ano, ano, oh my god, ano,thank you brain, it's not on today.
Morning. I need a,I need some anomaly detection from my own brain today. And determines like what type of traffic data, um,what type anomaly we actually have present. So in this example,we might be storing things like speed limit, um,and actual position of cars and traffic to influx CB cloud. And that's our actual sensor datawhere we're getting things like vehicle countand speed, et cetera.
And then in novis,we are actually storing maybe when we have seen anomaliespresent previously, some of those anomaliesin um, vis. And so that way when we detect that our data is outof norms, maybe with a really simple, you know, um,parameters that just say like, Hey, our normal statisticsare off, then we can take some of that time series dataand then vectorize itand compare it to some of the previous anomaliesthat we've seen in Novisand actually determine what type anomaly we have. So is it an accident, you know, is it just traffic, is itconstruction, et cetera. We could also imagine some other use caseswhere we would maybe combine time series datawith other types of datathat's typically stored in vector databases. You can think of the healthcare space, for example,where maybe we are monitoring, you know, different elementsof a patient, things like their heart rate,their oxygen levels, et cetera.
But then combining that with textural data based offof doctor's notes, um,and the long term health of a patient. And also, you know, maybe also even images,things from radiology, et cetera, to combine allthat data together to kindof make predictions about the state of a personand any upcoming illnesses that they might have based offof the similarity between their health recordsand, you know, other, other patients. Um, so that being said,let's go back to the notebook. Okay. So I talked kind of about theimaginary use case for this.
That being said, uh, this notebook is really just intendedto get you started vectorizing time series data as quicklyas possible, not actually demonstrate with real data. So as a result, the very first step that I have isto just import your dependencies. And onesecond, hereI lost my screen. Um,and so I'm using Pan as a NumPy NumPy here, um, just,you know, for the data manipulationand for the numerical operations, importing date time sothat I can work with timestampsand actually convert my timestamps to Unix so that I can,uh, basically normalize the timestampsand give them, actually write them into vis,because Melva doesn't support a timestamp format. It supports, you know, float and integer.
So that's what you need to work with. And then we use map plot lib just to plot some ofthe data here so we understand what we're working with. And then last but not least, we're using viss,which is the Viss client library to interactwith the viss database. So I'm gonna go ahead and import that first,and then I'm gonna define a functionto generate some traffic data here. So, like I said, this is just,we're just generating some random speed limit vehicle countand, um, average speed data sothat we can generate a data frame with those random values.
And, um, if you're running this yourself, have somethingto work with that might look like data that you're workingwith so that you can easily learn howto vectorize it yourself. So specifically we have thesetimestamps here that we've generated. They're current, we have our vehicle count, we can seethat it's, it's oscillatingand as well as our vehicle speed. And here we already see, um, an anomaly type'cause the idea being here that we have alreadymade some classifications alreadythat we have put into to mils. Um, and so this is kind of what the data looks likethat we might be writing into novis sothat when we get new data that just includes like vehiclecount, average speedand maybe even uh, you know, video data,we can then use that, uh, to perform a, a similarity searchand then determine, you know, do we have an accidentor do we have another type of anomaly?I think the three type of anomalies here,accident, construction, and oneother one that I can't remember.
Um, and then next we're gonna do,we're just gonna explore the data so we can kind of seewhat it looks like. Um, so here we have, uh,the blue line representing our vehicle countand the red line representing the average speedat different timestamps. And so this is also how we'll writeand query pandas data frames to influx cb. So I mentioned earlierhow you would query data from influx cb, which iswhat I would imagine you would do in this usecase, query it. And then you'd have your data, your data framedirectly here, um, instead.
Uh, andthe way that you would do that, uh, would beto query it right here with pandas. But the way that you would write it isto just include your bucket name, uh,pass in your data frame directly as your record. You can include a data frame measurement name. That's the same thing as basically specifyingwhat you want your table to be called. Um, and I called it just generated data.
And you can include any data frame tag comms you haveA tag is just where you have yourmetadata about your time series data. But this is totally optional. This is just kind of morefor the user to organize their data. And it doesn't matter on the database side,so you really don't have to worry about it. And then you can specify which column your timestamp isactually in, um, which is in our timestamp column,or I alternatively, if ever your data frame,the timestamp is in the index,then you can just write it directly withoutspecifying 'cause that's the default.
And then to query it, you would query it in from SQLand you would say like, select all the data from the last 90days or nine hours, whatever it might be. And then use the query method, uh, passing that query,return the mode pandas,and then you would ideally return some datathat looks like this minus the anomaly detection column. Okay, so now that we're there, we're readyto actually go aheadand create some sense, some sensor vector beddings. And so this is like where we're going to actually startgrouping our data into time windows. And the purpose of that is to prepare itfor the vector beddings.
Um, so what we're gonna do is use a window, um,a sliding window approachwhere we're gonna use a window size of 24 steps, um,a window size of 24 and a step size of 10. So that means that each window contains 24 data points,and then the next window starts 10 points later. And we're gonna output a new data framethat's gonna be called embeddings underscore data frame. And that contains the start in end times for each windowand the corresponding average speed vectorsand their anomaly types. So, um, here I'm just defining those windows.
And then here I'm iterating through the windowsand extracting the column values for each one. And then here I'm creating a new data frame forthat collected data, um, sothat I can create a embedded dataframe that looks like that. So here we basically have our vectorswhere we're now having these sliding windowsof time series data, um, with our start and stop times. And so now I have the, the taskof normalizing the sensor values. And the purpose of normalizing your sensor values isto take your average speedsand make them a value between zeroand one for consistent comparison analysis.
So basically, if all the values in a vectorare also identical, um,then the normalized result would all be a vector of zeros,otherwise each value is gonna be scaled based offof the minimum and maximum values in the vector. So essentially the normalized vector is basically wherewe take the original value, the average speed,and subtract it from the minimum value in the vector, um,in the list, and then divide itby the max value minus the minimum value. And this ensures that the smallest value in the list becomeszero, and the largest value in the list becomes one. We also do some edge case handling in casewe have to divide by zero. Um, but what's the point of this?So basically normalization is really crucial when you'reworking with vectors like Vissand working with vector databases like viss, um, to ensurethat you can compare your time windows effectivelybecause different time windows might have different skillsor different average speeds.
Um, and so what we're really trying to do here isnot just determine whetheror not a particular value is out of the ordinarybecause we might be comparing what traffic looks likeacross a bunch of different parts of the citieswith different times, times time zones,speed limits, et cetera. Um, so what we really wanna be doing is normalizing our timeseries data so that we can focus not on the absolute valueof our speeds and our time series values,but the actual shape. So we can be comparing what the actual shapeof a vector looks like. Because essentially when we take our time series,we're looking at a chunk of itand we put it in that multidimensional space. We're creating a multidimensionalshape for that time series.
Um, and so without normalization, if we kindof compare vectors with vastly different rages,it wouldn't necessarily yield meaningful results. So hopefully this also improves our search accuracy, um,especially because we're relying on distance,on distance measuresand search measures like Euclidean distance. If the vectors aren't normalized,then really large values in one vector woulddisproportionately affect the search results. So normalization also helps us ensurethat each vector contributes equallyto the similarity ca ca calculation and computation. Um, and then like I mentioned, it also handlesintentional variability,but also just any real world variability.
Um, you know, some periods of traffic might have, um,really low variability if you have steady flow,while other might have really high variability ifit's stop and go traffic. So, uh, normalizing the vectors also helpsbalance these differences. And so we creating this function basically to do thatto our vectors where we're performing that math there. And now we can see that we have our time series data from ascale of zero to one. And then last but not least, we needto also convert our data to timestampsand our timestamps to Unix.
And the reason that we need to con convert them to UNIX isbecause VIS doesn't, um, allow you to write timestamps. And specifically, um, we need a format that's usefulfor performing searches in, in those queries. So we're basically casting our timestamp to Unixand casting that to an in integer. And so there we have our Unix precision time,and now we actually need to convert, um,our embedding data frame into an entity sothat we can actually prepare to write that to Zillow. Um, and so basically we're iteratingthrough every row inand column in our data frame to do that.
And so this is what a, um,vector would actually look likewhere we have our vector field,and that's all of our normalized time series data. We also have our start time and our stop timeand our anomaly type. And so now we're ready to go aheadand store our entities in Zills. Um, so here we specify the vector field,and that's the normalized vector of average speed values. The start time fieldand the stop time field, um, are just the stopand end times for each window.
And the anomaly type field is gonna be the type of anomaly. So here we define the schema for our collectionby defining those and as well as their their types. Um, and then we create a schema collection, uh,and then we define our collectionand give it, give it a name,and we check whether or not the,the collection already existsand make sure there aren't any conflicts. There shouldn't be if you just pulled the repoand button up a docker container for the first time. Um, and then you specify your index parameters as wellto perform the, the search on.
And we'll be using L two specifically, uh,which is the query vector for Euclidean distance. That being said, uh, Viss has a bunchof other types for the embeddings. There's inner product, there's co-sign similarities,so you have options. Um, but I think Euclidean is usually what,what most people start with. Um, and so now we can see that we have an,a successful insertion of our dataand we also can list our collection,which we named example collection to make surethat we have in fact created it.
So now we're ready to actually go ahead and createor perform our similarity search. So here we're gonna search for vectors in milsthat are most similar to our query vector, usingthat Euclidean distance. So the line vectors under to search equals data underto search one vector field. This line is gonna specify the vector that we wanna useas the query vector for this search. Specifically we're selecting, um,the second vector from the data to insert list, um,to perform our search, which was generated earlier.
And it just represents the averagespeed that we we had before. So we can see thatbecause we used the same vectorthat we already wrote into the collection as our query,that means that we're looking atthe literally the same values. So it makes a lot of sense. This is very confirming to usthat in fact our similarity search is working as expectedbecause we have a distance of zero. Um, and then similarly we have some higher distances for thevectors or the time series data that came right after.
And then I'm also listing some of those IDsand returning them as well so that I can go aheadand graph the comparison of someof our vectors that we searched for. So the top three results that came back, um,and we can see that our, our search vector is the same. It's overlapping with our first vectorbecause it is the same vector. So of course, um, it's gonna return, you know,the same vector to us 'cause it performed the searchcorrectly and it, the, the onewith the closest similarity is the same one. And then we see some other, uh, time series datathat it brought back to us that was similar.
And you can see there's definitely areaswhere the time series, like the trends are kind of similarof course, when we can perform a mean absolute errorto quantify some of the similaritiesbetween let's say our search vector and vector two. Uh, and when we do, um, we,we actually get a pretty high mean absolute error. So this is for the second, uh, most similarresult that it's returning. And you might go, oh no, what, why, why is that so high?But if you remember, we were generatingcompletely random data. So this is actually also confirming,because we wouldn't expect if we are performing a similaritysearch across random data to,it'd be like statistically very improbable that we'd return.
I mean that you're not supposed to, it's supposedto be random that we'd return data that is also similar. So this is confirming in twofolds that the first value,the first vector that's the same as the onethat we're querying for is performing,is returning a distance of zero. And that the other ones that are random are perform,are returning a mean absolute error. Um, that's pretty high. That indicates that they are not similarbecause they are random data.
So this kind of just really simple, uh, tutorial,just walk you through how you can convert time series datainto a vector, try it out for yourself, get a feel for it. Instead, I would recommendthat you actually include your own time series dataand then perform some similarity searchesand see how there are actually some similarities in theoverall trends of, of your data. Um, and being able to actually effectivelycollect, um,and almost sort of classify someof the shapes of your time series data. And then last but not least, just good protocol is to,if you're not gonna continue using that collectionto delete all collection that you haveand disconnect from, um, the database. So yeah, so that's essentially the,the demo there for how to do that.
And I'll briefly walk you through just,this is just kind of an FYI some fun other projectsthat you can do with InfluxDB in case this kind ofencourage youor got you inspired to work with some time series. So, um, this one is called Saving the Holidaysand basically it's an example of using HI MQ with MQTT dataand quicks to look at some machine datafrom a factory floorand perform some predictive maintenance on it. And what's cool about this project is that it alsoleverages hugging face and allows you to, uh,scale this project and this architecture, um,for a real, real world use case. So while this is, you can all run this locally,you could easily use this same architecture, um, forfor your own projects. Um, and it goes through, you know, classifying, uh,complicated, um, anomalies with nns, um,which are also used in vector databases for classifications.
So it's kind of a cool overlap. Overlap. Um, I talked about this project already,so I won't go through that. Um, and actually we're getting short on timeand I wanna give people time to go ahead and ask questions. So I will just go aheadand just talk about someresources before we finish here.
So the first one is that I wanna encourage you to sign upfor influx data for cloud. We have a free cloud trial, I showed you know,what the UI looked like here and go aheadand get started really easily writing any data with Pythonor a variety of other methods. Um, then we also have our blog, which is a great resource ifyou know you want your preferred way of of learning isthrough blogs and not webinars. You know, there are blogs for almost every projectthat is in the GitHub org org influx community. So you can always look thereand find corresponding blog posts in our blog.
Our documentation is great. We also have Influx University and our community Slackand INFORMS is a fantastic resourcefor asking any questions that you might have. All things Time series and Influx db. So with that I'll leave the floorand ask do we have any questions?Thank you very much first. Oh, that was actually very interesting.
Thank you also for the cool demo we have so far. Two questions. Um, so one which isgenerally we would get an embedding modelto generate the embeddings,which are then saved in a vector database, you know,to do similarity search or time series data. How is the process covered? Could you likespend some time on that?Um, I personally haven't used an embedding modelto, to generate that. I think this is kind of like what you could use insteadof a embedding model.
Um mm-Hmm. I think embedding models are oftentimes usedfor much more complicated unstructured data. Um, when you have things like video and, and textand um, you know, things where you needto like perform complex extraction like that. But um, this is what I've foundfor time series specifically. Mm-Hmm.
So for, yeah,for time series you would say actually we don't needembedding embedding models directly, right?Yeah, I'm not, I mean I'm not sure I haven't used one. Yeah. Yeah. Okay, cool. There was another question that got deleted.
Uh, but it's like I have one as well. Also first, like a friend of mine would be very happybecause he is creating proteins using geo AIand vector search and it seems like he has a lab, you know,and it seems like with influx you can detection for example,when you have problems in the labs. Uh, so I might uh, give him a word about that. Ah, the question is back. So in the columns, if you have, um,if I have number entries, those should be fine,but what if the entries are some text?Yeah, I think you're gonna have to then use, um,an embedding model for the text specificallyand then combine them.
Okay. And I'll ask one on my end. So what other tools can you can be usedfor data processing when only detectionor forecasting within flux?Oh, sure. Um, I can actually go back,talk about some of those. Um, so actually one second.
Gimme one second to justcomment some slides here. Oh, no, Lost the slides. I know, I lost them too. Thank you. So we recommend a lot of different tasks.
The first or tools, um, for InfluxDB specifically,one would be, um, quicks, which is a streaming platformthat's all built on top of Kafka for data processing. And uh, it has,and it's all abstracted with Python and a ui. So you don't actually have to worry about configuring Kafka. Um, but highly scalable. They have a cloud version, um,and they're, uh, releasing a community edition as well.
Um, and that's what this, uh, whole project is based off of. Um, uses quicks to use source plugins for both MQTTto get data from an MQTT broker as well as employ. Um, it as Quicks has plugins for hugging face as well, sothat you can apply any machine learningmodels that you might want. Um, there's also maj, which is they consider themselvesto be the open source alternative to Airflow,and it's also open source ETL tool that's very easy to use. Um, and there are inthat same organization some examples for howto do anomaly detection with Influx CBand Maj using Half Space Trees, for example.
So you can try that out for yourself as well. Um, there's also things just like AWS Fargate, um, you know,you can just use that to do any sortof task scheduling that you might wanna do. Um, I would imagine that would be morefor some things like micro batching or down samplingor creating micro batch forecasts instead. Um, but again, you can, you can try that for yourself. And then Bite Wax is another alternative.
It's another open source tool that's usedfor data processing pipelines. Um, and I actually did not include a link for that,but if you look at the blog, um,and you just search for, you know, bite Wax and Influx ccbor you look here, um, you can also find a projectfor that, um, to use Bite Wax with influx CCBfor some data processing. Uh, and then can also recommend Kafkaand Faust using those together. Faust is, uh, also open source, it's community maintained,um, but you can use it to, you know,do any sort of stream processing that you might wantand include any sort of anomaly, detector forecasting that you, you might wanna include. And then, um, in general, there's also one fun thingthat's happening obviously in in Tech right now in,in machine learning and AI is, um, LLMs.
And that doesn't exclude people using language modelsfor Time series specifically, where instead your tokenizingthe time series instead of words. Uh, and so there are a bunchof different models that do this. Um, their Google has one, um, there's Kronos,there's uh, like five different ones right nowand I can't remember all of their names. Um, tiny Fixer LLM is another one. Uh, and Auto Lab has one.
But basically those are really cool to look intobecause they pro they provide zero shot forecasting,which means that you don't have to train your data at alland you can just simply borrow those weights from huggingface, apply them to your time series dataand then just get a forecast. Um, so it's kind of like you don't haveto have any forecasting knowledge or expertise. The only hard part about using them is actually using thembecause they require very specific environmentsand there isn't a lot of documentation around like howto deploy and or, uh, use them. So yeah. Cool.
Thank you. And we have another question, uh,which is very interesting. Uh, would this be usefulfor crypto prediction? Would you say?For what? For what? Crypto? Uh, crypto Crypto. Like Bitcoin or anything prediction. Oh, um, I mean I, orFinancial prediction in general, I would say,Yeah, I mean that's the, I think that's the idea between,um, time series lms, uh, if you look at the, they're also,they, they really perform well if you are using,doing multivariate analysis.
So if you were looking at, you know, stock trends and NLPand a bunch of different time series data altogether,that's where they really shine. If you're trying to do short termunivariate time series predictions, that'swhere statistical methods still have an advantage. But the fact that you can now use some of these models to,um, just forecast, like it's just wild that there's a modelthat can forecast very accurately anytype of time series data. Mm-Hmm. If you just give it a, like a good enough windowand your data exhibits some sortof seasonal seasonality Mm-Hmm.
Um, and it works across, you know, all sorts of domainsand all different types of time series. So they're looking really promising. And we've definitely reached a point in the time seriesspace where machine learning methods areconsistently outperforming statistical methods. Um, now whether or not you have like the, the teamand the ability to like maintain thatand it's worth it to youand there's like the cost benefit analysis of, of allthat maintenance and that domain expertise versus using astatistical method that's easy to useand doesn't require a lot of resources. Like that's a whole other issue, but it's mm-Hmm.
Yeah, it's interesting stuff in the last two years, so. Okay, cool. Thank you. And last one as well to conclude,um, how did we combine the analytical results from timeseries data with the results from Vector search?Is it based on the timestampof the matching result from both sides?I guess that's referring toThe demo. It was, it was basedon the actual shape of the time series.
Okay. So when we normalized the data,we basically created a multidimensional shapebased off of that time series. And then we said which of our new searchor new time series looks the most, likethat previous shape based off of the Euclidean distance. Okay. But then we're we,we included the timestamp in the index sothat we could say like, oh this anomalyis happening at this time.
It's just like additional informationthat might be useful to the end user. Cool. Thank you. Uh, I think we don't have any more questions. Thank you again Anai for the presentation.
That was really cool. Also, KU is on the demo. Uh, thank you everyone else that attended, uh,we'll share the recording in a couple of days. It will also be on YouTube, uh,and then you can follow us on socials. Anyway, thank you everyone.
Have a lovely morning, afternoon, evening,wherever you are. ByeBye. Thank you.
Meet the Speaker
Join the session for live Q&A with the speaker
Anais Dotis-Georgiou
Lead Developer Advocate at InfluxDB
Anais Dotis-Georgiou is a Developer Advocate for InfluxData with a passion for making data beautiful with the use of Data Analytics, AI, and Machine Learning. She takes the data that she collects, does a mix of research, exploration, and engineering to translate the data into something of function, value, and beauty. When she is not behind a screen, you can find her outside drawing, stretching, boarding, or chasing after a soccer ball.