
What Are Vector Embeddings? A Concise Guide
What Are Vector Embeddings?
Vector embeddings are numerical representations of unstructured data as multi-dimensional arrays of numbers. These vectors can reveal connections or correlations between data objects, often complex and disguised. The unstructured data might be text, images, audio, or video. The primary function of vector embeddings is to convert large and complicated data into machine-readable format for easy processing and analysis by machine learning algorithms.
Say, for instance, that you have a lot of text documents. With embeddings, you can translate every word in the document into its numerical vector within multi-dimensional space. This space becomes a region of similar and dissimilar words or concepts, with those similar being closer and those different further apart. For example, in word embeddings, “red” and “blue” vectors are alike based on semantic similarity (both are colors).
How to Generate Vector Embeddings
You generate vector embeddings by leveraging a machine-learning model trained on large datasets. The model converts each unit of information into a vector of numbers so that the distance between two vectors signifies how similar they are semantically.
- The first step is selecting the data and deciding which variables will form vectors. These may include words in a text corpus, visual imagery, and user preferences.
- Next, extract relevant features from the data using techniques like tokenization, stemming, and stop word removal for text data. Consider using convolutional neural networks (CNNs) to recognize and extract image features for image data.
- After preprocessing the data, input it into a machine learning model such as Word2Vec, GloVe, OpenAI, or another deep neural network model. During the training, the model learns to create vector representations for every item in the dataset. The model rescales these vectors to minimize differences among similar items while increasing differences among different ones.
- The result is a multidimensional vector space where a specific vector represents each item in the dataset. Similar items are closer in this space, and dissimilar items are further apart.
Types of Vector Embeddings
There are different vector embeddings to address different data and tasks. Here are some common types of vector embeddings.
Word Embeddings Word embeddings represent words as numerical vectors in a continuous vector space. They are trained on large text corpora like Wikipedia using various models like Word2Vec, GloVe, or fastText. These models use varying mechanisms to represent each word as a vector of numbers. Word2Vec generates word embeddings by forecasting a word in an enormous textual corpus. It defines linguistic relations among words. For example, it may represent "king" and "queen" as vectors close together in the space, showing their semantic similarity.
GloVe, on the other hand, relies on the co-occurrence data of words to construct vectors. GloVe might represent the words "ice" and "water" as vectors that are close together because they often appear together in text and share a semantic relationship.
fastText takes word embeddings to a sub-word level, allowing one to deal with out-of-vocabulary (OOV) words or variations.
Sentence and Document Embeddings Sentence and document embeddings represent entire sentences or documents as numerical vectors using models like Doc2Vec and BERT. Doc2Vec builds upon the Word2Vec paradigm to generate document-level embeddings, allowing the encoding of entire documents or passages. BERT (Bidirectional Encoder Representations from Transformers) considers the context of each term in a sentence, leading to highly context-aware vectors of embeddings.
Image Embeddings CNNs can produce image embeddings through feature extraction at varying network layers. They provide a valuable approach to image classification and retrieval. For example, a photo of a cat might have an image embedding vector with features that represent its ears, fur, and tail. Other models like Inception and ResNet also have layers that can extract features from images. An Inception model can generate an embedding that represents the visual attributes of an image, like objects or patterns.
Time Series EmbeddingsYou can embed time series data using long short-term memory (LSTM) and gated recurrent unit (GRU) neural networks with the ability to capture temporal dependencies. LSTM- or GRU-based embeddings can capture temporal dependencies for time series data like stock prices. For instance, these vector embeddings can represent patterns in stock price movements.
Audio EmbeddingsIn audio processing, mel-frequency cepstral coefficients (MFCC) embeddings represent audio data for audio classification tasks. In speech recognition, MFCC embeddings capture the acoustic characteristics of audio signals. For example, they can represent the spectral content of spoken words.
Graph Embedding Consider a social network graph. Node2Vec can generate node embeddings where similar users (nodes) have closer vectors. For instance, users with similar interests might have similar vector embeddings. You can also use graph neural networks (GNNs) to generate embeddings for nodes and capture complex relationships in graphs. GNNs can represent users and items in a recommendation system and predict user-item interactions.
Applications of Vector Embeddings
Here's how you can apply vector embeddings in different domains.
Image, video, and audio similarity search: You can extract embeddings from images, video frames, or audio segments using techniques like CNNs for image or feature extraction. Store these embeddings in a vector database. When a user queries with an image, video frame, or audio clip, retrieve similar items by measuring the similarity between their embeddings.
AI drug discovery: You can use these embeddings to measure the similarity between drug compounds by encoding the chemical structures of compounds into embeddings. Predict the potential target proteins for drug compounds.
Semantic search engine: Vector embeddings enhance search engine capability by matching the meaning of queries to relevant documents, improving search relevance.
Recommender system: You can represent users and items as embeddings. Measure user-item similarity to make personalized recommendations. Tailor-made recommendations enhance user experience.
Anomaly detection: By representing data points as embeddings, you can identify unusual patterns in data. You achieve this by calculating the distance or dissimilarity between data points. Data points with huge distances are potential anomalies. This can be helpful in fraud detection, network security, industrial equipment, or process monitoring.
Examples of Vector Embeddings
Here's an example of using a pretrained embedding model to generate embeddings for our own words. You'll need to install Python and set up Milvus to follow along.
First, install dependencies: milvus, pymilvus, and gensim. Pymilvus is a Python SDK for Milvus, and gensim is a Python library for NLP.
pip install milvus, pymilvus, gensim
Import the libraries.
import gensim.downloader as api
from pymilvus import (
connections,
FieldSchema,
CollectionSchema,
DataType,
Collection)
Create a connection to the Milvus server.
connections.connect(
alias="default",
user='username',
password='password',
host='localhost',
port='19530'
)
Create a collection.
#Creates a collection:
fields = [
FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True, auto_id=False),
FieldSchema(name="words", dtype=DataType.VARCHAR, max_length=50),
FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=50)
]
schema = CollectionSchema(fields, "Demo to store and retrieve embeddings")
demo_milvus = Collection("milvus_demo", schema)
Load the pretrained model from gensim.
model = api.load("glove-wiki-gigaword-50")
Generate embeddings for sample words.
ice = model['ice']
water = model['water']
cold = model['cold']
tree = model['tree']
man = model['man']
woman = model['woman']
child = model['child']
female = model['female']
Here's an example of the vector embedding for the word "female."
array([-0.31575 , 0.74461 , -0.11566 , -0.30607 , 1.524 , 1.9137 ,
-0.392 , -0.67556 , -0.1051 , -0.17457 , 1.0692 , -0.68617 ,
1.2178 , 1.0286 , 0.35633 , -0.40842 , -0.34413 , 0.67533 ,
-0.5443 , -0.21132 , -0.61226 , 0.95619 , 0.43981 , 0.59639 ,
0.02958 , -1.1064 , -0.48996 , -0.82416 , -0.97248 , -0.059594,
2.396 , 0.74269 , -0.16044 , -0.69316 , 0.55892 , 0.22892 ,
0.013605, -0.44858 , -0.52965 , -0.96282 , -0.54444 , 0.18284 ,
0.16551 , 0.33446 , 0.53432 , -1.4824 , -0.34574 , -0.82834 ,
0.10107 , 0.024414], dtype=float32)
Insert the generated vector embeddings into the collection
#Insert data in collection
data = [
[1,2,3,4,5,6,7,8], # field pk
['ice','water','cold','tree','man','woman','child','female'], # field words
[ice, water, cold, tree, man, woman, child, female], # field embeddings
]
insert_result = demo_milvus.insert(data)
# After final entity is inserted, it is best to call flush to have no growing segments left in memory
demo_milvus.flush()
Create indexes on the entities.
index = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 128},
}
demo_milvus.create_index("embeddings", index)
To confirm the upload was successful, load the collection to memory and do a vector similarity search.
demo_milvus.load()
# performs a vector similarity search:
data = [cold]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 10},
}
result = demo_milvus.search(data, "embeddings", search_params, limit=4, output_fields=["words"])
Loop through the results and print the words.
for i in range(0,4):
hit = result[0][i]
print(hit.entity.get('words'))
And here's the expected output.
cold
ice
water
man
Why Use Vector Embeddings
Pattern recognition: Vector embeddings capture patterns, similarities, and dissimilarities in data. Machine learning models benefit from embeddings representing meaningful patterns, improving performance in tasks like classification and clustering.
Dimensionality reduction: Embeddings reduce the dimensionality of data. They transform high-dimensional data into lower-dimensional vectors, simplifying computational tasks and often improving efficiency.
Semantic understanding: Vector embeddings encode semantic relationships between data points, making it easier for machines to understand and interpret complex information.
Efficient processing: Numerical vectors are computationally efficient. Machine learning algorithms can process numerical data faster and with less computational cost.
Transfer learning: Pretrained embeddings, such as word embeddings in NLP or image embeddings in computer vision, can be fine-tuned for specific tasks. This reduces the need for vast amounts of labeled data, accelerates model convergence, and enhances model performance.
Vector Embedding FAQs
Where Do You Store Vector Embeddings?
You can store vector embeddings in a vector database like Milvus, on an in-memory database like Redis, or in databases like PostgreSQL or your file system. The choice of where to store vector embeddings depends on factors like data volume, access patterns, and the application's specific requirements.
What Is the Difference Between Vector Databases and Vector Embeddings?
Vector databases are specialized databases that index, store, and retrieve vector data. In contrast, vector embeddings are numerical representations of data points (such as sentences, images, or other objects) in a continuous vector space. The embeddings capture meaningful relationships and patterns within the data, while databases store the vectors and optimize similarity searches based on vector similarity metrics.
Are Embeddings Different From Vectors?
A vector is a fundamental mathematical object, represented by a muti-dimensional array of numbers. Embeddings are a technique to transform unstructured data into vector representations that capture semantic relationships between data points.
What Is a Vector Index?
A vector index, also known as a vector database index or similarity index, is a data structure used in vector databases to organize and optimize the retrieval of vector data based on similarity metrics. Examples of indexes are FLAT, IVF_FLAT, IVF_PQ, IVF_SQ8, HNSW, and SCANN for CPU-based ANN searches and GPU_IVF_FLAT and GPU_IVF_PQ for GPU-based ANN searches.
What Is the Difference Between Pre-trained Embeddings and Custom Embeddings?
Pretrained embeddings come from pre-trained deep learning models called checkpoints. These checkpoints are usually open-source transformer models that have been trained on public data. Custom embeddings come from your own deep learning models trained on your own custom, domain-specific data.