An Ultimate Guide to Vectorizing and Querying Structured Data
This guide explains why and when you should vectorize your structured data and walks you through vectorizing and querying structured data with Milvus from start to finish.
Read the entire series
- Raft or not? The Best Solution to Data Consistency in Cloud-native Databases
- Understanding Faiss (Facebook AI Similarity Search)
- Information Retrieval Metrics
- Advanced Querying Techniques in Vector Databases
- Popular Machine-learning Algorithms Behind Vector Searches
- Hybrid Search: Combining Text and Image for Enhanced Search Capabilities
- Ensuring High Availability of Vector Databases
- Ranking Models: What Are They and When to Use Them?
- Navigating the Nuances of Lexical and Semantic Search with Zilliz
- Enhancing Efficiency in Vector Searches with Binary Quantization and Milvus
- Model Providers: Open Source vs. Closed-Source
- Embedding and Querying Multilingual Languages with Milvus
- An Ultimate Guide to Vectorizing and Querying Structured Data
- Understanding HNSWlib: A Graph-based Library for Fast Approximate Nearest Neighbor Search
- What is ScaNN (Scalable Nearest Neighbors)?
- Getting Started with ScaNN
- Next-Gen Retrieval: How Cross-Encoders and Sparse Matrix Factorization Redefine k-NN Search
- What is Voyager?
This guide explains why and when you should vectorize your structured data and walks you through vectorizing and querying structured data with Milvus from start to finish. But before we jump into the specifics, let's cover some fundamental concepts.
What is Vectorizing Data?
Vectorizing data, also known as data vectorization, is the process of taking different types of data (text, images, structured data etc.) and converting them into a numerical format using machine learning models. These vectors are arrays of numbers that represent the original data in a way that machines can process and analyse. Vectorized data allows algorithms to perform mathematical operations on the data, so you can extract insights, detect patterns or apply machine learning models.
For example, let’s take the sentence "The cat sat on the mat". A basic approach to vectorization is the bag-of-words (BoW) model, where each unique word in the sentence is represented by its frequency. Using BoW, the sentence would be converted into a vector like this: [1, 1, 1, 1, 2, 1], where each number represents the count of a corresponding word (in this case "the" appears twice, the other words appear once). But this only captures the presence or frequency of words, not the meaning or relationships between them.
In modern machine and deep learning models, we use more advanced techniques like word embeddings (Word2Vec, GloVe, BERT etc.) to create vectors. These methods generate dense, high-dimensional vectors that capture the meaning of words based on their context. For example, the word "cat" would be represented by a vector like [0.2, -0.3, 0.4, ...], where the values represent its meaning and relationships with other words in the language. Unlike frequency based methods, embeddings allow for more advanced analysis of text, so machines can understand context, similarity and meaning at a deeper level.
Why Do We Vectorize Data?
We vectorize data for several key reasons:
Machine Learning Compatibility: Most machine learning algorithms work with numerical data. Vectorization allows us to convert various types of data (such as text, images, and structured data) into numerical form, making it suitable for use in these algorithms.
Efficient Computation: Vectors can be processed quickly by computers, enabling faster operations like similarity searches or classification. Mathematical operations on vectors, such as distance calculations or dot products, are computationally efficient.
Capturing Semantic Meaning: Advanced vectorization techniques, such as embeddings, can capture semantic relationships between data points. For example, word embeddings can encode the meaning of words based on their context, allowing for nuanced analysis and comparison of text.
Enabling Dimensionality Reduction: While vectorization itself doesn’t reduce dimensionality, it enables data to be represented in ways that make dimensionality reduction techniques (such as PCA or t-SNE) easier to apply. These techniques can help condense complex data into more manageable forms for storage or further analysis.
What is Vector Data?
Vector data is when data is converted into a list of numbers, often called a vector. Each vector represents one item, data point or entity, with the numbers in the vector encoding different features or attributes of that item. Vectors are key in machine learning and data science because they allow complex data (text, images, user profiles etc) to be processed by algorithms.
Examples of vector data:
Word Embeddings: Words are represented as dense vectors of real numbers, where each dimension captures some aspect of the word’s meaning or context. For example, embeddings generated by models like Word2Vec or BERT represent words in a way that similar words (like "king" and "queen") have vectors that are close in a high-dimensional space.
Image Encodings: Images can be represented as vectors, either by raw pixel values (for simpler tasks) or by more complex features extracted by models like CNNs (Convolutional Neural Networks). These vectors encode the key features of the image, for tasks like image classification or similarity search.
User Behavior Profiles: In systems like recommendation engines, a user’s behavior (such as preferences, actions or past interactions) can be encoded as a vector. Each dimension of the vector may represent a feature, such as the number of times a user viewed a particular category of products or the frequency of specific actions (e.g., purchases or clicks).
What is Vector Data Processing?
Vector data processing involves manipulating, analyzing, or querying data that has been transformed into vector form. A vector represents an item or data point as a series of numerical values. Processing these vectors allows machines to efficiently handle tasks that involve comparing, grouping, or predicting relationships between data points.
Common vector data processing tasks include:
Similarity Search: Finding vectors that are most similar to a given query vector. This is essential in tasks like recommendation systems or semantic search, where items or documents similar to a query need to be identified quickly.
Clustering: Grouping similar vectors together based on their characteristics. Techniques like k-means clustering are used to automatically discover groups or clusters in data, often employed in customer segmentation, anomaly detection, or exploratory data analysis.
Classification: Assigning labels to vectors based on their features or patterns. For example, a machine learning model might classify a vector as representing "spam" or "not spam" in an email filtering system.
Dimensionality Reduction: Compressing high-dimensional vectors into lower-dimensional representations while preserving the most important information. Techniques like PCA (Principal Component Analysis) or t-SNE are used to reduce the complexity of data, which helps in visualization, reducing storage needs, and speeding up computations.
Vector data processing is particularly powerful for tasks that involve understanding relationships or similarities between data points. It is widely used in applications like recommendation systems, natural language processing (NLP), and computer vision, where comparing vectors efficiently is crucial to delivering accurate results.
Why Vectorize Structured Data?
While vectorization is commonly associated with unstructured data like text or images, it is also highly useful for structured data (such labeled data such as numerical, categorical, or temporal data). Here’s why vectorization can benefit structured data:
Unified Representation: Vectorization allows you to represent different types of structured data (e.g., numerical values, categories, dates) in a single, unified format. This makes it easier to process, analyze, and combine different types of features within machine learning models, particularly when integrating structured and unstructured data.
Capturing Complex Relationships: Vector representations, especially when generated by advanced algorithms (like embeddings or neural networks), can capture non-linear relationships between features that traditional structured data analysis methods (e.g., linear models or decision trees) might miss. This is particularly useful for more complex machine learning tasks.
Efficient Similarity Search: Once structured data is vectorized, it enables fast similarity searches across large datasets. For example, customer profiles represented as vectors can be efficiently compared to find similar customers based on multiple features, even if those features interact in complex ways.
Compatibility with Advanced ML Models: Many state-of-the-art machine learning models (such as neural networks or gradient-boosting models) work best with vectorized input. Vectorizing structured data allows you to harness the power of these models for tasks like classification, regression, and clustering.
Structured Data vs. Semi-structured Data vs. Unstructured Data
If you're working with data, you've probably heard the terms "structured," "semi-structured," and "unstructured" thrown around quite a bit. But what do they mean, and why should you care?
Let's break down the differences between these data types:
Structured data follows a predefined schema and has a consistent format. Think of it like a perfectly organized spreadsheet, with rows representing individual records and columns representing specific attributes. Structured data is the neat freak of the data world, always adhering to a strict structure.
Semi-structured data has a partially defined structure but allows for more flexibility. It's like a document with some structure, such as headings and paragraphs, but the content within those elements can vary. Emails, XML, and JSON documents are prime examples of semi-structured data.
Unstructured Data is the wild child of the data family and comes in various forms like text documents, images, videos and audio files. Unstructured data is harder to tame and analyze using traditional methods.
Now you might be wondering, "Where do I store all this raw data?" Well the choice of database depends on the type of your data and the requirements of your application.
Relational databases like MySQL, PostgreSQL and Oracle are the default choices for structured data. These databases provide a robust and efficient way to store, retrieve and manipulate structured data using SQL (Structured Query Language).
If you are dealing with semi-structured data, document oriented databases like MongoDB, Couchbase or Cassandra are the way to go. These databases are designed to handle flexible schema and can accommodate hierarchical and nested data structures.
For efficient processing of unstructured data you will need specialized vector databases like Milvus. These systems can handle massive volumes of unstructured data in high-dimensional vectors and provides the scalability and high availability you need.
Why and When Should You Vectorize Structured Data?
Since structured data is used for exact search, why vectorize it? Imagine being able to find records in your database based on their meaning not just exact matches. Or find hidden patterns and relationships in your data that you didn’t know existed. Vectorization makes all of that possible.
Here are a few cases where you should vectorize your structured data:
When your dataset has both structured and unstructured data.
When your structured data has unstructured values, like a CSV file with customer IDs and customer reviews or profile descriptions.
On the other hand, there are cases where vectorizing your structured data might not be a good idea:
If your data is all quantitative values, like product price sheets, vectorization won’t add much value. Traditional structured databases are well suited for this type of data and are usually the best tool for the job.
Suppose you want to do roll-up analytics, like how many people spent more than 2 seconds on your video. In that case the data has little semantic value that would benefit from vectorization. SQL-like databases are designed for these types of queries so it’s usually best to stick with them for that.
Why Bother Vectorizing Your Structured Data?
Since structured data is usually used for precise search, why bother vectorizing it? Imagine being able to find similar records in your database based on their semantic meaning, not just exact matches. Or uncovering hidden patterns and relationships in your data that you didn't know existed. Vectorization makes all of this possible.
There are several cases where you should consider vectorizing your structured datasets:
When your dataset contains both structured and unstructured data.
When your structured data includes unstructured values, like a CSV file with customer IDs and customer reviews or profile descriptions.
On the flip side, there are several advantages there are situations where vectorizing your structured data might not be a good idea:
If your data consists solely of quantitative values, such as product price sheets, vectorization might not add much value to converting data. Traditional structured databases are well-equipped to handle this type of data, and they're typically the most suitable tool for performing analyses.
Suppose your goal is to perform roll-up analytics, like determining the percentage of people who spent more than 2 seconds on your video. In that case, the data may have little semantic value that would benefit from vectorization. SQL-like databases are designed to handle these queries efficiently, so it's often best to stick with them for such tasks.
Performance Comparison: Vector Search vs. Traditional Queries
When deciding to vectorize structured data you need to understand the performance implications. Let’s compare vector similarity searches with traditional structured data queries across different scenarios.
Scenario 1: Exact Match Queries For simple exact match queries traditional databases still win. For example finding a customer by their unique ID:
Traditional SQL: Very fast, especially with proper indexing
Vector Search: Slower, as it typically needs to check many or all items
In this case traditional databases are faster and more efficient.
Scenario 2: Range Queries For range queries on numerical data traditional databases also perform well:
Traditional SQL: Fast, especially with proper indexing
Vector Search: Slower, as it typically needs to check many or all items
Again traditional databases win for these types of queries.
Scenario 3: Semantic Similarity Search This is where vector search excels. Finding semantically similar items is hard for traditional databases but easy for vector search:
Traditional SQL: Slow and complex, often requiring extensive data scanning
Vector Search: Fast, especially with proper indexing techniques
Vector search is much faster for semantic similarity queries as the dataset grows.
Scenario 4: Multi-modal Queries For queries involving multiple data types (e.g., text and images) vector search has the advantage:
Traditional SQL: Very difficult, often requiring complex data modeling and slow operations
Vector Search: Can easily combine different data types in a single search
Vector search is a single approach to multi-modal data which is often impossible with traditional databases.
In summary traditional databases win at exact match and range queries, vector search wins at semantic similarity and multi-modal queries. Choose wisely.
How to Use Milvus to Vectorize and Query Your Structured Data
Milvus is an open-source vector database for vector similarity search engines and GenAI applications. Milvus can store, index, and manage a billion+ embedding vectors generated by deep neural networks and other various machine learning algorithms (ML) models.
Milvus integrates with various popular embedding models, including OpenAI Embedding API, sentence-transformer, BM25, Splade, BGE-M3, and VoyageAI, making it easier to generate vector embeddings for your data. By leveraging these embedding models, Milvus simplifies the process of vectorizing structured data, enabling you to focus on building powerful applications that utilize vector similarity search and retrieval.
Now that we’ve understood the concepts, let’s walk you through using the Milvus integration to vectorize your structured data and perform a similarity search.
In this example, we will create a DataFrame, a structured dataset containing unstructured data like text. We will vectorize this unstructured input data inside the structured data, create a collection for the vector representation of entire dataset, and then query it.
Before you leverage the Milvus integration for vector generation, you must install Milvus on your computer and the necessary packages. Follow this guide to install Milvus.
Let’s see the step-by-step procedure for using Milvus to vectorize and query your structured data.
Step 1: Install Required Libraries
# Install the PyMilvus library
pip install pymilvus[model]
This command install the required Python libraries: pymilvus (which includes the model package). pymilvus is the Python client for Milvus.
Step 2: Import Required Libraries
# Import pandas for data manipulation
import pandas as pd
# Import required classes and modules from PyMilvus
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection
from pymilvus import model
from pymilvus import connections
from pymilvus import MilvusClient
Step 3: Start Milvus Server
# Create an instance of the Milvus client
client = MilvusClient('./milvus.db')
In this step, we import the necessary modules from PyMilvus.
We create an instance of the MilvusClient to connect to Milvus and pass the URI ./milvus.db into the initialization of MilvusClient.
Setting the URI as a local file, e.g., ./milvus.db, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file.
Important Note: We recommend setting up a more performant Milvus server on Docker or Kubernetes if you have over a million documents. When using this setup, please use the server URI, e.g., http://localhost:19530, as your URI.
Step 4: Prepare the data
# Import pandas for data manipulation
import pandas as pd
# Define the data as a dictionary
data = {
'id': [1, 2, 3],
'title': ['Introduction to Milvus', 'Milvus Advanced Features', 'Milvus Use Cases'],
'content': [
'Milvus is an open-source vector database for similarity search.',
'Milvus supports various indexes like IVF_FLAT, IVF_SQ8, and HNSW.',
'Milvus can be easily integrated with machine learning frameworks.'
]
}
# Create a pandas DataFrame from the dictionary
df = pd.DataFrame(data)
# Print the first few rows of the DataFrame
df.head()
Output
Here, we prepare the structured data in the form of a pandas DataFrame. The data contains three columns: id, title, and content. The content column contains the textual data that we want to vectorize.
Step 5: Vectorize the Data
# Create an instance of the DefaultEmbeddingFunction
ef = model.DefaultEmbeddingFunction()
# Vectorize (encode) the text data in the 'content' column
embeddings = ef.encode_documents(df['content'].tolist())
Step 6: Create a Collection Schema and Collection
# Define the fields for the collection schema
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=255),
FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=500),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=ef.dim)
]
# Create the collection schema from the defined fields
schema = CollectionSchema(fields)
# Define the name of the collection
collection_name = "structured_data"
# Create the collection with the given name and schema
collection = Collection(name=collection_name, schema=schema)
We define the schema for data processing the collection by specifying the fields and their data types. The FieldSchema object represents the columns in the collection. In this case, we have fields for id, title, content, and embedding.
The embedding field is a float vector with a dimension equal to the output dimension of the embedding function (ef.dim). We then create a CollectionSchema with the defined fields and instantiate a Collection object with the schema and a collection name.
Step 7: Insert Data into the Collection
# Create a new DataFrame with the embeddings column
data_with_embeddings = df.copy()
data_with_embeddings['embedding'] = embeddings
# Insert the data into the Milvus collection
collection.insert(data=data_with_embeddings.to_dict('records'))
We create a new DataFrame data_with_embeddings by copying the original DataFrame df and adding the embedding column with the computed embeddings. We then insert the data into the Milvus collection using the insert method, passing the data as a list of dictionaries.
Step 8: Create an Index
# Define the index parameters
index_params = {
"metric_type": "COSINE", # Use cosine similarity metric
"index_type": "HNSW", # Use the HNSW indexing algorithm
"params": {"M": 48, "efConstruction": 200} # Specify index parameters
}
# Create an index on the 'embedding' field
collection.create_index("embedding", index_params)
To perform efficient similarity searches, we create an index on the embedding field. We specify the index parameters, including the metric type (cosine similarity), an index type (HNSW), and other parameters like the number of neighbors (M) and the construction parameter (efConstruction).
Step 9: Load the Collection into Memory
# Load the collection into memory
client.load_collection(collection_name=collection_name)
# Check the load state of the collection
res = client.get_load_state(collection_name=collection_name)
print(res)
Output
{'state': <LoadState: Loaded>
Before performing searches, we must load the collection into memory using client.load_collection. We then check the collection's load state using client.get_load_state to ensure it's ready for queries.
Step 10: Perform a Similarity Search
# Define the query text
query = "what is milvus?"
# Encode the query text
query_embedding = ef.encode_documents([query])
# Define the search parameters
search_params = {"metric_type": "COSINE", "params": {"nprobe": 10}}
# Perform the similarity search
results = collection.search(query_embedding, anns_field="embedding", param=search_params, limit=1, output_fields=['content'])
# Print the search results
print(results)
Output
data: ['["id: 1, distance: 0.6577072143554688, entity: {\'content\': \'Milvus is an open-source vector database for similarity search.\'}"]'] , cost: 0
Finally, we perform a similarity search. We encode the query text using the same embedding function (ef.encode_documents). We define the search parameters, including the metric type (cosine similarity) and the nprobe parameter, which controls the number of candidates to explore during the search.
We then call the search method on the collection, passing the query embedding, the embedding field to search against, the search parameters, the maximum number of results to return (limit=1), and the output_fields to include in the results (in this case, only the content field).
The search results contain the most similar document(s) based on the cosine similarity between the query embedding and the document embeddings.
By following these steps, you can efficiently leverage Milvus to vectorize and perform similarity searches on your structured data.
Leveraging Vectorized Data for Similarity Retrieval in RAG with Milvus
Now, we will see the process of building a simple Retrieval Augmented Generation (RAG) system using Milvus, LangChain, and an OpenAI language model. We'll load and vectorize structured data from web sources, ingest it into Milvus, perform a similarity search, send the retrieved results to the language model, and generate the final answer to the user's question.
Install Required Libraries
First, we need to install the necessary Python libraries
pip install -U langchain langchain-community langchain-openai pymilvus[model]
Load and Preprocess Dataset
We'll load a sample dataset (the "Tips" dataset from Seaborn) and preprocess it for ingestion into Milvus.
import pandas as pd
from langchain.text_splitter import CharacterTextSplitter
from typing import List
from langchain_core.documents import Document
import seaborn as sns
# Load the Tips dataset
tips = sns.load_dataset("tips", cache=False)
# Drop rows with missing values
tips = tips.dropna()
tips = tips.head(4) # Take a small subset for demonstration
# Convert the DataFrame to a list of documents
documents = [Document(page_content=str(tips.iloc[i])) for i in range(len(tips))]
# Initialize a CharacterTextSplitter for splitting text into chunks
text_splitter = CharacterTextSplitter(separator="\\n", chunk_size=1000, chunk_overlap=0)
# Split the documents into chunks using the text_splitter
docs = text_splitter.split_documents(documents)
Set Up Vector Store and LLM
We'll set up the Milvus vector store, initialize the OpenAI embeddings and language model, and add the documents to the vector store.
from langchain_core.prompts import PromptTemplate
from langchain_community.vectorstores import Milvus
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
import os
# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "Your API Key"
# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings()
# Initialize Milvus vector store
vectorstore = Milvus(
embedding_function=embeddings,
auto_id=True,
drop_old=True,
)
# Add documents to the vector store
vectorstore.add_documents(docs)
Define Query and Perform Similarity Search
Now, we'll define the user's query and perform a similarity search in Milvus to retrieve the most relevant documents.
# Define the user's query
query = "what is the average tip?"
# Perform similarity search in Milvus
search_results = vectorstore.similarity_search(query)
Build RAG Chain
We'll define the prompt template for the RAG system, initialize the OpenAI language model, and build the RAG chain using LangChain's Expression Language.
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
# Define the prompt template for the RAG system
PROMPT_TEMPLATE = """
Human: You are an AI assistant, and you provide answers to questions by using fact-based and statistical information when possible.
Use the following pieces of information to provide a concise answer to the question enclosed in <question> tags.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
<context>
{context}
</context>
<question>
{question}
</question>
The response should be specific and use statistics or numbers when possible.
A:"""
# Initialize the prompt template
rag_prompt = PromptTemplate(
template=PROMPT_TEMPLATE, input_variables=["context", "question"]
)
# Initialize the OpenAI language model
llm = ChatOpenAI(model="gpt-4o", temperature=0)
# Convert the vector store to a retriever
retriever = vectorstore.as_retriever()
# Format the search results for the prompt
def format_docs(docs: List[Document]):
return "\n\n".join(doc.page_content for doc in docs)
# Define the RAG (Retrieval-Augmented Generation) chain for AI response generation
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| rag_prompt
| llm
| StrOutputParser()
)
Generate Answer
Finally, we'll invoke the RAG chain with the user's query and retrieve the generated answer.
# Invoke the RAG chain with the user's question and retrieve the generated answer
res = rag_chain.invoke(query)
print("Question:", query)
print("Answer:", res)
Output
Question: what is the average tip?
Answer: To calculate the average tip, we sum the tips and divide by the number of entries.
The tips are: 1.66, 3.31, 3.50, and 1.01.
Sum of tips = 1.66 + 3.31 + 3.50 + 1.01 = 9.48
Number of entries = 4
Average tip = 9.48 / 4 = 2.37
The average tip is $2.37.
By leveraging Milvus for efficient vector similarity search and machine learning models and combining it with powerful language models like OpenAI, you can create sophisticated RAG systems that provide accurate and contextually relevant answers to user queries in human language.
Vectorization and Querying with Zilliz Cloud Pipelines
Zilliz Cloud is a fully managed vector database that builds on the Milvus vector database. Zilliz Cloud Pipelines is a one-stop solution for vector creation and retrieval. It provides a comprehensive set of tools and APIs that allow you to easily connect to various data sources, apply pre-built or custom vectorization models, and store the vectorized data in Zilliz Cloud for high-performance similarity search and retrieval.
Let's walk through an example of using Zilliz Cloud Pipelines to vectorize the data and store the embeddings in Zilliz Cloud for similarity search.
Set up Zilliz Cloud Pipelines
Obtain the necessary information about your Zilliz Cloud cluster, including cluster-ID, cloud region, API key, and project ID. For more information, see On Zilliz Cloud Console.
CLOUD_REGION = 'gcp-us-west1'
CLUSTER_ID = 'your CLUSTER_ID'
API_KEY = 'your API_KEY'
PROJECT_ID = 'your PROJECT_ID'
Create an Ingestion Pipeline
Define an Ingestion pipeline to process and vectorize your structured data and store the embeddings in Milvus.
Specify the functions, such as INDEX_TEXT for text data or PRESERVE for additional metadata.
import requests
headers = {
"Content-Type": "application/json",
"Accept": "application/json",
"Authorization": f"Bearer {API_KEY}"
}
create_pipeline_url = f"https://controller.api.{CLOUD_REGION}.zillizcloud.com/v1/pipelines"
collection_name = 'my_structured_data_collection'
embedding_service = "zilliz/bge-base-en-v1.5"
data = {
"name": "my_ingestion_pipeline",
"description": "A pipeline that generates embeddings for structured data",
"type": "INGESTION",
"projectId": PROJECT_ID,
"clusterId": CLUSTER_ID,
"collectionName": collection_name,
"functions": [
{
"name": "index_text",
"action": "INDEX_TEXT",
"language": "ENGLISH",
"embedding": embedding_service
},
{
"name": "preserve_metadata",
"action": "PRESERVE",
"inputField": "metadata",
"outputField": "metadata",
"fieldType": "VarChar"
}
]
}
# Send a POST request to create the Ingestion pipeline
response = requests.post(create_pipeline_url, headers=headers, json=data)
# Extract the pipeline ID from the response
ingestion_pipe_id = response.json()["data"]["pipelineId"]
In the code of above example, we define an Ingestion pipeline by specifying the necessary details, such as the pipeline name, description, project ID, cluster ID, and the collection name where the embeddings will be stored in Milvus.
We also define two functions within the pipeline:
INDEX_TEXT: This function is used to process and generate embeddings for the text data.
PRESERVE: This function preserves additional metadata associated with the structured data.
Finally, we send a POST request to create the Ingestion pipeline and extract the pipeline ID from the response.
Run the Ingestion Pipeline
Ingest your structured data into Milvus using the created Ingestion pipeline.
Provide the necessary data and metadata as input to the pipeline.
run_pipeline_url = f"https://controller.api.{CLOUD_REGION}.zillizcloud.com/v1/pipelines/{ingestion_pipe_id}/run"
data = {
"data":
{
"text_list": [
"This is the first text.",
"This is the second text.",
"This is the third text."
],
"metadata": "Sample metadata"
}
}
# Send a POST request to run the Ingestion pipeline with the structured data
response = requests.post(run_pipeline_url, headers=headers, json=data)
We send a POST request to the pipeline's run endpoint to run the Ingestion pipeline, providing the structured data as input. This example has a list of texts (text_list) and some other text form associated metadata (metadata).
The Ingestion pipeline will process the structured data, generate embeddings for the text using the specified embedding service, and store the embeddings along with the metadata in the specified Milvus collection.
Using Zilliz Cloud Pipelines, you can easily vectorize your structured data and store the embeddings in Zilliz Cloud for similarity search. The platform provides a seamless and efficient way to process and search your structured data using vector embeddings.
For more detailed information and additional code examples, refer to the Quickstart | Zilliz Cloud Developer Hub
Summary
We’ve covered a lot on data vectorization, focusing on vectorizing and querying structured data. Let’s review:
Data vectorization converts various data into numerical vectors for efficient processing. We vectorize data for machine learning, computation, semantic meaning, and dimensionality reduction.
We discussed structured, semi-structured, and unstructured data, and the databases suited to each type. Vectorizing structured data can reveal hidden patterns and enable similarity searches, particularly in mixed data.
While traditional databases excel at exact match and range queries, vector search outperforms them in semantic searches. Efficiently vectorizing structured data is crucial for AI and machine learning applications, offering deeper insights and more powerful analysis.
- What is Vectorizing Data?
- Why Do We Vectorize Data?
- What is Vector Data?
- What is Vector Data Processing?
- Why Vectorize Structured Data?
- **Structured Data vs. Semi-structured Data vs. Unstructured Data**
- **Why and When Should You Vectorize Structured Data?**
- Performance Comparison: Vector Search vs. Traditional Queries
- **How to Use Milvus to Vectorize and Query Your Structured Data**
- **Leveraging Vectorized Data for Similarity Retrieval in RAG with Milvus**
- **Vectorization and Querying with Zilliz Cloud Pipelines**
- Summary
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for Free