An Ultimate Guide to Vectorizing and Querying Structured Data
This guide explains why and when you should vectorize your structured data and walks you through vectorizing and querying structured data with Milvus from start to finish.
Read the entire series
- Raft or not? The Best Solution to Data Consistency in Cloud-native Databases
- Mastering Efficient Similarity Search with Faiss Library
- Information Retrieval Metrics
- Advanced Querying Techniques in Vector Databases
- Popular Machine-learning Algorithms Behind Vector Searches
- Hybrid Search: Combining Text and Image for Enhanced Search Capabilities
- Ensuring High Availability of Vector Databases
- Ranking Models: What Are They and When to Use Them?
- Navigating the Nuances of Lexical and Semantic Search with Zilliz
- Enhancing Efficiency in Vector Searches with Binary Quantization and Milvus
- Model Providers: Open Source vs. Closed-Source
- Embedding and Querying Multilingual Languages with Milvus
- An Ultimate Guide to Vectorizing and Querying Structured Data
In today's data-driven world, organizations handle vast amounts of information in various formats. Structured data, like spreadsheets, database tables, and CSV files, is a common type that's highly organized and follows a predefined schema. Structured data is typically used for precise search tasks.
Vectorization is the process of converting data, often unstructured like text, images, or videos, into numerical high-dimensional vectors. This approach allows machines to understand the semantics of the data. Vector embeddings are particularly handy for tasks like similarity search, which can be challenging with traditional methods.
Since structured data is used for precise search, why bother vectorizing it? This guide will answer this question and walk you through vectorizing and querying structured data from start to finish.
Structured Data vs. Semi-structured Data vs. Unstructured Data
If you're working with data, you've probably heard the terms "structured," "semi-structured," and "unstructured" thrown around quite a bit. But what do they mean, and why should you care?
Let's break down the differences between these data types:
Structured data follows a predefined schema and has a consistent format. Think of it like a perfectly organized spreadsheet, with rows representing individual records and columns representing specific attributes. Structured data is the neat freak of the data world, always adhering to a strict structure.
Semi-structured data has a partially defined structure but allows for more flexibility. It's like a document with some structure, such as headings and paragraphs, but the content within those elements can vary. Emails, XML, and JSON documents are prime examples of semi-structured data.
Unstructured Data, as the name suggests, lacks a predefined structure. It's the wild child of the data family, coming in various formats like text documents, images, videos, and audio files. Unstructured data is more challenging to wrangle and analyze using traditional methods.
Now, you might wonder, "Where do I store all this data?" Well, the choice of database depends on the nature of your data and the specific requirements of your application.
Relational databases like MySQL, PostgreSQL, and Oracle are the go-to choices for structured data. These databases provide a robust and efficient way to store, retrieve, and manipulate structured data using SQL (Structured Query Language).
If you're dealing with semi-structured data, document-oriented databases like MongoDB, Couchbase, or Cassandra are your best bet. These databases are designed to handle flexible schema designs and can easily accommodate hierarchical and nested data structures.
For unstructured data, you'll need specialized vector databases like Milvus. These systems can handle massive volumes of unstructured data in high-dimensional vectors and provide the scalability and high availability you need.
Why and When Should You Vectorize Structured Data?
Now that we've covered the basics of structured data and its various types, you might wonder, "What exactly is vectorization, and why should I bother with it?" Great question!
Vectorization is the process of converting data, often unstructured like text, images, or videos, into numerical high-dimensional vectors. Think of it as creating a map of your data, where each data point is represented by a unique set of coordinates. Transforming your data into vectors opens up a whole new world of possibilities for search, analysis, and machine learning.
Vectorization allows machines to understand the semantics of your data. Vector embeddings are particularly handy for tasks like similarity search, which can be challenging with traditional methods.
Why Bother Vectorizing Your Structured Data?
Since structured data is usually used for precise search, why bother vectorizing it? Imagine being able to find similar records in your database based on their meaning, not just exact matches. Or uncovering hidden patterns and relationships in your data that you didn't know existed. Vectorization makes all of this possible.
There are several cases where you should consider vectorizing your structured datasets:
When your dataset contains both structured and unstructured data.
When your structured data includes unstructured values, like a CSV file with customer IDs and customer reviews or profile descriptions.
On the flip side, there are situations where vectorizing your structured data might not be a good idea:
If your data consists solely of quantitative values, such as product price sheets, vectorization might not add much value. Traditional structured databases are well-equipped to handle this type of data, and they're typically the most suitable tool for performing analyses.
Suppose your goal is to perform roll-up analytics, like determining the percentage of people who spent more than 2 seconds on your video. In that case, the data may have little semantic value that would benefit from vectorization. SQL-like databases are designed to handle these queries efficiently, so it's often best to stick with them for such tasks.
How to Use Milvus to Vectorize and Query Your Structured Data
Milvus is an open-source vector database for vector similarity search engines and GenAI applications. Milvus can store, index, and manage a billion+ embedding vectors generated by deep neural networks and other machine learning (ML) models.
Milvus integrates with various popular embedding models, including OpenAI Embedding API, sentence-transformer, BM25, Splade, BGE-M3, and VoyageAI, making it easier to generate vector embeddings for your data. By leveraging these embedding models, Milvus simplifies the process of vectorizing structured data, enabling you to focus on building powerful applications that utilize vector similarity search and retrieval.
Now that we’ve understood the concepts, let’s walk you through using the Milvus integration to vectorize your structured data and perform a similarity search.
In this example, we will create a DataFrame, a structured dataset containing unstructured data like text. We will vectorize this unstructured data inside the structured data, create a collection for the entire dataset, and then query it.
Before you leverage the Milvus integration for vector generation, you must install Milvus on your computer and the necessary packages. Follow this guide to install Milvus.
Let’s see the step-by-step procedure for using Milvus to vectorize and query your structured data.
Step 1: Install Required Libraries
# Install the PyMilvus library
pip install pymilvus[model]
This command install the required Python libraries: pymilvus
(which includes the model
package). pymilvus
is the Python client for Milvus.
Step 2: Import Required Libraries
# Import pandas for data manipulation
import pandas as pd
# Import required classes and modules from PyMilvus
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection
from pymilvus import model
from pymilvus import connections
from pymilvus import MilvusClient
Step 3: Start Milvus Server
# Create an instance of the Milvus client
client = MilvusClient('./milvus.db')
In this step, we import the necessary modules from PyMilvus.
We create an instance of the MilvusClient
to connect to Milvus and pass the URI ./milvus.db
into the initialization of MilvusClient
.
Setting the URI as a local file, e.g., ./milvus.db
, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file.
Important Note: We recommend setting up a more performant Milvus server on Docker or Kubernetes if you have over a million documents. When using this setup, please use the server URI, e.g., http://localhost:19530
, as your URI.
Step 4: Prepare the data
# Import pandas for data manipulation
import pandas as pd
# Define the data as a dictionary
data = {
'id': [1, 2, 3],
'title': ['Introduction to Milvus', 'Milvus Advanced Features', 'Milvus Use Cases'],
'content': [
'Milvus is an open-source vector database for similarity search.',
'Milvus supports various indexes like IVF_FLAT, IVF_SQ8, and HNSW.',
'Milvus can be easily integrated with machine learning frameworks.'
]
}
# Create a pandas DataFrame from the dictionary
df = pd.DataFrame(data)
# Print the first few rows of the DataFrame
df.head()
Output
Here, we prepare the structured data in the form of a pandas DataFrame. The data contains three columns: id, title, and content. The content column contains the textual data that we want to vectorize.
Step 5: Vectorize the Data
# Create an instance of the DefaultEmbeddingFunction
ef = model.DefaultEmbeddingFunction()
# Vectorize (encode) the text data in the 'content' column
embeddings = ef.encode_documents(df['content'].tolist())
Step 6: Create a Collection Schema and Collection
# Define the fields for the collection schema
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=255),
FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=500),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=ef.dim)
]
# Create the collection schema from the defined fields
schema = CollectionSchema(fields)
# Define the name of the collection
collection_name = "structured_data"
# Create the collection with the given name and schema
collection = Collection(name=collection_name, schema=schema)
We define the schema for the collection by specifying the fields and their data types. The FieldSchema
object represents the columns in the collection. In this case, we have fields for id
, title
, content
, and embedding
.
The embedding
field is a float vector with a dimension equal to the output dimension of the embedding function (ef.dim
). We then create a CollectionSchema
with the defined fields and instantiate a Collection
object with the schema and a collection name.
Step 7: Insert Data into the Collection
# Create a new DataFrame with the embeddings column
data_with_embeddings = df.copy()
data_with_embeddings['embedding'] = embeddings
# Insert the data into the Milvus collection
collection.insert(data=data_with_embeddings.to_dict('records'))
We create a new DataFrame data_with_embeddings
by copying the original DataFrame df
and adding the embedding
column with the computed embeddings. We then insert the data into the Milvus collection using the insert
method, passing the data as a list of dictionaries.
Step 8: Create an Index
# Define the index parameters
index_params = {
"metric_type": "COSINE", # Use cosine similarity metric
"index_type": "HNSW", # Use the HNSW indexing algorithm
"params": {"M": 48, "efConstruction": 200} # Specify index parameters
}
# Create an index on the 'embedding' field
collection.create_index("embedding", index_params)
To perform efficient similarity searches, we create an index on the embedding
field. We specify the index parameters, including the metric type (cosine similarity), an index type (HNSW), and other parameters like the number of neighbors (M
) and the construction parameter (efConstruction
).
Step 9: Load the Collection into Memory
# Load the collection into memory
client.load_collection(collection_name=collection_name)
# Check the load state of the collection
res = client.get_load_state(collection_name=collection_name)
print(res)
Output
{'state': <LoadState: Loaded>
Before performing searches, we must load the collection into memory using client.load_collection
. We then check the collection's load state using client.get_load_state
to ensure it's ready for queries.
Step 10: Perform a Similarity Search
# Define the query text
query = "what is milvus?"
# Encode the query text
query_embedding = ef.encode_documents([query])
# Define the search parameters
search_params = {"metric_type": "COSINE", "params": {"nprobe": 10}}
# Perform the similarity search
results = collection.search(query_embedding, anns_field="embedding", param=search_params, limit=1, output_fields=['content'])
# Print the search results
print(results)
Output
data: ['["id: 1, distance: 0.6577072143554688, entity: {\'content\': \'Milvus is an open-source vector database for similarity search.\'}"]'] , cost: 0
Finally, we perform a similarity search. We encode the query text using the same embedding function (ef.encode_documents
). We define the search parameters, including the metric type (cosine similarity) and the nprobe
parameter, which controls the number of candidates to explore during the search.
We then call the search method on the collection, passing the query embedding, the embedding
field to search against, the search parameters, the maximum number of results to return (limit=1
), and the output_fields
to include in the results (in this case, only the content
field).
The search results contain the most similar document(s) based on the cosine similarity between the query embedding and the document embeddings.
By following these steps, you can efficiently leverage Milvus to vectorize and perform similarity searches on your structured data.
Leveraging Vectorized Data for Similarity Retrieval in RAG with Milvus
Now, we will see the process of building a simple Retrieval Augmented Generation (RAG) system using Milvus, LangChain, and an OpenAI language model. We'll load and vectorize structured data from web sources, ingest it into Milvus, perform a similarity search, send the retrieved results to the language model, and generate the final answer to the user's question.
Install Required Libraries
First, we need to install the necessary Python libraries
pip install -U langchain langchain-community langchain-openai pymilvus[model]
Load and Preprocess Dataset
We'll load a sample dataset (the "Tips" dataset from Seaborn) and preprocess it for ingestion into Milvus.
import pandas as pd
from langchain.text_splitter import CharacterTextSplitter
from typing import List
from langchain_core.documents import Document
import seaborn as sns
# Load the Tips dataset
tips = sns.load_dataset("tips", cache=False)
# Drop rows with missing values
tips = tips.dropna()
tips = tips.head(4) # Take a small subset for demonstration
# Convert the DataFrame to a list of documents
documents = [Document(page_content=str(tips.iloc[i])) for i in range(len(tips))]
# Initialize a CharacterTextSplitter for splitting text into chunks
text_splitter = CharacterTextSplitter(separator="\\n", chunk_size=1000, chunk_overlap=0)
# Split the documents into chunks using the text_splitter
docs = text_splitter.split_documents(documents)
Set Up Vector Store and LLM
We'll set up the Milvus vector store, initialize the OpenAI embeddings and language model, and add the documents to the vector store.
from langchain_core.prompts import PromptTemplate
from langchain_community.vectorstores import Milvus
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
import os
# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "Your API Key"
# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings()
# Initialize Milvus vector store
vectorstore = Milvus(
embedding_function=embeddings,
auto_id=True,
drop_old=True,
)
# Add documents to the vector store
vectorstore.add_documents(docs)
Define Query and Perform Similarity Search
Now, we'll define the user's query and perform a similarity search in Milvus to retrieve the most relevant documents.
# Define the user's query
query = "what is the average tip?"
# Perform similarity search in Milvus
search_results = vectorstore.similarity_search(query)
Build RAG Chain
We'll define the prompt template for the RAG system, initialize the OpenAI language model, and build the RAG chain using LangChain's Expression Language.
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
# Define the prompt template for the RAG system
PROMPT_TEMPLATE = """
Human: You are an AI assistant, and you provide answers to questions by using fact-based and statistical information when possible.
Use the following pieces of information to provide a concise answer to the question enclosed in <question> tags.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
<context>
{context}
</context>
<question>
{question}
</question>
The response should be specific and use statistics or numbers when possible.
A:"""
# Initialize the prompt template
rag_prompt = PromptTemplate(
template=PROMPT_TEMPLATE, input_variables=["context", "question"]
)
# Initialize the OpenAI language model
llm = ChatOpenAI(model="gpt-4o", temperature=0)
# Convert the vector store to a retriever
retriever = vectorstore.as_retriever()
# Format the search results for the prompt
def format_docs(docs: List[Document]):
return "\n\n".join(doc.page_content for doc in docs)
# Define the RAG (Retrieval-Augmented Generation) chain for AI response generation
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| rag_prompt
| llm
| StrOutputParser()
)
Generate Answer
Finally, we'll invoke the RAG chain with the user's query and retrieve the generated answer.
# Invoke the RAG chain with the user's question and retrieve the generated answer
res = rag_chain.invoke(query)
print("Question:", query)
print("Answer:", res)
Output
Question: what is the average tip?
Answer: To calculate the average tip, we sum the tips and divide by the number of entries.
The tips are: 1.66, 3.31, 3.50, and 1.01.
Sum of tips = 1.66 + 3.31 + 3.50 + 1.01 = 9.48
Number of entries = 4
Average tip = 9.48 / 4 = 2.37
The average tip is $2.37.
By leveraging Milvus for efficient vector similarity search and combining it with powerful language models like OpenAI, you can create sophisticated RAG systems that provide accurate and contextually relevant answers to user queries.
Vectorization and Querying with Zilliz Cloud Pipelines
Zilliz Cloud is a fully managed vector database that builds on the Milvus vector database. Zilliz Cloud Pipelines is a one-stop solution for vector creation and retrieval. It provides a comprehensive set of tools and APIs that allow you to easily connect to various data sources, apply pre-built or custom vectorization models, and store the vectorized data in Zilliz Cloud for high-performance similarity search and retrieval.
Let's walk through an example of using Zilliz Cloud Pipelines to vectorize the data and store the embeddings in Zilliz Cloud for similarity search.
Set up Zilliz Cloud Pipelines
Obtain the necessary information about your Zilliz Cloud cluster, including cluster-ID, cloud region, API key, and project ID. For more information, see On Zilliz Cloud Console.
CLOUD_REGION = 'gcp-us-west1'
CLUSTER_ID = 'your CLUSTER_ID'
API_KEY = 'your API_KEY'
PROJECT_ID = 'your PROJECT_ID'
Create an Ingestion Pipeline
Define an Ingestion pipeline to process and vectorize your structured data and store the embeddings in Milvus.
Specify the functions, such as
INDEX_TEXT
for text data orPRESERVE
for additional metadata.
import requests
headers = {
"Content-Type": "application/json",
"Accept": "application/json",
"Authorization": f"Bearer {API_KEY}"
}
create_pipeline_url = f"https://controller.api.{CLOUD_REGION}.zillizcloud.com/v1/pipelines"
collection_name = 'my_structured_data_collection'
embedding_service = "zilliz/bge-base-en-v1.5"
data = {
"name": "my_ingestion_pipeline",
"description": "A pipeline that generates embeddings for structured data",
"type": "INGESTION",
"projectId": PROJECT_ID,
"clusterId": CLUSTER_ID,
"collectionName": collection_name,
"functions": [
{
"name": "index_text",
"action": "INDEX_TEXT",
"language": "ENGLISH",
"embedding": embedding_service
},
{
"name": "preserve_metadata",
"action": "PRESERVE",
"inputField": "metadata",
"outputField": "metadata",
"fieldType": "VarChar"
}
]
}
# Send a POST request to create the Ingestion pipeline
response = requests.post(create_pipeline_url, headers=headers, json=data)
# Extract the pipeline ID from the response
ingestion_pipe_id = response.json()["data"]["pipelineId"]
In the code above, we define an Ingestion pipeline by specifying the necessary details, such as the pipeline name, description, project ID, cluster ID, and the collection name where the embeddings will be stored in Milvus.
We also define two functions within the pipeline:
INDEX_TEXT
: This function is used to process and generate embeddings for the text data.PRESERVE
: This function preserves additional metadata associated with the structured data.
Finally, we send a POST request to create the Ingestion pipeline and extract the pipeline ID from the response.
Run the Ingestion Pipeline
Ingest your structured data into Milvus using the created Ingestion pipeline.
Provide the necessary data and metadata as input to the pipeline.
run_pipeline_url = f"https://controller.api.{CLOUD_REGION}.zillizcloud.com/v1/pipelines/{ingestion_pipe_id}/run"
data = {
"data":
{
"text_list": [
"This is the first text.",
"This is the second text.",
"This is the third text."
],
"metadata": "Sample metadata"
}
}
# Send a POST request to run the Ingestion pipeline with the structured data
response = requests.post(run_pipeline_url, headers=headers, json=data)
We send a POST request to the pipeline's run endpoint to run the Ingestion pipeline, providing the structured data as input. This example has a list of texts (text_list
) and some associated metadata (metadata
).
The Ingestion pipeline will process the structured data, generate embeddings for the text using the specified embedding service, and store the embeddings along with the metadata in the specified Milvus collection.
Using Zilliz Cloud Pipelines, you can easily vectorize your structured data and store the embeddings in Zilliz Cloud for similarity search. The platform provides a seamless and efficient way to process and search your structured data using vector embeddings.
For more detailed information and additional code examples, refer to the Quickstart | Zilliz Cloud Developer Hub
Summary
This article explored why to vectorize structured data and demonstrated how to load and vectorize it, ingest it into Milvus, perform similarity searches, and generate contextually relevant answers using the OpenAI language model.
- Structured Data vs. Semi-structured Data vs. Unstructured Data
- Why and When Should You Vectorize Structured Data?
- How to Use Milvus to Vectorize and Query Your Structured Data
- Leveraging Vectorized Data for Similarity Retrieval in RAG with Milvus
- Vectorization and Querying with Zilliz Cloud Pipelines
- Summary