Navigating the Nuances of Lexical and Semantic Search with Zilliz
Learn the mechanics, applications, and benefits of lexical and semantic search and how to perform it in Zilliz.
Read the entire series
- Raft or not? The Best Solution to Data Consistency in Cloud-native Databases
- Understanding Faiss (Facebook AI Similarity Search)
- Information Retrieval Metrics
- Advanced Querying Techniques in Vector Databases
- Popular Machine-learning Algorithms Behind Vector Searches
- Hybrid Search: Combining Text and Image for Enhanced Search Capabilities
- Ensuring High Availability of Vector Databases
- Ranking Models: What Are They and When to Use Them?
- Navigating the Nuances of Lexical and Semantic Search with Zilliz
- Enhancing Efficiency in Vector Searches with Binary Quantization and Milvus
- Model Providers: Open Source vs. Closed-Source
- Embedding and Querying Multilingual Languages with Milvus
- An Ultimate Guide to Vectorizing and Querying Structured Data
Information across industries constantly expands, and relying on keyword searches for extensive databases doesn’t yield the best results. This is because keywords fail to capture the context behind the user’s query and the relationship between concepts. Advanced search techniques solve this limitation using sophisticated algorithms to understand user intent and produce highly personalized results. These algorithms use Neural Networks (NN) and similarity search techniques for reliable retrieval of results based on user queries. Zilliz revolutionizes search functionalities through lexical and semantic analysis. The lexical and semantic analysis offers faster, more flexible, and more accurate search results from the sea of complex information.
In this article, we’ll examine the transformational impact of these search methodologies on data interpretation and retrieval. We will also explore how lexical and semantic search work and how Zilliz combines their power to enhance search capabilities.
Deciphering Lexical Search
Lexical search relies on exact keyword matching to retrieve relevant documents from a database. It compares the keywords provided in a search query against all database objects and returns wherever a one-to-one match is found. Due to its strict nature, the accuracy of the search depends on the keyword variation provided. Since the search algorithm does not consider context or semantics, the search space is restricted to the provided search terms. Moreover, despite being a relatively simple algorithm, a larger search space (large database) can impact the search speed and efficiency.
The Zilliz cloud implements modern indexing approaches to improve the lexical search efficiency. The vector database allows indexing of the query objects, exponentially increasing the search efficiency. Furthermore, the vectorized objects are divided into subsets based on similarity. The keyword search query is directed towards the most relevant subset, reducing the search space and improving search speed and the slight cost of accuracy.
Exploring Semantic Search
While lexical search yields precise results with the help of exact word matching, it cannot understand user intent or decode synonyms and grammar variation. Semantic search, powered by Natural Language Processing (NLP) and Machine Learning (ML), seeks to understand the user intent and enables relevant data retrieval. Modern NLP algorithms like BERT, break unstructured data into vector embeddings and represent it in a high-dimensional vector space. It then uses similarity search algorithms to identify data points similar to the search query. Current-day semantic search engines also use knowledge graphs (KG) to capture the relationship between different data points, targeting the deeper meanings and connections among words.
Zilliz vector database offers advanced search capabilities with semantic search. Using advanced information retrieval algorithms, it understands user queries and retrieves the most relevant vector embeddings from a complex web of information. The semantic search pipeline on Zilliz Cloud offers unparalleled precision and effortless scaling.
The Synergy of Lexical and Semantic Search in Zilliz
Zilliz is a vector database that stores data points as vector embeddings and supports semantic search for accurate and meaningful vector search. However, Zilliz also supports lexical search by enabling CRUD operations on various data types through modern indexing techniques. This ensures that users don’t have to manage different data infrastructures for different search needs.
Since vector databases are not optimized for relational CRUD operations, Zilliz offers to store data in a way that is similar to tables in relational databases. For lexical search, it stores data as collections that represent tables, and each collection has its own schema.
Zilliz's support for both lexical and semantic searches makes it the ideal data store for a superior search experience. As your data needs change, you can optimize your infrastructure, whether you need vector search capabilities or other databases to store your data, such as a JSON data store.
Real-world Examples of Lexical and Semantic Search Hybridization
Combining the benefits of lexical and semantic search powers various real-world applications. Two of the common examples of hybridization of lexical and semantic search are:
1. E-commerce Product Search
Traditionally, e-commerce product searches rely on lexical searches, i.e., keyword matches. For example, if a user searches for “soft light socks”, the e-commerce search engine would retrieve products with these keywords in their description. However, Semantic search leverages NLP to understand user intent and offers a more personalized search experience. By combining these techniques, we can achieve a highly relevant yet fast search system.
Hybridization of Lexical and Semantic search works by adding an extra layer of semantic search in e-commerce search systems. When a user enters a query, the system identifies which search method is suitable according to query complexity and desired accuracy. This results in a faster initial search and improved accuracy with the help of semantic search.
Another scenario could be a user entering a misspelled query. Since lexical search can’t handle ambiguity, i.e., typos, it will not be able to find relevant products. However, performing an additional search using the semantic technology instead of throwing an error right away will improve user experience.
Multiple companies have adopted the combined power of lexical and semantic search in e-commerce. For example, Rakuten Fashion uses semantic search to improve user experience, and Zalando uses NLP to understand user intent.
2. Short Text Classification
The same approach of lexical and semantic search hybridization is also used in developing context-aware text classification systems. Built upon the lexical search systems as a foundation, semantic search allows the system to understand user intent. The system functionality begins by using lexical search for initial keyword search and then using NLP for advanced, contextual search.
Combining Lexical and Semantic Features for Short Text Classification offers AI-powered search capabilities for richer text analysis and classification.
Technical Deep-Dive: Building with Zilliz
Zilliz Cloud offers a straightforward approach for implementing lexical and semantic search. Here's the implementation of semantic search and lexical search in Zilliz Cloud, respectively:
Implementing Semantic Search in Zilliz
Zilliz supports all RESTful API endpoints and the Milvus SDKs. You can check out the implementation of all SDKs including Python, Java, Go, and Node.js in the Zilliz documentation. Here we’re using Python SDK to understand how Zilliz vector search works.
Install Python SDK
Copy the following commands to your Python console to install PyMilvus, Python SDK. Make sure you’ve Python version 3.8 or greater for successful implementation.
# Install specific PyMilvus
python -m pip install pymilvus==2.3.7
# Update PyMilvus to the newest version
python -m pip install --upgrade pymilvus
# Verify installation success
python -m pip list | grep pymilvus
Create a Cluster
Creating a cluster refers to setting up multiple computers that run your database. The cluster provides the resources needed to store and handle your data. Once your cluster is set up, you will be prompted to provide your cluster credentials, which you will need to connect to your cluster later. The following code snippet demonstrates how to create a cluster using RESTful API:
curl --request POST \
--url "https://controller.api.${CLOUD_REGION}.zillizcloud.com/v1/clusters/create" \
--header "Authorization: Bearer ${API_KEY}" \
--header "accept: application/json" \
--header "content-type: application/json" \
--data-raw "{
\"plan\": \"Standard\",
\"clusterName\": \"cluster-standard\",
\"cuSize\": 1,
\"cuType\": \"Performance-optimized\",
\"projectId\": \"${PROJECT_ID}\"
}"
Alternatively, you can use Zilliz console to create a cluster. Whatever method you choose, add a subscription plan before you create a cluster.
Connect to Zilliz Cluster
Now that you’ve obtained your cluster credentials, you can connect your cluster now. Run the following code to connect your Zilliz Cloud cluster:
from pymilvus import MilvusClient, DataType
CLUSTER_ENDPOINT = "YOUR_CLUSTER_ENDPOINT"
TOKEN = "YOUR_CLUSTER_TOKEN"
# Set up a Milvus client
client = MilvusClient(
uri=CLUSTER_ENDPOINT,
token=TOKEN
)
Create a Collection
Zilliz Cloud stores vector embeddings as a collection. Vector embeddings stored in the same collection have the same dimensionality and distance metric for similarity search.
# Create a collection in quick setup mode
client.create_collection(
collection_name="quick_setup",
dimension=5
)
The above setup uses the default Cosine metric for similarity measures. The primary field accepts integers and does not automatically increment. And a reserved JSON field named $meta is used to store non-schema-defined fields and their values.
Insert Data
Once you create a collection, you can insert data into your database. Copy the following code to add same data to your collection:
# 4. Insert data into the collection
# 4.1. Prepare data
data=[
{"id": 0, "vector": [0.3580376395471989, -0.6023495712049978, 0.18414012509913835, -0.26286205330961354, 0.9029438446296592], "color": "pink_8682"},
{"id": 1, "vector": [0.19886812562848388, 0.06023560599112088, 0.6976963061752597, 0.2614474506242501, 0.838729485096104], "color": "red_7025"},
{"id": 2, "vector": [0.43742130801983836, -0.5597502546264526, 0.6457887650909682, 0.7894058910881185, 0.20785793220625592], "color": "orange_6781"},
{"id": 3, "vector": [0.3172005263489739, 0.9719044792798428, -0.36981146090600725, -0.4860894583077995, 0.95791889146345], "color": "pink_9298"},
{"id": 4, "vector": [0.4452349528804562, -0.8757026943054742, 0.8220779437047674, 0.46406290649483184, 0.30337481143159106], "color": "red_4794"},
{"id": 5, "vector": [0.985825131989184, -0.8144651566660419, 0.6299267002202009, 0.1206906911183383, -0.1446277761879955], "color": "yellow_4222"},
{"id": 6, "vector": [0.8371977790571115, -0.015764369584852833, -0.31062937026679327, -0.562666951622192, -0.8984947637863987], "color": "red_9392"},
{"id": 7, "vector": [-0.33445148015177995, -0.2567135004164067, 0.8987539745369246, 0.9402995886420709, 0.5378064918413052], "color": "grey_8510"},
{"id": 8, "vector": [0.39524717779832685, 0.4000257286739164, -0.5890507376891594, -0.8650502298996872, -0.6140360785406336], "color": "white_9381"},
{"id": 9, "vector": [0.5718280481994695, 0.24070317428066512, -0.3737913482606834, -0.06726932177492717, -0.6980531615588608], "color": "purple_4976"}
]
# 4.2. Insert data
res = client.insert(
collection_name="quick_setup",
data=data
)
print(res)
# Output
#
# {
# "insert_count": 10,
# "ids": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# }
- The data tis a list of dictionaries, where each dictionary represents a data record, termed as an entity.
- Each dictionary contains a non-schema-defined field named color.
- Each dictionary contains the keys corresponding to both pre-defined and dynamic fields.
Similarity Search
Wait for a few seconds after inserting the data into your collection. Otherwise, you’ll get an empty result since it’s an asynchronous operation. To perform a search, either of the three methods can be used according to your preference:
1. Single-vector Search
As the name suggests, single-vector search performs search with a single query vector. The query vector contains a sub-list of vector embeddings representing the vector dimensions.
# 6. Search with a single vector
# 6.1. Prepare query vectors
query_vectors = [
[0.041732933, 0.013779674, -0.027564144, -0.013061441, 0.009748648]
]
# 6.2. Start search
res = client.search(
collection_name="quick_setup", # target collection
data=query_vectors, # query vectors
limit=3, # number of returned entities
)
print(res)
# Output
#
# [
# [
# {
# "id": 548,
# "distance": 0.08589144051074982,
# "entity": {}
# },
# {
# "id": 736,
# "distance": 0.07866684347391129,
# "entity": {}
# },
# {
# "id": 928,
# "distance": 0.07650312781333923,
# "entity": {}
# }
# ]
# ]
**2. Bulk-vector Search **
Bulk-vector search performs a search with more than one query vector. Following is the code for conducting a batch semantic search:
# 7. Search with multiple vectors
# 7.1. Prepare query vectors
query_vectors = [
[0.041732933, 0.013779674, -0.027564144, -0.013061441, 0.009748648],
[0.0039737443, 0.003020432, -0.0006188639, 0.03913546, -0.00089768134]
]
# 7.2. Start search
res = client.search(
collection_name="quick_setup",
data=query_vectors,
limit=3,
)
print(res)
# Output
#
# [
# [
# {
# "id": 548,
# "distance": 0.08589144051074982,
# "entity": {}
# },
# {
# "id": 736,
# "distance": 0.07866684347391129,
# "entity": {}
# },
# {
# "id": 928,
# "distance": 0.07650312781333923,
# "entity": {}
# }
# ],
# [
# {
# "id": 532,
# "distance": 0.044551681727170944,
# "entity": {}
# },
# {
# "id": 149,
# "distance": 0.044386886060237885,
# "entity": {}
# },
# {
# "id": 271,
# "distance": 0.0442606583237648,
# "entity": {}
# }
# ]
# ]
3. Filtered Searches
Including filters in the query further enhances the search by specifying output fields in the search request. Below is an example of a filtered search using schema-defined fields:
# 8. Search with a filter expression using schema-defined fields
# 1 Prepare query vectors
query_vectors = [
[0.041732933, 0.013779674, -0.027564144, -0.013061441, 0.009748648]
]
# 2. Start search
res = client.search(
collection_name="quick_setup",
data=query_vectors,
filter="500 < id < 800",
limit=3
)
print(res)
# Output
#
# [
# [
# {
# "id": 548,
# "distance": 0.08589144051074982,
# "entity": {}
# },
# {
# "id": 736,
# "distance": 0.07866684347391129,
# "entity": {}
# },
# {
# "id": 505,
# "distance": 0.0749310627579689,
# "entity": {}
# }
# ]
# ]
Implementing Lexical Search in Zilliz
Zilliz Cloud supports lexical search by behaving as a JSON datastore and allowing string CRUD operations. Here’s the implementation of lexical search in Zilliz:
Setup the Schema
from pymilvus import Collection, CollectionSchema, DataType, FieldSchema
fields = [ FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True, max_length=100), FieldSchema(name="_unused", dtype=DataType.FLOAT_VECTOR, dim=1)]
schema = CollectionSchema(fields, "NoSQL data", enable_dynamic_field=True)
collection = Collection("json_data", schema)
collection.load()
Lexical Search
In the code snippet below, we download an example dataset from GovTrack to conduct a lexical search. You can use any data you want:
import requestsr = requests.get("https://www.govtrack.us/api/v2/role?current=true&role_type=senator")data = r.json()["objects"]len(data)
Once the data is downloaded, insert it into your database. The following command loads the data to Milvus:
rows = [{"_unused": [0]} | d for d in data]collection.insert(rows)
You’re all set to search now. The following code searched for the top one record based on the query. You can search as many records as you want:
top_k = 1
collection.query(
expr="state in ['OR'] and senator_rank in ['senior']",
limit=top_k,
output_fields=["person"]
)
Optimizing Search Functionalities
Optimizing search functionalities goes beyond using a relevant search technique. Factors like relevancy, scalability, and performance influence the search functionality and define user experience. Here are the challenges and best practices for implementing search functionalities:
** 1. Scalability **
Database performance degrades with growing data volumes, suboptimal configuration, and hardware limitations. Finding similar items from vast amounts of information becomes slower and less accurate when databases aren’t scalable.
Solution
Using distributed search architecture and partitioning the search indexes into shards yields faster and more scalable data management. Zilliz Cloud's distributed capabilities can scale the cluster to 500 CUs, serving over 100 billion items.
** 2. Performance**
While vector databases are highly accurate, balancing speed with accuracy can be tricky. This is because complex underlying algorithms require more processing power to retrieve accurate results.
Solution
Optimizing indexing strategies, lazy loading, and filtering leads to faster results with high accuracy. Zilliz uses the Cardinal search engine, which performs algorithm, engineering, and low-level optimization, making vector search ten times faster.
3. Accuracy
Maintaining accuracy is crucial in search systems for an improved and consistent user experience. However, consistently delivering accurate results can be difficult due to data inconsistencies and ambiguous queries.
Solution
Using NLP to understand the user intent behind a query improves the relevance of search results. Regular analysis and monitoring also help improve search systems over time. Zilliz leverages the power of deep learning algorithms to transform unstructured data into a searchable format, leading to highly accurate and personalized results.
Advancing Search Capabilities: The Role of Zilliz
Zilliz Cloud, built on Milvus, is the most performant vector database. With its advanced features, Zilliz is a game-changer that uses cutting-edge technologies to redefine search experiences. Here’s how Zilliz sets new standards for search innovation:
The cardinal search engine, the most performant search engine for vector searches, makes Zilliz ten times faster than Milvus.
The AUTOINDEX index type eliminates the need to pick the right index type.
Enterprise-grade security ensures customer data is secured and compliant with data protection policies.
Machine learning and deep learning-driven vector search enhance personalized search experiences.
A variety of similarity metrics allow customized classification and clustering solutions.
The ability to switch between lexical and semantic search with changing needs makes Zilliz manage data in any format in one place.
Component-based architecture allows effortless horizontal scaling regardless of workload fluctuations.
The Future of Search Technologies
New technologies constantly drive advancements to search innovation, giving rise to more relevant and personalized systems. Here’s what the future of search technologies might hold:
1. Faster AI-driven Search
The use of NLP to understand the context of a search query adds processing overhead, and that’s where the traditional lexical search complements the semantic search. However, with more sophisticated algorithms, AI-driven search will become faster.
2. Multilingual Search
Researchers are exploring technologies to implement cross-lingual search systems. These systems will bridge language gaps and make the world's different regions more connected.
3. Explainability
With the development of more complex algorithms, users will need to understand the underlying logic. Explainable AI (XAI) technologies will be common to provide users with an explanation of model functionality and enhance their trust in AI-powered search.
Conclusion
Zilliz Cloud goes beyond keyword matching to understand the user intent and offers data-driven search results. The traditional lexical search approach offers a simplistic implementation but fails in complex and ambiguous scenarios. Semantic search algorithms, on the other hand, leverage NLP to improve search results but carry an additional processing overhead.
A hybrid search approach integrates lexical and semantic algorithms to leverage the best of both worlds. The final algorithm is a robust solution that offers speed and accuracy, improving user experience and empowering data driven insights.
Zilliz offers lexical search for relational databases and a deeper search with semantic search capabilities. With this hybrid approach, you can manage both vector and relational databases in one place while having a suitable infrastructure for both.
Moreover, Zilliz revolutionizes the search experience for companies with diverse data demands by offering high throughput capabilities while prioritizing security and governance.
Explore Zilliz's advanced search solutions and unlock the potential of your data. The Zilliz community is open to anyone seeking to identify profitable patterns in their data and deliver exceptional user experiences. Get your doubts cleared with Zilliz training and informative online resources, contribute to our ecosystem, and get support for implementing advanced search technologies.
- Deciphering Lexical Search
- Exploring Semantic Search
- The Synergy of Lexical and Semantic Search in Zilliz
- Technical Deep-Dive: Building with Zilliz
- Advancing Search Capabilities: The Role of Zilliz
- Conclusion
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for Free