Learn
Exploring Vector Database Use Cases

Mastering Text Similarity Search with Vectors in Zilliz Cloud

Mar 07, 20246 min read

We explore the fundamentals of vector embeddings and demonstrated their application in a practical book title search using Zilliz Cloud and OpenAI embedding models. We delve into key similarity metrics, such as cosine similarity, and discuss how these metrics play a crucial role in enhancing the relevance and accuracy of search results. Furthermore, we highlight best practices and optimization strategies essential for maximizing the performance of text similarity searches.

By Antony G.

Read the entire series

Imagine searching for a book on vector databases in an online bookstore. You enter "Vector databases" into the search bar. With traditional search methods, your results will primarily include books whose titles or descriptions contain the exact keywords "vector" and "database." This keyword-based approach is typically sufficient for straightforward queries. However, consider a more complex query like "advanced techniques for managing large-scale vector databases." In this scenario, a keyword search might overlook books titled "Scalable Vector Storage Solutions" or "High-Performance Vector Data Handling," even though these titles are highly relevant to your query. This limitation arises because the books don't match the exact keywords used in your search.

This is where semantic search comes into play. Unlike traditional search methods, semantic search understands the deeper meaning of your query. It identifies related concepts, synonyms, and even contextually similar terms, not just exact keyword matches. This broader understanding helps connect you with resources most relevant to your needs, even if they don't share the same terminology. In this blog post, we delve into text similarity search, specifically through the lens of Zilliz Cloud. We'll show you how to transform text queries into dense vectors and use Pymilvus, a Python SDK for Milvus, to connect with Zilliz Cloud. We'll also explore how to leverage an OpenAI model for embedding to perform effective similarity searches.

Lets first understand what text embeddings are

Zilliz Cloud is a cloud-native vector database designed for enterprise-grade similarity search capable of handling billions of embedding vectors. Embedding vectors are high-dimensional data representations where text, images, or other data types are converted into points in a geometric space. This transformation is critical because it allows complex items to be quantified and compared based on their semantic similarities rather than just their literal content. For example, in text applications, words or phrases with similar meanings are mapped close together in this space, enabling the system to understand and retrieve information based on conceptual similarity rather than exact matches. This capability makes Zilliz Cloud particularly powerful for applications like recommendation systems, fraud detection, and similarity search, where understanding the deeper, contextual relationships within the query is key.

Let's explore retrieving book titles using Zilliz Cloud and OpenAI's embedding API

The process begins when documents are ingested into Zilliz Cloud, where a built-in Ingestion Pipeline converts these documents into vector embeddings and stores them in designated collections. When a user submits a query, Zilliz Cloud’s Search Pipeline transforms the query into vector embeddings and initiates a search for similar items. The system retrieves the most relevant results by measuring the spatial distance between the user's query vectors and the document vectors stored in the database. Finally, Zilliz Cloud delivers the top most relevant results to the user, efficiently facilitating similarity searches at scale. This process is illustrated on figure one.

Fig 1. Flowchart of a text similarity

In this tutorial, we'll bring our hypothetical book search scenario to life by implementing a code example from the Zilliztech demos. We'll focus on the one on integrating Zilliz Cloud with OpenAI to set up a text similarity search. This will include everything from ingesting a dataset of book titles from Kaggle to executing queries, as detailed in Figure two.

Fig 2. Integrating Zilliz Cloud and OpenAI to search for book titles

Data is first converted to a vector representation. Zilliz Cloud has two options for converting your data into vector embeddings. The first option is to use your model or choose one of several integrations with models such as OpenAI, Cohere, Hugging Face, etc. The other option is to choose the Zilliz Cloud Pipelines feature, which allows you to pick from six available models to create an end-to-end pipeline. In this case, we will use OpenAI's embedding API to store the vector representation in Zilliz Cloud, a managed version of Milvus, the popular open-source vector database. With the representation stored in a database, the application can be designed to perform similarity searches. When a user queries about a specific theme like self-development, they get titles of the most relevant books.

Zilliz Cloud offers several techniques to perform similarity metrics, including Euclidean distance, cosine similarity, and inner product. As we saw in the colab notebook, we used the cosine similarity metric for the book title similarity search example. Cosine similarity is a critical metric for comparing text vectors due to its ability to measure the cosine of the angle between two vectors in a multi-dimensional space. This approach is essential for determining text relevance and similarity, as it captures the orientation of the vectors rather than their magnitude. By focusing on the direction in which the vectors point and not their length, cosine similarity effectively neutralizes the impact of document length variability. This is particularly important in contexts where text documents can vary significantly in length, making traditional size-dependent metrics less effective. Cosine similarity values range from -1 to 1, where 1 indicates that the vectors are proportional and thus perfectly aligned in direction, 0 indicates orthogonality (no similarity), and -1 indicates opposite directions.

Optimizing Performance and Accuracy

Extending the book search example we talked about, now suppose there is an accompanying book summary library that wants to improve its book summary articles recommendation system to enhance user engagement and more book sales. Using Zilliz Cloud, the library implements a text similarity search to dynamically recommend articles that are semantically related to what the user is currently reading. According to Zilliz recommendation for best practices and considerations, focusing on high-quality data preparation by cleaning and preprocessing the summaries, including tokenization and removing stop words. For indexing strategies, selecting the right index type and tuning the parameters is crucial, considering the trade-offs between query speed, data size, accuracy, index time, and cost. Additionally, fine-tuning vector models and optimizing query parameters, such as selecting the optimal number of nearest neighbors, will improve the relevance of search results and balance precision with performance. Using batch or multi-vector queries can further optimize the system by reducing the number of calls, thereby enhancing the overall efficiency and effectiveness of the recommendation system. Other methods of accelerating similarity search involve effectively organizing data also known as vector indexing as detailed in this Zilliz blog.

According to VectorDBBench, an open-source benchmarking tool for vector databases, Zilliz Cloud excels in several performance metrics. It leads in queries per second (QPS) and p99 latency, which measures the time within which 99% of requests are processed more quickly. Additionally, Zilliz Cloud stands out for its cost-performance ratio, offering an economical option in terms of cost per one million queries. These benchmarks highlight Zilliz Cloud as a top performer in vector database technology.

Conclusion

We've explored the fundamentals of vector embeddings and demonstrated their application in a practical book title search using Zilliz Cloud and OpenAI embedding models. We delved into key similarity metrics, such as cosine similarity, and discussed how these metrics play a crucial role in enhancing the relevance and accuracy of search results. Furthermore, we highlighted best practices and optimization strategies essential for maximizing the performance of text similarity searches.

Looking to the future, as the field of vector search technology continues to evolve, tools like Zilliz Cloud are poised to play an even more integral role in the development of sophisticated and semantically aware search systems. A discussion on Hacker News, hosted by the startup incubator Y Combinator, suggests that all databases will inevitably evolve into vector databases. This discussion reflects a broader trend in database technology where different needs and contexts dictate the adoption of specialized technologies. It's clear that the choice between traditional and vector databases depends heavily on specific use cases, scalability needs, and existing infrastructure. To stay updated and contribute to these advances, consider following Zilliz on GitHub today.

Resources

Updated on Dec 25, 2024

Antony G.

Next: Enhancing Customer Experience with Vector Databases: A Strategic Approach

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

Accelerating Similarity Search on Really Big Data with Vector Indexing (Part II)

Discover how indexes dramatically accelerate vector similarity search, different types of indexes, and how to choose the right index for your next AI application.

Making Machine Learning More Accessible for Application Developers

Learn how Towhee, an open-source embedding pipeline, supercharges the app development that requires embeddings and other ML tasks.

Leveraging Vector Databases for Next-Level E-Commerce Personalization

Explore the concepts of vector embeddings and vector databases and their role in improving the user experience in e-commerce.