Caching LLM Queries for performance & cost improvements
What is GPTCache? Are you looking to improve the performance of your large language model (LLM) application while reducing expenses? Look no further than a semantic cache for storing LLM responses. Caching LLM responses can significantly reduce the time it takes to retrieve data, reduce API call expenses, and improve scalability. Furthermore, by customizing the cache and monitoring its performance, you can optimize it to make it more efficient. In this blog, we'll introduce GPTCache, an open-source semantic cache for storing LLM responses, and provide tips on implementing it to get the best results. Keep reading to learn more about how caching LLM queries can help you achieve better performance and cost savings.
Why use a semantic cache for storing LLMs?
Building a semantic cache for storing LLM (Large Language Model) responses can bring several benefits, such as:
- Improved performance: Storing LLM responses in a cache can significantly reduce the time it takes to retrieve the response, especially when it has been previously requested and is already present in the cache. Storing responses in a cache can improve the overall performance of your application.
- Reduced expenses: Most LLM services charge fees based on a combination of the number of requests and token count. Caching LLM responses can reduce the number of API calls made to the service, translating into cost savings. Caching is particularly relevant when dealing with high traffic levels, where API call expenses can be substantial.
- Better scalability: Caching LLM responses can improve the scalability of your application by reducing the load on the LLM service. Caching helps avoid bottlenecks and ensures that the application can handle a growing number of requests.
- Customization: A semantic cache can be customized to store responses based on specific requirements, such as the type of input, the output format, or the length of the response. This can help to optimize the cache and make it more efficient.
- Reducing network latency: A semantic cache located closer to the user, reducing the time it takes to retrieve data from the LLM service. By reducing network latency, you can improve the overall user experience.
Building a semantic cache for storing LLM responses can bring several benefits, including improved performance, reduced expenses, better scalability, customization, and reduced network latency.
What is GPTCache?
While building the ChatGPT demo application, OSS Chat, we saw that it began to degrade in performance and increase service fees the more we tested it. This made us realize that we needed a caching mechanism to help combat the performance degradation and increase in costs. As we started building this caching layer, we realized this could be useful to the community, so we decided to open-source this as GPTCache.
GPTCache is an open-source tool designed to improve the efficiency and speed of GPT-based applications by implementing a cache to store the responses generated by language models. GPTCache allows users to customize the cache according to their needs, including options for embedding functions, similarity evaluation functions, storage location and eviction. In addition, GPTCache currently supports the OpenAI ChatGPT interface and the LangChain interface.
GPTCache also provides a range of options for extracting embeddings from requests for similarity search. In addition, the tool offers a generic interface that supports multiple embedding APIs, allowing users to choose the one that best fits their needs. The list of supported embedding APIs includes:
- OpenAI embedding API
- ONNX with the GPTCache/paraphrase-albert-onnx model
- Hugging Face embedding API
- Cohere embedding API
- fastText embedding API
- SentenceTransformers embedding API
These options give users a range of choices for embedding functions, which can affect the accuracy and efficiency of the similarity search functionality in GPTCache. GPTCache aims to provide flexibility and cater to a wider range of use cases by supporting multiple APIs.
GPTCache provides support for storing cached responses in a variety of database management systems. The tool supports multiple popular databases, including:
- SQL Server
Supporting popular databases means that users can choose the database that best fits their needs, depending on performance, scalability, and cost. In addition, GPTCache offers a universally accessible interface for extending the module, allowing users to add support for different database systems if needed.
Vector Store Options
GPTCache supports a Vector Store module, which helps to find the K most similar requests based on the extracted embeddings from the input request. This functionality can help assess the similarity between requests. GPTCache provides a user-friendly interface that supports various vector stores, including Milvus, Zilliz Cloud, and FAISS.
These options give users a range of choices for vector stores, which can affect the efficiency and accuracy of the similarity search functionality in GPTCache. GPTCache aims to provide flexibility and cater to a broader range of use cases by supporting multiple vector stores. We also plan to support other vector databases in the near term.
Eviction Policy Management
Cache Manager in GPTCache controls the operations of both the Cache Storage and Vector Store modules. When the cache becomes full, a replacement policy determines which data to evict to make room for new data. GPTCache currently supports two basic options:
- LRU (Least Recently Used) eviction policy
- FIFO (First In, First Out) eviction policy
These are both standard eviction policies used in caching systems.
The Similarity Evaluator module in GPTCache collects data from Cache Storage and Vector Store. It uses various strategies to determine the similarity between the input request and the requests from the Vector Store. The similarity determines whether a request matches the cache. GPTCache provides a standardized interface for integrating various similarity strategies and a collection of implementations. These different similarity strategies allow GPTCache to provide flexibility in determining cache matches based on other use cases and needs.
GPTCache is a project aimed at optimizing the use of language models in GPT-based applications by reducing the need to generate responses repeatedly from scratch and instead utilizing a cached response when applicable. GPTCache is an open-source project so check it out yourself. We'd love to hear your feedback or you can even help contribute to the project!