Blog
Caching LLM Queries for performance & cost improvements

Caching LLM Queries for performance & cost improvements

Apr 10, 20235 min read

What is GPTCache? Are you looking to improve the performance of your large language model (LLM) application while reducing expenses? Look no further than a semantic cache for storing LLM responses. Caching LLM responses can significantly reduce the time it takes to retrieve data, reduce API call expenses, and improve scalability. Furthermore, by customizing the cache and monitoring its performance, you can optimize it to make it more efficient. In this blog, we'll introduce GPTCache, an open-source semantic cache for storing LLM responses, and provide tips on implementing it to get the best results. Keep reading to learn more about how caching LLM queries can help you achieve better performance and cost savings.

Why use a semantic cache for storing LLMs?

Building a semantic cache for storing LLM (Large Language Model) responses can bring several benefits, such as:

Improved performance: Storing LLM responses in a cache can significantly reduce the time it takes to retrieve the response, especially when it has been previously requested and is already present in the cache. Storing responses in a cache can improve the overall performance of your application.
Reduced expenses: Most LLM services charge fees based on a combination of the number of requests and token count. Caching LLM responses can reduce the number of API calls made to the service, translating into cost savings. Caching is particularly relevant when dealing with high traffic levels, where API call expenses can be substantial.
Better scalability: Caching LLM responses can improve the scalability of your application by reducing the load on the LLM service. Caching helps avoid bottlenecks and ensures that the application can handle a growing number of requests.
Customization: A semantic cache can be customized to store responses based on specific requirements, such as the type of input, the output format, or the length of the response. This can help to optimize the cache and make it more efficient.
Reducing network latency: A semantic cache located closer to the user, reducing the time it takes to retrieve data from the LLM service. By reducing network latency, you can improve the overall user experience.

Building a semantic cache for storing LLM responses can bring several benefits, including improved performance, reduced expenses, better scalability, customization, and reduced network latency.

What is GPTCache?

While building the ChatGPT demo application, OSS Chat, we saw that it began to degrade in performance and increase service fees the more we tested it. This made us realize that we needed a caching mechanism to help combat the performance degradation and increase in costs. As we started building this caching layer, we realized this could be useful to the community, so we decided to open-source this as GPTCache.

GPTCache is an open-source tool designed to improve the efficiency and speed of GPT-based applications by implementing a cache to store the responses generated by language models. GPTCache allows users to customize the cache according to their needs, including options for embedding functions, similarity evaluation functions, storage location and eviction. In addition, GPTCache currently supports the OpenAI ChatGPT interface and the LangChain interface.

Supported Embeddings

GPTCache also provides a range of options for extracting embeddings from requests for similarity search. In addition, the tool offers a generic interface that supports multiple embedding APIs, allowing users to choose the one that best fits their needs. The list of supported embedding APIs includes:

OpenAI embedding API
ONNX with the GPTCache/paraphrase-albert-onnx model
Hugging Face embedding API
Cohere embedding API
fastText embedding API
SentenceTransformers embedding API

These options give users a range of choices for embedding functions, which can affect the accuracy and efficiency of the similarity search functionality in GPTCache. GPTCache aims to provide flexibility and cater to a wider range of use cases by supporting multiple APIs.

Cache Storage

GPTCache provides support for storing cached responses in a variety of database management systems. The tool supports multiple popular databases, including:

SQLite
PostgreSQL
MySQL
MariaDB
SQL Server
Oracle

Supporting popular databases means that users can choose the database that best fits their needs, depending on performance, scalability, and cost. In addition, GPTCache offers a universally accessible interface for extending the module, allowing users to add support for different database systems if needed.

Vector Store Options

GPTCache supports a Vector Store module, which helps to find the K most similar requests based on the extracted embeddings from the input request. This functionality can help assess the similarity between requests. GPTCache provides a user-friendly interface that supports various vector stores, including Milvus, Zilliz Cloud, and FAISS.

These options give users a range of choices for vector stores, which can affect the efficiency and accuracy of the similarity search functionality in GPTCache. GPTCache aims to provide flexibility and cater to a broader range of use cases by supporting multiple vector stores. We also plan to support other vector databases in the near term.

Eviction Policy Management

Cache Manager in GPTCache controls the operations of both the Cache Storage and Vector Store modules. When the cache becomes full, a replacement policy determines which data to evict to make room for new data. GPTCache currently supports two basic options:

LRU (Least Recently Used) eviction policy
FIFO (First In, First Out) eviction policy

These are both standard eviction policies used in caching systems.

Similarity Evaluator

The Similarity Evaluator module in GPTCache collects data from Cache Storage and Vector Store. It uses various strategies to determine the similarity between the input request and the requests from the Vector Store. The similarity determines whether a request matches the cache. GPTCache provides a standardized interface for integrating various similarity strategies and a collection of implementations. These different similarity strategies allow GPTCache to provide flexibility in determining cache matches based on other use cases and needs.

In Summary

GPTCache is a project aimed at optimizing the use of language models in GPT-based applications by reducing the need to generate responses repeatedly from scratch and instead utilizing a cached response when applicable. GPTCache is an open-source project so check it out yourself. We'd love to hear your feedback or you can even help contribute to the project!

Learn More

- What is GPTCache?

Updated on Mar 28, 2025

Chris Churilo
Chris Churilo is the VP of Marketing & Community at Zilliz where she leads all community, developer relations, and marketing efforts. Prior to Zilliz, Chris was a founding member of the InfluxData’s go to market efforts and helped propel the time series database platform to dominance in the market. In earlier roles she defined and designed a SaaS monitoring solution at Centroid, and prior to that she was the VP of product management at iPass and was the LOB for several cloud services that required her to track the business and operational metrics and analytics to help identify and resolve issues.

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

What is the K-Nearest Neighbors (KNN) Algorithm in Machine Learning?

KNN is a supervised machine learning technique and algorithm for classification and regression. This post is the ultimate guide to KNN.

Building RAG Pipelines for Real-Time Data with Cloudera and Milvus

explore how Cloudera can be integrated with Milvus to effectively implement some of the key functionalities of RAG pipelines.

The AI Revolution in Marketing: How Vector Databases Are Unlocking True Personalization

Explore how vector databases and AI are transforming marketing platforms, enabling real-time personalization and predictive analytics while balancing automation with creativity.