What is GPTCache?

GPTCache is an open-source library designed to improve the efficiency and speed of GPT-based applications by implementing a cache to store the responses generated by language models. GPTCache allows users to customize the cache according to their needs, including options for embedding functions, similarity evaluation functions, storage location and eviction. In addition, GPTCache currently supports the OpenAI ChatGPT interface and the Langchain interface.

Try Managed Milvus free

Built on a strong and growing community.

7,794+

GitHub stars

Why Use GPTCache?

Developing a semantic cache, such as GPTCache, to store Large Language Model (LLM) responses can provide several advantages, such as:

Improved performance

Storing LLM responses in a cache can significantly reduce the time it takes to retrieve the response, especially when it has been previously requested and is already present in the cache. Storing responses in a cache can improve the overall performance of your application.

Reduced expenses

Most LLM services charge fees based on a combination of the number of requests and token count. Caching LLM responses can reduce the number of API calls made to the service, translating into cost savings. Caching is particularly relevant when dealing with high traffic levels, where API call expenses can be substantial.

Better scalability

Caching LLM responses can improve the scalability of your application by reducing the load on the LLM service. Caching helps avoid bottlenecks and ensures that the application can handle a growing number of requests.

Minimized development cost

A semantic cache can be a valuable tool to help reduce costs during the development phase of an LLM (Language Model) app. An LLM application requires an LLM APIs connection even during development, which could become costly. GPTCache offers the same interface as LLM APIs and can store LLM-generated or mocked-up data. GPTCache helps verify your application's features without connecting to the LLM APIs or the network.

Lower network latency

A semantic cache located closer to the user’s machine, reducing the time it takes to retrieve data from the LLM service. By reducing network latency, you can improve the overall user experience.

Improved availability

LLM services frequently enforce rate limits, which are constraints that APIs place on the number of times a user or client can access the server within a given timeframe. Hitting a rate limit means that additional requests will be blocked until a certain period has elapsed, leading to a service outage. With GPTCache, you can quickly scale to accommodate an increasing volume of queries, ensuring consistent performance as your application's user base expands.

Overall, developing a semantic cache for storing LLM responses can offer various benefits, including improved performance, reduced expenses, better scalability, customization, and reduced network latency.

How GPTCache Works

GPTCache takes advantage of data locality in online services by storing commonly accessed data, reducing retrieval time, and easing the backend server load. Unlike traditional cache systems, GPTCache uses semantic caching, identifying and storing similar or related queries to improve cache hit rates.

Using embedding algorithms, GPTCache converts queries into embeddings and employs a vector store for similarity search, enabling retrieval of related queries from the cache. The modular design of GPTCache allows users to customize their semantic cache with various implementations for each module.

While semantic caching may result in false positives and negatives, GPTCache offers three performance metrics to help developers optimize their caching systems.

This process allows GPTCache to identify and retrieve similar or related queries from the cache storage, as illustrated in the diagram below.

Customize semantic cache

GPTCache was built with a modular design to make it easy for users to customize their semantic cache. Each module has options for the users to choose from to fit their needs.

Pre-processor Pre-processor manages, analyzes, and formats the queries sent to LLMs, including removing redundant information from inputs, compressing input information, cutting long texts, and performing other related tasks.
LLM Adapter The LLM Adapter integrates with different LLM models and is standardized on the OpenAI API, unifying their APIs and request protocols. The LLM Adapter allows for easier experimentation and testing with various LLM models, as you can switch between them without having to rewrite your code or learn a new API. Support is available for:
- OpenAI ChatGPT API
- langchain
- Minigpt4
- Llamacpp
- Roadmap — Hugging Face Hub, Bard and Anthropic
Embedding Generator The Embedding Generator generates embeddings with the model of your choice from the request queue to execute a similarity search. Models that are supported include OpenAI embedding API. ONNX with the GPTCache/paraphrase-albert-onnx model, Hugging Face embedding API, Cohere embedding API, fastText embedding API, and the SentenceTransformers embedding API and Timm models for image embeddings.
Cache Store Cache Store is where the response from LLMs, such as ChatGPT, is stored. Cached responses are retrieved to assist in evaluating similarity and are returned to the requester if there is a good semantic match. GPTCache supports SQLite, PostgreSQL, MySQL, MariaDB, SQL Server, and Oracle. Supporting popular databases means that users can choose the database that best fits their needs, depending on performance, scalability, and cost.
Vector Store options GPTCache supports a Vector Store module, which helps to find the K most similar requests based on the extracted embeddings from the input request. This functionality can help assess the similarity between requests. In addition, GPTCache provides a user-friendly interface that supports various vector stores, including Milvus, Zilliz Cloud, Milvus Lite, Hnswlib, PGVector, Chroma, DocArray and FAISS. These options give users a range of choices for vector stores, which can affect the efficiency and accuracy of the similarity search functionality in GPTCache. GPTCache aims to provide flexibility and cater to a broader range of use cases by supporting multiple vector stores.
Eviction Policy Management Cache Manager in GPTCache controls the operations of both the Cache Store and Vector Store modules. When the cache becomes full, a replacement policy determines which data to evict to make room for new data. GPTCache currently supports two basic options: - LRU (Least Recently Used) eviction policy - FIFO (First In, First Out) eviction policy These are both standard eviction policies used in caching systems.
Similarity Evaluator The Similarity Evaluator module in GPTCache collects data from Cache Storage and Vector Store. It uses various strategies to determine the similarity between the input request and the requests from the Vector Store. The similarity determines whether a request matches the cache. GPTCache provides a standardized interface for integrating various similarity strategies and a collection of implementations. These different similarity strategies allow GPTCache to provide flexibility in determining cache matches based on other use cases and needs.
Post-Processor Post-processor prepares the final response to return to the user when the cache is hit. If the answer is not in the cache, the LLM Adapter requests responses from LLM and writes them back to the Cache Manager.

Related Resources

Start to build your GenAl apps today with Zilliz Cloud Serverless

Get Started Free Read Docs

What is GPTCache?

Built on a strong and growing community.

Why Use GPTCache?

How GPTCache Works

Customize semantic cache

Intro to OSS Chat

Introducing GPTCache

Yet another cache, but for ChatGPT

Start to build your GenAl apps today with Zilliz Cloud Serverless

AI Assistant