Blog
GPTCache, LangChain, Strong Alliance

GPTCache, LangChain, Strong Alliance

May 25, 20233 min read

Introduction to GPTCache

ChatGPT and other large language models (LLMs) have incredible versatility that can be used for a wide range of application development. However, as the popularity of applications developed with LLMs increases and traffic levels grow, the cost associated with LLM API calls can become prohibitively expensive. Additionally, LLM services may exhibit slow response times when handling large volumes of requests.

To address this challenge, we created the GPTCache project, dedicated to building a semantic cache for storing LLM responses.

Introduction to LangChain

Large language models (LLMs) are becoming a transformative technology that enables developers to build applications that were previously impossible. However, relying solely on a single LLM often makes it difficult to create a truly powerful application. The real power lies in combining them with other computational or knowledge sources. Therefore, the LangChain library aims to assist in developing these types of applications.

Current status of LangChain Cache

Before integrating GPTCache, the LangChain cache was based on string matching. With this string-matching approach, the latter request can retrieve the corresponding data from the cache when two requests have identical strings. The implementation includes Memory Cache, SQLite Cache, and Redis Cache. The usage is roughly as follows:

import langchain
from langchain.cache import InMemoryCache
langchain.llm_cache = InMemoryCache()
llm = OpenAI(model_name="text-davinci-002", n=2, best_of=2)

// CPU times: user 14.2 ms, sys: 4.9 ms, total: 19.1 ms
// Wall time: 1.1 s
llm("Tell me a joke")

// CPU times: user 162 µs, sys: 7 µs, total: 169 µs
// Wall time: 175 µs
llm("Tell me a joke")

LangChain Cache Analysis

From a runtime perspective, it is clear that if a request hits the cache, it will significantly reduce the response time. At the same time, the current cost of using LLM is relatively high. Using online services such as OpenAI and Cohere generally incurs charges through tokens or deploying the corresponding LLM model by oneself, with the one-time inference time depending on the number of computer resources, including CPU, memory, GPU, etc. At the same time, if multiple requests are processed concurrently, there are higher requirements for computing resources. If requests hit the cache numerous times, it can reduce the pressure on computer resources and give more computing resources to other tasks.

The condition for LangChain to hit the cache is that two questions must be identical. Unfortunately, it is still difficult to hit the cache in actual use, and there is much room for improvement in the cache utilization rate.

GPTCache Integration

The integration of GPTCache will significantly improve the functionality of the LangChain cache module, increase the cache hit rate, and thus reduce LLM usage costs and response times. Because GPTCache first performs embedding operations on the input to obtain a vector and then conducts a vector approximation search in the cache storage. After receiving the search results, it performs a similarity evaluation and returns when the set threshold is reached. Adjusting the threshold can change the accuracy of its fuzzy search results.

An example of using GPTCache for similarity search in LangChain:

from gptcache import Cache
from gptcache.adapter.api import init_similar_cache
from langchain.cache import GPTCache
import hashlib
def get_hashed_name(name):
   return hashlib.sha256(name.encode()).hexdigest()
def init_gptcache(cache_obj: Cache, llm: str):
   hashed_llm = get_hashed_name(llm)
   init_similar_cache(cache_obj=cache_obj, data_dir=f"similar_cache_{hashed_llm}")
langchain.llm_cache = GPTCache(init_gptcache)

# The first time, it is not yet in cache, so it should take longer
# CPU times: user 1.42 s, sys: 279 ms, total: 1.7 s
# Wall time: 8.44 s
llm("Tell me a joke")

# This is an exact match, so it finds it in the cache
# CPU times: user 866 ms, sys: 20 ms, total: 886 ms
# Wall time: 226 ms
llm("Tell me a joke")

# This is not an exact match, but semantically within distance so it hits!
# CPU times: user 853 ms, sys: 14.8 ms, total: 868 ms
# Wall time: 224 ms
llm("Tell me joke")

We continue to build out the functionality of GPT-Cache, and if you have a chance to try it out, let us know what you think. We also have several great resources for you to check out!

References:

Updated on Jul 05, 2024

Sim Fu

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

Advancing LLMs: Exploring Native, Advanced, and Modular RAG Approaches

This post explores the key components of RAG, its evolution, technical implementation, evaluation methods, and potential for real-world applications.

Introducing IBM Data Prep Kit for Streamlined LLM Workflows

The Data Prep Kit (DPK) is an open-source toolkit by IBM Research designed to streamline unstructured data preparation for building AI applications.

New for Zilliz Cloud: 10X Performance Boost and Enhanced Enterprise Features

A 10x faster Performance with Cardinal vector search engine, production-ready features including Multi-replica, Data Migration, Authentication, and more