Using Your Vector Database as a JSON (or Relational) Datastore
TL;DR: Vector databases support CRUD over "traditional" data formats such as JSON. If you're a solo developer or a small team and don't want to manage many different pieces of data infrastructure, you can use Milvus or Zilliz Cloud (the managed Milvus) as your only datastore and easily migrate vectorless collections to different databases as you scale.
Powered by the popularity of ChatGPT and other autoregressive language models, vector search has exploded in popularity in the past year. As a result, we've seen many companies and organizations hop on the vector search bandwagon, from NoSQL database providers such as MongoDB (via Atlas Vector Search) to traditional relational databases such as Postgres (via pgvector). The general messaging I hear around these vector search plugins is largely the same and goes something like this: developers should stick with us since you can store tables/JSON in addition to vectors, so there is no need to manage multiple pieces of infrastructure!
This kind of statement always cracks me up, as it's clearly crafted by unsophisticated marketing teams. Not only is the technology behind vector search vastly different from storage and querying strategies in relational & NoSQL databases, but it's fairly well-known now that vector databases can store relations, JSON documents, and other structured data sources. The first point is difficult to illustrate concisely without deep prior knowledge of database management systems, but the second point is fairly easy to show through some short sample code snippets. That's what this blog post is dedicated to.
Setting Up
Milvus stores data in units known as collections, analogous to tables in relational databases. Each collection can have its own schema, and schemas have a vector field of fixed dimensionality, e.g. 768 for vector embeddings based on e5-base
. Let's create a collection to store JSON documents rather than vector data. To better illustrate this point, I've left out some of the earlier and latter steps, such as calling collections.connect
:
from pymilvus import Collection, CollectionSchema, DataType, FieldSchema
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True, max_length=100),
FieldSchema(name="_unused", dtype=DataType.FLOAT_VECTOR, dim=1)
]
schema = CollectionSchema(fields, "NoSQL data", enable_dynamic_field=True)
collection = Collection("json_data", schema)
collection.load()
Here, we've specified this collection to use Milvus dynamic schema capabilities, letting the collection to accept JSON payloads as "extra" data associated with each row.
I hope you weren't expecting more, 'cause that's it - it's really that simple.
CRUD Operations
You can interact with the collection we've created above as you would do with any other database. As an example, let's download the list of current US senators (in JSON format) from govtrack:
import requests
r = requests.get("https://www.govtrack.us/api/v2/role?current=true&role_type=senator")
data = r.json()["objects"]
len(data)
100
We can store these documents directly in Milvus with just a tiny bit of data wrangling:
rows = [{"_unused": [0]} | d for d in data]
collection.insert(rows)
(insert count: 100, delete count: 0, upsert count: 0, ...
From here, we can perform queries directly over that data:
collection.query(
expr="party like 'Dem%'",
limit=1,
output_fields=["person"]
)
[{'person': {'bioguideid': 'C000127',
'birthday': '1958-10-13',
'cspanid': 26137,
'fediverse_webfinger': None,
'firstname': 'Maria',
'gender': 'female',
'gender_label': 'Female',
'lastname': 'Cantwell',
'link': 'https://www.govtrack.us/congress/members/maria_cantwell/300018',
'middlename': '',
'name': 'Sen. Maria Cantwell [D-WA]',
'namemod': '',
'nickname': '',
'osid': 'N00007836',
'pvsid': None,
'sortname': 'Cantwell, Maria (Sen.) [D-WA]',
'twitterid': 'SenatorCantwell',
'youtubeid': 'SenatorCantwell'},
'id': 447376465724036249}]
Without specifying any vector field, we've queried our database for the first result where the party
field in the accompanying JSON payload has "Dem" (Democrat) as a prefix.
This step hopefully demonstrates the capability to perform structured data searches in Milvus, but it's not a particularly useful query. Let's find all senators from my home state of Oregon:
collection.query(
expr="state in ['OR']",
limit=10,
output_fields=["person"]
)
[{'person': {'bioguideid': 'M001176',
'birthday': '1956-10-24',
'cspanid': 1029842,
'fediverse_webfinger': None,
'firstname': 'Jeff',
'gender': 'male',
'gender_label': 'Male',
'lastname': 'Merkley',
'link': 'https://www.govtrack.us/congress/members/jeff_merkley/412325',
'middlename': '',
'name': 'Sen. Jeff Merkley [D-OR]',
'namemod': '',
'nickname': '',
'osid': 'N00029303',
'pvsid': None,
'sortname': 'Merkley, Jeff (Sen.) [D-OR]',
'twitterid': 'SenJeffMerkley',
'youtubeid': 'SenatorJeffMerkley'},
'id': 447376465724036286},
{'person': {'bioguideid': 'W000779',
'birthday': '1949-05-03',
'cspanid': 1962,
'fediverse_webfinger': None,
'firstname': 'Ron',
'gender': 'male',
'gender_label': 'Male',
'lastname': 'Wyden',
'link': 'https://www.govtrack.us/congress/members/ron_wyden/300100',
'middlename': '',
'name': 'Sen. Ron Wyden [D-OR]',
'namemod': '',
'nickname': '',
'osid': 'N00007724',
'pvsid': None,
'sortname': 'Wyden, Ron (Sen.) [D-OR]',
'twitterid': 'RonWyden',
'youtubeid': 'senronwyden'},
'id': 447376465724036331}]
Even though we specified limit=10
, only two documents were returned (since each state has only two senators). Let's narrow down our query even more to get only the senior senator:
collection.query(
expr="state in ['OR'] and senator_rank in ['senior']",
limit=10,
output_fields=["person"]
)
[{'person': {'bioguideid': 'W000779',
'birthday': '1949-05-03',
'cspanid': 1962,
'fediverse_webfinger': None,
'firstname': 'Ron',
'gender': 'male',
'gender_label': 'Male',
'lastname': 'Wyden',
'link': 'https://www.govtrack.us/congress/members/ron_wyden/300100',
'middlename': '',
'name': 'Sen. Ron Wyden [D-OR]',
'namemod': '',
'nickname': '',
'osid': 'N00007724',
'pvsid': None,
'sortname': 'Wyden, Ron (Sen.) [D-OR]',
'twitterid': 'RonWyden',
'youtubeid': 'senronwyden'},
'id': 447376465724036331}]
Perhaps I would like to update Senator Ron Wyden's profile with a bit more info. We can easily do so by retrieving the entire document with output_fields=["*"]
, updating the resulting document, and inserting it back into our database without the old primary key:
expr = "state in ['OR'] and senator_rank in ['senior']"
res = collection.query(
expr=expr,
limit=10,
output_fields=["*"]
)
res[0].update({
"elected_in": "1996",
"college": "Stanford University",
"college_major": "Political Science"
})
del res[0]["id"]
(insert count: 1, delete count: 0, upsert count: 0, ...
collection.delete(expr)
collection.insert(res)
Let's see if this worked as expected.
collection.query(
expr=expr,
limit=10,
output_fields=["elected_in", "college", "college_major"]
)
[{'elected_in': '1996',
'college': 'Stanford University',
'college_major': 'Political Science',
'id': 447376465724036353}]
The data indeed matches the updates we made.
The Full Script
Here's the whole script from front to back for convenience without the extraneous queries. I used our Zilliz Cloud free tier instead of milvus-lite
:
from milvus import default_server
from pymilvus import connections
from pymilvus import Collection, CollectionSchema, DataType, FieldSchema
import requests
# Uncomment this if you're using `milvus-lite`
#default_server.start()
#connections.connect(host="127.0.0.1", port=default_server.listen_port)
# Uncomment this if you're using Zilliz Cloud
connections.connect(
uri=os.environ["ZILLIZ_URI"],
token=os.environ["ZILLIZ_TOKEN"]
)
# Create the schema for our new collection. We set turn on Milvus' dynamic
# schema capability in order to store arbitratily large (or small) JSON blogs
# in each row.
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True, max_length=100),
FieldSchema(name="_unused", dtype=DataType.FLOAT_VECTOR, dim=1)
]
schema = CollectionSchema(fields, "Milvus as a JSON datastore", enable_dynamic_field=True)
# Now let's creat the collection.
collection = Collection("json_data", schema)
index_params = {
"index_type": "AUTOINDEX",
"metric_type": "L2",
"params": {}
}
collection.create_index(
field_name="_unused",
index_params=index_params
)
collection.load()
# Insert US senator JSON data into our newly formed collection.
r = requests.get("https://www.govtrack.us/api/v2/role?current=true&role_type=senator")
data = r.json()["objects"]
rows = [{"_unused": [0]} | d for d in data]
collection.insert(rows)
# Fetch the first Democrat in the database
top_k = 1
collection.query(
expr="state in ['OR'] and senator_rank in ['senior']",
limit=top_k,
output_fields=["person"]
)
[{'person': {'bioguideid': 'W000779',
'birthday': '1949-05-03',
'cspanid': 1962,
'fediverse_webfinger': None,
'firstname': 'Ron',
'gender': 'male',
'gender_label': 'Male',
'lastname': 'Wyden',
'link': 'https://www.govtrack.us/congress/members/ron_wyden/300100',
'middlename': '',
'name': 'Sen. Ron Wyden [D-OR]',
'namemod': '',
'nickname': '',
'osid': 'N00007724',
'pvsid': None,
'sortname': 'Wyden, Ron (Sen.) [D-OR]',
'twitterid': 'RonWyden',
'youtubeid': 'senronwyden'},
'id': 447376465724036331}]
Try it for yourself!
pymongo
-> milvusmongo
One last note before wrapping up - I've created a small Python package called milvusmongo
that implements pymongo
's most basic CRUD functionality across collections. It uses Milvus as the underlying database rather than MongoDB. Like pymongo
, milvusmongo
supports both dictionary and attribute-style access and abstracts away extra logic needed for CRUD calls (i.e. insert_one
, update_one
, and delete_one
). You can install it with pip install milvusmongo
:
% pip install milvus
% pip install milvusmongo
Once that's done, you can start up an embedded Milvus instance and use it in conjunction with milvusmongo
to perform queries over JSON data. For example:
from milvus import default_server
default_server.start()
from milvusmongo import MongoClient
client = MongoClient("127.0.0.1", 19530)
client.insert_one(my_document)
Please note that this library is meant to demonstrate Milvus' flexibility as a datastore rather than serve as something you should use in large-scale production environments.
Closing Words
Milvus isn't here to replace NoSQL databases or lexical text search engines; we're here to provide you with the best possible vector/filtered search experience. More importantly, we're here to help accelerate the adoption of vector search as a technology - it's why open source is such a core part of our ethos.
But that doesn't mean we don't support other types of data. As a solo developer or a small startup, you're free to use Milvus as your only data store. You can always optimize your infrastructure usage later as you grow. Milvus will provide you with best-in-class vector search capabilities, while other databases are there to store, index, and search other forms of data. Once your application starts requiring more complex workloads (such as joins or aggregations), that's when you'll want to contemplate using different data stores.
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for Free