Using Similarity Search - How Not to Lose Meetup Content on the Internet
Have you ever experienced the frustration of recalling a brilliant idea shared at a Meetup, only to find it lost in the maze of past events? As an organizer and attendee, I've felt the sting of valuable content slipping through the cracks once the event wraps up. Too often, insights shared in one corner of the globe remain siloed, inaccessible to those who could benefit elsewhere.
To tackle this issue, I turned to similarity search techniques to sift through the extensive array of unstructured data. Unstructured Data accounts for 80% of the world’s data and can be converted into vectors using different Machine Learning Models. My chosen tool for this task was Milvus, a popular open-source vector database that excels in managing and searching through complex data landscapes. Milvus makes it possible to discover underlying connections and similarities.
To compute Embeddings, I am using SentenceTransformers. It is a Python framework for state-of-the-art sentence, text, and image embeddings. You can use it to compute sentence/text embeddings for more than 100 languages. These embeddings can then be compared, e.g., with cosine similarity, to find sentences with a similar meaning.
Downloading the Data
Meetup.com does not have a free public API. You will need a Pro account to access it, and you can check it yourself here.
Because the API isn’t public, I generated some data from a Meetup group I run. You can find the simple data on GitHub and load it with Pandas.
import pandas as pd
df = pd.read_csv(‘data/data_meetup.csv’)
The Tech Stack: Milvus and SentenceTransformers
We will use Milvus as the Vector Database, SentenceTransformers for generating text embeddings and OpenAI GPT 3.5-turbo to summarize the essence of Meetup. There is usually a lot of noise in the event description so summarizing them can help with that.
Milvus Lite
Milvus offers different deployment options to suit different needs. For a lightweight and straightforward setup, Milvus Lite is ideal. It can be easily installed via PyPi pip install Milvus
and run directly within a Jupyter notebook, providing a hassle-free way to incorporate vector database capabilities into our project.
Milvus with Docker/ Docker Compose
For more robust needs, Milvus can be deployed using Docker Compose, as it is a distributed system.
The docker compose file is available on the Install Milvus Standalone page and the Milvus GitHub. When you spin up Milvus with Docker Compose, you will see three containers and connect to Milvus through port 19530
by default.
SentenceTransformers
SentenceTransfomers is used to create our Embeddings, it is available on PyPi with pip install sentence-transformers
. We will use the model all-MiniLM-L6-v2
as it is 5 times smaller than 5 times faster and still offers good quality in comparison to the best model they offer.
Perform Similarity Search Queries
Start Milvus
To perform a similarity search, we need to have a Vector Database. To start it, simply import the default_server
and call the start()
function.
from milvus import default_server
default_server.start()
Populate the Data into Milvus
Before being able to add data to Milvus, we need to create a Collection and prepare a Schema. First, prepare the necessary parameters, including the field schema, collection schema, and collection name.
from pymilvus import FieldSchema, CollectionSchema, DataType, Collection
# We insert the object in the format of title, date, content, content embedding
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=500),
FieldSchema(name="date", dtype=DataType.VARCHAR, max_length=100),
FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=10000),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=384)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name=”mlops_meetups”, schema=schema)
Now that the Collection and schema are created, we can create an index for the embedding
field and load the data into memory with the load()
function.
collection.create_index(field_name="embedding")
collection.load()
Create Embeddings with SentenceTransformer
As said previously, we will use SentenceTransformer and the model 'all-MiniLM-L6-v2'
to create our embeddings. Let’s import what is needed for that.
from sentence_transformers import SentenceTransformer
transformer = SentenceTransformer('all-MiniLM-L6-v2')
content_detail = df[‘content’]
content_detail = content_detail.tolist()
embeddings = [transformer.encode(c) for c in content_detail]
# Create an embedding column in our Dataframe
df['embedding'] = embeddings
# Insert the data in the collection
collection.insert(data=df)
Summarize the Content of Meetups
Descriptions of Meetups while informative can also be quite noisy, they usually contain schedule information, who is sponsoring the event and different rules about the venue/ event, etc. While those are very important when you attend a Meetup, they don’t matter for our use case. I use OpenAI GPT-3.5-turbo to summarize the content.
def summarise_meetup_content(content: str) -> str:
response = openai.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{
"role": "system",
"content": "Summarize content you are provided with."
},
{
"role": "user",
"content": f"{content}"
}
],
temperature=0,
max_tokens=1024,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
summary = response.choices[0].message.content
return summary
Return Similar Content
For our Similarity Search to work, we have to make sure our Vector Database can understand our search terms, let’s create embeddings of our search terms.
search_terms = "The speaker speaks about Open Source and ML Platform"
search_data = [transformer.encode(search_terms)] # Must be a list.
Search the Milvus Collection for Similar Content
res = collection.search(
data=search_data, # Embedded search value
anns_field="embedding", # Search across embeddings
param={"metric_type": "IP"},
limit = 3, # Limit to top_k results per search
output_fields=["title", "content"] # Include title field in result
)
for hits_i, hits in enumerate(res):
print("Search Terms:", search_terms)
print("Results:")
for hit in hits:
content_test = hit.entity.get("content")
print(hit.entity.get("title"), "----", hit.distance)
print(f'{summarise_meetup_content(hit.entity.get("content"))} \n')
Results
Search terms: The speaker speaks about Open Source and ML Platform
Results:
First MLOps.community Berlin Meetup ---- 0.5537542700767517
The MLOps.community meetup in Berlin on June 30th will feature a main talk by Stephen Batifol from Wolt on Scaling Open-Source Machine Learning. The event will also include lightning talks, networking, and food and drinks. The agenda includes opening doors at 6:00 pm, Stephen's talk at 7:00 pm, lightning talks at 7:50 pm, and socializing at 8:15 pm. Attendees can sign up for lightning talks on Meetup.com. The event is in collaboration with neptune.ai.
MLOps.community Berlin 04: Pre-event Women+ In Data and AI Festival ---- 0.4623506963253021
The MLOps.community Berlin is hosting a special edition event on June 29th and 30th at Thoughtworks. The event is a warm-up for the Women+ In Data and AI festival. The meetup will feature speakers Fiona Coath discussing surveillance capitalism and Magdalena Stenius talking about the carbon footprint of machine learning. The agenda includes talks, lightning talks, and networking opportunities. Attendees are encouraged to review and abide by the event's Code of Conduct for an inclusive and respectful environment.
MLOps.community Berlin Meetup 02 ---- 0.41342616081237793
The MLOps.community meetup in Berlin on October 6th will feature a main talk by Lina Weichbrodt on ML Monitoring, lightning talks, and networking opportunities. The event will be held at Wolt's office with a capacity limit of 150 people. Lina has extensive experience in developing scalable machine learning models and has worked at companies like Zalando and DKB. The agenda includes food, a bonding activity, the main talk, lightning talks, and socializing. Attendees can also sign up to give lightning talks on various MLOps-related topics. The event is in collaboration with neptune.ai.
Feel free to check out the code on Github.
- Downloading the Data
- The Tech Stack: Milvus and SentenceTransformers
- SentenceTransformers
- Perform Similarity Search Queries
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for Free