Blog
Using Similarity Search - How Not to Lose Meetup Content on the Internet

Using Similarity Search - How Not to Lose Meetup Content on the Internet

Mar 19, 20245 min read

Have you ever experienced the frustration of recalling a brilliant idea shared at a Meetup, only to find it lost in the maze of past events? As an organizer and attendee, I've felt the sting of valuable content slipping through the cracks once the event wraps up. Too often, insights shared in one corner of the globe remain siloed, inaccessible to those who could benefit elsewhere.

To tackle this issue, I turned to similarity search techniques to sift through the extensive array of unstructured data. Unstructured Data accounts for 80% of the world’s data and can be converted into vectors using different Machine Learning Models. My chosen tool for this task was Milvus, a popular open-source vector database that excels in managing and searching through complex data landscapes. Milvus makes it possible to discover underlying connections and similarities.

To compute Embeddings, I am using SentenceTransformers. It is a Python framework for state-of-the-art sentence, text, and image embeddings. You can use it to compute sentence/text embeddings for more than 100 languages. These embeddings can then be compared, e.g., with cosine similarity, to find sentences with a similar meaning.

Downloading the Data

Meetup.com does not have a free public API. You will need a Pro account to access it, and you can check it yourself here.

Because the API isn’t public, I generated some data from a Meetup group I run. You can find the simple data on GitHub and load it with Pandas.

import pandas as pd
df = pd.read_csv(‘data/data_meetup.csv’)

The Tech Stack: Milvus and SentenceTransformers

We will use Milvus as the Vector Database, SentenceTransformers for generating text embeddings and OpenAI GPT 3.5-turbo to summarize the essence of Meetup. There is usually a lot of noise in the event description so summarizing them can help with that.

Milvus Lite

Milvus offers different deployment options to suit different needs. For a lightweight and straightforward setup, Milvus Lite is ideal. It can be easily installed via PyPi pip install Milvus and run directly within a Jupyter notebook, providing a hassle-free way to incorporate vector database capabilities into our project.

Milvus with Docker/ Docker Compose

For more robust needs, Milvus can be deployed using Docker Compose, as it is a distributed system.
The docker compose file is available on the Install Milvus Standalone page and the Milvus GitHub. When you spin up Milvus with Docker Compose, you will see three containers and connect to Milvus through port 19530 by default.

SentenceTransformers

SentenceTransfomers is used to create our Embeddings, it is available on PyPi with pip install sentence-transformers. We will use the model all-MiniLM-L6-v2 as it is 5 times smaller than 5 times faster and still offers good quality in comparison to the best model they offer.

Perform Similarity Search Queries

Start Milvus

To perform a similarity search, we need to have a Vector Database. To start it, simply import the default_server and call the start() function.

from milvus import default_server
default_server.start()

Populate the Data into Milvus

Before being able to add data to Milvus, we need to create a Collection and prepare a Schema. First, prepare the necessary parameters, including the field schema, collection schema, and collection name.

from pymilvus import FieldSchema, CollectionSchema, DataType, Collection

# We insert the object in the format of title, date, content, content embedding
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=500),
    FieldSchema(name="date", dtype=DataType.VARCHAR, max_length=100),
    FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=10000),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=384)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name=”mlops_meetups”, schema=schema)

Now that the Collection and schema are created, we can create an index for the embedding field and load the data into memory with the load() function.

collection.create_index(field_name="embedding")
collection.load()

Create Embeddings with SentenceTransformer

As said previously, we will use SentenceTransformer and the model 'all-MiniLM-L6-v2' to create our embeddings. Let’s import what is needed for that.

from sentence_transformers import SentenceTransformer

transformer = SentenceTransformer('all-MiniLM-L6-v2')
content_detail = df[‘content’]
content_detail = content_detail.tolist()
embeddings = [transformer.encode(c) for c in content_detail]

# Create an embedding column in our Dataframe
df['embedding'] = embeddings

# Insert the data in the collection
collection.insert(data=df)

Summarize the Content of Meetups

Descriptions of Meetups while informative can also be quite noisy, they usually contain schedule information, who is sponsoring the event and different rules about the venue/ event, etc. While those are very important when you attend a Meetup, they don’t matter for our use case. I use OpenAI GPT-3.5-turbo to summarize the content.

def summarise_meetup_content(content: str) -> str: 
    response = openai.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {
          "role": "system",
          "content": "Summarize content you are provided with."
        },
        {
          "role": "user",
          "content": f"{content}"
        }
    ],
        temperature=0,
        max_tokens=1024,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )
    summary = response.choices[0].message.content
    return summary

Return Similar Content

For our Similarity Search to work, we have to make sure our Vector Database can understand our search terms, let’s create embeddings of our search terms.

search_terms = "The speaker speaks about Open Source and ML Platform"
search_data = [transformer.encode(search_terms)] # Must be a list.

Search the Milvus Collection for Similar Content

res = collection.search(
    data=search_data,  # Embedded search value
    anns_field="embedding",  # Search across embeddings
    param={"metric_type": "IP"},
    limit = 3,  # Limit to top_k results per search
    output_fields=["title", "content"]  # Include title field in result
)

for hits_i, hits in enumerate(res):
    print("Search Terms:", search_terms)
    print("Results:")
    for hit in hits:
        content_test = hit.entity.get("content")
        print(hit.entity.get("title"), "----", hit.distance)
        print(f'{summarise_meetup_content(hit.entity.get("content"))} \n')

Results

Milvus returned three meetups about Open-Source and ML Platform hosted by MLOps.community and Neptune.ai.

Below are details:

Search terms: The speaker speaks about Open Source and ML Platform

Results:

First MLOps.community Berlin Meetup ---- 0.5537542700767517
The MLOps.community meetup in Berlin on June 30th will feature a main talk by Stephen Batifol from Wolt on Scaling Open-Source Machine Learning. The event will also include lightning talks, networking, and food and drinks. The agenda includes opening doors at 6:00 pm, Stephen's talk at 7:00 pm, lightning talks at 7:50 pm, and socializing at 8:15 pm. Attendees can sign up for lightning talks on Meetup.com. The event is in collaboration with <a href="https://neptune.ai/>neptune.ai</a>

MLOps.community Berlin 04: Pre-event Women+ In Data and AI Festival ---- 0.4623506963253021
The MLOps.community Berlin is hosting a special edition event on June 29th and 30th at Thoughtworks. The event is a warm-up for the Women+ In Data and AI festival. The meetup will feature speakers Fiona Coath discussing surveillance capitalism and Magdalena Stenius talking about the carbon footprint of machine learning. The agenda includes talks, lightning talks, and networking opportunities. Attendees are encouraged to review and abide by the event's Code of Conduct for an inclusive and respectful environment. 

MLOps.community Berlin Meetup 02 ---- 0.41342616081237793
The MLOps.community meetup in Berlin on October 6th will feature a main talk by Lina Weichbrodt on ML Monitoring, lightning talks, and networking opportunities. The event will be held at Wolt's office with a capacity limit of 150 people. Lina has extensive experience in developing scalable machine learning models and has worked at companies like Zalando and DKB. The agenda includes food, a bonding activity, the main talk, lightning talks, and socializing. Attendees can also sign up to give lightning talks on various MLOps-related topics. The event is in collaboration with neptune.ai.

Feel free to check out the code on Github.

Updated on Oct 25, 2024

Stephen Batifol
Stephen Batifol is a Developer Advocate at Zilliz. He previously worked as a Machine Learning Engineer at Wolt, where he was working on the ML Platform and as a Data Scientist at Brevo. Stephen studied Computer Science and Artificial Intelligence. He enjoys dancing and surfing.

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Share this article

Keep Reading

Democratizing AI: Making Vector Search Powerful and Affordable

Zilliz democratizes AI vector search with Milvus 2.6 and Zilliz Cloud for powerful, affordable scalability, cutting costs in infrastructure, operations, and development.

Vector Databases vs. NewSQL Databases

Use a vector database for AI-powered similarity search; use a NewSQL database for scalable transactional workloads requiring strong consistency and relational capabilities.

Leveraging Milvus and Friendli Serverless Endpoints for Advanced RAG and Multi-Modal Queries

This tutorial has demonstrated how to leverage Milvus and Friendli Serverless Endpoints to implement advanced RAG and multi-modal queries.