Exploring BERTopic: An Advanced Neural Topic Modeling Technique
BERTopic is a novel topic modeling technique that allows for easily interpretable topics while keeping important words in the topic descriptions.
Read the entire series
- Natural Language Processing Fundamentals: Tokens, N-Grams, and Bag-of-Words Models
- Primer on Neural Networks and Embeddings for Language Models
- Sparse and Dense Embeddings
- Sentence Transformers for Long-Form Text
- Training Your Own Text Embedding Model
- Evaluating Your Embedding Model
- Class Activation Mapping (CAM): Better Interpretability in Deep Learning Models
- CLIP Object Detection: Merging AI Vision with Language Understanding
- Discover SPLADE: Revolutionizing Sparse Data Processing
- Exploring BERTopic: An Advanced Neural Topic Modeling Technique
- Streamlining Data: Effective Strategies for Reducing Dimensionality
- All-Mpnet-Base-V2: Enhancing Sentence Embedding with AI
- Time Series Embedding in Data Analysis
- Enhancing Information Retrieval with Sparse Embeddings
- What is BERT (Bidirectional Encoder Representations from Transformers)?
- What is Mixture of Experts (MoE)?
As we navigate the vast ocean of digital information, the need for tools to extract meaningful insights from unstructured text data has never been more critical. BERTopic stands at the forefront of this transformative era of machine learning, employing neural network-based techniques to uncover themes and patterns in large text corpora with unprecedented accuracy and depth.
In this blog, we will explore the intricacies of the BERTopic topic modeling technique, from its reliance on transformer models to its innovative approach to clustering and dimensionality
What Is Topic Modeling?
Before we dive into BERTopic, we should learn about Topic Modeling.
Topic modeling is a method for unearthing the latent themes or “topics” within representative documents or a collection of documents. It involves examining the text within these documents to detect patterns and relationships that indicate the presence of these topics. For instance, a document focused on artificial intelligence will likely contain terms like “large language models” (LLMs) and “ChatGPT,” unlike a document centered on baking bread.
Topic modeling has existed since 1990, and some popular techniques include Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and non-negative Matrix Factorization (NMF). However, these conventional topic modelling techniques either don’t present the semantic relation between words or struggle to extract the most representative topic representation for a given group of documents. Evaluating topic models can be challenging due to the somewhat subjective nature of evaluation, emphasizing the importance of visualizing different aspects of the topic model to enhance understanding and facilitate adjustments to meet user preferences.
What Is BERTopic?
BERTopic is a novel topic modeling technique that simplifies the topic modeling process. It uses various embedding techniques and class-based TF-IDF (c-TF-IDF) to create dense clusters, allowing for easily interpretable topics while keeping important words in the topic descriptions. It can analyze latent topics in clusters of varying densities and extract topics with the most relevant keywords. BERTopic extends on existing document embedding-based topic modeling techniques and methodologies, but flexibility and robustness separate it from pre-existing solutions. Additionally, BERTopic leverages pre-trained models to enhance its capabilities, enabling incremental learning, batched learning, and federated learning, which address challenges related to data privacy and computational resources.
BERTopic model approaches topic modeling in four steps on a high level:
Document Embedding: Convert documents into vector embeddings using a pre-trained transformer language model like Bidirectional Encoder Representations from Transformers (BERT).
Dimensionality Reduction: Compresses vector embeddings into a lower-dimensional space.
Clustering: Group these embeddings to gather similar documents in one category.
Topic Extraction: Extract topic names using a class-based variation of TF-IDF.
Document Embedding
BERTopic starts with transforming our input documents into numerical representations called vector embeddings. BERTopic allows you to choose any state-of-the-art embedding model capable of capturing the semantic essence of the text. In the original BERTopic paper, Sentence BERT (SBERT) was employed as the embedding model due to its robust performance across various sentence embedding tasks. This model, specifically the all-MiniLM-L6-v2 model, is accessible through the HuggingFace Hub. Alternatively, proprietary models from OpenAI, such as text-embedding-ada-002, text-embedding-small, or text-embedding-large, represent other options for generating embeddings, though these are typically subscription-based.
Dimensionality Reduction
Embeddings are high-dimensional, which might slow the following step of clustering. Furthermore, dimensionality reduction can help us visualize our data when assessing whether it can be clustered. Therefore, after building our embeddings, BERTopic compresses them into a lower-dimensional space.
In this step, the embedded document vectors are projected to a smaller embedding space, allowing the clustering algorithms to create coherent clusters. Many solutions are available for dimensionality reduction, like Principal Component Analysis (PCA) or t-SNE (t—t-distributed stochastic neighbor embeddings). Still, the paper's author recommends using UMAP (Uniform Manifold Approximation and Projection), as it maintains the local and global information while projecting the matrices to lower dimensions.
Clustering
fter reducing the dimensionality of our input embeddings, we can apply a clustering algorithm to create document clusters. This process is important because the more performant our clustering technique is, the more accurate our topic representations will be.
The density-based clustering approach (DBSCAN) is most recommended here as it allows for creating clusters with varying densities, which is more suited for topic modeling. In the BERTopic paper, the author recommends using a hierarchical density-based clustering approach (HDBSCAN), a variant of the original DBSCAN algorithm. HDBSCAN is more suited than DBSCAN because it:
Does not need to specify the number of topics beforehand.
Effectively handles outliers.
However, there is no perfect clustering model; you might want to use something entirely different for your use case.
Topic Extraction
The final step in BERTopic is extracting topics for each of our clusters. To do this, BERTopic uses a modified version of TF-IDF called class-based TF-IDF, also known as c-TF-IDF.
TF-IDF stands for Term Frequency, Inverse Document Frequency, an algorithm used to quantify a word's relevance to a document. In a class-based variant of TF-IDF, all documents in a cluster are concatenated and represented as one document. Instead of identifying a word's relevance to a document, the cTF-IDF reflects a word's relevance to a cluster.
Other Optional Steps: Visualize Topic Hierarchy
In addition to the four key steps mentioned more detailed overview above, the BERTopic approach also involves optional steps such as tokenization and representation fine-tuning based on users’ specific requirements. These optional steps can provide interesting perspectives by exploring different topic representations, enhancing coherence and the quality of topic interpretations.
An overview of the BERTopic library
An overview of the BERTopic library; Image source
BERTopic Practical Use Cases and Applications
BERTopic has seen a lot of applications across different sectors and industries in recent years, including use by top Fortune 500 companies like Meta, Microsoft, CISCO, NVIDIA, and Amazon. Developers and organizations use BERTopic in various use cases and domains, ranging from cancer research and voice perception studies to analyzing employee surveys and social media content. Visualizing topic hierarchy is crucial in these applications as it helps in understanding and managing complex topics, reducing issues, and clarifying connections between topics.
Some of the real-world applications of BERTopic include:
Telefonica, a multinational telecommunications company, adopted BERTopic for topic modeling and classification of customer reviews to improve user experience (UX) and reveal useful customer information.
In the U.S. Department of Homeland Security, BERTopic analyzes employee surveys by identifying key topics discussed and assessing the sentiments.
BERTopic Challenges and Considerations
Using BERTopic for topic modeling can pose challenges like choosing the right embedding model, multilingual support, offline running, slow inference, and more. Let’s shed light on some of the more detailed overviews of the common issues and their solutions when generating topics when using BERTopic:
Memory: BERTopic tends to run into out-of-memory issues when modeling topics for larger datasets. The primary cause of that usually is UMAP, which could be executed in a low memory footprint. Secondly, we can skip the calculation of a considerable document topic distribution in the topic extraction phase and limit the probability matrix to relevant top-K topics. Thirdly, we can decrease the size of the TF-IDF matrix by setting the minimum frequency of a word to be considered a potential topic as a large number.
Speed: BERTopic runs slow on a larger collection of documents due to the embedding phase. To speed up your algorithm, you can do it asynchronously in parallel beforehand. Another solution is to use a GPU if you can access one or utilize the free tiers from Google Colab or Kaggle.
Topic Number: BERTopic usually generates many topics, which is not helpful. An easier way to reduce the number of topics is to set the minimum topic size. In another scenario, when the number of generated documents is very few, one can increase the number of documents and their diversity in the dataset or decrease the minimum topic size.
Getting Started with BERTopic
BERTopic is an open-source project hosted on GitHub. The package has over 5,000 stars and provides extensive documentation for getting started with the framework. BERTopic offers a modular approach to using different algorithms for each step, allowing you to build your own customized topic model.
BERTopic essentially allows you to build your own topic model.
BERTopic essentially allows you to build your own topic model. Image source
Each step is a building block, and the library offers multiple options for each phase. For example, if we choose spaCy for document embedding, with PCA for dimensionality-reduction, K-means for clustering for clustering methods, and CountVectorizer for combining documents, and finally, using c-TF-IDF for topic extraction, we have an end-to-end customized BERTopic model.
To start with BERTopic, first install this package using:
!pip install bertopic
!pip install bertopic[visualization]
Next, set up the dataset for which you want to model the topics. I am using the 20newsgroup dataset with English documents in it.
# Load the Data
from sklearn.datasets import fetch_20newsgroups
docs2 = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
docs2
Initialize the BERTopic model and find the topics.
# Initialize the BERTopic model
model = BERTopic(verbose=True,nr_topics=10)
# Let’s find the latent topics
topics, probs = model.fit_transform(docs2)
Display the top topics extracted and inspect one topic closely to see its top words, to understand the relation between the topic and its relevant documents.
# Select top topics
model.get=topic+freq().head(11)
# Let’s look at the top words of the topic
model/get_topic(3)
Visualize the topics using many supported visualizations in BERTopic. Let’s visualize the cluster's density distribution:
# Visualize topic clusters
model.visualize_topics()
Finally, we can also look at the top words in a cluster or in a topic:
# Visualize top words in a topic
model.visualize_barchart()
If you want to build more sophisticated topic modeling solutions with more control, check these resources:
In Summary About BERTopic
BERTopic is an amazing framework that allows quick off-the-shelf topic modeling to help understand many documents. BERTopic has many advantages, including:
No need for data preprocessing.
Flexibility to try different document embeddings from Gensim, Flair, Spacy, and now even a state-of-the-art LLM form OpenAI or HuggingFace.
Awesome range of visualizations to inspect and analyze the modeled topics.
The wide applicability of BERTopic, from scholarly research in cancer and voice perception to practical analyses in corporate settings like employee feedback and social media
- What Is Topic Modeling?
- What Is BERTopic?
- BERTopic Practical Use Cases and Applications
- BERTopic Challenges and Considerations
- Getting Started with BERTopic
- In Summary About BERTopic
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for Free