Exploring BERTopic: A New Era of Neural Topic Modeling
BERTopic is a novel topic modeling technique that allows for easily interpretable topics while keeping important words in the topic descriptions.
Read the entire series
- Natural Language Processing Fundamentals: Tokens, N-Grams, and Bag-of-Words Models
- Primer on Neural Networks and Embeddings for Language Models
- Sparse and Dense Embeddings
- Sentence Transformers for Long-Form Text
- Training Your Own Text Embedding Model
- Evaluating Your Embedding Model
- Class Activation Mapping: Unveiling The Visual Story
- CLIP Object Detection: Merging AI Vision with Language Understanding
- Discover SPLADE: Revolutionizing Sparse Data Processing
- Exploring BERTopic: A New Era of Neural Topic Modeling
- Streamlining Data: Effective Strategies for Reducing Dimensionality
- All-Mpnet-Base-V2: Enhancing Sentence Embedding with AI
- Time Series Embedding in Data Analysis
- Enhancing Information Retrieval with Sparse Embeddings
As we navigate the vast ocean of digital information, the need for tools to extract meaningful insights from unstructured text data has never been more critical. BERTopic stands at the forefront of this transformative era, employing neural network-based techniques to uncover themes and patterns in large text corpora with unprecedented accuracy and depth.
In this blog, we will explore the intricacies of the BERTopic topic modeling technique, from its reliance on transformer models to its innovative approach to clustering and dimensionality reduction.
What Is Topic Modeling?
Before we dive into BERTopic, we should learn about Topic Modeling.
Topic modeling is a method for unearthing the latent themes or "topics" within a collection of documents. It involves examining the text within these documents to detect patterns and relationships that indicate the presence of these topics. For instance, a document focused on artificial intelligence will likely contain terms like "large language models" and "ChatGPT," unlike a document centered on baking bread.
Topic modeling has existed since 1990, and some popular techniques include Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and non-negative Matrix Factorization (NMF). However, these conventional techniques either don’t present the semantic relation between words or struggle to extract the most representative topic for a given group of documents.
What Is BERTopic?
BERTopic is a novel topic modeling technique that simplifies the topic modeling process. It uses various embedding techniques and class-based TF-IDF (c-TF-IDF) to create dense clusters, allowing for easily interpretable topics while keeping important words in the topic descriptions. It can analyze latent topics in clusters of varying densities and extract topics with the most relevant keywords. BERTopic extends on existing document embedding-based topic modeling methodologies, but flexibility and robustness separate it from pre-existing solutions.
BERTopic approaches topic modeling in four steps on a high level:
Document Embedding: Convert documents into embeddings using a pre-trained transformer language model like Bidirectional Encoder Representations from Transformers (BERT).
Dimensionality Reduction: Compresses embeddings into a lower-dimensional space.
Clustering: Group these embeddings to gather similar documents in one category.
Topic Extraction: Extract topic names using a class-based variation of TF-IDF.
Document Embedding
BERTopic starts with transforming our input documents into numerical representations called embeddings. BERTopic allows you to choose any state-of-the-art embedding model capable of capturing the semantic essence of the text. In the original BERTopic paper, Sentence BERT (SBERT) was employed as the embedding model due to its robust performance across various sentence embedding tasks. This model, specifically the all-MiniLM-L6-v2 model, is accessible through the HuggingFace Hub. Alternatively, proprietary models from OpenAI, such as text-embedding-ada-002, text-embedding-small, or text-embedding-large, represent other options for generating embeddings, though these are typically subscription-based.
Dimensionality Reduction
Embeddings are high-dimensional, which might slow the following step of clustering. Furthermore, dimensionality reduction can help us visualize our data when assessing whether it can be clustered. Therefore, after building our embeddings, BERTopic compresses them into a lower-dimensional space.
In this step, the embedded document vectors are projected to a smaller embedding space, allowing the clustering algorithms to create coherent clusters. Many solutions are available for dimensionality reduction, like Principal Component Analysis (PCA) or t-SNE (t—t-distributed stochastic neighbor embeddings). Still, the paper's author recommends using UMAP (Uniform Manifold Approximation and Projection), as it maintains the local and global information while projecting the matrices to lower dimensions.
Clustering
After reducing the dimensionality of our input embeddings, we can apply a clustering algorithm to create document clusters. This process is important because the more performant our clustering technique is, the more accurate our topic representations will be.
The density-based clustering approach (DBSCAN) is most recommended here as it allows for creating clusters with varying densities, which is more suited for topic modeling. In the BERTopic paper, the author recommends using a hierarchical density-based clustering approach (HDBSCAN), a variant of the original DBSCAN algorithm. HDBSCAN is more suited than DBSCAN because it:
Does not need to specify the number of topics beforehand.
Effectively handles outliers.
However, there is no perfect clustering model; you might want to use something entirely different for your use case.
Topic Extraction
The final step in BERTopic is extracting topics for each of our clusters. To do this, BERTopic uses a modified version of TF-IDF called class-based TF-IDF, also known as c-TF-IDF.
TF-IDF stands for Term Frequency, Inverse Document Frequency, an algorithm used to quantify a word's relevance to a document. In a class-based variant of TF-IDF, all documents in a cluster are concatenated and represented as one document. Instead of identifying a word's relevance to a document, the cTF-IDF reflects a word's relevance to a cluster.
Other Optional Steps
In addition to the four key steps mentioned above, the BERTopic approach also involves optional steps such as tokenization and representation fine-tuning based on users’ specific requirements.
An overview of the BERTopic library; Image source: [https://maartengr.github.io/BERTopic/algorithm/algorithm.html](https://maartengr.github.io/BERTopic/algorithm/algorithm.html)
BERTopic Practical Use Cases and Applications
BERTopic has seen a lot of applications across different sectors and industries in recent years, including use by top Fortune 500 companies like Meta, Microsoft, CISCO, NVIDIA, and Amazon. Developers and organizations use BERTopic in various use cases and domains, ranging from cancer research and voice perception studies to analyzing employee surveys and social media content.
Some of the real-world applications of BERTopic include:
Telefonica, a multinational telecommunications company, adopted BERTopic for topic modeling and classification of customer reviews to improve user experience (UX) and reveal useful customer information.
In the U.S. Department of Homeland Security, BERTopic analyzes employee surveys by identifying key topics discussed and assessing the sentiments.
McMaster University uses BERTopic in a research project to categorize first impressions based on voice perception by analyzing descriptions from participants who listen to recordings of people saying "Hi."
Iodine Software uses BERTopic to analyze physician-created documents from hospitals. They employ it to identify themes within the text and ensure a clear interpretation of medical documentation.
Challenges and Considerations
Using BERTopic for topic modeling can pose challenges like correct embedding model choice, multilingual support, offline running, slow inference, and more. Let’s shed light on some of the common issues and their solutions when using BERTopic:
Memory: BERTopic tends to run into out-of-memory issues when modeling topics for larger datasets. The primary cause of that usually is UMAP, which could be executed in a low memory footprint. Secondly, we can skip the calculation of a considerable document topic distribution in the topic extraction phase and limit the probability matrix to relevant top-K topics. Thirdly, we can decrease the size of the TF-IDF matrix by setting the minimum frequency of a word to be considered a potential topic as a large number.
Speed: BERTopic runs slow on a larger collection of documents due to the embedding phase. To speed up your algorithm, you can do it asynchronously in parallel beforehand. Another solution is to use a GPU if you can access one, or you can utilize the free tiers from Google Colab or Kaggle.
Topic Number: BERTopic generates a large number of topics most of the time, which is not helpful. An easier way to reduce the number of topics is to set the minimum topic size. In another scenario, when the number of generated documents is very few, one can increase the number of documents and their diversity in the dataset or decrease the minimum topic size.
Getting Started with BERTopic
BERTopic is an open-source project hosted on GitHub. The package has over 5,000 stars and provides extensive documentation for getting started with the framework. BERTopic offers a modular approach to using different algorithms for each step, allowing you to build your own topic model.
BERTopic essentially allows you to build your own topic model. Image source:[https://maartengr.github.io/BERTopic/algorithm/algorithm.html](https://maartengr.github.io/BERTopic/algorithm/algorithm.html)
Each step is a building block, and the library offers multiple options for each phase. For example, if we choose spaCy for document embedding, with PCA for dimensionality-reduction, K-means for clustering, and CountVectorizer for combining documents, and finally, using c-TF-IDF for topic extraction, we have an end-to-end customized BERTopic model.
To start with BERTopic, first install this package using:
!pip install bertopic
!pip install bertopic[visualization]
Next, set up the dataset for which you want to model the topics. I am using the 20newsgroup
dataset with English documents in it.
# Load the Data
from sklearn.datasets import fetch_20newsgroups
docs2 = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
docs2
Initialize the BERTopic model and find the topics.
# Initialize the BERTopic model
model = BERTopic(verbose=True,nr_topics=10)
# Let’s find the latent topics
topics, probs = model.fit_transform(docs2)
Display the top topics extracted and inspect one topic closely to see its top words, to understand the relation between the topic and its relevant documents.
# Select top topics
model.get=topic+freq().head(11)
# Let’s look at the top words of the topic
model/get_topic(3)
Visualize the topics using many supported visualizations in BERTopic. Let’s visualize the cluster's density distribution:
# Visualize topic clusters
model.visualize_topics()
Finally, we can also look at the top words in a cluster or in a topic:
# Visualize top words in a topic
model.visualize_barchart()
If you want to build more sophisticated topic modeling solutions with more control, check these resources:
In Summary About BERTopic
BERTopic is an amazing framework that allows quick off-the-shelf topic modeling to help understand a large collection of documents. BERTopic has many advantages, including:
No need for data preprocessing.
Flexibility to try different document embeddings from Gensim, Flair, Spacy, and now even a state-of-the-art LLM form OpenAI or HuggingFace.
Awesome range of visualizations to inspect and analyze the modeled topics.
The wide applicability of BERTopic, from scholarly research in cancer and voice perception to practical analyses in corporate settings like employee feedback and social media monitoring, underscores its significance and potential impact across different sectors. Companies like Meta, Microsoft, and Amazon, among others, have already harnessed the power of BERTopic to improve their operations and strategies.
As we continue to generate and rely on vast amounts of text data, tools like BERTopic become indispensable in our toolkit for data analysis and knowledge discovery. It democratizes the power of advanced topic modeling for a broader audience and sets a new standard in the field, promising ongoing evolution and enhancement to meet the dynamic demands of data-driven industries.