1.7. Top2Vec in Python#
Now that we understand the primary concepts behind leveraging transformers and sentence embeddings to perform topic modeling, let’s examine a key library and making this entire workflow simplified with just a single line of Python. That library is Top2Vec.
Unlike a traditional topic modeling approach or even the manual approach with transformers that we saw in the previous section, Top2Vec will embed not only all the sentences, but all the words used in your corpus. In addition to this, it will also embed the topics themselves. This means that words, documents, and topics all occupy the same higher-dimensional space. On the surface, this may not seem that important, but it allows us to understand the relationship between three interconnected pieces of the process in topic modeling mathematically. This means, for example, Top2Vec let’s us know the mathematical similarity between a given word and a document or a topic in general. Understanding these relationships lets us understand broader patterns in our topics that may otherwise be missed.
That said, Top2Vec does have several drawbacks. First, it will produce a high number of outliers. These are documents that do not fit neatly into any one topic. Top2Vec has a rather unique way of handling outliers, however. It generate an embedding for the topics 0 and above and then assign outliers to their nearest topic embedding. This means that outliers may be quite pronounced for certain topics. For some workflows, this may not be an issue. Second, Top2Vec can produce results that are occasionally difficult to understand. Certain topics may not seem to have any coherency. This is common in all approaches to topic modeling, however.
In order to model our data, we once again need to load it. Again, we will use pandas.
import pandas as pd
df = pd.read_json("https://raw.githubusercontent.com/wjbmattingly/bap_sent_embedding/main/data/vol7.json")
df
names | descriptions | |
---|---|---|
0 | AARON, Thabo Simon | An ANCYL member who was shot and severely inju... |
1 | ABBOTT, Montaigne | A member of the SADF who was severely injured ... |
2 | ABDUL WAHAB, Zakier | A member of QIBLA who disappeared in September... |
3 | ABRAHAM, Nzaliseko Christopher | A COSAS supporter who was kicked and beaten wi... |
4 | ABRAHAMS, Achmat Fardiel | Was shot and blinded in one eye by members of ... |
... | ... | ... |
21742 | ZWENI, Ernest | One of two South African Police (SAP) members ... |
21743 | ZWENI, Lebuti | An ANC supporter who was shot dead by a named ... |
21744 | ZWENI, Louis | Was shot dead in Tokoza, Transvaal, on 22 May ... |
21745 | ZWENI, Mpantesa William | His home was lost in an arson attack by Witdoe... |
21746 | ZWENI, Xolile Milton | A Transkei Defence Force (TDF) soldier who was... |
21747 rows × 2 columns
Once again, we will convert all our descriptions into a single list of documents.
docs = df.descriptions.tolist()
docs[0]
"An ANCYL member who was shot and severely injured by SAP members at Lephoi, Bethulie, Orange Free State (OFS) on 17 April 1991. Police opened fire on a gathering at an ANC supporter's house following a dispute between two neighbours, one of whom was linked to the ANC and the other to the SAP and a councillor."
1.7.1. Creating a Top2Vec Model#
Now that we have our data, we can get started with Top2Vec. First, you will need to install the library onto your system. In order to do this, you can use pip
with the following command:
pip install top2vec
This will install Top2Vec as well as all its dependencies. Once it is installed, you will be able to import the Top2Vec class with the following line:
from top2vec import Top2Vec
/home/wjbmattingly/anaconda3/envs/python-textbook/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Once we have imported the Top2Vec class, we can create our Top2Vec topic model with just one line of Python. We can pass many key word arguments to our Top2Vec class, but the result from the default settings is usually quite good. It may take several minutes or even hours to run the code below, depending on your system requirements.
model = Top2Vec(docs)
2022-11-13 13:58:09,857 - top2vec - INFO - Pre-processing documents for training
2022-11-13 13:58:11,379 - top2vec - INFO - Creating joint document/word embedding
2022-11-13 13:58:53,163 - top2vec - INFO - Creating lower dimension embedding of documents
2022-11-13 13:59:16,683 - top2vec - INFO - Finding dense areas of documents
2022-11-13 13:59:18,018 - top2vec - INFO - Finding topics
1.7.2. Analyzing our Topic Model#
Once our Top2Vec topic model has finished training, we can then analyze the results. A good first step is to analyze the topic sizes. We can do so with the model .get_topic_sizes()
. This method will return a list of two lists: the size of each topic and its number.
topic_sizes, topic_nums = model.get_topic_sizes()
We can better understand these results by zipping the two lists together and iterating over the first five topics to see how large they are.
for topic_size, topic_num in zip(topic_sizes[:5], topic_nums[:5]):
print(f"Topic Num {topic_num} has {topic_size} documents.")
Topic Num 0 has 763 documents.
Topic Num 1 has 346 documents.
Topic Num 2 has 247 documents.
Topic Num 3 has 232 documents.
Topic Num 4 has 231 documents.
Notice that the higher number the topic number, the smaller the quantity of documents. This is because Top2Vec organizes the topics by size.
We can use the .get_topics()
method to extract information about our topics from the model. This will take one argument, the quantity of topics you want to examine. It will return 3 items: a list of words for each topic, a list of word scores for each list, and a list of the topic numbers.
topic_words, word_scores, topics = model.get_topics(2)
Let’s now examine the data for topic 1.
for words, scores, num in zip(topic_words[1:], word_scores[1:], topics[1:]):
print(f"Topic {num}")
for word, score in zip(words, scores):
print(word, score)
Topic 1
matlala 0.8607659
bk 0.8543051
gamatlala 0.8241883
proposed 0.794874
lebowa 0.74230933
resisted 0.67116654
independence 0.65576303
africa 0.6098439
chief 0.49963036
she 0.39457357
because 0.3908882
goederede 0.36000514
february 0.3470583
mahlangu 0.34650403
home 0.34591654
down 0.34060025
south 0.33364677
dennilton 0.3321176
burnt 0.32735756
would 0.3133811
supported 0.31247702
incorporation 0.3080003
stolen 0.28846616
from 0.28013867
had 0.27430868
it 0.2732216
lost 0.2716071
kwandebele 0.266503
congress 0.26175743
allegedly 0.25916496
supporters 0.25534543
village 0.25463325
her 0.25239387
mr 0.2508838
homeland 0.24748528
spite 0.24678819
about 0.24613273
maboloko 0.24536304
resulted 0.24119411
into 0.24059604
opposed 0.23594634
pietersburg 0.23558588
alleged 0.22179778
possessions 0.21927707
many 0.21916199
house 0.21861681
imbokodo 0.21361831
leeuwfontein 0.21190558
tribal 0.21142614
zeerust 0.21082786
This gives us a word list of 50 words most closely associated with Topic 1
. Each word has a score. Remember, words, documents, and topics all occupy the same higher-dimensional space. This is an example of this advantage in practice. We are able to calculate the degree to which a word is relevant to a topic because of its proximity to the center of a topic. The higher the score, the closer a word is to a topic’s center and, therefore, in theory more closely represents that cluster.
We can also use the search_documents_by_topic()
method to learn about the documents that reside within a topic. This method takes two arguments: topic_num
(the number of the topic you wish to examine) and num_docs
(the number of documents you want to return). This method will return three things: the documents, the scores, or the degree to which they represent the topic, and their unique ids.
documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=0, num_docs=10)
Now that we have our data, we can iterate them all simultaneously with the example provided in the Top2Vec README on GitHub.
for doc, score, doc_id in zip(documents, document_scores, document_ids):
print(f"Document: {doc_id}, Score: {score}")
print("-----------")
print(doc)
print("-----------")
print()
Document: 15060, Score: 0.9702943563461304
-----------
An ANC supporter who had her house burnt down by IFP supporters in Sonkombo, Ndwedwe, KwaZulu, near Durban, on 16 March 1994. See Sonkombo arson attacks.
-----------
Document: 5686, Score: 0.965286374092102
-----------
She lost her house in Sonkombo, Ndwedwe, KwaZulu, near Durban, in an arson attack by IFP supporters on 16 March 1994. See Sonkombo arson attacks.
-----------
Document: 21471, Score: 0.9650641083717346
-----------
An ANC supporter whose house was burnt down by IFP supporters in Sonkombo, Ndwedwe, KwaZulu, near Durban, on 16 March 1994. See Sonkombo arson attacks.
-----------
Document: 2957, Score: 0.9647031426429749
-----------
His home was burnt down by IFP supporters on 16 March 1994 at Sonkombo, Ndwedwe, KwaZulu, near Durban, in intense political conflict in the area. See Sonkombo arson attacks.
-----------
Document: 714, Score: 0.9634069204330444
-----------
An ANC supporter who had her home burnt down by IFP supporters at Sonkombo, Ndwedwe, KwaZulu, near Durban, on 16 March 1994. See Sonkombo arson attacks.
-----------
Document: 9855, Score: 0.9628040194511414
-----------
He had his house burnt down by IFP supporters in Sonkombo, Ndwedwe, KwaZulu, near Durban, on 16 March 1994. See Sonkombo arson attacks.
-----------
Document: 759, Score: 0.9617692232131958
-----------
Had her home burnt down by IFP supporters at Sonkombo, Ndwedwe, KwaZulu, near Durban, on 16 March 1994. See Sonkombo arson attacks.
-----------
Document: 519, Score: 0.9617335796356201
-----------
An ANC supporter who had her house burnt down by IFP supporters on 16 March 1994 at Sonkombo, Ndwedwe, KwaZulu, near Durban, in intense political conflict in the area. See Sonkombo arson attacks.
-----------
Document: 3424, Score: 0.9609926342964172
-----------
An ANC supporter whose home was burnt down by IFP supporters on 20 March 1994 at Sonkombo, Ndwedwe, KwaZulu, near Durban. See Sonkombo arson attacks.
-----------
Document: 12659, Score: 0.9603915810585022
-----------
An ANC supporter who had her house burnt down by IFP supporters in Sonkombo, Ndwedwe, KwaZulu, near Durban, on 20 March 1994. See Sonkombo arson attacks.
-----------
These results are good. These are all individuals who have near identical descriptions in the TRC Volume 7 Final Report. In other words, Top2Vec has clearly isolated documents of identical or overlapping similarity.
1.7.3. Working with Bigrams and Trigrams#
Top2Vec also wraps around the Gensim library and allows users to automatically create bigrams and trigrams. We can initiate this process by setting the keyword argument ngram_vocba
to True
model_ngram = Top2Vec(docs, ngram_vocab=True)
2022-11-13 13:59:18,293 - top2vec - INFO - Pre-processing documents for training
2022-11-13 13:59:19,889 - top2vec - INFO - Creating joint document/word embedding
2022-11-13 14:00:07,661 - top2vec - INFO - Creating lower dimension embedding of documents
2022-11-13 14:00:19,175 - top2vec - INFO - Finding dense areas of documents
2022-11-13 14:00:19,748 - top2vec - INFO - Finding topics
Now that we have our model_ngram
trained, let’s explore topic 1 once again.
topic_sizes_ngram, topic_nums_ngram = model_ngram.get_topic_sizes()
topic_words_ngram, word_scores_ngram, topics_ngram = model_ngram.get_topics(2)
for words, scores, num in zip(topic_words_ngram[1:], word_scores_ngram[1:], topics_ngram[1:]):
print(f"Topic {num}")
for word, score in zip(words, scores):
print(word, score)
Topic 1
island 0.7406236
imprisonment 0.7383562
years 0.7350636
robben 0.73397404
half years 0.7247492
years imprisonment 0.7206887
sentenced 0.6986825
served 0.695978
arrested 0.66353947
charged 0.66336066
banned 0.66275764
imprisoned 0.6580816
five years 0.6498868
suspended sentence 0.6363463
prison sentence 0.6340719
trial 0.6326254
sentence 0.62952983
activist 0.6273678
months 0.62542427
detained 0.6249351
prison 0.6236554
until 0.6223862
poqo 0.6147149
custody 0.6120131
year sentence 0.6114016
held 0.6040434
worcester 0.60011864
pac 0.5935757
released 0.5893775
poqo activities 0.58773917
charges 0.5808439
zwelethemba worcester 0.57607645
zweletemba worcester 0.575092
arrest 0.5745207
robben island 0.571052
under 0.56998795
worcester cape 0.56770444
detention 0.5670366
with public 0.56048626
ashton 0.55939454
later acquitted 0.5478358
tortured 0.5445701
charged with 0.54265773
act 0.5392955
interrogation 0.5369543
without 0.53288996
paarl 0.53190106
public 0.5305038
prison paarl 0.5296649
year prison 0.5292457
It is important to note that this Topic 1 is entirely different from the Topic 1 of our first model as both the initialization of the process is random and the data used for embedding documents is also altered (because we are accepting bigrams). Our data does reflect a deeper representation of our corpus. Prison sentence
, for example, is a collective word that has a distinct meaning.
1.7.4. Saving and Loading a Top2Vec Model#
Once we have a model that we are happy with, we can save it using the save()
method.
model_ngram.save("../data/top2vec_ngram")
Likewise, we can load up our previous model with load()
new_model = Top2Vec.load("../data/top2vec_ngram")
And for good measure we can run the same code as above to print off our bigrams once again.
topic_words_ngram, word_scores_ngram, topics_ngram = new_model.get_topics(2)
for words, scores, num in zip(topic_words_ngram[1:], word_scores_ngram[1:], topics_ngram[1:]):
print(f"Topic {num}")
for word, score in zip(words, scores):
print(word, score)
Topic 1
island 0.7406236
imprisonment 0.7383562
years 0.7350636
robben 0.73397404
half years 0.7247492
years imprisonment 0.7206887
sentenced 0.6986825
served 0.695978
arrested 0.66353947
charged 0.66336066
banned 0.66275764
imprisoned 0.6580816
five years 0.6498868
suspended sentence 0.6363463
prison sentence 0.6340719
trial 0.6326254
sentence 0.62952983
activist 0.6273678
months 0.62542427
detained 0.6249351
prison 0.6236554
until 0.6223862
poqo 0.6147149
custody 0.6120131
year sentence 0.6114016
held 0.6040434
worcester 0.60011864
pac 0.5935757
released 0.5893775
poqo activities 0.58773917
charges 0.5808439
zwelethemba worcester 0.57607645
zweletemba worcester 0.575092
arrest 0.5745207
robben island 0.571052
under 0.56998795
worcester cape 0.56770444
detention 0.5670366
with public 0.56048626
ashton 0.55939454
later acquitted 0.5478358
tortured 0.5445701
charged with 0.54265773
act 0.5392955
interrogation 0.5369543
without 0.53288996
paarl 0.53190106
public 0.5305038
prison paarl 0.5296649
year prison 0.5292457
If you are working with topic modeling or just want to get a general sense of how your documents align in a large corpus, then Top2Vec is a great solution to your situation. It offers a lot of features for a single line of code. There are other options available that do similar things, notable BerTopic. One should never rely on a single topic modeling approach, nor should one rest arguments heavily upon the results. Topic modeling, regardless of method, is imperfect. It is, however, a great way to explore your data and begin finding patterns that you may otherwise not notice.