1.4. LDA Topic Modeling#
A common way to perform topic modeling in the digital humanities is via Latent Dirichlet Allocation (LDA) topic modeling. This method originated in population genomics in 2000 as a way to understand larger patterns in genomics data. In 2003, it was applied to machine learning, specifically texts to solve the problem of topic discovery. It leverages statistics to identify topics across a distributed set of data.
1.4.1. Process of Topic Modeling#
In order to engage in LDA Topic Modeling, one must clean a corpus significantly. Common steps that we will cover in this chapter are:
removing of stopwords
lemmatization (optional)
removing of punctuation
Stopwords consist of words common across all texts. The official list from the NLTK
library is as follows:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
Here, let’s consider a simple example. We will work with the following document from the TRC Volume 7 dataset.
An ANCYL member who was shot and severely injured by SAP members at Lephoi, Bethulie, Orange Free State (OFS) on 17 April 1991. Police opened fire on a gathering at an ANC supporter's house following a dispute between two neighbours, one of whom was linked to the ANC and the other to the SAP and a councillor.
After we perform the cleaning steps above (with the exception of lemmatization), we have the following result:
['ancyl', 'member', 'shot', 'severely', 'injured', 'sap', 'members', 'lephoi', 'bethulie', 'orange', 'free', 'state', 'ofs', '17', 'april', '1991', 'police', 'opened', 'fire', 'gathering', 'anc', 'supporters', 'house', 'following', 'dispute', 'two', 'neighbours', 'one', 'linked', 'anc', 'sap', 'councillor']
Note that we now have a list of only lowercase words split into a list. There are no punctuation marks. In addition, all stop words have been removed.
In addition to cleaning the corpus, one must also reduce all words to unique numbers. These numbers have now relevance to the word, rather they are a single integer. In this approach, a researcher creates a bag-of-words (BOW) dictionary. This dictionary is a collection of integers and their corresponding word. Each text is then reduced to a sequence of numbers. Once we perform this step, our documents now look like this:
[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 2), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)]
This sequence of numbers refers to the presence of a word in this document. If we were to print off each individual word from our entire dictionary, we would see that it looks like this:
0 17
1 1991
2 anc
3 ancyl
4 april
5 bethulie
6 councillor
7 dispute
8 fire
9 following
10 free
11 gathering
12 house
13 injured
14 lephoi
15 linked
16 member
17 members
18 neighbours
19 ofs
20 one
21 opened
22 orange
23 police
24 sap
25 severely
26 shot
27 state
28 supporters
29 two
As we will see when we explore Top2Vec later in this chapter, a major drawback to this approach is that we do not retain any information about the syntax or semantics of the language in this approach. Rather, we just know if a word appears or not. Language is, however, far more complex than simply the presence or lack of presence of words. For words to have meaning, we must know how that word is used. Word order, for example, dramatically affects meaning of individual words due to context. Traditional LDA topic modeling does not offer solutions to these problems.
1.4.2. Knowing the Total Number of Topics#
Another issue with LDA Topic Modeling is that one must know a specified amount of topics to identify across the entire corpus. Assigning this number can be quite challenging and will require a series of trial-and-error passes. As we will see in the next section, a lot of these issues, from cleaning a corpus to guessing the quantity of topics in the corpus, are eliminated with the use of transformers and more recent machine learning methods and algorithms.
1.4.3. Applying a Topic Model#
Once an LDA topic model is trained, it can then be used to interrogate the corpus collectively. One can explore the topics identified and by reading the documents that the model placed in the same category. Depending on the corpus, certain topics can sometimes be difficult to understand. At this stage, the topics do not have actual names, rather they are purely numerical.
If we want to assign labels to each topic, we have two different options available to us. First, a content expert can look at the results and intuitively assign a label that makes the most sense given the corpus. Another approach is through automation. A researcher can find the most common and unique words that appear in a topic and automatically assign them to the label name.
1.4.4. Summary of Key Issues with LDA Topic Modeling#
In summary, the concept of topic modeling is a worthwhile endeavor for many text-based classification problems. It is particularly useful when trying to understand large corpora that cannot possibly be read in a reasonable amount of time. It is also suited to understanding how documents cluster together in those corpora to understand large, latent (hidden) themes. LDA Topic Modeling, however, presents many limitations in this process, notably it requires a good deal of preprocessing, knowledge about the quantity of topics one wants to find in the corpus, and the inability to retain semantic and syntactic meaning.