1.1. What is Topic Modeling?#

Topic modeling is an approach in NLP where we try to find hidden themes within a collection of documents in a corpus. These themes are the topics. Topic modeling is particularly useful when we do not know the subject of each document in our collection and the corpus is too vast to manually tag each document with a specific topic name. It allows us to automate the tagging of each document without doing so manually.

Topic modeling is distinctly different from approach in NLP known as text classification. In text classification, we train a machine learning model to recognize specific known labels. We do this with training data that consists of texts and their corresponding labels. This method of using training data is similar to the machine learning NER approach we saw in Part Three of this textbook. When we use labeled date to train a machine learning model, this is known as supervised learning. Topic modeling is an entirely different approach designed to work with an entirely different problem. Since we do not know our labels and do not have training data, supervised learning would not work. Topic modeling is an unsupervised learning approach to finding and identifying the labels.

Today, there are many approaches to topic modeling. In Chapter 2, we will learn how to build an LDA (Latent Dirichlet Allocation) model. While useful, this approach to topic modeling has largely been replaced with transformer-based topic models (Chapter 3). Nevertheless, it is import to understand LDA topic models in order to appreciate the novelty of transformer-based approaches. Further, LDA topic modeling has a strong history in the digital humanities and it is useful to understand how and why this approach works to understand the literature produced by digital humanities during the early twenty-first century. Before we can explore each of these, however, it is import to have a key understanding of the themes and concepts of topic modeling. This will be the subject of this chapter.

1.1.1. Rules-Based Methods#

A simple approach to the identification of subjects in a collection of documents is what we would call a rules-based approach. Here, we could argue that documents in which certain words appear indicate a specific topic. This manual approach may yield good results. For example, if I had a 100,000 documents and I knew that some dealt with medical details, I could look for a handful of terms, such as doctor, medicine, hospital, etc. to extract those documents as dealing with a topic of medical.

The key problem with this approach is scale. We cannot practically construct lists like this for all possible topics, especially if we do not know the topics found within a corpus. Further, constructing lists like this requires detailed knowledge about a subject. For these reasons, a rules-based approach to classifying documents is not practical for large corpora.

1.1.2. Machine Learning-Based Methods#

Another option to identify topics in a text is via a machine learning-based approach. In this method, we do not give a computer system a set of rules, rather we let the computer generate its own rules to identify topics in a corpus. This is done in two different ways: supervised and unsupervised learning.

In supervised learning, we know the key subjects in a corpus. We give a computer system a set of documents with their corresponding label to teach it to identify the characteristics that make that particular topic or class unique. This is mostly used for text classification.

Another approach is via unsupervised learning. In unsupervised learning, we do not know the topics of our documents and, instead, we want let the system identify those topics and cluster (or group together) the ones of a high degree of similarity together. We then examine the words that occur the most frequently in each cluster to get a sense of the topics at hand. The classic example for machine learning topic modeling is LDA, or Latent Dirichlet Allocation. We will learn about this method in far more detail in Chapter 2.

1.1.3. Why use Topic Modeling?#

All of this leads to a vital question: Why use topic modeling? Topic modeling affords researchers the ability to learn a lot about their corpus very quickly. It is often used when the corpus is so large that no single human could read it in a single lifetime.

In both a rules-based and machine learning-based approach, a researcher can see what major subjects are discussed in a corpus. This information can be used to perform targeted research by weeding out the documents that likely do not contain the information the researcher needs. Additionally, the information drawn from topic modeling can be used to make large deductions about the corpus at hand. We will see that topic modeling can be used to draw imprecise or incorrect conclusions.

It is vital, however, to understand the limitations of topic modeling. There is always a potential for the researcher to use topic modeling to validate a wrong presumption about the data. Throughout this part of the textbook, I will emphasize methodological steps that can (and should) be taken to limit these mistakes. Despite this potential for error, topic modeling can provide valuable insight, relatively quickly about a large corpus.