1. Topic Modeling: Concepts and Theory#

The purposes of this part of the textbook is fivefold.

  1. Introduce the reader to the core concepts of topic modeling and text classification

  2. Provide an introduction to three libraries used for traditional topic modeling (Scikit Learn, Gensim, and spaCy) for those with limited Python knowledge

  3. Detail the problems and solutions to working with various topic modeling problems

  4. Provide an overview of transformer-based topic modeling

  5. Provide code that will be easily reproducible for readers who wish to apply these methods to their own domains.

Throughout this part of the textbook, we will work with one dataset, a collection of short descriptions of violence in Apartheid South Africa, which comes from Volume 7 of the Truth and Reconciliation Commission’s final report (hereafter, TRC Vol. 7). I have chosen this dataset because I have experience with it and I know that the data is perfectly suited to topic modeling as hidden topics are found within it.

In every real world scenario you will know the data that you are working with. In this section I think it’s worthwhile to spend a little bit of time talking about the data that we are going to be working with throughout this notebook. The first bit of data that we’re going to be working with is a collection of descriptions of violence from apartheid South Africa from the 20th century. They come out of the Truth and Reconciliation Commission in the early 2000s. The Truth and Reconciliation Commission or the TRC was a body put together at the end of the twentieth century to catalog the violence that befell victims during the apartheid period. The first dataset is what is known as Volume 7 of the TRC Final Report. It is organized as a collection of individuals and a brief description of the violence that befell them and the victim’s age. Descriptions of violence are not structured in any way. Instead they have rich metadata within them. This metadata consists of things like dates, places, and other organizations during the 20th century.

The data that I have prepared for you is a JSON file of the TRC Volume 7 report. The JSON file is structured as a dictionary with two keys the first key is names and that corresponds to a list of the victim names. The second key is descriptions. This is the key piece of the data that we will be working with. These are the descriptions of violence and we are trying to identify topics within these descriptions.