1.5. Creating LDA in Python#

In the last section, we learned a lot about the key idea and concepts behind LDA topic modeling. Let’s now put those ideas into practice. In this section, we will be using the Gensim library to create our topic model and the PyLDAVis library to visualize it. You can install both libraries with pip with the following commands:

pip install gensim

and

pip install pyldavis

We will also need to install NLTK, or the Natural Language Toolkit, in order to get a list of stop words. We can install the library with pip:

pip install nltk

Once you have installed NLTK, you will need to download the list of English stop words. You can do so with the following command:

nltk.download('stopwords')

1.5.1. Importing the Required Libraries and Data#

Now that we have our libraries installed correctly, we can import everything.

import pandas as pd
from nltk.corpus import stopwords
import string
import gensim.corpora as corpora
from gensim.models import LdaModel
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()

This will import the requisite model from Gensim. For this notebook, we will be using the LdaModel class. This class allows us to create an LDA model. Before we can populate our model, however, we must first load and clean our data.

df = pd.read_csv("../data/trc.csv")
df = df[["Last", "First", "Description"]]
df
Last First Description
0 AARON Thabo Simon An ANCYL member who was shot and severely inju...
1 ABBOTT Montaigne A member of the SADF who was severely injured ...
2 ABRAHAM Nzaliseko Christopher A COSAS supporter who was kicked and beaten wi...
3 ABRAHAMS Achmat Fardiel Was shot and blinded in one eye by members of ...
4 ABRAHAMS Annalene Mildred Was shot and injured by members of the SAP in ...
... ... ... ...
20829 XUZA Mandla Was severely injured when he was stoned by a f...
20830 YAKA Mbangomuni An IFP supporter and acting induna who was sho...
20831 YALI Khayalethu Was shot by members of the SAP in Lingelihle, ...
20832 YALO Bikiwe An IFP supporter whose house and possessions w...
20833 YALOLO-BOOYSEN Geoffrey Yali An ANC supporter and youth activist who was to...

20834 rows × 3 columns

Our goal in this section will be to model all the descriptions of violence in the TRC Volume 7 Final Report. We will, therefore, grab all documents and place them into a list.

docs = df.Description.tolist()
docs[0]
"An ANCYL member who was shot and severely injured by SAP members at Lephoi, Bethulie, Orange Free State (OFS) on 17 April 1991. Police opened fire on a gathering at an ANC supporter's house following a dispute between two neighbours, one of whom was linked to the ANC and the other to the SAP and a councillor."

1.5.2. Cleaning Documents#

Now that we have our documents, let’s go ahead and load up our stop words. These will be the words that we remove from our documents.

stop_words = stopwords.words('english')

Next, we need a function to clean our documents. The purpose of the function below is to take a single document as an input and return a cleaned sequence of words with no punctuation or stop words.

def clean_doc(doc):
    no_punct = ''
    for c in doc:
        if c not in string.punctuation:
            no_punct = no_punct+c
    # with list comprehension
    # no_punct = ''.join([c for c in doc if c not in string.punctuation])
    
    words = no_punct.lower().split()
    
    final_words = []
    for word in words:
        if word not in stop_words:
            final_words.append(word)
    
    # with list comprehension
    # final_words = [word for word in words if word not in stop_words]

    return final_words

Let’s take a look and see what this looks like now when we run it over our first document.

cleaned = clean_doc(docs[0])
print(docs[0])
print(cleaned)
An ANCYL member who was shot and severely injured by SAP members at Lephoi, Bethulie, Orange Free State (OFS) on 17 April 1991. Police opened fire on a gathering at an ANC supporter's house following a dispute between two neighbours, one of whom was linked to the ANC and the other to the SAP and a councillor.
['ancyl', 'member', 'shot', 'severely', 'injured', 'sap', 'members', 'lephoi', 'bethulie', 'orange', 'free', 'state', 'ofs', '17', 'april', '1991', 'police', 'opened', 'fire', 'gathering', 'anc', 'supporters', 'house', 'following', 'dispute', 'two', 'neighbours', 'one', 'linked', 'anc', 'sap', 'councillor']

With our function created, we can now process all our documents. In the line be below, we will convert all documents into a list called cleaned_docs. This will be our documents that are now represented as a sequence of words.

cleaned_docs = [clean_doc(doc) for doc in docs]

1.5.3. Create ID-Word Index#

Remember, an LDA topic model cannot look at words, rather it must look at numbers in order for the algorithm to work. This means that we need to convert all words into numbers. We can do this with a bag-of-words approach. In this approach, we create an ID-Word Index. This is essentially a dictionary where each unique word has a unique number. The dictionary is sorted alphabetically.

id2word = corpora.Dictionary(cleaned_docs)

With the ID-Word Index created, let’s query it and see what word maps to index number 250.

id2word[250]
'bmw'

As we can see, it is BMW, a political organization in South Africa. Now that we have our dictionary, we can convert all our documents into a sequence of numbers, rather than words. We can do this via the doc2bow() method.

corpus = [id2word.doc2bow(cleaned_doc) for cleaned_doc in cleaned_docs]

The corpus object now contains all the documents but represented as a bag-of-words index, rather than as a sequence of words. This is the precise data that our LDA model will expect.

print(corpus[0])
[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 2), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)]

In order to see what these numbers correspond to, let’s take a look at our first document and map each number to the ID-Word Index.

for num in corpus[0]:
    num = num[0]
    print(f"{num}\t{id2word[num]}")
0	17
1	1991
2	anc
3	ancyl
4	april
5	bethulie
6	councillor
7	dispute
8	fire
9	following
10	free
11	gathering
12	house
13	injured
14	lephoi
15	linked
16	member
17	members
18	neighbours
19	ofs
20	one
21	opened
22	orange
23	police
24	sap
25	severely
26	shot
27	state
28	supporters
29	two

1.5.4. Creating LDA Topic Model#

With our corpus now prepared,, we can pass it to our LDA Model. We need have a lot of parameters that we can use here, but three things are mandatory:

  • corpus which will be our corpus object

  • id2word which will be our ID-Word Index

  • num_topics which will be the number of topics we want the model to find.

Now, we can train our topic model.

lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=100)

Training a model may take several seconds to minutes. The more documents you have, the longer it will take. Once the model is trained, we can begin to analyze the model.

1.5.5. Analyze a Document#

If we want to get the topics for each document in our corpus, we can do so via get_document_topics(). This will take a single argument, the corpus object. It will return a list of topics and a score for each topic.

topics = lda_model.get_document_topics(corpus)
len(topics)
20834

We can analyze the topics for our first document.

topics[0]
[(6, 0.037855655),
 (15, 0.037333403),
 (27, 0.5222747),
 (36, 0.06073708),
 (43, 0.041372776),
 (74, 0.21932246),
 (94, 0.05292057)]

These topics were assigned to the following document:

print(docs[0])
An ANCYL member who was shot and severely injured by SAP members at Lephoi, Bethulie, Orange Free State (OFS) on 17 April 1991. Police opened fire on a gathering at an ANC supporter's house following a dispute between two neighbours, one of whom was linked to the ANC and the other to the SAP and a councillor.

Let’s now print off some of the key words for each of the top-3 topics. For each topic, we will print off the top-10 words with the get_topic_terms() method which will take two arguments: the topic number and the amount of words to return.

for topic in topics[0][:3]:
    terms = lda_model.get_topic_terms(topic[0], 10)
    print(topic)
    for num in terms:
        num = num[0]
        print(num, id2word[num])
    print()
(6, 0.038197085)
733 seriously
1394 boipatong
13 injured
809 left
750 massacre
0 17
284 thirteen
68 people
551 five
406 refused

(15, 0.037302956)
625 campaign
3085 soldiers
226 torture
37 transvaal
7 dispute
1350 military
33 landmine
1224 thrown
711 section
2 anc

(27, 0.5223308)
26 shot
8 fire
21 opened
17 members
23 police
24 sap
13 injured
255 march
58 town
61 1990

These results seem acceptable.

1.5.6. Analyze the Topic Model#

It is often difficult to gauge the quality of a topic model by simply looking at raw output. For these reasons, it is useful to be able to study a model in its entirety. We can do this with PyLDAVis, a library that was designed to display in 2-dimensional space the topics of a topic model and the key words associated with each topic.

vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, mds="mmds", R=30)
c:\Users\wma22\anaconda3\envs\textbook\lib\site-packages\sklearn\manifold\_mds.py:298: FutureWarning: The default value of `normalized_stress` will change to `'auto'` in version 1.4. To suppress this warning, manually set the value of `normalized_stress`.
  warnings.warn(
vis

Now, we can see all 100 topics plotted in two-dimensional space and we can study the degree to which the words in each topic are unique to that topic when compared with the corpus as as whole.

While useful as a first step to exploring data, we have more powerful ways of clustering documents today. As we will see over the remainder of the chapter, other libraries and approaches that leverage state of the art language models are perhaps better suited to finding and extracting topics from your corpus.