Creating LDA in Python

1.5. Creating LDA in Python#

In the last section, we learned a lot about the key idea and concepts behind LDA topic modeling. Let’s now put those ideas into practice. In this section, we will be using the Gensim library to create our topic model and the PyLDAVis library to visualize it. You can install both libraries with pip with the following commands:

pip install gensim

and

pip install pyldavis

We will also need to install NLTK, or the Natural Language Toolkit, in order to get a list of stop words. We can install the library with pip:

pip install nltk

Once you have installed NLTK, you will need to download the list of English stop words. You can do so with the following command:

nltk.download('stopwords')

1.5.1. Importing the Required Libraries and Data#

Now that we have our libraries installed correctly, we can import everything.

import pandas as pd
from nltk.corpus import stopwords
import string
import gensim.corpora as corpora
from gensim.models import LdaModel
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()

This will import the requisite model from Gensim. For this notebook, we will be using the LdaModel class. This class allows us to create an LDA model. Before we can populate our model, however, we must first load and clean our data.

df = pd.read_csv("../data/trc.csv")
df = df[["Last", "First", "Description"]]
df

	Last	First	Description
0	AARON	Thabo Simon	An ANCYL member who was shot and severely inju...
1	ABBOTT	Montaigne	A member of the SADF who was severely injured ...
2	ABRAHAM	Nzaliseko Christopher	A COSAS supporter who was kicked and beaten wi...
3	ABRAHAMS	Achmat Fardiel	Was shot and blinded in one eye by members of ...
4	ABRAHAMS	Annalene Mildred	Was shot and injured by members of the SAP in ...
...	...	...	...
20829	XUZA	Mandla	Was severely injured when he was stoned by a f...
20830	YAKA	Mbangomuni	An IFP supporter and acting induna who was sho...
20831	YALI	Khayalethu	Was shot by members of the SAP in Lingelihle, ...
20832	YALO	Bikiwe	An IFP supporter whose house and possessions w...
20833	YALOLO-BOOYSEN	Geoffrey Yali	An ANC supporter and youth activist who was to...

20834 rows × 3 columns

Our goal in this section will be to model all the descriptions of violence in the TRC Volume 7 Final Report. We will, therefore, grab all documents and place them into a list.

docs = df.Description.tolist()
docs[0]

"An ANCYL member who was shot and severely injured by SAP members at Lephoi, Bethulie, Orange Free State (OFS) on 17 April 1991. Police opened fire on a gathering at an ANC supporter's house following a dispute between two neighbours, one of whom was linked to the ANC and the other to the SAP and a councillor."

1.5.2. Cleaning Documents#

Now that we have our documents, let’s go ahead and load up our stop words. These will be the words that we remove from our documents.

stop_words = stopwords.words('english')

Next, we need a function to clean our documents. The purpose of the function below is to take a single document as an input and return a cleaned sequence of words with no punctuation or stop words.

def clean_doc(doc):
    no_punct = ''
    for c in doc:
        if c not in string.punctuation:
            no_punct = no_punct+c
    # with list comprehension
    # no_punct = ''.join([c for c in doc if c not in string.punctuation])
    
    words = no_punct.lower().split()
    
    final_words = []
    for word in words:
        if word not in stop_words:
            final_words.append(word)
    
    # with list comprehension
    # final_words = [word for word in words if word not in stop_words]

    return final_words

Let’s take a look and see what this looks like now when we run it over our first document.

cleaned = clean_doc(docs[0])
print(docs[0])
print(cleaned)

An ANCYL member who was shot and severely injured by SAP members at Lephoi, Bethulie, Orange Free State (OFS) on 17 April 1991. Police opened fire on a gathering at an ANC supporter's house following a dispute between two neighbours, one of whom was linked to the ANC and the other to the SAP and a councillor.
['ancyl', 'member', 'shot', 'severely', 'injured', 'sap', 'members', 'lephoi', 'bethulie', 'orange', 'free', 'state', 'ofs', '17', 'april', '1991', 'police', 'opened', 'fire', 'gathering', 'anc', 'supporters', 'house', 'following', 'dispute', 'two', 'neighbours', 'one', 'linked', 'anc', 'sap', 'councillor']

With our function created, we can now process all our documents. In the line be below, we will convert all documents into a list called cleaned_docs. This will be our documents that are now represented as a sequence of words.

cleaned_docs = [clean_doc(doc) for doc in docs]

1.5.3. Create ID-Word Index#

Remember, an LDA topic model cannot look at words, rather it must look at numbers in order for the algorithm to work. This means that we need to convert all words into numbers. We can do this with a bag-of-words approach. In this approach, we create an ID-Word Index. This is essentially a dictionary where each unique word has a unique number. The dictionary is sorted alphabetically.

id2word = corpora.Dictionary(cleaned_docs)

With the ID-Word Index created, let’s query it and see what word maps to index number 250.

id2word[250]

'bmw'

As we can see, it is BMW, a political organization in South Africa. Now that we have our dictionary, we can convert all our documents into a sequence of numbers, rather than words. We can do this via the doc2bow() method.

corpus = [id2word.doc2bow(cleaned_doc) for cleaned_doc in cleaned_docs]

The corpus object now contains all the documents but represented as a bag-of-words index, rather than as a sequence of words. This is the precise data that our LDA model will expect.

print(corpus[0])

[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 2), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)]

In order to see what these numbers correspond to, let’s take a look at our first document and map each number to the ID-Word Index.

for num in corpus[0]:
    num = num[0]
    print(f"{num}\t{id2word[num]}")

17
1991
anc
ancyl
april
bethulie
councillor
dispute
fire
following
free
gathering
house
injured
lephoi
linked
member
members
neighbours
ofs
one
opened
orange
police
sap
severely
shot
state
supporters
two

1.5.4. Creating LDA Topic Model#

With our corpus now prepared,, we can pass it to our LDA Model. We need have a lot of parameters that we can use here, but three things are mandatory:

corpus which will be our corpus object
id2word which will be our ID-Word Index
num_topics which will be the number of topics we want the model to find.

Now, we can train our topic model.

lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=100)

Training a model may take several seconds to minutes. The more documents you have, the longer it will take. Once the model is trained, we can begin to analyze the model.

1.5.5. Analyze a Document#

If we want to get the topics for each document in our corpus, we can do so via get_document_topics(). This will take a single argument, the corpus object. It will return a list of topics and a score for each topic.

topics = lda_model.get_document_topics(corpus)
len(topics)

We can analyze the topics for our first document.

topics[0]

[(6, 0.037855655),
 (15, 0.037333403),
 (27, 0.5222747),
 (36, 0.06073708),
 (43, 0.041372776),
 (74, 0.21932246),
 (94, 0.05292057)]

These topics were assigned to the following document:

print(docs[0])

An ANCYL member who was shot and severely injured by SAP members at Lephoi, Bethulie, Orange Free State (OFS) on 17 April 1991. Police opened fire on a gathering at an ANC supporter's house following a dispute between two neighbours, one of whom was linked to the ANC and the other to the SAP and a councillor.

Let’s now print off some of the key words for each of the top-3 topics. For each topic, we will print off the top-10 words with the get_topic_terms() method which will take two arguments: the topic number and the amount of words to return.

for topic in topics[0][:3]:
    terms = lda_model.get_topic_terms(topic[0], 10)
    print(topic)
    for num in terms:
        num = num[0]
        print(num, id2word[num])
    print()

(6, 0.038197085)
seriously
boipatong
injured
left
massacre
17
thirteen
people
five
refused

(15, 0.037302956)
campaign
soldiers
torture
transvaal
dispute
military
landmine
thrown
section
anc

(27, 0.5223308)
shot
fire
opened
members
police
sap
injured
march
town
1990

These results seem acceptable.

1.5.6. Analyze the Topic Model#

It is often difficult to gauge the quality of a topic model by simply looking at raw output. For these reasons, it is useful to be able to study a model in its entirety. We can do this with PyLDAVis, a library that was designed to display in 2-dimensional space the topics of a topic model and the key words associated with each topic.

vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, mds="mmds", R=30)

c:\Users\wma22\anaconda3\envs\textbook\lib\site-packages\sklearn\manifold\_mds.py:298: FutureWarning: The default value of `normalized_stress` will change to `'auto'` in version 1.4. To suppress this warning, manually set the value of `normalized_stress`.
  warnings.warn(

vis

Now, we can see all 100 topics plotted in two-dimensional space and we can study the degree to which the words in each topic are unique to that topic when compared with the corpus as as whole.

While useful as a first step to exploring data, we have more powerful ways of clustering documents today. As we will see over the remainder of the chapter, other libraries and approaches that leverage state of the art language models are perhaps better suited to finding and extracting topics from your corpus.