# 1.5. Creating LDA in Python#

In the last section, we learned a lot about the key idea and concepts behind LDA topic modeling. Let’s now put those ideas into practice. In this section, we will be using the Gensim library to create our topic model and the PyLDAVis library to visualize it. You can install both libraries with pip with the following commands:

pip install gensim


and

pip install pyldavis


We will also need to install NLTK, or the Natural Language Toolkit, in order to get a list of stop words. We can install the library with pip:

pip install nltk


Once you have installed NLTK, you will need to download the list of English stop words. You can do so with the following command:

nltk.download('stopwords')


## 1.5.1. Importing the Required Libraries and Data#

Now that we have our libraries installed correctly, we can import everything.

import pandas as pd
from nltk.corpus import stopwords
import string
import gensim.corpora as corpora
from gensim.models import LdaModel
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()


This will import the requisite model from Gensim. For this notebook, we will be using the LdaModel class. This class allows us to create an LDA model. Before we can populate our model, however, we must first load and clean our data.

df = pd.read_csv("../data/trc.csv")
df = df[["Last", "First", "Description"]]
df

Last First Description
0 AARON Thabo Simon An ANCYL member who was shot and severely inju...
1 ABBOTT Montaigne A member of the SADF who was severely injured ...
2 ABRAHAM Nzaliseko Christopher A COSAS supporter who was kicked and beaten wi...
3 ABRAHAMS Achmat Fardiel Was shot and blinded in one eye by members of ...
4 ABRAHAMS Annalene Mildred Was shot and injured by members of the SAP in ...
... ... ... ...
20829 XUZA Mandla Was severely injured when he was stoned by a f...
20830 YAKA Mbangomuni An IFP supporter and acting induna who was sho...
20831 YALI Khayalethu Was shot by members of the SAP in Lingelihle, ...
20832 YALO Bikiwe An IFP supporter whose house and possessions w...
20833 YALOLO-BOOYSEN Geoffrey Yali An ANC supporter and youth activist who was to...

20834 rows × 3 columns

Our goal in this section will be to model all the descriptions of violence in the TRC Volume 7 Final Report. We will, therefore, grab all documents and place them into a list.

docs = df.Description.tolist()
docs[0]

"An ANCYL member who was shot and severely injured by SAP members at Lephoi, Bethulie, Orange Free State (OFS) on 17 April 1991. Police opened fire on a gathering at an ANC supporter's house following a dispute between two neighbours, one of whom was linked to the ANC and the other to the SAP and a councillor."


## 1.5.2. Cleaning Documents#

Now that we have our documents, let’s go ahead and load up our stop words. These will be the words that we remove from our documents.

stop_words = stopwords.words('english')


Next, we need a function to clean our documents. The purpose of the function below is to take a single document as an input and return a cleaned sequence of words with no punctuation or stop words.

def clean_doc(doc):
no_punct = ''
for c in doc:
if c not in string.punctuation:
no_punct = no_punct+c
# with list comprehension
# no_punct = ''.join([c for c in doc if c not in string.punctuation])

words = no_punct.lower().split()

final_words = []
for word in words:
if word not in stop_words:
final_words.append(word)

# with list comprehension
# final_words = [word for word in words if word not in stop_words]

return final_words


Let’s take a look and see what this looks like now when we run it over our first document.

cleaned = clean_doc(docs[0])
print(docs[0])
print(cleaned)

An ANCYL member who was shot and severely injured by SAP members at Lephoi, Bethulie, Orange Free State (OFS) on 17 April 1991. Police opened fire on a gathering at an ANC supporter's house following a dispute between two neighbours, one of whom was linked to the ANC and the other to the SAP and a councillor.
['ancyl', 'member', 'shot', 'severely', 'injured', 'sap', 'members', 'lephoi', 'bethulie', 'orange', 'free', 'state', 'ofs', '17', 'april', '1991', 'police', 'opened', 'fire', 'gathering', 'anc', 'supporters', 'house', 'following', 'dispute', 'two', 'neighbours', 'one', 'linked', 'anc', 'sap', 'councillor']


With our function created, we can now process all our documents. In the line be below, we will convert all documents into a list called cleaned_docs. This will be our documents that are now represented as a sequence of words.

cleaned_docs = [clean_doc(doc) for doc in docs]


## 1.5.3. Create ID-Word Index#

Remember, an LDA topic model cannot look at words, rather it must look at numbers in order for the algorithm to work. This means that we need to convert all words into numbers. We can do this with a bag-of-words approach. In this approach, we create an ID-Word Index. This is essentially a dictionary where each unique word has a unique number. The dictionary is sorted alphabetically.

id2word = corpora.Dictionary(cleaned_docs)


With the ID-Word Index created, let’s query it and see what word maps to index number 250.

id2word[250]

'bmw'


As we can see, it is BMW, a political organization in South Africa. Now that we have our dictionary, we can convert all our documents into a sequence of numbers, rather than words. We can do this via the doc2bow() method.

corpus = [id2word.doc2bow(cleaned_doc) for cleaned_doc in cleaned_docs]


The corpus object now contains all the documents but represented as a bag-of-words index, rather than as a sequence of words. This is the precise data that our LDA model will expect.

print(corpus[0])

[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 2), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)]


In order to see what these numbers correspond to, let’s take a look at our first document and map each number to the ID-Word Index.

for num in corpus[0]:
num = num[0]
print(f"{num}\t{id2word[num]}")

0	17
1	1991
2	anc
3	ancyl
4	april
5	bethulie
6	councillor
7	dispute
8	fire
9	following
10	free
11	gathering
12	house
13	injured
14	lephoi
16	member
17	members
18	neighbours
19	ofs
20	one
21	opened
22	orange
23	police
24	sap
25	severely
26	shot
27	state
28	supporters
29	two


## 1.5.4. Creating LDA Topic Model#

With our corpus now prepared,, we can pass it to our LDA Model. We need have a lot of parameters that we can use here, but three things are mandatory:

• corpus which will be our corpus object

• id2word which will be our ID-Word Index

• num_topics which will be the number of topics we want the model to find.

Now, we can train our topic model.

lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=100)


Training a model may take several seconds to minutes. The more documents you have, the longer it will take. Once the model is trained, we can begin to analyze the model.

## 1.5.5. Analyze a Document#

If we want to get the topics for each document in our corpus, we can do so via get_document_topics(). This will take a single argument, the corpus object. It will return a list of topics and a score for each topic.

topics = lda_model.get_document_topics(corpus)
len(topics)

20834


We can analyze the topics for our first document.

topics[0]

[(46, 0.2453331),
(49, 0.1601346),
(57, 0.24741054),
(60, 0.064691655),
(65, 0.09262718),
(67, 0.06555787),
(87, 0.058863647),
(91, 0.037500985)]


These topics were assigned to the following document:

print(docs[0])

An ANCYL member who was shot and severely injured by SAP members at Lephoi, Bethulie, Orange Free State (OFS) on 17 April 1991. Police opened fire on a gathering at an ANC supporter's house following a dispute between two neighbours, one of whom was linked to the ANC and the other to the SAP and a councillor.


Let’s now print off some of the key words for each of the top-3 topics. For each topic, we will print off the top-10 words with the get_topic_terms() method which will take two arguments: the topic number and the amount of words to return.

for topic in topics[0][:3]:
terms = lda_model.get_topic_terms(topic[0], 10)
print(topic)
for num in terms:
num = num[0]
print(num, id2word[num])
print()

(46, 0.24563931)
19 ofs
27 state
10 free
22 orange
1250 wife
2 anc
4 april
279 13
17 members

(49, 0.16010018)
26 shot
8 fire
21 opened
24 sap
17 members
888 funeral
13 injured
23 police
255 march

(57, 0.24714684)
1200 students
3044 parents
1268 william’s
17 members
2040 amabutho
2560 beat
2284 associated
224 school
24 sap
237 assaulted


These results seem acceptable.

## 1.5.6. Analyze the Topic Model#

It is often difficult to gauge the quality of a topic model by simply looking at raw output. For these reasons, it is useful to be able to study a model in its entirety. We can do this with PyLDAVis, a library that was designed to display in 2-dimensional space the topics of a topic model and the key words associated with each topic.

vis = pyLDAvis.gensim_models.prepare(lda_model, id_docs, id2word, mds="mmds", R=30)

vis


Now, we can see all 100 topics plotted in two-dimensional space and we can study the degree to which the words in each topic are unique to that topic when compared with the corpus as as whole.

While useful as a first step to exploring data, we have more powerful ways of clustering documents today. As we will see over the remainder of the chapter, other libraries and approaches that leverage state of the art language models are perhaps better suited to finding and extracting topics from your corpus.