1.6. Transformer Models#

Transformer models language models are robust machine learning models that are capable of solving many complex problems. It is beyond the scope of this textbook to explain the architecture of transformers or how they work precisely. Nevertheless, it is important to understand a few things. First, transformer models are particularly suited for working with multi-lingual documents. Second, they are powerful, but slow. Third, the way they represent texts is different from other language models in that they are able to change the vector, or numerical representation, of an individual word based on context.

This makes them particularly suited to the problem of topic modeling because they can convert each document into a numerical representation that is not a sequence of numbers that correspond to the presence of an individual word. Instead, transformer models can convert a document into a deeply semantic representation that retains the context and syntax.

In order to leverage transformer models, we need to install SentenceTransformers (from HuggingFace). We can do that with pip via the command below:

pip install sentence_transformers

In this section, we will also be working with two other libraries: umap (for representing our complex numerical representation of documents in 2-dimensional space) and hdbscan (for finding clusters). UMAP has gained popularity in recent years as a quick, effectively, and fairly accurate way to represent higher dimensional data in lower dimensions. In Python, we can access the UMAP algorithm through the UMAP library which can be installed with pip by typing the following command:

pip install umap-learn

Note the -learn after umap. This is very important as umap is an entirely different library.

pip install hdbscan

Now that our libraries are installed, we can begin importing them.

1.6.1. Importing Libraries and Gathering Data#

from sentence_transformers import SentenceTransformer
import umap
import hdbscan
import pandas as pd
/home/wjbmattingly/anaconda3/envs/python-textbook/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Before we begin leveraging advanced transformer-based topic modeling libraries, like Top2Vec, we should have a good understanding about how they work. Top2Vec leverages three libraries: Sentence Transformers, UMAP, and HDBScan. Here, we will explore each of these so that the reader will have a basic understanding of the theory and methods behind Top2Vec

df = pd.read_csv("../data/trc.csv")
df = df[["Last", "First", "Description"]]
df.head(1)
Last First Description
0 AARON Thabo Simon An ANCYL member who was shot and severely inju...
documents = df.Description.tolist()

1.6.2. Embedding the Documents#

The first step in using transformers in topic modeling is to convert the text into a vector. We met vectors when we explored LDA topic modeling in the previous chapter. Arrays for LDA topic modeling were rooted in a bag-of-word (BOW) index. This index, while computationally light, did not retain semantic meaning or word order.

When we are working with transformers, we can create a vector for each document in our dataset. This vector is not an index of the words used, rather it is an embedding for the entire document that contains its semantic usages of words. It also preserves in this same vector space the word order to a degree. This document vector is similar to the word vector that we met in Part Three of this textbook. Instead of embedding a single word, however, the entire document receives an embedding. This allows us to mathematically compare documents across an entire corpus.

To convert our documents into vectors, we first need a transformer model. Fortunately, the Sentence Transformer library from HuggingFace allows us to easily load robust pre-trained language models. In our case, we will be using the all-MiniLM-L6-v2 model. We can load this model by calling the Sentence Transformer class from the sentence_transformers library.

model = SentenceTransformer('all-MiniLM-L6-v2')

Once our model class is created, we can use the .encode() method. This method will encode all the documents that we pass to it. In our case, this is the approximately 22,000 descriptions from the TRC dataset. The .encode() method takes a single mandatory argument, a list of data to embed.

doc_embeddings = model.encode(documents)

Now that we have the vectors for each document, let’s examine one.

doc_embeddings[0][:10]
array([-0.07123438,  0.00332599, -0.05571468,  0.08363082,  0.09066874,
        0.05503598,  0.08029197,  0.01370712,  0.05912026,  0.06226278],
      dtype=float32)

As we can see, this looks remarkably similar to our word embeddings. While this is useful for examining mathematically comparing the similarity between documents, it can be difficult to parse this numerical data visually. For this reason, it is useful to flatten the data into 2 or 3 dimensions. This allows the data to be graphed. In the previous chapter, we learned how to flatten data with PCA. In this chapter, we will meet a new dimensionality reduction algorithm, UMAP.

1.6.3. Flattening the Data#

Once you have installed UMAP correctly, you can access the UMAP class. This will take several parameters that can be adjusted to yield different results.

umap_proj = umap.UMAP(n_neighbors=10,
                          min_dist=0.01,
                          metric='correlation').fit_transform(doc_embeddings)

1.6.4. Isolating Clusters with HDBScan#

Once our data has been flattened, we can automatically identify the number of clusters within it and assign documents to each cluster with the HDBScan algorithm.

hdbscan_labels = hdbscan.HDBSCAN(min_samples=2, min_cluster_size=3).fit_predict(umap_proj)
print(len(set(hdbscan_labels)))
2317
df["x"] = umap_proj[:, 0]
df["y"] = umap_proj[:, 1]
df["topic"] = hdbscan_labels
df.head(1)
Last First Description x y topic
0 AARON Thabo Simon An ANCYL member who was shot and severely inju... 7.330942 -0.935302 -1

1.6.5. Analyzing the Labels#

Now that we have the labels loaded into our DataFrame, we can use Pandas to interrogate that data. Let’s grab a topic and examine it. Here, we will examine topic 100.

for d in df.loc[df.topic == 100].Description.tolist():
    print(d)
    print()
Was one of thirteen people killed in an attack by UDF and ANC supporters on Inkatha supporters in the Mahwaqa area, near Port Shepstone, Natal, on 24 March 1990. Two UDF supporters were granted amnesty (AC/2000/041).

An ANC supporter who was shot dead on 6 April 1990 when a group of Inkatha supporters attacked UDF supporters and residents at Mpumalanga, KwaZulu, near Durban, in spite of a heavy police and military presence. Fourteen people were killed and at least one hundred and twenty homes were burnt down. One former IFP member was granted amnesty (AC/1999/0332).

An ANC supporter who was shot and killed when a group of Inkatha supporters and Caprivi trainees attacked a UDF meeting in a house at Mpumalanga, KwaZulu, near Durban, on 18 January 1988. Nine people were killed and an estimated two hundred people were injured in the attack. The group went on to destroy around eight houses. One former Inkatha member was granted amnesty (AC/1999/0332).

An ANC supporter who was shot and killed when a group of Inkatha supporters and Caprivi trainees attacked a UDF meeting in a house at Mpumalanga, KwaZulu, near Durban, on 18 January 1988. Nine people were killed and an estimated two hundred people were injured in the attack. The group went on to destroy around eight houses. One former Inkatha member was granted amnesty (AC/1999/0332).

An ANC supporter who was shot dead on 18 January 1988 when a group of Inkatha supporters, including some Caprivi trainees, opened fire on a UDF meeting in a house at Mpumalanga, KwaZulu, near Durban. Nine people were killed and an estimated two hundred people were injured. The group went on to destroy about eight houses. Amnesty applications were received for this incident.

Was one of thirteen people killed in an attack by UDF and ANC supporters on Inkatha supporters in the Mahwaqa area, near Port Shepstone, Natal, on 24 March 1990. Two UDF supporters were granted amnesty (AC/2000/041).

Was one of thirteen people killed in an attack by UDF and ANC supporters on Inkatha supporters in the Mahwaqa area, near Port Shepstone, Natal, on 24 March 1990. Two UDF supporters were granted amnesty (AC/2000/041).

Was abducted, interrogated and stabbed to death by UDF and ANC supporters at Mpumalanga, Natal, in July 1989. He was believed to be an Inkatha member. One UDF and ANC supporter was granted amnesty (AC/200/011).

Was one of thirteen people killed in an attack by UDF and ANC supporters on Inkatha supporters in the Mahwaqa area, near Port Shepstone, Natal, on 24 March 1990. Two UDF supporters were granted amnesty (AC/2000/041).

Was one of thirteen people killed in an attack by UDF and ANC supporters on Inkatha supporters in the Mahwaqa area, near Port Shepstone, Natal, on 24 March 1990. Two UDF supporters were granted amnesty (AC/2000/041).

An Inkatha supporter who was stabbed and severely injured by named ANC supporters in Mtwalume, near Umzinto, Natal, in political conflict in the area on 4 February 1990 following the unbanning of political organisations two days earlier. Two UDF supporters were granted amnesty for the attack (AC/2000/041).

Was one of thirteen people killed in an attack by UDF and ANC supporters on Inkatha supporters in the Mahwaqa area, near Port Shepstone, Natal, on 24 March 1990. Two UDF supporters were granted amnesty (AC/2000/041). 

Was one of thirteen people killed in an attack by UDF and ANC supporters on Inkatha supporters in the Mahwaqa area, near Port Shepstone, Natal, on 24 March 1990. Two UDF supporters were granted amnesty (AC/2000/041). 

Was one of thirteen people killed in an attack by UDF and ANC supporters on Inkatha supporters in the Mahwaqa area, near Port Shepstone, Natal, on 24 March 1990. Two UDF supporters were granted amnesty (AC/2000/041). 

Was one of thirteen people killed in an attack by UDF and ANC supporters on Inkatha supporters in the Mahwaqa area, near Port Shepstone, Natal, on 24 March 1990. Two UDF supporters were granted amnesty (AC/2000/041). 

Notice that these results for Topic 100 all clearly have overlapping similarity. In some cases, these are identical, in others they are quite similar. Nevertheless, we have one major limitation to this approach, outliers or noise. Let’s examine Topic -1.

1.6.6. Outliers (Noise)#

df.loc[df.topic == -1]
Last First Description x y topic
0 AARON Thabo Simon An ANCYL member who was shot and severely inju... 7.330942 -0.935302 -1
7 ABRAHAMS Moegsien Was stabbed and stoned to death by a group of ... 5.861767 10.907988 -1
16 ADAMS Zwelinzima Sidwell Was severely beaten and shot in the leg in Gug... 5.548826 9.517207 -1
31 AGGETT Neil Hudson Died in detention at John Vorster Square, Joha... 11.213459 8.340458 -1
34 ALBERT Nombuyiselo Francis Was beaten and stabbed to death on 10 December... 6.867918 2.346848 -1
... ... ... ... ... ... ...
20804 XULU Josephina An IFP supporter who had three rondavels burnt... 2.483765 3.955176 -1
20810 XULU Mzomonje Phineas Was stabbed to death in Inchanga, Natal, on 15... 5.103362 11.622403 -1
20820 XULU Sipho Aubrey An ANC supporter who was shot and fatally woun... 7.560771 0.817540 -1
20821 XULU Sipho Brigitte An MK operative who was executed in Pretoria C... 3.011799 -0.359574 -1
20830 YAKA Mbangomuni An IFP supporter and acting induna who was sho... 8.195865 2.783923 -1

4579 rows × 6 columns

We have 4,579 outliers, or documents that do not neatly cluster into any one category. This is a clear limitation of this approach. What do we do with these outliers? One option is to assign them to the nearest topic. The process of doing this, however, can be quite complex as the we calculate distance between a document vector and a topic can vary depending on which measurements we wish to use. In the final section of this chapter, we will examine a library that handles this problem for us.