1.2. Topics and Clusters#

1.2.1. What are Topics?#

Topics are labels assigned to textual data that detail the subjects contained within a given text. In topic modeling we try to create computer systems that can assign topics the way a human would. In order to understand this process, it’s best if we take a step back and think about how we assign topics as humans.

To do this, let’s examine these two texts.

Number

Text

Text 1

Thomas enjoys playing basketball. He is an exceptionally good point guard.

Text 2

Victoria enjoys playing baseball. She is an exceptionally good at playing first base.

If I asked you to provide two topics to these texts, what might they be? Basketball and baseball are likely two top candidates. Text 1 would have the topic of basketball, while Text 2 would have the topic of baseball. Now, let’s consider these same texts, but add two more into the mix.

Number

Text

Text 1

Thomas enjoys playing basketball. He is an exceptionally good point guard.

Text 2

Victoria enjoys playing baseball. She is an exceptionally good at playing first base.

Text 3

John is a talented chef. He enjoys making pasta professionally.

Text 4

Jeff is a talented cook. He owns a pizzeria.

Now, if I asked you to assign two topics to all four texts, what might those topics be? It is likely that your answer changed. No longer are the two topics of baseball and basketball relevant because Text 3 and Text 4 do not align well with those topics. Instead, a better pair of topics might be sports and cooking, or something like that. What changed? The collection of texts in our corpus changed.

What does this demonstrate? It tells us that topics are corpus-dependent, meaning the topics we assign to texts depend on their context against surrounding texts. The same holds true for topic modeling via computer systems.

1.2.2. What are Clusters?#

In topic modeling, computer systems do not generate topics, rather they generate a list of high concentration words. Texts that share common terms are clustered together by similarity. A cluster is nothing more than a collection of similar pieces of data. When we are working with texts, a cluster is a collection of texts that have similar overlapping themes.

There are various ways to cluster textual data that we will explore throughout this chapter.