1.3. Bigrams and Trigrams#
Let’s take a moment and step away topic modeling. Instead, let’s think about language. The essential medium of topic modeling is texts which are the products of language. And language is nothing more than a collection of words whose usage is rooted around grammar and syntax. A given word’s meaning, in other words, is often defined by how, when, and where it is used in a given text. Words can also mean different things when used collectively.
1.3.1. Textual Ambiguity#
If I use the word apple, you likely are thinking of something like the fruit. Apple is a simple word, yet it can mean different things in different contexts. What if I said the following: “My Apple is better than a PC.” Now, what image comes to mind? Perhaps the computer product.
This is an example of textual ambiguity. Syntactical context helps eliminate textual ambiguity. Apple is a single word here that contains a single concept. It’s a relatively simple word. Perhaps one of the earliest a native speaker of English learns as a child. And yet it has textual ambiguity. Because “apple” is a single word, it is known as a unigram. A unigram is a single word that represents a single concept.
Textual ambiguity, however, occurs in more dynamic ways when we think about concepts beyond the single span of a single word. In this section, we will focus on two such cases that are essential for natural language processing: bigrams and trigrams. Bigrams are two words that contain a distinct meaning when used together, while trigrams are three words that contain a distinct meaning when used together.
Understanding bigrams and trigrams are essential because in order for a computer to truly understand langauge the way a human does, it must be able to understand the nuances of a single word and how a word’s meaning not only shifts in context, but shifts in meaning when used in conjunction with other words.
As noted above, a bigram is a combination of two words that have a distinct meaning. To demonstrate this, let us consider quickly the word “French”. A single word, that may have multiple meanings. Perhaps the word French refers to the language, perhaps it references a French person.
Let’s hold off on the word French for just a moment. Let’s now use the word “revolution”. Again, there is textual ambiguity, but perhaps I am referencing the concept of revolution in the sense of the Earth.
Now you may already see where I am going with this, but let’s now think about what happens when I put those two textually ambiguous unigrams together “The French Revolution”. “The” here is a stop word that is frequently dropped in natural language processing, so “French Revolution” is all that we should consider. These two words when combined have a distinct concept.
But we can think about language in even more nuanced ways.
Trigrams, as noted above, are the same as bigrams, except with three words, instead of two. Let’s continue with our example of “French”. What might you think about if I used the word “army”. Perhaps something distinct to your own experiences with the word. For me, as a modern American, I think initially about the American Army in the modern sense of the word. So I may picture something like modern soldiers in my mind’s eye.
For others in other parts of the world, other images may be more relevant. However, when I use the phrase “The French Revolutionary Army”, I now have something defined, something distinct. This is a trigram, a distinct concept consisting of three words that may have individual meanings when used alone.
1.3.4. Why are these Important?#
So, why are bigrams and trigrams so important? The reason comes down to getting machines to understand that when certain words are used together, they bear a distinct meaning. In order to produce a good topic model, therefore, the model must be able to understand and process words in this manner, the way we humans use the language we are trying to get the machine to understand.