2.3. The PhraseMatcher#

2.3.1. Introduction#

Another rules-based component built into spaCy is the PhraseMatcher. Like the Matcher, the PhraseMatcher does not sit inside a spaCy pipeline. It does not, therefore, align with a spaCy extension, such as doc.ents, like the EntityRuler does. Instead, it is meant to run over a Doc container, just like the Matcher. Unlike the Matcher, however, the PhraseMatcher does not function a sequence of linguistic features at the token level, rather it is focused on matching at the phrase level.

In practice, you would use the Matcher when you need to rely on a sequence of linguistic features at the token level to extract data. This is powerful, but can sometimes be difficult to write robust patterns to match all instances of a the patterns you wish to match. The PhraseMatcher, on the other hand, should be used when you know relatively well how the data will appear in a text. It is easier to use the PhraseMatcher, but it is not as dynamic as the Matcher.

.## Basic Example

As with the Matcher, it is best to see the PhraseMatcher in action with a basic example. First, let’s import the PhraseMatcher class and load up the default small English pipeline.

import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")

Now, let’s consider a basic example. Let’s consider the text below. Here, we wish to find and extract the instances where Harry Potter appears in the text. Harry appears in four different ways in the text: 1) Harry Potter, 2) Harry, 3) Potter, and 4) The Boy who Lived.

text = """
Harry Potter was the main character in the book.
Harry was a normal boy who discovered he was a wizard.
Ultimately, Potter goes to Hogwarts.
He is also known as the Boy who Lived.
The Boy who Lived has an enemy named Voldemorte who is known as He who Must not be Named.
"""
matcher = PhraseMatcher(nlp.vocab)
matcher.add("HARRY_POTTER", [nlp("Harry Potter"), nlp("Harry"), nlp("Potter"), nlp("the Boy who Lived")])
doc = nlp(text)

Again, we will create our matches.

matches = matcher(doc)

Let’s iterate over our matches.

for match in matches:
    print(match)
(12243270181114079557, 1, 2)
(12243270181114079557, 1, 3)
(12243270181114079557, 2, 3)
(12243270181114079557, 12, 13)
(12243270181114079557, 27, 28)
(12243270181114079557, 38, 42)

And now let’s iterate over our matches and grab a bit more data, including the token spans and the sentence in which a match was found.

for match in matches:
    lexeme, start, end = match
    print(nlp.vocab[lexeme].text, doc[start:end])
    print(f"Sentence: {doc[start].sent}")
HARRY_POTTER Harry
Sentence: Harry Potter was the main character in the book.

HARRY_POTTER Harry Potter
Sentence: Harry Potter was the main character in the book.

HARRY_POTTER Potter
Sentence: Harry Potter was the main character in the book.

HARRY_POTTER Harry
Sentence: Harry was a normal boy who discovered he was a wizard.

HARRY_POTTER Potter
Sentence: Ultimately, Potter goes to Hogwarts.

HARRY_POTTER the Boy who Lived
Sentence: He is also known as the Boy who Lived.

As we can tel, the results are good, but we are missing one case. The second example of boy who lived was not grabbed because The was not capitalized. We can account for this by changing the main attribute of the PhraseMatcher.

2.3.2. Setting a Custom Attribute#

Unlike the Matcher, the PhraseMatcher does not let us control how it reads each individual token in the pattern. The way the PhraseMatcher parses the phrase is as the sequence level. By default, the PhraseMatcher reads the entire pattern as ORTH, or raw text. In other words, it must be a precise match in order to be flagged and extracted. In some instances, however, it may be important for a pattern to be not just raw text, but also in all forms, both uppercase and lowercase. This is particularly important for phrases, like the boy who lived, where the word the may be capitalized if it is at the start of a sentence. In these instances, we can change the main way the PhraseMatcher works by using the attr argument. By using attr="LOWER", we can make our PhraseMatcher pattern case-agnostic.

matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
matcher.add("HARRY_POTTER", [nlp("Harry Potter"), nlp("Harry"), nlp("Potter"), nlp("the Boy who Lived")])
doc = nlp(text)
matches = matcher(doc)
for match in matches:
    lexeme, start, end = match
    print(nlp.vocab[lexeme].text, doc[start:end])
    print(f"Sentence: {doc[start].sent}")
HARRY_POTTER Harry
Sentence: Harry Potter was the main character in the book.

HARRY_POTTER Harry Potter
Sentence: Harry Potter was the main character in the book.

HARRY_POTTER Potter
Sentence: Harry Potter was the main character in the book.

HARRY_POTTER Harry
Sentence: Harry was a normal boy who discovered he was a wizard.

HARRY_POTTER Potter
Sentence: Ultimately, Potter goes to Hogwarts.

HARRY_POTTER the Boy who Lived
Sentence: He is also known as the Boy who Lived.

HARRY_POTTER The Boy who Lived
Sentence: The Boy who Lived has an enemy named Voldemorte who is known as He who Must not be Named.

Notice that now, we have grabbed all ways the phrase the boy who lived is expressed in our text.

2.3.3. Adding a Function with on_match#

In production, it can sometimes be difficult to deploy a spaCy-based solution that requires pasting a for loop each time you want to iterate over the results. Usually, you want to automate certain tasks so that when a match is found, some event occurs in your code. The PhraseMatcher allows you to pass an extra argument to your patterns: on_match. This keyword argument will take a function which will receive four arguments from the PhraseMatcher: matcher (the PhraseMatcher), doc (the doc container that the PhraseMatcher just passed over), id, and matches (the resulting matches from the PhraseMatcher).

Let’s create a basic function that will iterate over each match and print off the match, its label, and the sentence in which it was found.

def on_match(matcher, doc, id, matches):
    for match in matches:
        lexeme, start, end = match
        print(nlp.vocab[lexeme].text, doc[start:end])
        print(f"Sentence: {doc[start].sent}")

Just as before, we will create our PhraseMatcher.

matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

This time, however, when we add our patterns to the PhraseMatcher, we will also add the keyword argument on_match that will point to the above function.

matcher.add("HARRY_POTTER", [nlp("Harry Potter")], on_match=on_match)

All that is left to do is then create the Doc container from the text and then run the PhraseMatcher over the Doc container.

doc = nlp(text)
matches = matcher(doc)
HARRY_POTTER Harry Potter
Sentence: Harry Potter was the main character in the book.

Just like the PhraseMatcher, the Matcher also can take the on_match keyword argument.