2.2. The Matcher#
2.2.1. Introduction#
SpaCy’s built in Matcher component is another way a user can leverage spaCy to leverage linguistic annotations to find and extract structured data from unstructured text. Unlike the EntityRuler that we met in the previous section, the Matcher does not assign the identified patterns into an extension, such as doc.ents
. Instead, the Matcher is designed to be leveraged outside of a spaCy pipeline. Also unlike the EntityRuler, the Matcher will not sit inside a spaCy pipeline, rather it will run over a spaCy Doc container.
2.2.2. A Basic Example#
To understand how this works, let’s first look at a very basic example of the spaCy Matcher in practice. We will be working with the small spaCy English pipeline in this example, so let’s go ahead and import the Matcher
class and load up the pipeline.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
Now that we have our pipeline loaded up into memory, let’s go ahead and start working with the Matcher component. To load up the Matcher, we will use the Matcher class that we imported above. The Matcher class will take a single argument, the vocabulary of the nlp pipeline. We can access the nlp vocab by using the command nlp.vocab
.
matcher = Matcher(nlp.vocab)
With our Matcher now loaded into memory, it is time to create a basic pattern. The pattern that we will want to find and extract from our texts is an email. As we will learn below, there are many token attributes you can leverage to create powerful rules-based pattern matchers in spaCy, rather like the EntityRuler we learned about in the previous section. One attribute we can look for is a Boolean (True or False) that looks to see if a token looks like an e-mail address. This attribute is LIKE_EMAIL
.
pattern = [
{"LIKE_EMAIL": True}
]
As with the EntityRuler, we can add these patterns into the Matcher. Unlike our EntityRuler, we will use the .add
method, rather than .add_patterns
. This method will take two arguments. First, the label that you want to assign to the matched token(s) and second the patterns themselves. Note that just like the EntityRuler, our patterns must be nested in a list.
matcher.add("EMAIL_ADDRESS", [pattern])
With the Matcher now loaded in memory with patterns, we can run it over some text. Because the spaCy Matcher is a spaCy component, that text must first be converted into a spaCy Doc container. We can create that Doc container just as we normally would.
doc = nlp("This is an email address: wmattingly@aol.com")
We can then pass that Doc container to the Matcher as a single argument.
matches = matcher(doc)
Let’s go ahead and print off the matches
object now.
print (matches)
[(16571425990740197027, 6, 7)]
This may not be what you were expecting, but let’s dive in and explore what is going on with this output. The output is a list with a single index. Each index will be a tuple. This tuple will consist of three parts: lexeme
, start token
, and end token
. The lexeme is a numerical representation of our label in the nlp.vocab. We can access the label name by indexing the nlp.vocab at the lexeme index and then accessing the raw text with .text
.
Let’s do each of these steps in turn. First, we will create an object called first_match
. This will be our first hit with the Matcher.
first_match = matches[0]
print(first_match)
(16571425990740197027, 6, 7)
Next, we will grab lexeme, which is the first index in our first_match
.
lexeme = first_match[0]
print(lexeme)
16571425990740197027
Finally, we will print off the raw text of the label by indexing the nlp.vocab at that specific lexeme numerical value.
print(nlp.vocab[lexeme].text)
EMAIL_ADDRESS
This is the basic idea behind the Matcher. It is a way of finding structured text using a pattern-based matching technique. As with the EntityRuler, there are a lot of token attributes we can leverage.
2.2.3. Attributes Taken by Matcher#
ORTH - The exact verbatim of a token (str)
TEXT - The exact verbatim of a token (str)
LOWER - The lowercase form of the token text (str)
LENGTH - The length of the token text (int)
IS_ALPHA
IS_ASCII
IS_DIGIT
IS_LOWER
IS_UPPER
IS_TITLE
IS_PUNCT
IS_SPACE
IS_STOP
IS_SENT_START
LIKE_NUM
LIKE_URL
LIKE_EMAIL
SPACY
POS
TAG
MORPH
DEP
LEMMA
SHAPE
ENT_TYPE
_ - Custom extension attributes (Dict[str, Any])
OP
2.2.4. Applied Matcher#
Let’s now take a look at a practical application of the Matcher. Say, we had a text. In our case, it will be the following text:
text = """
Harry Potter was the main character in the book.
Harry was a normal boy who discovered he was a wizard.
Ultimately, Potter goes to Hogwarts.
He is also known as the Boy who Lived.
The Boy who Lived has an enemy named Voldemorte who is known as He who Must not be Named.
"""
Our goal in this exercise is to isolate all proper nouns and extract them. Ideally, we would like to extract proper nouns that are bigrams or trigrams and keep them intact.
To do this, we will need to load up the spaCy small English pipeline.
nlp = spacy.load("en_core_web_sm")
2.2.4.1. Grabbing all Proper Nouns#
Our initial pattern will be quite simple. We are simply going to grab all patterns where a single token has a part-of-speech (POS) is “PROPN”, or proper noun.
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN"}]
matcher.add("PROPER_NOUNS", [pattern])
doc = nlp(text)
matches = matcher(doc)
print (len(matches))
for match in matches[:10]:
print (match, doc[match[1]:match[2]])
6
(3232560085755078826, 1, 2) Harry
(3232560085755078826, 2, 3) Potter
(3232560085755078826, 12, 13) Harry
(3232560085755078826, 27, 28) Potter
(3232560085755078826, 30, 31) Hogwarts
(3232560085755078826, 52, 53) Voldemorte
2.2.4.2. Improving it with Multi-Word Tokens#
While the above output is good, it does not capture multi-word-tokens, such as Harry Potter
. Ideally, we would like to find and extract these instances. We can do this by passing in a second argument in our pattern after the POS: OP
and set it to +
. This will look for any sequence of proper nouns.
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]
matcher.add("PROPER_NOUNS", [pattern])
doc = nlp(text)
matches = matcher(doc)
print (len(matches))
for match in matches[:10]:
print (match, doc[match[1]:match[2]])
7
(3232560085755078826, 1, 2) Harry
(3232560085755078826, 1, 3) Harry Potter
(3232560085755078826, 2, 3) Potter
(3232560085755078826, 12, 13) Harry
(3232560085755078826, 27, 28) Potter
(3232560085755078826, 30, 31) Hogwarts
(3232560085755078826, 52, 53) Voldemorte
2.2.4.3. Greedy Keyword Argument#
As we can see, this has worked pretty well, but we have a key issue. Harry
is grabbed as is Harry Potter
as is Potter
. These are all three instances of a single multi-word token: Harry Potter
.
The Matcher can also take another keyword argument when we add patterns: greedy
. This will take one of two possible arguments: FIRST
or LONGEST
. They each define how spaCy will function when it encounters two matches that have overlapping spans. FIRST
will extract the first hit with overlapping spans, while LONGEST
will extract only the longest. Let’s consider the same example above.
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]
matcher.add("PROPER_NOUNS", [pattern], greedy='LONGEST')
doc = nlp(text)
matches = matcher(doc)
print (len(matches))
for match in matches[:10]:
print (match, doc[match[1]:match[2]])
5
(3232560085755078826, 1, 3) Harry Potter
(3232560085755078826, 12, 13) Harry
(3232560085755078826, 27, 28) Potter
(3232560085755078826, 30, 31) Hogwarts
(3232560085755078826, 52, 53) Voldemorte
2.2.4.4. Adding in Sequences#
Let’s say I not only wanted to extract these type of instances, but I wanted to look for more robust patterns. What if I wanted to look for places in the text where Harry Potter is followed by a verb of action (not a verb of to be). We can do this by passing a third component to our pattern: POS
is VERB
.
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}, {"POS": "VERB"}]
matcher.add("PROPER_NOUNS", [pattern], greedy='LONGEST')
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x: x[1])
print (len(matches))
for match in matches[:10]:
print (match, doc[match[1]:match[2]])
1
(3232560085755078826, 27, 29) Potter goes
And just like this, we were able to extract an instance of Harry Potter’s name followed by a verb of action. In our case, goes
. While this example is quite simple, the Matcher allows for robust pattern matching with spaCy containers. The patterns that we looked at here can also be used and applied to the EntityRuler.