Getting Started with spaCy and its Linguistic Annotations

1.2. Getting Started with spaCy and its Linguistic Annotations#

The goal of this section are twofold. First, it is my hope that you understand the basic spaCy syntax for creating a Doc container and how to call specific attributes of that container. Second, it is my hope that you leave this section with a basic understanding of the vast linguistic annotations available in spaCy. While we will not explore all attributes, we will deal with many of the most important ones, such as lemmas, parts-of-speech, and named entities. By the time you are finished with this chapter, you should have enough of a basic understanding of spaCy to begin applying it to your own texts.

1.2.1. Importing spaCy and Loading Data#

As with all Python libraries, the first thing we need to do is import spaCy. In the last notebook, I walked you through how to install it and download the small English model. If you have followed those steps, you should be able to import it like so:

import spacy

With spaCy imported, we can now create our nlp object. This is the standard Pythonic way to create your model in a Python script. Unless you are working with multiple models in a script, try to always name your model, nlp. It will make your script much easier to read. To do this, we will use spacy.load(). This command tells spaCy to load up a model. In order to know which model to load, it needs a string argument that corresponds to the model name. Since we will be working with the small English model, we will use “en_core_web_sm”. This function can take keyword arguments to identify which parts of the model you want to load, but we will get to that later. For now, we want to import the whole thing.

nlp = spacy.load("en_core_web_sm")

Great! With the model loaded, let’s go ahead and import our text. For this chapter, we will be working with the opening description from the Wikipedia article on the United States. In this repo, it is found in the subfolder data and is entitled wiki_us.txt.

with open("../data/wiki_us.txt", "r") as f:
    text = f.read()

Now, let’s see what this text looks like. It can be a bit difficult to read in a JupyterBook, but notice the horizontal slider below. You don’t neeed to read this in its entirety.

print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country located in North America. It consists of 50 states, a federal district, five major unincorporated territories, nine Minor Outlying Islands, and 326 Indian reservations. It is the third-largest country by both land and total area. The United States shares land borders with Canada to its north and with Mexico to its south. It has maritime borders with the Bahamas, Cuba, Russia, and other nations. With a population of over 331 million, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city and financial center is New York City.

1.2.2. Creating a Doc Container#

With the data loaded in, it’s time to make our first Doc container. Unless you are working with multiple Doc containers, it is best practice to always call this object “doc”, all lowercase. To create a doc container, we will usually just call our nlp object and pass our text to it as a single argument.

doc = nlp(text)

Great! Let’s see what this looks like.

print (doc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country located in North America. It consists of 50 states, a federal district, five major unincorporated territories, nine Minor Outlying Islands, and 326 Indian reservations. It is the third-largest country by both land and total area. The United States shares land borders with Canada to its north and with Mexico to its south. It has maritime borders with the Bahamas, Cuba, Russia, and other nations. With a population of over 331 million, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city and financial center is New York City.

If you are trying to spot the difference between this and the text above, good luck. You will not see a difference when printing off the doc container. But I promise you, it is quite different behind the scenes. The Doc container, unlike the text object, contains a lot of valuable metadata, or attributes, hidden behind it. To prove this, let’s examine the length of the doc object and the text object.

print (len(doc))
print (len(text))

146
716

What’s going on here? It is the same text, but different length. Why does this occur? To answer that, let’s explore it more deeply and try and print off each item in each object.

for token in text[:10]:
    print (token)

T
h
e
 
U
n
i
t
e
d

As we would expect. We have printed off each character, including white spaces. Let’s try and do the same with the Doc container.

for token in doc[:10]:
    print (token)

The
United
States
of
America
(
U.S.A.
or
USA
)

And now we see the magical difference. While on the surface it may seem that the Doc container’s length is dependent on the quantity of words, look more closely. You should notice that the open and close parentheses are also considered an item in the container. These are all known as tokens. Tokens are a fundamental building block of spaCy or any NLP framework. They can be words or punctuation marks. Tokens are something that has syntactic purpose in a sentence and is self-contained. A good example of this is the contraction “don’t” in English. When tokenized, or the process of converting the text into tokens, we will have two tokens. “do” and “n’t” because the contraction represents two words, “do” and “not”.

On the surface, this may not seem exceptional. But it is. You may be thinking to yourself that you could easily use the split method in Python to split by whitespace and have the same result. But you’d be wrong. Let’s see why.

for token in text.split()[:10]:
    print (token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known

Notice that the parentheses are not removed or handled individually. To see this more clearly, let’s print off all tokens from index 5 to 8 in both the text and doc objects.

words = text.split()[:10]

i=5
for token in doc[i:8]:
    print (f"SpaCy Token {i}:\n{token}\nWord Split {i}:\n{words[i]}\n\n")
    i=i+1

SpaCy Token 5:
(
Word Split 5:
(U.S.A.


SpaCy Token 6:
U.S.A.
Word Split 6:
or


SpaCy Token 7:
or
Word Split 7:
USA),

We can see clearly now how the spaCy Doc container does much more with its tokenization than a simple split method. We could, surely, write complex rules for a language to achieve the same results, but why bother? SpaCy does it exceptionally well for all languages. In my entire time using spaCy, I have never seen the tokenizer make a mistake. I am sure that mistakes may occur, but these are probably rare exceptions.

Let’s see what else this Doc Container holds.

1.2.3. Sentence Boundary Detection (SBD)#

In NLP, sentence boundary detection, or SBD, is the identification of sentences in a text. Again, this may seem fairly easy to do with rules. One could use split(“.”), but in English we use the period to also denote abbreviation. You could, again, write rules to look for periods not proceeded by a lowercase word, but again, I ask the question, “why bother?”. We can use spaCy and in seconds have all sentences fully separated through SBD.

To access the sentences in the Doc container, we can use the attribute sents, like so:

for sent in doc.sents:
    print (sent)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, nine Minor Outlying Islands, and 326 Indian reservations.
It is the third-largest country by both land and total area.
The United States shares land borders with Canada to its north and with Mexico to its south.
It has maritime borders with the Bahamas, Cuba, Russia, and other nations.
With a population of over 331 million, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city and financial center is New York City.

We got an error. That is because the sents attribute is a Span container that is stored as a generator. Generators are beyond the scope of this textbook, but in short generators will yield results from functions rather than return results. While you can iterate over a generator like a list, they are not the same thing. Instead of populating all results in memory, a generator lets us only load one result at a time. We can use next to grab the first sentence.

sentence1 = next(doc.sents)
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country located in North America.

Now we have the first sentence. Now that we have a smaller text, let’s explore spaCy’s other building block, the token.

1.2.4. Token Attributes#

The token object contains a lot of different attributes that are VITAL do performing NLP in spaCy. We will be working with a few of them, such as:

Name	Description
sent	the sentence to which the span blongs
text	the raw text of the token
head	the parent of the token
left_edge	the left edge of the token’s descendents
right_edge	the right edge of the token’s descendents
ent_type_	the label of the token, if it is an entity
lemma_	the lemmatized form of the token
morph	provides the morphology of the token
pos_	part of speech
dep_	dependency relation
lang_	the language of the parent document

I will briefly describe these here and show you how to grab each one and what they look like. We will be exploring each of these attributes more deeply in this chapter and future chapters. To demonstrate each of these attributes, we will use one token, “States” which is part of a sequence of tokens that make up “The United States of America”

token2 = sentence1[2]
print (token2)

States

1.2.4.1. Text#

Verbatim text content. -spaCy docs

token2.text

'States'

1.2.4.2. Head#

The syntactic parent, or “governor”, of this token. -spaCy docs

token2.head

is

This tells to which word it is governed by, in this case, the primary verb, “is”, as it is part of the noun subject.

1.2.4.3. Left Edge#

The leftmost token of this token’s syntactic descendants. -spaCy docs

token2.left_edge

The

If part of a sequence of tokens that are collectively meaningful, known as multi-word tokens, this will tell us where the multi-word token begins.

1.2.4.4. Right Edge#

The rightmost token of this token’s syntactic descendants. -spaCy docs

token2.right_edge

This will tell us where the multi-word token ends.

1.2.4.5. Entity Type#

Named entity type. -spaCy docs

token2.ent_type

Note the absence of the _ at the end of the attribute. This will return an integer that corresponds to an entity type, where as _ will give you the string equivalent., as in below.

token2.ent_type_

'GPE'

We will learn all about types of entities in our chapter on named entity recognition, or NER. For now, simply understand that GPE is geopolitical entity and is correct.

1.2.4.6. Ent IOB#

IOB code of named entity tag. “B” means the token begins an entity, “I” means it is inside an entity, “O” means it is outside an entity, and "" means no entity tag is set.

token2.ent_iob_

'I'

IOB is a method of annotating a text. In this case, we see “I” because states is inside an entity, that is to say that it is part of the United States of America.

1.2.4.7. Lemma#

Base form of the token, with no inflectional suffixes. -spaCy docs

token2.lemma_

'States'

sentence1[12].lemma_

'know'

1.2.4.8. Morph#

Morphological analysis -spaCy docs

sentence1[12].morph

Aspect=Perf|Tense=Past|VerbForm=Part

1.2.4.9. Part of Speech#

Coarse-grained part-of-speech from the Universal POS tag set. -spaCy docs

token2.pos_

'PROPN'

1.2.4.10. Syntactic Dependency#

Syntactic dependency relation. -spaCy docs

token2.dep_

'nsubj'

1.2.4.11. Language#

Language of the parent document’s vocabulary. -spaCy docs

token2.lang_

'en'

1.2.5. Part of Speech Tagging (POS)#

In the field of computational linguistics, understanding parts-of-speech is essential. SpaCy offers an easy way to parse a text and identify its parts of speech. Below, we will iterate across each token (word or punctuation) in the text and identify its part of speech.

for token in sentence1:
    print (token.text, token.pos_, token.dep_)

The DET det
United PROPN compound
States PROPN nsubj
of ADP prep
America PROPN pobj
( PUNCT punct
U.S.A. PROPN appos
or CCONJ cc
USA PROPN conj
) PUNCT punct
, PUNCT punct
commonly ADV advmod
known VERB acl
as ADP prep
the DET det
United PROPN compound
States PROPN pobj
( PUNCT punct
U.S. PROPN appos
or CCONJ cc
US PROPN conj
) PUNCT punct
or CCONJ cc
America PROPN conj
, PUNCT punct
is AUX ROOT
a DET det
country NOUN attr
located VERB acl
in ADP prep
North PROPN compound
America PROPN pobj
. PUNCT punct

Here, we can see two vital pieces of information: the string and the corresponding part-of-speech (pos). For a complete list of the pos labels, see the spaCy documentation (https://spacy.io/api/annotation#pos-tagging). Most of these, however, should be apparent, i.e. PROPN is proper noun, AUX is an auxiliary verb, ADJ, is adjective, etc. We can visualize this sentence with a diagram through spaCy’s displaCy Notebook feature. To make this easier to read in a book, we will be using a different, smaller sentence.

from spacy import displacy
sample_doc = nlp("The dog ran over the fence.")
displacy.render(sample_doc, style="dep", options={"compact":True})

1.2.6. Named Entity Recognition#

Another essential task of NLP, is named entity recognition, or NER. I spoke about NER in the last notebook. Here, I’d like to demonstrate how to perform basic NER via spaCy. Again, we will iterate over the doc object as we did above, but instead of iterating over doc.sents, we will iterate over doc.ents. For our purposes right now, I simply want to print off each entity’s text (the string itself) and its corresponding label (note the _ after label). I will be explaining this process in much greater detail in the next two notebooks.

for ent in doc.ents:
    print (ent.text, ent.label_)

The United States of America GPE
U.S.A. GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC
50 CARDINAL
five CARDINAL
nine CARDINAL
Minor Outlying Islands PERSON
326 CARDINAL
Indian NORP
third ORDINAL
The United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
Russia GPE
over 331 million MONEY
third ORDINAL
Washington GPE
D.C. GPE
New York City GPE

Sometimes it can be difficult to read this output as raw data. In this case, we can again leverage spaCy’s displaCy feature. Notice that this time we are altering the keyword argument, style, with the string “ent”. This tells displaCy to display the text as NER annotations

displacy.render(doc, style="ent")

The United States of America GPE ( U.S.A. GPE or USA GPE ), commonly known as the United States GPE ( U.S. GPE or US GPE ) or America GPE , is a country located in North America LOC . It consists of 50 CARDINAL states, a federal district, five CARDINAL major unincorporated territories, nine CARDINAL Minor Outlying Islands PERSON , and 326 CARDINAL Indian NORP reservations. It is the third ORDINAL -largest country by both land and total area. The United States GPE shares land borders with Canada GPE to its north and with Mexico GPE to its south. It has maritime borders with the Bahamas GPE , Cuba GPE , Russia GPE , and other nations. With a population of over 331 million MONEY , it is the third ORDINAL most populous country in the world. The national capital is Washington GPE , D.C. GPE , and the most populous city and financial center is New York City GPE .

1.2.7. Conclusion#

I recommend spending a little bit of time going through this notebook a few times. The information covered throughout this notebook will be reinforced as we explore each of these areas in more depth with real-world examples of how to implement them to tackle different problems.