Creating Rules-Based Pipeline for Holocaust Documents

3.3. Creating Rules-Based Pipeline for Holocaust Documents#

In this section, we will walk through how to develop a complete heuristic spaCy pipeline for performing named entity recognition. We will leverage a combination of EntityRuler pipes and custom spaCy pipes that use RegEx to find larger matches.

3.3.1. Creating a Blank spaCy Model#

The first thing we need to do is import all of the different components from spaCy and other libraries that we will need. I will explain these as we go forward.

import spacy
from spacy.util import filter_spans
from spacy.tokens import Span
from spacy.language import Language
import re
import pandas as pd
import unidecode
from spacy import displacy

We will be using Pandas in this notebook to run data checks in a CSV file originally produced by the Holocaust Geographies Collaborative, headed by Anne Knowles, Tim Cole, Alberto Giordano, Paul Jaskot, and Anika Walke. We will be importing RegEx because a lot of our heuristics will rely on capturing multi-word tokens.

Now that we have imported everything, let’s create a blank English pipeline in spaCy. As we work through this notebook, we will add pipes to it.

nlp = spacy.load("en_core_web_sm")

3.3.2. Creating EntityRulers#

In the first section of this chapter, we looked at acquiring a camps dataset from the USHMM website. We will now build off that section by normalizing our dataset to include camp names with and without accent marks.

ushmm_camps = ['Alderney', 'Amersfoort', 'Auschwitz', 'Banjica', 'Bełżec', 'Bergen-Belsen', 'Bernburg', 'Bogdanovka', 'Bolzano', 'Bor', 'Breendonk',
         'Breitenau', 'Buchenwald', 'Chełmno', 'Dachau', 'Drancy', 'Falstad', 'Flossenbürg', 'Fort VII', 'Fossoli', 'Grini', 'Gross-Rosen',
         'Herzogenbusch', 'Hinzert', 'Janowska', 'Jasenovac', 'Kaiserwald', 'Kaunas', 'Kemna', 'Klooga', 'Le Vernet', 'Majdanek', 'Malchow',
         'Maly Trostenets', 'Mechelen', 'Mittelbau-Dora', 'Natzweiler-Struthof', 'Neuengamme', 'Niederhagen', 'Oberer Kuhberg', 'Oranienburg',
         'Osthofen', 'Płaszów', 'Ravensbruck', 'Risiera di San Sabba', 'Sachsenhausen', 'Sajmište', 'Salaspils', 'Sobibór', 'Soldau', 'Stutthof',
         'Theresienstadt', 'Trawniki', 'Treblinka', 'Vaivara']

camp_patterns = []
for camp in ushmm_camps:
    normalized = unidecode.unidecode(camp)
    camp_patterns.append({"pattern": camp, "label": "CAMP"})
    camp_patterns.append({"pattern": normalized, "label": "CAMP"})

Note that we are creating a list called camp_patterns. These are a set of patterns expected by a spaCy EntityRuler. With our patterns created, we can create our EntityRuler. Note that we are placing the ruler before the NER pipe in the model. This is so that our heuristic EntityRuler will be able to annotate the doc container before the NER model. We are also giving it a custom name, camp_ruler. This is so that we know precisely what this ruler does in our pipeline when we examine the pipeline structure.

camp_ruler = nlp.add_pipe("entity_ruler", before="ner", name="camp_ruler")

With the ruler created, we can then populate it with the patterns.

camp_ruler.add_patterns(camp_patterns)

Now, it is time to test the pipeline to make sure it is working as intended.

doc = nlp("Auschwitz was a camp during WWII. Płaszów was also a camp. Another spelling is Plaszow.")
displacy.render(doc, style="ent", jupyter=True)

Auschwitz CAMP was a camp during WWII EVENT . Płaszów CAMP was also a camp. Another spelling is Plaszow CAMP .

Note that our pipeline is now correctly identifying all forms of camps as camps correctly. This will ensure that our pipeline can function with different spellings of these words.

3.3.3. Creating Function for Matching RegEx#

As noted in the previous section, one of the main challenges with Holocaust-related data is the high degree of toponyms, or words that have identical spelling but different classifications. We will see this issue with several of our classes. Many US ships, for example, bear the name of states. Additionally, named ghettos, such as the Warsaw Ghetto, also can function as a general place (GPE). If we want to use heuristics, therefore, we must create rather robust rules to ensure that our rules have a high degree of accuracy. We can use the following function to do just that by leveraging RegEx.

We will be using this function quite a bit for each of our custom spaCy pipes, so let’s explore this function in depth.

def regex_match(doc, pattern, label, filter=True,
                context=False, context_list=[],
                window_start=100, window_end=100):
    text = doc.text
    new_ents = []
    original_ents = list(doc.ents)
    for match in re.finditer(pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        if context==True:
            window = text[start-window_start:end+window_end]
            if any(term in window.lower() for term in context_list):
                if span is not None:
                    new_ents.append((span.start, span.end, span.text))
        else:
            if span is not None:
                new_ents.append((span.start, span.end, span.text))
    for ent in new_ents:
        start, end, name = ent
        new_ent = Span(doc, start, end, label=label)
        original_ents.append(new_ent)
    if filter==True:
        filtered = filter_spans(original_ents)
        final_ents = filtered
    else:
        final_ents = new_ents
    doc.ents = final_ents
    return doc

def The purpose of this function is to leverage RegEx and spaCy to perform full-text matching. This allows us to use the strengths of both NLP techniques. With RegEx, we can perform complex fuzzy string matching beyond an individual token. With spaCy, we can leverage linguistic knowledge about the text. This function will let us pass a RegEx pattern, a doc container, and a custom label, then find all items that match the pattern. It will then inject those RegEx matches into spaCy Span containers that will then be manually inserted into the found entities. Let’s break this down in the code.

The first line of our code is as follows:

def regex_match(doc, pattern, label, filter=True,
                context=False,The context_list=[],
                window_start=100, window_end=100):

Here, we are creating a function that will take three mandatory arguments: doc, pattern, and label. The doc parameter will be the Doc container that is created by the spaCy nlp pipeline. The pattern will be the RegEx pattern that we want to use to perform a match. Finally, the label is the label that we want to assign to the items matched.

We also have several keyword arguments:

filter
context
context_list
window_start
window_end

These keyword arguments allow for greater versatility in the function. It allows a user to filter the entities and only keep the ones that are longest. Remember, spaCy entities are hard classifications, meaning a token can only belong to a single class, or label. This ensures that when we manually place our found RegEx matches into the Doc container’s entities, there are no overlapping entities.

The context parameter is a Boolean that allows for a user to pass a context_list to the function. This will be a list of words that increases the propability that an identified token belongs to a specific class. We will see this at play when we write rules for the labels GHETTO and SHIP. As noted above, ghettos frequently function as GHETTO and GPE. The presence of certain words around a match, such as ghetto, radically increase the chance that our hit is a GHETTO and not a GPE. Likewise, with ships, many ships in the US fleet are named after states. This means we could potentially mark New York, for example, as a SHIP, rather than as a GPE.

The window_start and window_end control the window of the context. So it will only look forward and backward n-characters for the presence of a word in the context_list. This is a simple heuristic but one that proves quite effective.

The next bit of code is as follows:

    text = doc.text
    new_ents = []
    original_ents = list(doc.ents)

Here, we are grabbing the raw text of the Doc container. This will be necessary if the user has enabled context searching. Next, we create an empty list of new entities. This will be the list to which we append our new hits. Finally, we grab the original entities already found by earlier pipes. It is important to convert this to a list because doc.ents is a generator which prevents us from iterating over the data.

The next block of code looks like this

    for match in re.finditer(pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        if context==True:
            window = text[start-window_start:end+window_end]
            if any(term in window.lower() for term in context_list):
                if span is not None:
                    new_ents.append((span.start, span.end, span.text))
        else:
            if span is not None:
                new_ents.append((span.start, span.end, span.text))

Here, we are using re.finditer() to find all cases where our RegEx formula has matched in the doc.text. We iterate over each match and create a Span with the start and end character. If the user has enabled context searching, then we look for any term in their context_list to see if it appears in the given window they have established. If there is a match, then we append the Span to new_ents.

Our final section of code reads:

    for ent in new_ents:
        start, end, name = ent
        new_ent = Span(doc, start, end, label=label)
        original_ents.append(new_ent)
    if filter==True:
        filtered = filter_spans(original_ents)
        final_ents = filtered
    else:
        final_ents = new_ents
    doc.ents = final_ents
    return doc

Here, we are iterating over each of our newly found matches. We then place the new_ent match into the original_ents list after assigning it the specified label. Finally, we filter the spans so that there are no overlapping entities.

Once complete, we reset the doc.ents to the final_ents and return the doc object back to the user.

3.3.4. Add Pipe for Finding Streets#

Let’s examine this function in practice by trying to find and extract all streets in a text. Because our data comes from Europe, we need to represent the way streets are represented in multiple languages. Let’s look at the example below where we use RegEx to find ways in which a street may appear in German (and Dutch) as well as English. Unlike English, German street names often contain the word street (strasse) as a compound in the street name. English, on the other hand, has many different ways to render the concept of a street both in abbreviated and full forms. RegEx is a perfect tool for working with this degree of variance.

nlp = spacy.load("en_core_web_sm")

german_streets = r"[A-Z][a-z]*(strasse|straße|straat)\b"
english_streets = r"([A-Z][a-z]* (Street|St|Boulevard|Blvd|Avenue|Ave|Road|Rd|Lane|Ln|Place|Pl)(\.)*)"
@Language.component("find_streets")
def find_streets(doc):
    doc = regex_match(doc, german_streets, "STREET")
    doc = regex_match(doc, english_streets, "STREET")
    return (doc)
nlp.add_pipe("find_streets", before="ner")

doc = nlp("Berlinstrasse was a famous street. In America, many towns have a Main Street. It can also be spelled Main St.")
displacy.render(doc, style="ent", jupyter=True)

Berlinstrasse STREET was a famous street. In America GPE , many towns have a Main Street. STREET It can also be spelled Main St. STREET

3.3.5. Creating a Pipe for Finding Ships#

Ships frequently appear in Holocaust documents. We could provide a list of ships to an EntityRuler, but finding one of these patterns isn’t enough. We want to ensure that the thing referenced is in fact a ship. Many of these terms could easily be toponyms, or entities that share the same spelling but mean different things in different contexts, e.g. the Ile de France, could easily be a GPE that refers to the area around Paris. General Hosey could easily be a PERSON. These are also known ships. To ensure toponym disambiguation, I set up several contextual clues, e.g. the list of nautical terms. If any of these words appear in area around the hit, then the heuristics assign that token the label of SHIP. If not, it ignores it and allows later pipes or the machine learning model to annotate it.

We can gather a list of known WWII ships from the WWII Database (https://ww2db.com/). One of the problems with the lists available here is that they are not utf-8 encoded. This means that the text is not normalized. Inside this repository is a csv file with a normalized version of these ship names. This CSV file also stores metadata about ships, namely their class, country of origin, and the year they were built.

In spaCy, we can store this metadata inside the entity Span container with a custom extension. This is done via a function known as a getter function that maps the data to the custom extension. Let’s see how this works in practice. We will use a synthetic sentence in which the Activity, a British vessel, is referenced as is New York, a place as well as a US ship active during WWII.

def ship_metadata_getter(ent):
    df = pd.read_csv("../data/wwii-ships.csv")
    if ent.label_ == "SHIP":
        ship_name = ent.text.replace("The", "").replace("the", "").strip()
        row = df.loc[df.name==ship_name]
        if len(row) > 0:
            row = row.iloc[0]
            return {"class": row["class"], "country": row.country, "year_built": row.year}
        else:
            return None
    else:
        return False
nlp = spacy.load("en_core_web_sm")
df = pd.read_csv("../data/wwii-ships.csv")

named_ships_pattern = f"(The|the) ({'|'.join(df.name.tolist())})"

@Language.component("named_ships")
def named_ships(doc):
    nautical = ["ship", "boat", "sail", "captain", "sea", "harbor", "aboard", "admiral", "liner"]
    doc = regex_match(doc, named_ships_pattern, "SHIP", context=True, context_list=nautical)
    for ent in doc.ents:
        if ent.label_ == "SHIP":
            ent.set_extension('ship_metadata', getter=ship_metadata_getter, force=True)
    return (doc)
nlp.add_pipe("named_ships", before="ner")
doc = nlp("The Activity set sail from New York")
for ent in doc.ents:
    print(ent.text, ent.label_, ent._.ship_metadata)

The Activity SHIP {'class': 'Activity-class Escort Carrier', 'country': 'United Kingdom', 'year_built': '1942'}
New York GPE False

There is a lot happening in this section of code, so let’s break down the different components here. The first function is our getter function: ship_metadata_getter().

def ship_metadata_getter(ent):
    df = pd.read_csv("../data/wwii-ships.csv")
    if ent.label_ == "SHIP":
        ship_name = ent.text.replace("The", "").replace("the", "").strip()
        row = df.loc[df.name==ship_name]
        if len(row) > 0:
            row = row.iloc[0]
            return {"class": row["class"], "country": row.country, "year_built": row.year}
        else:
            return None
    else:
        return False

This function takes a single argument, the entity identified from the RegEx match. Here, we strip out the The from the entity’s name and check to see if the ship found appears in our dataset of known ships. We then isolate the ship’s metadata in the dataset and return that data as a dictionary.

df = pd.read_csv("../data/wwii-ships.csv")

named_ships_pattern = f"(The|the) ({'|'.join(df.name.tolist())})"

@Language.component("named_ships")
def named_ships(doc):
    nautical = ["ship", "boat", "sail", "captain", "sea", "harbor", "aboard", "admiral", "liner"]
    doc = regex_match(doc, named_ships_pattern, "SHIP", context=True, context_list=nautical)
    for ent in doc.ents:
        if ent.label_ == "SHIP":
            ent.set_extension('ship_metadata', getter=ship_metadata_getter, force=True)
    return (doc)

In this next section of code, we load up the dataset to construct our pattern of names to find via our RegEx function. As we iterate over each entity in our Doc container, we check to see if that entity’s label is a SHIP. If it is, then we set the extension of ship_metadata via the getter function referenced above.

Finally, we add our pipe to the spaCy pipeline and test the model with our constructed sentence.

nlp.add_pipe("named_ships", before="ner")
doc = nlp("The Activity set sail from New York")
for ent in doc.ents:
    print(ent.text, ent.label_, ent._.ship_metadata)

3.3.6. Create Pipe for Identifying a Military Personnel#

Military personnel are often referenced in English documents with their military rank. This means that we can use basic RegEx rules for finding and extracting those individuals. When we encounter a sequence of tokens that are capitalized after a military rank, we can make a fairly safe presumption that these tokens will be associated nominally with the rank. For this pipeline, we can again use WWII Database to isolate and grab all known military ranks during WWII for all countries.

nlp = spacy.load("en_core_web_sm")
with open("../data/military_ranks.txt", "r") as f:
    ranks = f.read().splitlines()
military_pattern = f"({'|'.join(ranks)})((?=\s[A-Z])(?:\s[A-Z][a-z\.]+)+)"
@Language.component("find_military")
def find_military(doc):
    doc = regex_match(doc, military_pattern, "MILITARY")
    return (doc)
nlp.add_pipe("find_military", before="ner")

doc = nlp("Captain James T. Kirk commanded the ship.")
displacy.render(doc, style="ent", jupyter=True)

Captain James T. Kirk MILITARY commanded the ship.

3.3.7. Create Pipe for Identifying Spouses#

Often times in historical documents the identity of people are referenced collectively. In some instances, such as those of spouses, this results in the name of the woman being attached to the name of her husband. The purpose of this SPOUSAL entity is to identify such constructs so that users can manipulate the output and reconstruct each individual singularly.

nlp = spacy.load("en_core_web_sm")
spousal_pattern = r"((Mr|Mrs|Miss|Dr)(\.)* and (Mr|Mrs|Miss|Dr)(\.)*((?=\s[A-Z])(?:\s[A-Z][a-z\.]+)+))"
@Language.component("find_spousal")
def find_spousal(doc):
    doc = regex_match(doc, spousal_pattern, "SPOUSAL")
    return (doc)
nlp.add_pipe("find_spousal", before="ner")

doc = nlp("Mr. and Mrs. Smith are going to the movie")
for ent in doc.ents:
    print(ent.text, ent.label_)

Mr. and Mrs. Smith SPOUSAL

3.3.8. Creating a Pipe for Finding Ghettos#

In Holocaust documents, identifying ghettos can be difficult. Frequently a ghetto has the same name as the city in which it is found. To distinguish between the city and the ghetto, speakers and writers often specify that they are referencing the ghetto, not the city generally, by using the word “ghetto” in near proximity to the city name. The below pipe leverages this tendency by looking for capitalized words that precede the word “ghetto”.

ghettos_pattern1 = r"[A-Z]\w+((-| )*[A-Z]\w+)* (g|G)hetto"
ghettos_pattern2 = r"(g|G)hetto (of|in|at)((?=\s[A-Z])(?:\s[A-Z][a-z\.]+)+)"
nlp = spacy.load("en_core_web_sm")
@Language.component("find_ghettos")
def find_ghettos(doc):
    doc = regex_match(doc, ghettos_pattern1, "GHETTO")
    doc = regex_match(doc, ghettos_pattern2, "GHETTO")
    return (doc)
nlp.add_pipe("find_ghettos", before="ner")
doc = nlp("The Warsaw Ghetto was in Poland. The ghetto at Warsaw.")
displacy.render(doc, style="ent", jupyter=True)

The Warsaw Ghetto GHETTO was in Poland GPE . The ghetto at Warsaw. GHETTO

3.3.9. Creating a Geography Pipe#

Often in historical documents, the named geographic features of Europe are referenced. We can use both a list and a set of patterns to look for named mountains, forests, rivers, etc.

general_pattern = r"([A-Z]\w+) (River|Mountain|Mountains|Forest|Forests|Sea|Ocean)*"
river_pattern = "(the|The) (Rhone|Volga|Danube|Ural|Dnieper|Don|Pechora|Kama|Oka|Belaya|Dniester|Rhine|Desna|Elbe|Donets|Vistula|Tagus|Daugava|Loire|Tisza|Ebro|Prut|Neman|Sava|Meuse|Kuban River|Douro|Mezen|Oder|Guadiana|Rhône|Kuma|Warta|Seine|Mureș|Northern Dvina|Vychegda|Drava|Po|Guadalquivir|Bolshoy Uzen|Siret|Maly Uzen|Terek|Olt|Vashka|Glomma|Garonne|Usa|Kemijoki|Great Morava|Moselle|Main 525|Torne|Dalälven|Inn|Maritsa|Marne|Neris|Júcar|Dordogne|Saône|Ume|Mur|Ångerman|Klarälven|Lule|Gauja|Weser|Kalix|Vindel River|Ljusnan|Indalsälven|Vltava|Ponoy|Ialomița|Onega|Somes|Struma|Adige|Skellefte|Tiber|Vah|Pite|Faxälven|Vardar|Shannon|Charente|Iskar|Tundzha|Ems|Tana|Scheldt|Timiș|Genil|Severn|Morava|Luga|Argeș|Ljungan|Minho|Venta|Thames|Drina|Jiu|Drin|Segura|Torne|Osam|Arda|Yantra|Kamchiya|Mesta)"

nlp = spacy.load("en_core_web_sm")

@Language.component("find_geography")
def find_geography(doc):
    doc = regex_match(doc, general_pattern, "GEOGRAPHY")
    doc = regex_match(doc, river_pattern, "GEOGRAPHY")    
    return (doc)
nlp.add_pipe("find_geography", before="ner")
doc = nlp("We entered the Black Forest and eventually walked to the Rhine.")
displacy.render(doc, style="ent", jupyter=True)

We entered the Black Forest GEOGRAPHY and eventually walked to the Rhine GEOGRAPHY .

3.3.10. Seeing the Pipes at Work#

Let’s now bring all this work together and assemble all our pipes into a single pipeline. We will disable the standard “ner” in this pipeline for now.

nlp = spacy.load("en_core_web_sm", disable=["ner"])
camp_ruler = nlp.add_pipe("entity_ruler", before="ner", name="camp_ruler")
camp_ruler.add_patterns(camp_patterns)
nlp.add_pipe("find_streets", before="ner")
nlp.add_pipe("find_spousal", before="ner")
nlp.add_pipe("find_ghettos", before="ner")
nlp.add_pipe("find_military", before="ner")
nlp.add_pipe("find_geography", before="ner")
nlp.add_pipe("named_ships", before="ner")

/home/wjbmattingly/anaconda3/envs/python-textbook/lib/python3.9/site-packages/spacy/language.py:1895: UserWarning: [W123] Argument disable with value ['ner'] is used instead of ['senter'] as specified in the config. Be aware that this might affect other components in your pipeline.
  warnings.warn(

<function __main__.named_ships(doc)>

Now that we have compiled our pipeline, let’s test it. We will grab the raw text of a USHMM oral history testimony and process it through our pipeline. As we can see below, we were able to identify 3 different types of entities: CAMP, GHETTO, and MILITARY. We can also see the specific entities we grabbed. Note that the purpose of a heuristic pipeline is not to grab all entities, rather extract as many true positives as possible at the expense of missing a few examples. This can improve an overall pipeline that also leverages machine learning models. All 56 extracted entities are true positives.

import requests
import json
s = requests.get("https://collections.ushmm.org/search/catalog/irn505576.json")

data = json.loads(s.content)
doc = nlp(data["response"]["document"]["fnd_content_web"][0])
ents = list(doc.ents)
ent_types = [ent.label_ for ent in ents]
ent_types = list(set(ent_types))
ent_types.sort()
print(f"Found a Total of {len(list(doc.ents))} in {len(ent_types)} categories: {ent_types}")
for ent in ents:
    print(ent.text, ent.label_)

Found a Total of 56 in 3 categories: ['CAMP', 'GHETTO', 'MILITARY']
Sobibór CAMP
Sobibór CAMP
Sobibór CAMP
Oberscharführer Gustav Wagner MILITARY
Sobibór CAMP
Sobibór CAMP
Oberscharführer Karl Frenzel MILITARY
Sobibór CAMP
Sobibor CAMP
Scharführer Hubert Gomerski MILITARY
Scharführer Michel MILITARY
Oberscharführer Hermann Michel MILITARY
Sobibór CAMP
Majdanek CAMP
Treblinka CAMP
Treblinka CAMP
Sobibór CAMP
Treblinka CAMP
Belzec CAMP
Belzec CAMP
Sobibór CAMP
Sobibór CAMP
Treblinka CAMP
Belzec CAMP
Untersturmführer Neumann MILITARY
Hauptscharführer Johann Niemann MILITARY
Sobibór CAMP
Sobibór CAMP
Sobibór CAMP
Sobibór CAMP
Auschwitz CAMP
Auschwitz CAMP
Sobibór CAMP
Oberscharführer Gustav Wagner MILITARY
Sobibór CAMP
Scharführer Wolf MILITARY
Unterscharführer Franz Wolf MILITARY
Sobibor CAMP
Sobibór CAMP
Sobibór CAMP
Sobibór CAMP
Sobibór CAMP
Sobibór CAMP
Sobibór CAMP
Sobibór CAMP
Sobibór CAMP
ghetto in Helm GHETTO
Sobibór CAMP
Sobibór CAMP
Sobibór CAMP
Sobibór CAMP
Sobibór CAMP
Sobibor CAMP
Sobibór CAMP
Sobibór CAMP
Sobibór CAMP

This pipeline can now extract structured metadata from unstructured raw text. With heuristics like the ones above, we can construct fairly robust NLP pipelines that leverage lists of known entities as well as spaCy’s built in linguistic features to identify and extract useful and relevant from our text data. If we wanted to improve our pipeline, we could also train custom spaCy NER machine learning models, but this is beyond the scope of this book. With a strong basis in spaCy, however, and the plentiful resources available online, this should provide you with a good starting point to begin working with spaCy in your own projects.