{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Creating Rules-Based Pipeline for Holocaust Documents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section, we will walk through how to develop a complete heuristic spaCy pipeline for performing named entity recognition. We will leverage a combination of EntityRuler pipes and custom spaCy pipes that use RegEx to find larger matches." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a Blank spaCy Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first thing we need to do is import all of the different components from spaCy and other libraries that we will need. I will explain these as we go forward." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import spacy\n", "from spacy.util import filter_spans\n", "from spacy.tokens import Span\n", "from spacy.language import Language\n", "import re\n", "import pandas as pd\n", "import unidecode\n", "from spacy import displacy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will be using Pandas in this notebook to run data checks in a CSV file originally produced by the Holocaust Geographies Collaborative, headed by Anne Knowles, Tim Cole, Alberto Giordano, Paul Jaskot, and Anika Walke. We will be importing RegEx because a lot of our heuristics will rely on capturing multi-word tokens.\n", "\n", "Now that we have imported everything, let's create a blank English pipeline in spaCy. As we work through this notebook, we will add pipes to it." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "nlp = spacy.load(\"en_core_web_sm\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating EntityRulers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the first section of this chapter, we looked at acquiring a camps dataset from the USHMM website. We will now build off that section by normalizing our dataset to include camp names with and without accent marks." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "ushmm_camps = ['Alderney', 'Amersfoort', 'Auschwitz', 'Banjica', 'Bełżec', 'Bergen-Belsen', 'Bernburg', 'Bogdanovka', 'Bolzano', 'Bor', 'Breendonk',\n", " 'Breitenau', 'Buchenwald', 'Chełmno', 'Dachau', 'Drancy', 'Falstad', 'Flossenbürg', 'Fort VII', 'Fossoli', 'Grini', 'Gross-Rosen',\n", " 'Herzogenbusch', 'Hinzert', 'Janowska', 'Jasenovac', 'Kaiserwald', 'Kaunas', 'Kemna', 'Klooga', 'Le Vernet', 'Majdanek', 'Malchow',\n", " 'Maly Trostenets', 'Mechelen', 'Mittelbau-Dora', 'Natzweiler-Struthof', 'Neuengamme', 'Niederhagen', 'Oberer Kuhberg', 'Oranienburg',\n", " 'Osthofen', 'Płaszów', 'Ravensbruck', 'Risiera di San Sabba', 'Sachsenhausen', 'Sajmište', 'Salaspils', 'Sobibór', 'Soldau', 'Stutthof',\n", " 'Theresienstadt', 'Trawniki', 'Treblinka', 'Vaivara']\n", "\n", "camp_patterns = []\n", "for camp in ushmm_camps:\n", " normalized = unidecode.unidecode(camp)\n", " camp_patterns.append({\"pattern\": camp, \"label\": \"CAMP\"})\n", " camp_patterns.append({\"pattern\": normalized, \"label\": \"CAMP\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we are creating a list called `camp_patterns`. These are a set of patterns expected by a spaCy EntityRuler. With our patterns created, we can create our EntityRuler. Note that we are placing the ruler before the NER pipe in the model. This is so that our heuristic EntityRuler will be able to annotate the doc container before the NER model. We are also giving it a custom name, `camp_ruler`. This is so that we know precisely what this ruler does in our pipeline when we examine the pipeline structure." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "camp_ruler = nlp.add_pipe(\"entity_ruler\", before=\"ner\", name=\"camp_ruler\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the ruler created, we can then populate it with the patterns." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "camp_ruler.add_patterns(camp_patterns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, it is time to test the pipeline to make sure it is working as intended." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " Auschwitz\n", " CAMP\n", "\n", " was a camp during \n", "\n", " WWII\n", " EVENT\n", "\n", ". \n", "\n", " Płaszów\n", " CAMP\n", "\n", " was also a camp. Another spelling is \n", "\n", " Plaszow\n", " CAMP\n", "\n", ".
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "doc = nlp(\"Auschwitz was a camp during WWII. Płaszów was also a camp. Another spelling is Plaszow.\")\n", "displacy.render(doc, style=\"ent\", jupyter=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that our pipeline is now correctly identifying all forms of camps as camps correctly. This will ensure that our pipeline can function with different spellings of these words." ] }, { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true, "tags": [] }, "source": [ "## Creating Function for Matching RegEx" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As noted in the previous section, one of the main challenges with Holocaust-related data is the high degree of toponyms, or words that have identical spelling but different classifications. We will see this issue with several of our classes. Many US ships, for example, bear the name of states. Additionally, named ghettos, such as the Warsaw Ghetto, also can function as a general place (GPE). If we want to use heuristics, therefore, we must create rather robust rules to ensure that our rules have a high degree of accuracy. We can use the following function to do just that by leveraging RegEx.\n", "\n", "We will be using this function quite a bit for each of our custom spaCy pipes, so let's explore this function in depth." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def regex_match(doc, pattern, label, filter=True,\n", " context=False, context_list=[],\n", " window_start=100, window_end=100):\n", " text = doc.text\n", " new_ents = []\n", " original_ents = list(doc.ents)\n", " for match in re.finditer(pattern, doc.text):\n", " start, end = match.span()\n", " span = doc.char_span(start, end)\n", " if context==True:\n", " window = text[start-window_start:end+window_end]\n", " if any(term in window.lower() for term in context_list):\n", " if span is not None:\n", " new_ents.append((span.start, span.end, span.text))\n", " else:\n", " if span is not None:\n", " new_ents.append((span.start, span.end, span.text))\n", " for ent in new_ents:\n", " start, end, name = ent\n", " new_ent = Span(doc, start, end, label=label)\n", " original_ents.append(new_ent)\n", " if filter==True:\n", " filtered = filter_spans(original_ents)\n", " final_ents = filtered\n", " else:\n", " final_ents = new_ents\n", " doc.ents = final_ents\n", " return doc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "def The purpose of this function is to leverage RegEx and spaCy to perform full-text matching. This allows us to use the strengths of both NLP techniques. With RegEx, we can perform complex fuzzy string matching beyond an individual token. With spaCy, we can leverage linguistic knowledge about the text. This function will let us pass a RegEx pattern, a doc container, and a custom label, then find all items that match the pattern. It will then inject those RegEx matches into spaCy `Span` containers that will then be manually inserted into the found entities. Let's break this down in the code.\n", "\n", "The first line of our code is as follows:\n", "\n", "```\n", "def regex_match(doc, pattern, label, filter=True,\n", " context=False,The context_list=[],\n", " window_start=100, window_end=100):\n", "```\n", "\n", "Here, we are creating a function that will take three mandatory arguments: `doc`, `pattern`, and `label`. The `doc` parameter will be the Doc container that is created by the spaCy nlp pipeline. The `pattern` will be the RegEx pattern that we want to use to perform a match. Finally, the `label` is the label that we want to assign to the items matched.\n", "\n", "We also have several keyword arguments:\n", "- `filter`\n", "- `context`\n", "- `context_list`\n", "- `window_start`\n", "- `window_end`\n", "\n", "These keyword arguments allow for greater versatility in the function. It allows a user to filter the entities and only keep the ones that are longest. Remember, spaCy entities are hard classifications, meaning a token can only belong to a single class, or label. This ensures that when we manually place our found RegEx matches into the Doc container's entities, there are no overlapping entities.\n", "\n", "The `context` parameter is a Boolean that allows for a user to pass a `context_list` to the function. This will be a list of words that increases the propability that an identified token belongs to a specific class. We will see this at play when we write rules for the labels `GHETTO` and `SHIP`. As noted above, ghettos frequently function as `GHETTO` and `GPE`. The presence of certain words around a match, such as `ghetto`, radically increase the chance that our hit is a `GHETTO` and not a `GPE`. Likewise, with ships, many ships in the US fleet are named after states. This means we could potentially mark New York, for example, as a `SHIP`, rather than as a `GPE`.\n", "\n", "The `window_start` and `window_end` control the window of the context. So it will only look forward and backward n-characters for the presence of a word in the `context_list`. This is a simple heuristic but one that proves quite effective.\n", "\n", "The next bit of code is as follows:\n", "\n", "```\n", " text = doc.text\n", " new_ents = []\n", " original_ents = list(doc.ents)\n", "```\n", "\n", "Here, we are grabbing the raw text of the Doc container. This will be necessary if the user has enabled context searching. Next, we create an empty list of new entities. This will be the list to which we append our new hits. Finally, we grab the original entities already found by earlier pipes. It is important to convert this to a list because `doc.ents` is a generator which prevents us from iterating over the data.\n", "\n", "The next block of code looks like this\n", "```\n", " for match in re.finditer(pattern, doc.text):\n", " start, end = match.span()\n", " span = doc.char_span(start, end)\n", " if context==True:\n", " window = text[start-window_start:end+window_end]\n", " if any(term in window.lower() for term in context_list):\n", " if span is not None:\n", " new_ents.append((span.start, span.end, span.text))\n", " else:\n", " if span is not None:\n", " new_ents.append((span.start, span.end, span.text))\n", "```\n", "\n", "Here, we are using `re.finditer()` to find all cases where our RegEx formula has matched in the `doc.text`. We iterate over each match and create a `Span` with the start and end character. If the user has enabled `context` searching, then we look for any term in their `context_list` to see if it appears in the given window they have established. If there is a match, then we append the `Span` to `new_ents`.\n", "\n", "Our final section of code reads:\n", "\n", "```\n", " for ent in new_ents:\n", " start, end, name = ent\n", " new_ent = Span(doc, start, end, label=label)\n", " original_ents.append(new_ent)\n", " if filter==True:\n", " filtered = filter_spans(original_ents)\n", " final_ents = filtered\n", " else:\n", " final_ents = new_ents\n", " doc.ents = final_ents\n", " return doc\n", "```\n", "\n", "Here, we are iterating over each of our newly found matches. We then place the `new_ent` match into the `original_ents` list after assigning it the specified `label`. Finally, we filter the spans so that there are no overlapping entities.\n", "\n", "Once complete, we reset the `doc.ents` to the `final_ents` and return the `doc` object back to the user." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Add Pipe for Finding Streets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's examine this function in practice by trying to find and extract all streets in a text. Because our data comes from Europe, we need to represent the way streets are represented in multiple languages. Let's look at the example below where we use RegEx to find ways in which a street may appear in German (and Dutch) as well as English. Unlike English, German street names often contain the word street (strasse) as a compound in the street name. English, on the other hand, has many different ways to render the concept of a street both in abbreviated and full forms. RegEx is a perfect tool for working with this degree of variance." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " Berlinstrasse\n", " STREET\n", "\n", " was a famous street. In \n", "\n", " America\n", " GPE\n", "\n", ", many towns have a \n", "\n", " Main Street.\n", " STREET\n", "\n", " It can also be spelled \n", "\n", " Main St.\n", " STREET\n", "\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "nlp = spacy.load(\"en_core_web_sm\")\n", "\n", "german_streets = r\"[A-Z][a-z]*(strasse|straße|straat)\\b\"\n", "english_streets = r\"([A-Z][a-z]* (Street|St|Boulevard|Blvd|Avenue|Ave|Road|Rd|Lane|Ln|Place|Pl)(\\.)*)\"\n", "@Language.component(\"find_streets\")\n", "def find_streets(doc):\n", " doc = regex_match(doc, german_streets, \"STREET\")\n", " doc = regex_match(doc, english_streets, \"STREET\")\n", " return (doc)\n", "nlp.add_pipe(\"find_streets\", before=\"ner\")\n", "\n", "doc = nlp(\"Berlinstrasse was a famous street. In America, many towns have a Main Street. It can also be spelled Main St.\")\n", "displacy.render(doc, style=\"ent\", jupyter=True)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Creating a Pipe for Finding Ships" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ships frequently appear in Holocaust documents. We could provide a list of ships to an EntityRuler, but finding one of these patterns isn't enough. We want to ensure that the thing referenced is in fact a ship. Many of these terms could easily be toponyms, or entities that share the same spelling but mean different things in different contexts, e.g. the Ile de France, could easily be a GPE that refers to the area around Paris. General Hosey could easily be a PERSON. These are also known ships. To ensure toponym disambiguation, I set up several contextual clues, e.g. the list of nautical terms. If any of these words appear in area around the hit, then the heuristics assign that token the label of SHIP. If not, it ignores it and allows later pipes or the machine learning model to annotate it.\n", "\n", "We can gather a list of known WWII ships from the WWII Database (https://ww2db.com/). One of the problems with the lists available here is that they are not utf-8 encoded. This means that the text is not normalized. Inside this repository is a csv file with a normalized version of these ship names. This CSV file also stores metadata about ships, namely their class, country of origin, and the year they were built.\n", "\n", "In spaCy, we can store this metadata inside the entity Span container with a custom extension. This is done via a function known as a `getter` function that maps the data to the custom extension. Let's see how this works in practice. We will use a synthetic sentence in which the `Activity`, a British vessel, is referenced as is `New York`, a place as well as a US ship active during WWII." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Activity SHIP {'class': 'Activity-class Escort Carrier', 'country': 'United Kingdom', 'year_built': '1942'}\n", "New York GPE False\n" ] } ], "source": [ "def ship_metadata_getter(ent):\n", " df = pd.read_csv(\"../data/wwii-ships.csv\")\n", " if ent.label_ == \"SHIP\":\n", " ship_name = ent.text.replace(\"The\", \"\").replace(\"the\", \"\").strip()\n", " row = df.loc[df.name==ship_name]\n", " if len(row) > 0:\n", " row = row.iloc[0]\n", " return {\"class\": row[\"class\"], \"country\": row.country, \"year_built\": row.year}\n", " else:\n", " return None\n", " else:\n", " return False\n", "nlp = spacy.load(\"en_core_web_sm\")\n", "df = pd.read_csv(\"../data/wwii-ships.csv\")\n", "\n", "named_ships_pattern = f\"(The|the) ({'|'.join(df.name.tolist())})\"\n", "\n", "@Language.component(\"named_ships\")\n", "def named_ships(doc):\n", " nautical = [\"ship\", \"boat\", \"sail\", \"captain\", \"sea\", \"harbor\", \"aboard\", \"admiral\", \"liner\"]\n", " doc = regex_match(doc, named_ships_pattern, \"SHIP\", context=True, context_list=nautical)\n", " for ent in doc.ents:\n", " if ent.label_ == \"SHIP\":\n", " ent.set_extension('ship_metadata', getter=ship_metadata_getter, force=True)\n", " return (doc)\n", "nlp.add_pipe(\"named_ships\", before=\"ner\")\n", "doc = nlp(\"The Activity set sail from New York\")\n", "for ent in doc.ents:\n", " print(ent.text, ent.label_, ent._.ship_metadata)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a lot happening in this section of code, so let's break down the different components here. The first function is our `getter` function: `ship_metadata_getter()`.\n", "\n", "```\n", "def ship_metadata_getter(ent):\n", " df = pd.read_csv(\"../data/wwii-ships.csv\")\n", " if ent.label_ == \"SHIP\":\n", " ship_name = ent.text.replace(\"The\", \"\").replace(\"the\", \"\").strip()\n", " row = df.loc[df.name==ship_name]\n", " if len(row) > 0:\n", " row = row.iloc[0]\n", " return {\"class\": row[\"class\"], \"country\": row.country, \"year_built\": row.year}\n", " else:\n", " return None\n", " else:\n", " return False\n", "```\n", "\n", "This function takes a single argument, the entity identified from the RegEx match. Here, we strip out the `The` from the entity's name and check to see if the ship found appears in our dataset of known ships. We then isolate the ship's metadata in the dataset and return that data as a dictionary.\n", "\n", "```\n", "df = pd.read_csv(\"../data/wwii-ships.csv\")\n", "\n", "named_ships_pattern = f\"(The|the) ({'|'.join(df.name.tolist())})\"\n", "\n", "@Language.component(\"named_ships\")\n", "def named_ships(doc):\n", " nautical = [\"ship\", \"boat\", \"sail\", \"captain\", \"sea\", \"harbor\", \"aboard\", \"admiral\", \"liner\"]\n", " doc = regex_match(doc, named_ships_pattern, \"SHIP\", context=True, context_list=nautical)\n", " for ent in doc.ents:\n", " if ent.label_ == \"SHIP\":\n", " ent.set_extension('ship_metadata', getter=ship_metadata_getter, force=True)\n", " return (doc)\n", "```\n", "\n", "In this next section of code, we load up the dataset to construct our pattern of names to find via our RegEx function. As we iterate over each entity in our Doc container, we check to see if that entity's label is a `SHIP`. If it is, then we set the extension of `ship_metadata` via the `getter` function referenced above.\n", "\n", "Finally, we add our pipe to the spaCy pipeline and test the model with our constructed sentence.\n", "\n", "```\n", "nlp.add_pipe(\"named_ships\", before=\"ner\")\n", "doc = nlp(\"The Activity set sail from New York\")\n", "for ent in doc.ents:\n", " print(ent.text, ent.label_, ent._.ship_metadata)\n", "```" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Create Pipe for Identifying a Military Personnel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Military personnel are often referenced in English documents with their military rank. This means that we can use basic RegEx rules for finding and extracting those individuals. When we encounter a sequence of tokens that are capitalized after a military rank, we can make a fairly safe presumption that these tokens will be associated nominally with the rank. For this pipeline, we can again use WWII Database to isolate and grab all known military ranks during WWII for all countries." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " Captain James T. Kirk\n", " MILITARY\n", "\n", " commanded the ship.
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "nlp = spacy.load(\"en_core_web_sm\")\n", "with open(\"../data/military_ranks.txt\", \"r\") as f:\n", " ranks = f.read().splitlines()\n", "military_pattern = f\"({'|'.join(ranks)})((?=\\s[A-Z])(?:\\s[A-Z][a-z\\.]+)+)\"\n", "@Language.component(\"find_military\")\n", "def find_military(doc):\n", " doc = regex_match(doc, military_pattern, \"MILITARY\")\n", " return (doc)\n", "nlp.add_pipe(\"find_military\", before=\"ner\")\n", "\n", "doc = nlp(\"Captain James T. Kirk commanded the ship.\")\n", "displacy.render(doc, style=\"ent\", jupyter=True)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Create Pipe for Identifying Spouses" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often times in historical documents the identity of people are referenced collectively. In some instances, such as those of spouses, this results in the name of the woman being attached to the name of her husband. The purpose of this SPOUSAL entity is to identify such constructs so that users can manipulate the output and reconstruct each individual singularly." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mr. and Mrs. Smith SPOUSAL\n" ] } ], "source": [ "nlp = spacy.load(\"en_core_web_sm\")\n", "spousal_pattern = r\"((Mr|Mrs|Miss|Dr)(\\.)* and (Mr|Mrs|Miss|Dr)(\\.)*((?=\\s[A-Z])(?:\\s[A-Z][a-z\\.]+)+))\"\n", "@Language.component(\"find_spousal\")\n", "def find_spousal(doc):\n", " doc = regex_match(doc, spousal_pattern, \"SPOUSAL\")\n", " return (doc)\n", "nlp.add_pipe(\"find_spousal\", before=\"ner\")\n", "\n", "doc = nlp(\"Mr. and Mrs. Smith are going to the movie\")\n", "for ent in doc.ents:\n", " print(ent.text, ent.label_)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Creating a Pipe for Finding Ghettos" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Holocaust documents, identifying ghettos can be difficult. Frequently a ghetto has the same name as the city in which it is found. To distinguish between the city and the ghetto, speakers and writers often specify that they are referencing the ghetto, not the city generally, by using the word \"ghetto\" in near proximity to the city name. The below pipe leverages this tendency by looking for capitalized words that precede the word \"ghetto\"." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " The Warsaw Ghetto\n", " GHETTO\n", "\n", " was in \n", "\n", " Poland\n", " GPE\n", "\n", ". The \n", "\n", " ghetto at Warsaw.\n", " GHETTO\n", "\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "ghettos_pattern1 = r\"[A-Z]\\w+((-| )*[A-Z]\\w+)* (g|G)hetto\"\n", "ghettos_pattern2 = r\"(g|G)hetto (of|in|at)((?=\\s[A-Z])(?:\\s[A-Z][a-z\\.]+)+)\"\n", "nlp = spacy.load(\"en_core_web_sm\")\n", "@Language.component(\"find_ghettos\")\n", "def find_ghettos(doc):\n", " doc = regex_match(doc, ghettos_pattern1, \"GHETTO\")\n", " doc = regex_match(doc, ghettos_pattern2, \"GHETTO\")\n", " return (doc)\n", "nlp.add_pipe(\"find_ghettos\", before=\"ner\")\n", "doc = nlp(\"The Warsaw Ghetto was in Poland. The ghetto at Warsaw.\")\n", "displacy.render(doc, style=\"ent\", jupyter=True)" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Creating a Geography Pipe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often in historical documents, the named geographic features of Europe are referenced. We can use both a list and a set of patterns to look for named mountains, forests, rivers, etc." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
We entered the \n", "\n", " Black Forest\n", " GEOGRAPHY\n", "\n", " and eventually walked to \n", "\n", " the Rhine\n", " GEOGRAPHY\n", "\n", ".
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "general_pattern = r\"([A-Z]\\w+) (River|Mountain|Mountains|Forest|Forests|Sea|Ocean)*\"\n", "river_pattern = \"(the|The) (Rhone|Volga|Danube|Ural|Dnieper|Don|Pechora|Kama|Oka|Belaya|Dniester|Rhine|Desna|Elbe|Donets|Vistula|Tagus|Daugava|Loire|Tisza|Ebro|Prut|Neman|Sava|Meuse|Kuban River|Douro|Mezen|Oder|Guadiana|Rhône|Kuma|Warta|Seine|Mureș|Northern Dvina|Vychegda|Drava|Po|Guadalquivir|Bolshoy Uzen|Siret|Maly Uzen|Terek|Olt|Vashka|Glomma|Garonne|Usa|Kemijoki|Great Morava|Moselle|Main 525|Torne|Dalälven|Inn|Maritsa|Marne|Neris|Júcar|Dordogne|Saône|Ume|Mur|Ångerman|Klarälven|Lule|Gauja|Weser|Kalix|Vindel River|Ljusnan|Indalsälven|Vltava|Ponoy|Ialomița|Onega|Somes|Struma|Adige|Skellefte|Tiber|Vah|Pite|Faxälven|Vardar|Shannon|Charente|Iskar|Tundzha|Ems|Tana|Scheldt|Timiș|Genil|Severn|Morava|Luga|Argeș|Ljungan|Minho|Venta|Thames|Drina|Jiu|Drin|Segura|Torne|Osam|Arda|Yantra|Kamchiya|Mesta)\"\n", "\n", "nlp = spacy.load(\"en_core_web_sm\")\n", "\n", "@Language.component(\"find_geography\")\n", "def find_geography(doc):\n", " doc = regex_match(doc, general_pattern, \"GEOGRAPHY\")\n", " doc = regex_match(doc, river_pattern, \"GEOGRAPHY\") \n", " return (doc)\n", "nlp.add_pipe(\"find_geography\", before=\"ner\")\n", "doc = nlp(\"We entered the Black Forest and eventually walked to the Rhine.\")\n", "displacy.render(doc, style=\"ent\", jupyter=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Seeing the Pipes at Work" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now bring all this work together and assemble all our pipes into a single pipeline. We will disable the standard \"ner\" in this pipeline for now." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/wjbmattingly/anaconda3/envs/python-textbook/lib/python3.9/site-packages/spacy/language.py:1895: UserWarning: [W123] Argument disable with value ['ner'] is used instead of ['senter'] as specified in the config. Be aware that this might affect other components in your pipeline.\n", " warnings.warn(\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp = spacy.load(\"en_core_web_sm\", disable=[\"ner\"])\n", "camp_ruler = nlp.add_pipe(\"entity_ruler\", before=\"ner\", name=\"camp_ruler\")\n", "camp_ruler.add_patterns(camp_patterns)\n", "nlp.add_pipe(\"find_streets\", before=\"ner\")\n", "nlp.add_pipe(\"find_spousal\", before=\"ner\")\n", "nlp.add_pipe(\"find_ghettos\", before=\"ner\")\n", "nlp.add_pipe(\"find_military\", before=\"ner\")\n", "nlp.add_pipe(\"find_geography\", before=\"ner\")\n", "nlp.add_pipe(\"named_ships\", before=\"ner\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have compiled our pipeline, let's test it. We will grab the raw text of a USHMM oral history testimony and process it through our pipeline. As we can see below, we were able to identify 3 different types of entities: `CAMP`, `GHETTO`, and `MILITARY`. We can also see the specific entities we grabbed. Note that the purpose of a heuristic pipeline is not to grab all entities, rather extract as many true positives as possible at the expense of missing a few examples. This can improve an overall pipeline that also leverages machine learning models. All 56 extracted entities are true positives." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found a Total of 56 in 3 categories: ['CAMP', 'GHETTO', 'MILITARY']\n", "Sobibór CAMP\n", "Sobibór CAMP\n", "Sobibór CAMP\n", "Oberscharführer Gustav Wagner MILITARY\n", "Sobibór CAMP\n", "Sobibór CAMP\n", "Oberscharführer Karl Frenzel MILITARY\n", "Sobibór CAMP\n", "Sobibor CAMP\n", "Scharführer Hubert Gomerski MILITARY\n", "Scharführer Michel MILITARY\n", "Oberscharführer Hermann Michel MILITARY\n", "Sobibór CAMP\n", "Majdanek CAMP\n", "Treblinka CAMP\n", "Treblinka CAMP\n", "Sobibór CAMP\n", "Treblinka CAMP\n", "Belzec CAMP\n", "Belzec CAMP\n", "Sobibór CAMP\n", "Sobibór CAMP\n", "Treblinka CAMP\n", "Belzec CAMP\n", "Untersturmführer Neumann MILITARY\n", "Hauptscharführer Johann Niemann MILITARY\n", "Sobibór CAMP\n", "Sobibór CAMP\n", "Sobibór CAMP\n", "Sobibór CAMP\n", "Auschwitz CAMP\n", "Auschwitz CAMP\n", "Sobibór CAMP\n", "Oberscharführer Gustav Wagner MILITARY\n", "Sobibór CAMP\n", "Scharführer Wolf MILITARY\n", "Unterscharführer Franz Wolf MILITARY\n", "Sobibor CAMP\n", "Sobibór CAMP\n", "Sobibór CAMP\n", "Sobibór CAMP\n", "Sobibór CAMP\n", "Sobibór CAMP\n", "Sobibór CAMP\n", "Sobibór CAMP\n", "Sobibór CAMP\n", "ghetto in Helm GHETTO\n", "Sobibór CAMP\n", "Sobibór CAMP\n", "Sobibór CAMP\n", "Sobibór CAMP\n", "Sobibór CAMP\n", "Sobibor CAMP\n", "Sobibór CAMP\n", "Sobibór CAMP\n", "Sobibór CAMP\n" ] } ], "source": [ "import requests\n", "import json\n", "s = requests.get(\"https://collections.ushmm.org/search/catalog/irn505576.json\")\n", "\n", "data = json.loads(s.content)\n", "doc = nlp(data[\"response\"][\"document\"][\"fnd_content_web\"][0])\n", "ents = list(doc.ents)\n", "ent_types = [ent.label_ for ent in ents]\n", "ent_types = list(set(ent_types))\n", "ent_types.sort()\n", "print(f\"Found a Total of {len(list(doc.ents))} in {len(ent_types)} categories: {ent_types}\")\n", "for ent in ents:\n", " print(ent.text, ent.label_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This pipeline can now extract structured metadata from unstructured raw text. With heuristics like the ones above, we can construct fairly robust NLP pipelines that leverage lists of known entities as well as spaCy's built in linguistic features to identify and extract useful and relevant from our text data. If we wanted to improve our pipeline, we could also train custom spaCy NER machine learning models, but this is beyond the scope of this book. With a strong basis in spaCy, however, and the plentiful resources available online, this should provide you with a good starting point to begin working with spaCy in your own projects." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }