Cultivating Good Datasets for Entities

3.1. Cultivating Good Datasets for Entities#

3.1.1. Introduction to Datasets#

One of the greatest challenges when developing a rules-based pipeline for named entity recognition is finding quality datasets. In certain domains, data is readily available, but in many of the domains associated with the humanities, datasets can be hard to find. For this reason, we often need to create our own datasets. In this section, we will set it upon ourselves to find a list of concentration camps from data that is publicly available on the web. The data we acquire will not be perfect, nor will it be complete, but it will provide a starting point for creating a dataset of known concentration camps.

3.1.2. Acquiring the Data#

A good way to think about datasets for named entity recognition heuristics is as lists. We want to construct a list of known concentration camps that we can then pass to a spaCy EntityRuler pipe as a set of patterns to which we can assign the label CAMP. One of the first questions you will ask yourself in this pursuit, is “where can I acquire lists?” The answer is, unfortunately, “it depends.” Sometimes good datasets exist. There are a few good places to look, such as GitHub, Wikipedia, and academic digital projects. For each project, you must don your detective goggles and explore the web to find places to acquire this data. Most times, it will take a bit of work (and some Python code) to get the data into a structured form.

If we are looking to generate entities for concentration camps, we have a wealth of data, but this data is not necessary cleaned or structured. Let’s examine three different locations where we can collate a list of concentration camps from the web and the strengths and weaknesses of those sources.

3.1.3. United States Holocaust Memorial Museum#

The United States Holocaust Memorial Museum (USHMM), located in Washington, D.C. in the United States, is an excellent source for data on the Holocaust. When searching the USHMM collections, one way to limit your search is by Key Camps (https://www.ushmm.org/). This list looks like this:

ushmm_camps = ['Alderney', 'Amersfoort', 'Auschwitz', 'Banjica', 'Bełżec', 'Bergen-Belsen', 'Bernburg', 'Bogdanovka', 'Bolzano', 'Bor', 'Breendonk',
         'Breitenau', 'Buchenwald', 'Chełmno', 'Dachau', 'Drancy', 'Falstad', 'Flossenbürg', 'Fort VII', 'Fossoli', 'Grini', 'Gross-Rosen',
         'Herzogenbusch', 'Hinzert', 'Janowska', 'Jasenovac', 'Kaiserwald', 'Kaunas', 'Kemna', 'Klooga', 'Le Vernet', 'Majdanek', 'Malchow',
         'Maly Trostenets', 'Mechelen', 'Mittelbau-Dora', 'Natzweiler-Struthof', 'Neuengamme', 'Niederhagen', 'Oberer Kuhberg', 'Oranienburg',
         'Osthofen', 'Płaszów', 'Ravensbruck', 'Risiera di San Sabba', 'Sachsenhausen', 'Sajmište', 'Salaspils', 'Sobibór', 'Soldau', 'Stutthof',
         'Theresienstadt', 'Trawniki', 'Treblinka', 'Vaivara']
print (ushmm_camps)

['Alderney', 'Amersfoort', 'Auschwitz', 'Banjica', 'Bełżec', 'Bergen-Belsen', 'Bernburg', 'Bogdanovka', 'Bolzano', 'Bor', 'Breendonk', 'Breitenau', 'Buchenwald', 'Chełmno', 'Dachau', 'Drancy', 'Falstad', 'Flossenbürg', 'Fort VII', 'Fossoli', 'Grini', 'Gross-Rosen', 'Herzogenbusch', 'Hinzert', 'Janowska', 'Jasenovac', 'Kaiserwald', 'Kaunas', 'Kemna', 'Klooga', 'Le Vernet', 'Majdanek', 'Malchow', 'Maly Trostenets', 'Mechelen', 'Mittelbau-Dora', 'Natzweiler-Struthof', 'Neuengamme', 'Niederhagen', 'Oberer Kuhberg', 'Oranienburg', 'Osthofen', 'Płaszów', 'Ravensbruck', 'Risiera di San Sabba', 'Sachsenhausen', 'Sajmište', 'Salaspils', 'Sobibór', 'Soldau', 'Stutthof', 'Theresienstadt', 'Trawniki', 'Treblinka', 'Vaivara']

3.1.4. Normalizing Data#

While this dataset is cleaned and good, it has certain limitations. First, it is not complete. This is a list of key camps, not all camps. Note that subcamps are left off the list. The second problem we have is that these camps of certain characters in their names that reflect the accent marks or letters that are not in the English alphabet. Some Holocaust texts, however, use only English letters and characters. Therefore, searches for certain words, such as Bełżec will not return a hit in a search for Belzec. It is important, therefore, to make sure both forms of the word are represented in a rules-based search.

We can normalize accented text in Python with a single line of code via the unidecode library which can be installed with pip install unidecode. Once installed, we can import unidecode.

import unidecode

The unidecode library comes with the class unidcode() which will take a single argument, the string that we wish to standardize. Let’s examine how this works in practice with the camp Bełżec.

for camp in ushmm_camps[4:5]:
    normalized = unidecode.unidecode(camp)
    print(camp, normalized)

Bełżec Belzec

Note the standardization of ‘Bełżec’ as ‘Belzec’ in the list. Both forms are now represented in our dataset, meaning we can develop a rules-based EntityRuler that can find both forms of these words in texts. While we were able to solve the first problem, that of standardized data, we cannot solve the first, however. Should we wish, though, we could add this dataset to our Wikipedia datasets, but as we will see below, a larger dataset presents new challenges.