3.6. SNA on Humanities Data: Structuring the Data#

In this section, we will apply the skills we have learned about SNA on real humanities data. Again, we will be working with the Volume 7 Dataset from the South African TRC. Unlike the last chapter on Topic Modeling, we will not be interested in how the descriptions of violence cluster together. Instead, we will be interested in exploring how organizations relate to specific individuals as they appear in their descriptions. It is important to note here, in this approach we do not know the relationship between the victim and the organization. There is equal possibility that they were a member or victim of the organization. To understand these relationships, we would want to use spaCy and a few NLP techniques to extract meaning about these relationships first.

3.6.1. Examining the Data#

First, let’s import our required libraries. We will only be concerned with structuring the data in this section, so we will only import Pandas. We will also import random because we want to assign a random color to each unique organization. This ensures that our approach scales quickly if we were to add thousands of new organizations into our dataset. For a polished, final version, one would want to manually assign a color to each organization so that the results would be more reproducible.

import pandas as pd
import random

We will now load our data. We only need four of the columns: Last, First, ORG, Place

df = pd.read_csv("../data/trc.csv")
df = df[:1000]
df = df[["Last", "First","ORG", "Place"]]
df
Last First ORG Place
0 AARON Thabo Simon ANC|ANCYL|Police|SAP Bethulie
1 ABBOTT Montaigne SADF Messina
2 ABRAHAM Nzaliseko Christopher COSAS|Police Mdantsane
3 ABRAHAMS Achmat Fardiel SAP Athlone
4 ABRAHAMS Annalene Mildred Police|SAP Robertson
... ... ... ... ...
995 CELE Nompumelelo Iris ‘Magwaza’ ANC Ndwedwe
996 CELE Nomvula Eunice ANC Umbumbulu
997 CELE Nonhlanhla Evelina ANC Umzimkulu
998 CELE Nozimpahla NaN Sonkombo
999 DLAMINI Cedric Bongani ANC Durban

1000 rows × 4 columns

Now that we have our data, we can begin clean it and prepare it for inclusion in a NetworkX graph. Our goal is to create a list of nodes and edges separately which we will then store as two separate Pandas DataFrames. We will then be able to use this data for graph creation in the next section. To do this, we will use the following code.

nodes = []
edge_list = []
found_orgs = []
for idx, row in df.iterrows():
    node_id = f"{idx}_{row.First} {row.Last}"
    place = row.Place
    nodes.append(({"name": node_id, "color": "green", "place": place}))
    if pd.isnull(row.ORG) == False:
        orgs = row.ORG.split("|")
        for org in orgs:
            if org not in found_orgs:
                color = "#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)])
                nodes.append({"name": org, "color": color})
                found_orgs.append(org)
            edge_list.append({"source": org, "target": node_id, "place": place})
print(nodes[:1])
print(edge_list[:1])
print(len(nodes))
[{'name': '0_Thabo Simon AARON', 'color': 'green', 'place': 'Bethulie'}]
[{'source': 'ANC', 'target': '0_Thabo Simon AARON', 'place': 'Bethulie'}]
1021

Let’s break this down. First, we create three separate lists that we will append to:

nodes = []
edge_list = []
found_orgs = []

The list nodes will contain a list of all nodes for the graph. The list edge_list will contain all the edges in our graph. The list found_orgs will be an easy way to know which organizations have already been found. This is to prevent us from adding an organization into our nodes list more than once.

Next, we begin iterating over our DataFrame and creating a unique node_id for each person. Remember, some individuals have the same first and last names. This means we need to assign a unique number to their name as well. We also want to give each node some extra metadata, namely the place that is referenced in their description and the color of green. This will keep all individuals in our graph the same node color.

for idx, row in df.iterrows():
    node_id = f"{idx}_{row.First} {row.Last}"
    place = row.Place
    nodes.append(({"name": node_id, "color": "green", "place": place}))

After this, we want to then add each organization to the node list if it does not appear already in there and then add an edge between it and the individual to which it is connected.

    if pd.isnull(row.ORG) == False:
        orgs = row.ORG.split("|")
        for org in orgs:
            if org not in found_orgs:
                color = "#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)])
                nodes.append({"name": org, "color": color})
                found_orgs.append(org)
            edge_list.append({"source": org, "target": node_id, "place": place})

Not all individual’s, however, are connected to organizations. Sometimes, our ORG column is null. For this reason, we use the condition:

    if pd.isnull(row.ORG) == False:

This checks to see if the ORG column is empty. If it is, we ignore it. If not, then we split up all the organizations into individual strings and then check to see if is in found_orgs. If not, then we assign it a random color and add it to our list of nodes. Next, we add it to found_orgs so that we do not add it again.

Once we have created our data, we can save each the node list and the edge list to CSV files.

node_df = pd.DataFrame(nodes)
node_df.to_csv("../data/nodes.csv", index=False)
node_df.head(1)
name color place
0 0_Thabo Simon AARON green Bethulie
edge_df = pd.DataFrame(edge_list)
edge_df.to_csv("../data/edges.csv", index=False)
edge_df.head(1)
source target place
0 ANC 0_Thabo Simon AARON Bethulie