3.6. SNA on Humanities Data: Structuring the Data#
In this section, we will apply the skills we have learned about SNA on real humanities data. Again, we will be working with the Volume 7 Dataset from the South African TRC. Unlike the last chapter on Topic Modeling, we will not be interested in how the descriptions of violence cluster together. Instead, we will be interested in exploring how organizations relate to specific individuals as they appear in their descriptions. It is important to note here, in this approach we do not know the relationship between the victim and the organization. There is equal possibility that they were a member or victim of the organization. To understand these relationships, we would want to use spaCy and a few NLP techniques to extract meaning about these relationships first.
3.6.1. Examining the Data#
First, let’s import our required libraries. We will only be concerned with structuring the data in this section, so we will only import Pandas. We will also import random because we want to assign a random color to each unique organization. This ensures that our approach scales quickly if we were to add thousands of new organizations into our dataset. For a polished, final version, one would want to manually assign a color to each organization so that the results would be more reproducible.
import pandas as pd
import random
We will now load our data. We only need four of the columns: Last
, First
, ORG
, Place
df = pd.read_csv("../data/trc.csv")
df = df[:1000]
df = df[["Last", "First","ORG", "Place"]]
df
Last | First | ORG | Place | |
---|---|---|---|---|
0 | AARON | Thabo Simon | ANC|ANCYL|Police|SAP | Bethulie |
1 | ABBOTT | Montaigne | SADF | Messina |
2 | ABRAHAM | Nzaliseko Christopher | COSAS|Police | Mdantsane |
3 | ABRAHAMS | Achmat Fardiel | SAP | Athlone |
4 | ABRAHAMS | Annalene Mildred | Police|SAP | Robertson |
... | ... | ... | ... | ... |
995 | CELE | Nompumelelo Iris ‘Magwaza’ | ANC | Ndwedwe |
996 | CELE | Nomvula Eunice | ANC | Umbumbulu |
997 | CELE | Nonhlanhla Evelina | ANC | Umzimkulu |
998 | CELE | Nozimpahla | NaN | Sonkombo |
999 | DLAMINI | Cedric Bongani | ANC | Durban |
1000 rows × 4 columns
Now that we have our data, we can begin clean it and prepare it for inclusion in a NetworkX graph. Our goal is to create a list of nodes and edges separately which we will then store as two separate Pandas DataFrames. We will then be able to use this data for graph creation in the next section. To do this, we will use the following code.
nodes = []
edge_list = []
found_orgs = []
for idx, row in df.iterrows():
node_id = f"{idx}_{row.First} {row.Last}"
place = row.Place
nodes.append(({"name": node_id, "color": "green", "place": place}))
if pd.isnull(row.ORG) == False:
orgs = row.ORG.split("|")
for org in orgs:
if org not in found_orgs:
color = "#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)])
nodes.append({"name": org, "color": color})
found_orgs.append(org)
edge_list.append({"source": org, "target": node_id, "place": place})
print(nodes[:1])
print(edge_list[:1])
print(len(nodes))
[{'name': '0_Thabo Simon AARON', 'color': 'green', 'place': 'Bethulie'}]
[{'source': 'ANC', 'target': '0_Thabo Simon AARON', 'place': 'Bethulie'}]
1021
Let’s break this down. First, we create three separate lists that we will append to:
nodes = []
edge_list = []
found_orgs = []
The list nodes
will contain a list of all nodes for the graph. The list edge_list
will contain all the edges in our graph. The list found_orgs
will be an easy way to know which organizations have already been found. This is to prevent us from adding an organization into our nodes list more than once.
Next, we begin iterating over our DataFrame and creating a unique node_id
for each person. Remember, some individuals have the same first and last names. This means we need to assign a unique number to their name as well. We also want to give each node some extra metadata, namely the place
that is referenced in their description and the color
of green
. This will keep all individuals in our graph the same node color.
for idx, row in df.iterrows():
node_id = f"{idx}_{row.First} {row.Last}"
place = row.Place
nodes.append(({"name": node_id, "color": "green", "place": place}))
After this, we want to then add each organization to the node list if it does not appear already in there and then add an edge between it and the individual to which it is connected.
if pd.isnull(row.ORG) == False:
orgs = row.ORG.split("|")
for org in orgs:
if org not in found_orgs:
color = "#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)])
nodes.append({"name": org, "color": color})
found_orgs.append(org)
edge_list.append({"source": org, "target": node_id, "place": place})
Not all individual’s, however, are connected to organizations. Sometimes, our ORG
column is null. For this reason, we use the condition:
if pd.isnull(row.ORG) == False:
This checks to see if the ORG
column is empty. If it is, we ignore it. If not, then we split up all the organizations into individual strings and then check to see if is in found_orgs
. If not, then we assign it a random color and add it to our list of nodes. Next, we add it to found_orgs
so that we do not add it again.
Once we have created our data, we can save each the node list and the edge list to CSV files.
node_df = pd.DataFrame(nodes)
node_df.to_csv("../data/nodes.csv", index=False)
node_df.head(1)
name | color | place | |
---|---|---|---|
0 | 0_Thabo Simon AARON | green | Bethulie |
edge_df = pd.DataFrame(edge_list)
edge_df.to_csv("../data/edges.csv", index=False)
edge_df.head(1)
source | target | place | |
---|---|---|---|
0 | ANC | 0_Thabo Simon AARON | Bethulie |