{
"cells": [
{
"cell_type": "markdown",
"id": "b899b5cb-e954-4a20-a346-1ba90b702b1a",
"metadata": {},
"source": [
"# SNA on Humanities Data: Structuring the Data"
]
},
{
"cell_type": "markdown",
"id": "0f4c2208-374c-471c-be80-6b57b2ec382f",
"metadata": {},
"source": [
"In this section, we will apply the skills we have learned about SNA on real humanities data. Again, we will be working with the Volume 7 Dataset from the South African TRC. Unlike the last chapter on Topic Modeling, we will not be interested in how the descriptions of violence cluster together. Instead, we will be interested in exploring how organizations relate to specific individuals as they appear in their descriptions. It is important to note here, in this approach we do not know the relationship between the victim and the organization. There is equal possibility that they were a member or victim of the organization. To understand these relationships, we would want to use spaCy and a few NLP techniques to extract meaning about these relationships first."
]
},
{
"cell_type": "markdown",
"id": "f1e8b360-4d50-4934-bf3f-8328ee6430d4",
"metadata": {},
"source": [
"## Examining the Data"
]
},
{
"cell_type": "markdown",
"id": "8b4f48a0-96bf-4608-95dc-b8ba6a007a12",
"metadata": {},
"source": [
"First, let's import our required libraries. We will only be concerned with structuring the data in this section, so we will only import Pandas. We will also import random because we want to assign a random color to each unique organization. This ensures that our approach scales quickly if we were to add thousands of new organizations into our dataset. For a polished, final version, one would want to manually assign a color to each organization so that the results would be more reproducible."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "d9d7daed-5d39-4e5c-9d62-63c88cbf0ff3",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import random"
]
},
{
"cell_type": "markdown",
"id": "48df6644-fe1b-4b1c-a2dc-0d2a59d28a34",
"metadata": {},
"source": [
"We will now load our data. We only need four of the columns: `Last`, `First`, `ORG`, `Place`"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "84b5d73c-f9c1-4af0-99e9-1a1d4cd93e1f",
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"../data/trc.csv\")\n",
"df = df[:1000]\n",
"df = df[[\"Last\", \"First\",\"ORG\", \"Place\"]]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "c63387f8-4a37-47e6-9ff5-3820ca68cd1c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Last | \n",
" First | \n",
" ORG | \n",
" Place | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" AARON | \n",
" Thabo Simon | \n",
" ANC|ANCYL|Police|SAP | \n",
" Bethulie | \n",
"
\n",
" \n",
" 1 | \n",
" ABBOTT | \n",
" Montaigne | \n",
" SADF | \n",
" Messina | \n",
"
\n",
" \n",
" 2 | \n",
" ABRAHAM | \n",
" Nzaliseko Christopher | \n",
" COSAS|Police | \n",
" Mdantsane | \n",
"
\n",
" \n",
" 3 | \n",
" ABRAHAMS | \n",
" Achmat Fardiel | \n",
" SAP | \n",
" Athlone | \n",
"
\n",
" \n",
" 4 | \n",
" ABRAHAMS | \n",
" Annalene Mildred | \n",
" Police|SAP | \n",
" Robertson | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 995 | \n",
" CELE | \n",
" Nompumelelo Iris ‘Magwaza’ | \n",
" ANC | \n",
" Ndwedwe | \n",
"
\n",
" \n",
" 996 | \n",
" CELE | \n",
" Nomvula Eunice | \n",
" ANC | \n",
" Umbumbulu | \n",
"
\n",
" \n",
" 997 | \n",
" CELE | \n",
" Nonhlanhla Evelina | \n",
" ANC | \n",
" Umzimkulu | \n",
"
\n",
" \n",
" 998 | \n",
" CELE | \n",
" Nozimpahla | \n",
" NaN | \n",
" Sonkombo | \n",
"
\n",
" \n",
" 999 | \n",
" DLAMINI | \n",
" Cedric Bongani | \n",
" ANC | \n",
" Durban | \n",
"
\n",
" \n",
"
\n",
"
1000 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Last First ORG Place\n",
"0 AARON Thabo Simon ANC|ANCYL|Police|SAP Bethulie\n",
"1 ABBOTT Montaigne SADF Messina\n",
"2 ABRAHAM Nzaliseko Christopher COSAS|Police Mdantsane\n",
"3 ABRAHAMS Achmat Fardiel SAP Athlone\n",
"4 ABRAHAMS Annalene Mildred Police|SAP Robertson\n",
".. ... ... ... ...\n",
"995 CELE Nompumelelo Iris ‘Magwaza’ ANC Ndwedwe\n",
"996 CELE Nomvula Eunice ANC Umbumbulu\n",
"997 CELE Nonhlanhla Evelina ANC Umzimkulu\n",
"998 CELE Nozimpahla NaN Sonkombo\n",
"999 DLAMINI Cedric Bongani ANC Durban\n",
"\n",
"[1000 rows x 4 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"id": "d9100d57-c02d-4052-86f9-740f569e2560",
"metadata": {},
"source": [
"Now that we have our data, we can begin clean it and prepare it for inclusion in a NetworkX graph. Our goal is to create a list of nodes and edges separately which we will then store as two separate Pandas DataFrames. We will then be able to use this data for graph creation in the next section. To do this, we will use the following code."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "15639064-4851-4ebf-a9e5-d8656d6804e9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[{'name': '0_Thabo Simon AARON', 'color': 'green', 'place': 'Bethulie'}]\n",
"[{'source': 'ANC', 'target': '0_Thabo Simon AARON', 'place': 'Bethulie'}]\n",
"1021\n"
]
}
],
"source": [
"nodes = []\n",
"edge_list = []\n",
"found_orgs = []\n",
"for idx, row in df.iterrows():\n",
" node_id = f\"{idx}_{row.First} {row.Last}\"\n",
" place = row.Place\n",
" nodes.append(({\"name\": node_id, \"color\": \"green\", \"place\": place}))\n",
" if pd.isnull(row.ORG) == False:\n",
" orgs = row.ORG.split(\"|\")\n",
" for org in orgs:\n",
" if org not in found_orgs:\n",
" color = \"#\"+''.join([random.choice('0123456789ABCDEF') for j in range(6)])\n",
" nodes.append({\"name\": org, \"color\": color})\n",
" found_orgs.append(org)\n",
" edge_list.append({\"source\": org, \"target\": node_id, \"place\": place})\n",
"print(nodes[:1])\n",
"print(edge_list[:1])\n",
"print(len(nodes))"
]
},
{
"cell_type": "markdown",
"id": "fba1dc12-2518-41f8-9953-ea3060f8c7a6",
"metadata": {},
"source": [
"Let's break this down. First, we create three separate lists that we will append to:\n",
"\n",
"```\n",
"nodes = []\n",
"edge_list = []\n",
"found_orgs = []\n",
"```\n",
"\n",
"The list `nodes` will contain a list of all nodes for the graph. The list `edge_list` will contain all the edges in our graph. The list `found_orgs` will be an easy way to know which organizations have already been found. This is to prevent us from adding an organization into our nodes list more than once.\n",
"\n",
"Next, we begin iterating over our DataFrame and creating a unique `node_id` for each person. Remember, some individuals have the same first and last names. This means we need to assign a unique number to their name as well. We also want to give each node some extra metadata, namely the `place` that is referenced in their description and the `color` of `green`. This will keep all individuals in our graph the same node color.\n",
"\n",
"```\n",
"for idx, row in df.iterrows():\n",
" node_id = f\"{idx}_{row.First} {row.Last}\"\n",
" place = row.Place\n",
" nodes.append(({\"name\": node_id, \"color\": \"green\", \"place\": place}))\n",
"```\n",
"\n",
"After this, we want to then add each organization to the node list if it does not appear already in there and then add an edge between it and the individual to which it is connected.\n",
"\n",
"```\n",
" if pd.isnull(row.ORG) == False:\n",
" orgs = row.ORG.split(\"|\")\n",
" for org in orgs:\n",
" if org not in found_orgs:\n",
" color = \"#\"+''.join([random.choice('0123456789ABCDEF') for j in range(6)])\n",
" nodes.append({\"name\": org, \"color\": color})\n",
" found_orgs.append(org)\n",
" edge_list.append({\"source\": org, \"target\": node_id, \"place\": place})\n",
"```\n",
"\n",
"Not all individual's, however, are connected to organizations. Sometimes, our `ORG` column is null. For this reason, we use the condition:\n",
"\n",
"```\n",
" if pd.isnull(row.ORG) == False:\n",
"```\n",
"\n",
"This checks to see if the `ORG` column is empty. If it is, we ignore it. If not, then we split up all the organizations into individual strings and then check to see if is in `found_orgs`. If not, then we assign it a random color and add it to our list of nodes. Next, we add it to `found_orgs` so that we do not add it again.\n",
"\n",
"Once we have created our data, we can save each the node list and the edge list to CSV files."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "3120e3be-270b-4fc8-8290-6f34ab6fb304",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" name | \n",
" color | \n",
" place | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0_Thabo Simon AARON | \n",
" green | \n",
" Bethulie | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" name color place\n",
"0 0_Thabo Simon AARON green Bethulie"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"node_df = pd.DataFrame(nodes)\n",
"node_df.to_csv(\"../data/nodes.csv\", index=False)\n",
"node_df.head(1)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "d6eea2ea-5568-402b-b450-94bfa030bc2c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" source | \n",
" target | \n",
" place | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" ANC | \n",
" 0_Thabo Simon AARON | \n",
" Bethulie | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" source target place\n",
"0 ANC 0_Thabo Simon AARON Bethulie"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"edge_df = pd.DataFrame(edge_list)\n",
"edge_df.to_csv(\"../data/edges.csv\", index=False)\n",
"edge_df.head(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b09ce55c-4a5f-4b67-907b-7b08b0ab8ded",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "2440ed66-64b9-4a34-893b-4280d3394e8c",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "e6d288e8-758b-49d8-8712-ec960d0fee98",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}