{ "cells": [ { "cell_type": "markdown", "id": "dominant-celebration", "metadata": {}, "source": [ "# Events Analysis" ] }, { "cell_type": "markdown", "id": "075538b3-d372-4025-b0d6-808882086c0b", "metadata": {}, "source": [ "The only output file that details event data is the .tokens file. As a result, this file will be the focus of this chapter. Each section of this chapter will analyze the .tokens file in a deeper way to identify and extract event data. At the end of the chapter, we will bring everything together with a single function that can recreate these results on any BookNLP output .tokens file." ] }, { "cell_type": "markdown", "id": "dependent-greeting", "metadata": {}, "source": [ "## Exploring the Tokens File" ] }, { "cell_type": "markdown", "id": "30d6d992-ae58-4a43-b28c-ad5c95121671", "metadata": {}, "source": [ "Let's first go ahead and open up the .tokens file and take a look at it so we can remember precisely what the .tsv file looks like. If you remember from chapter 3, we can analyze the files a bit more easily, if we use pandas, a tabular data analysis library in Python." ] }, { "cell_type": "code", "execution_count": 8, "id": "cultural-medicine", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
paragraph_IDsentence_IDtoken_ID_within_sentencetoken_ID_within_documentwordlemmabyte_onsetbyte_offsetPOS_tagfine_POS_tagdependency_relationsyntactic_head_IDevent
00000Mr.Mr.03PROPNNNPnmod3O
10011andand47CCONJCCcc0O
20022Mrs.Mrs.812PROPNNNPcompound3O
30033DursleyDursley1320PROPNNNPnsubj12O
40044,,2021PUNCT,punct3O
..........................................
99251299561721099251DudleyDudley438929438935PROPNNNPpobj99250O
99252299561721199252thisthis438936438940DETDTdet99253O
99253299561721299253summersummer438941438947NOUNNNnpadvmod99245O
99254299561721399254........438947438951PUNCT.punct99243O
99255299561721499255\\t438951438952PUNCT''punct99243ONaN
\n", "

99256 rows × 13 columns

\n", "
" ], "text/plain": [ " paragraph_ID sentence_ID token_ID_within_sentence \\\n", "0 0 0 0 \n", "1 0 0 1 \n", "2 0 0 2 \n", "3 0 0 3 \n", "4 0 0 4 \n", "... ... ... ... \n", "99251 2995 6172 10 \n", "99252 2995 6172 11 \n", "99253 2995 6172 12 \n", "99254 2995 6172 13 \n", "99255 2995 6172 14 \n", "\n", " token_ID_within_document word lemma byte_onset byte_offset \\\n", "0 0 Mr. Mr. 0 3 \n", "1 1 and and 4 7 \n", "2 2 Mrs. Mrs. 8 12 \n", "3 3 Dursley Dursley 13 20 \n", "4 4 , , 20 21 \n", "... ... ... ... ... ... \n", "99251 99251 Dudley Dudley 438929 438935 \n", "99252 99252 this this 438936 438940 \n", "99253 99253 summer summer 438941 438947 \n", "99254 99254 .... .... 438947 438951 \n", "99255 99255 \\t 438951 438952 PUNCT \n", "\n", " POS_tag fine_POS_tag dependency_relation syntactic_head_ID event \n", "0 PROPN NNP nmod 3 O \n", "1 CCONJ CC cc 0 O \n", "2 PROPN NNP compound 3 O \n", "3 PROPN NNP nsubj 12 O \n", "4 PUNCT , punct 3 O \n", "... ... ... ... ... ... \n", "99251 PROPN NNP pobj 99250 O \n", "99252 DET DT det 99253 O \n", "99253 NOUN NN npadvmod 99245 O \n", "99254 PUNCT . punct 99243 O \n", "99255 '' punct 99243 O NaN \n", "\n", "[99256 rows x 13 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.read_csv(\"data/harry_potter/harry_potter.tokens\", delimiter=\"\\t\")\n", "df" ] }, { "cell_type": "markdown", "id": "ethical-tourist", "metadata": {}, "source": [ "We have approximately 99,000 rows and 13 columns of data. Throughout this chapter, we will focus on only four columns in particular:\n", "\n", "- sentence_ID\n", "- word\n", "- lemma\n", "- event\n", "\n", "As such, let's go ahead and remove all the extra data for now so that we can just view the columns we care about." ] }, { "cell_type": "markdown", "id": "86db149b-10d1-4ae3-bf5f-b4acdd904077", "metadata": {}, "source": [ "df = df[[\"sentence_ID\", \"word\", \"lemma\", \"event\"]]\n", "df" ] }, { "cell_type": "markdown", "id": "7c0324bf-5c65-4c77-94d8-5b9bb5c6d815", "metadata": {}, "source": [ "Excellent! Now we can analyze this event column a bit more easily." ] }, { "cell_type": "markdown", "id": "b2615c17-bdd0-450c-9e25-ee1a105d69e3", "metadata": {}, "source": [ "## Grabbing the Events" ] }, { "cell_type": "markdown", "id": "e1f476a5-4633-40a6-aa55-fc323f2e9052", "metadata": {}, "source": [ "One of the things we can see above is that some event columns contain NaN. Ideally, we want to ignore these entirely. We can do this in pandas by using the isnull() method." ] }, { "cell_type": "code", "execution_count": 57, "id": "secure-radiation", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentence_IDwordlemmaevent
00Mr.Mr.O
10andandO
20Mrs.Mrs.O
30DursleyDursleyO
40,,O
...............
992506172withwithO
992516172DudleyDudleyO
992526172thisthisO
992536172summersummerO
992546172........O
\n", "

94498 rows × 4 columns

\n", "
" ], "text/plain": [ " sentence_ID word lemma event\n", "0 0 Mr. Mr. O\n", "1 0 and and O\n", "2 0 Mrs. Mrs. O\n", "3 0 Dursley Dursley O\n", "4 0 , , O\n", "... ... ... ... ...\n", "99250 6172 with with O\n", "99251 6172 Dudley Dudley O\n", "99252 6172 this this O\n", "99253 6172 summer summer O\n", "99254 6172 .... .... O\n", "\n", "[94498 rows x 4 columns]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "events = df[~df['event'].isnull()]\n", "events" ] }, { "cell_type": "markdown", "id": "dedfaeec-a371-449b-b830-4245d5b6f0c5", "metadata": {}, "source": [ "As we can see this eliminated roughly 5,000 rows. Let's take a closer look at the column event and see what kind of data we can expect to see here." ] }, { "cell_type": "code", "execution_count": 12, "id": "f2e3507b-01bc-4a08-8287-3a52d3feb37b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'EVENT', 'O'}\n" ] } ], "source": [ "event_options = set(events.event.tolist())\n", "print (event_options)" ] }, { "cell_type": "markdown", "id": "palestinian-movement", "metadata": {}, "source": [ "By converting this column to a list and then to a set (which eliminates the duplicates), we can see that we have two types of data in the event column:\n", "\n", "- EVENT\n", "- O\n", "\n", "If a row has \"EVENT\" in the column then it means the corresponding word was identified by the BookNLP pipeline as being an event-triggering word. Now that we know this, let's take a look at only the rows that have EVENT in the event column." ] }, { "cell_type": "code", "execution_count": 58, "id": "db43a6d1-20ab-4028-a8ed-5e3912713b26", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentence_IDwordlemmaevent
2429shudderedshudderEVENT
30812wokewakeEVENT
34613hummedhumEVENT
34913pickedpickEVENT
36113gossipedgossipEVENT
...............
991526167hunghangEVENT
991856169saidsayEVENT
992096170saidsayEVENT
992156170surprisedsurprisedEVENT
992186170gringrinEVENT
\n", "

6029 rows × 4 columns

\n", "
" ], "text/plain": [ " sentence_ID word lemma event\n", "242 9 shuddered shudder EVENT\n", "308 12 woke wake EVENT\n", "346 13 hummed hum EVENT\n", "349 13 picked pick EVENT\n", "361 13 gossiped gossip EVENT\n", "... ... ... ... ...\n", "99152 6167 hung hang EVENT\n", "99185 6169 said say EVENT\n", "99209 6170 said say EVENT\n", "99215 6170 surprised surprised EVENT\n", "99218 6170 grin grin EVENT\n", "\n", "[6029 rows x 4 columns]" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "real_events = events.loc[df[\"event\"] == \"EVENT\"]\n", "real_events" ] }, { "cell_type": "markdown", "id": "fe03c383-35f1-4488-b4e1-e588a3cfb8ed", "metadata": {}, "source": [ "We now have only 6,029 rows to analyze!" ] }, { "cell_type": "markdown", "id": "previous-physiology", "metadata": {}, "source": [ "## Analyzing Events Words and Lemmas" ] }, { "cell_type": "markdown", "id": "black-frame", "metadata": {}, "source": [ "Let's dig a little deeper. Let's try to analyze the words and lemmas of these rows to see how many unique words and lemmas we have." ] }, { "cell_type": "code", "execution_count": 59, "id": "animal-governor", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1501" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "event_words = set(real_events.word.tolist())\n", "len(event_words)" ] }, { "cell_type": "code", "execution_count": 60, "id": "3bfb56e0-4290-4368-b1e6-126229788675", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1021" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "event_lemmas = list(set(real_events.lemma.tolist()))\n", "event_lemmas.sort()\n", "len(event_lemmas)" ] }, { "cell_type": "code", "execution_count": 61, "id": "b578c827-81c6-48b8-be34-5487a0756e2c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['BOOM', 'Bludger', 'Pompously', 'Scowling', 'Smelting', 'Whispers', 'aback', 'accept', 'ache', 'act']\n" ] } ], "source": [ "print (event_lemmas[:10])" ] }, { "cell_type": "markdown", "id": "bcf20a20-e739-4004-8314-5b09672e2b08", "metadata": {}, "source": [ "While we have 1501 unique words, we only have 1021 unique lemmas. If we were interested in seeing the type of event words and lemmas appear in Harry Potter, we can now do that, but something I notice quickly is that some lemmas are capitalized. Let's eliminate all duplicates by lowering all lemmas." ] }, { "cell_type": "code", "execution_count": 62, "id": "bb396dd1-abe5-4307-b6c0-cfdddc2a7b34", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1020\n", "['boom', 'bludger', 'pompously', 'scowling', 'smelting', 'whispers', 'aback', 'accept', 'ache', 'act']\n" ] } ], "source": [ "final_lemmas = []\n", "for lemma in event_lemmas:\n", " lemma = lemma.lower()\n", " if lemma not in final_lemmas:\n", " final_lemmas.append(lemma)\n", " \n", "print(len(final_lemmas))\n", "print(final_lemmas[:10])" ] }, { "cell_type": "markdown", "id": "513694a5-aad7-4c10-8e05-9d29079a554d", "metadata": {}, "source": [ "We eliminated only one duplicate." ] }, { "cell_type": "markdown", "id": "floral-antique", "metadata": {}, "source": [ "## Grabbing Event Sentences" ] }, { "cell_type": "markdown", "id": "hybrid-milwaukee", "metadata": {}, "source": [ "Now that we know how to grab individual event-triggering words, what about the sentences that contain events? To analyze this, we can use the sentence_ID column which contains a unique number for each sentence." ] }, { "cell_type": "code", "execution_count": 63, "id": "plain-chester", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[9, 12, 13, 13, 13, 13, 13, 14, 15, 15]\n", "['shuddered', 'woke', 'hummed', 'picked', 'gossiped', 'wrestled', 'screaming', 'flutter', 'picked', 'pecked']\n" ] } ], "source": [ "sentences = real_events.sentence_ID.tolist()\n", "events = real_events.word.tolist()\n", "print (sentences[:10])\n", "print (events[:10])" ] }, { "cell_type": "markdown", "id": "critical-capital", "metadata": {}, "source": [ "We can see that some sentences appear multiple times. This is because they contain multiple words that are event-triggering.\n", "\n", "Let's take a look at our initial DataFrame once again." ] }, { "cell_type": "code", "execution_count": 64, "id": "fantastic-provincial", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentence_IDwordlemmaevent
00Mr.Mr.O
10andandO
20Mrs.Mrs.O
30DursleyDursleyO
40,,O
...............
992516172DudleyDudleyO
992526172thisthisO
992536172summersummerO
992546172........O
992556172\\t438951NaN
\n", "

99256 rows × 4 columns

\n", "
" ], "text/plain": [ " sentence_ID word lemma event\n", "0 0 Mr. Mr. O\n", "1 0 and and O\n", "2 0 Mrs. Mrs. O\n", "3 0 Dursley Dursley O\n", "4 0 , , O\n", "... ... ... ... ...\n", "99251 6172 Dudley Dudley O\n", "99252 6172 this this O\n", "99253 6172 summer summer O\n", "99254 6172 .... .... O\n", "99255 6172 \\t 438951 NaN\n", "\n", "[99256 rows x 4 columns]" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "id": "parallel-element", "metadata": {}, "source": [ "Let's say we were interested in grabbing the first sentence from the first event, we can grab all rows that have a matching sentence_ID." ] }, { "cell_type": "code", "execution_count": 65, "id": "headed-dollar", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentence_IDwordlemmaevent
2409ThetheO
2419DursleysDursleysO
2429shudderedshudderEVENT
2439totoO
2449thinkthinkO
2459whatwhatO
2469thetheO
2479neighborsneighborO
2489wouldwouldO
2499saysayO
2509ififO
2519thetheO
2529PottersPottersO
2539arrivedarriveO
2549ininO
2559thetheO
2569streetstreetO
2579..O
\n", "
" ], "text/plain": [ " sentence_ID word lemma event\n", "240 9 The the O\n", "241 9 Dursleys Dursleys O\n", "242 9 shuddered shudder EVENT\n", "243 9 to to O\n", "244 9 think think O\n", "245 9 what what O\n", "246 9 the the O\n", "247 9 neighbors neighbor O\n", "248 9 would would O\n", "249 9 say say O\n", "250 9 if if O\n", "251 9 the the O\n", "252 9 Potters Potters O\n", "253 9 arrived arrive O\n", "254 9 in in O\n", "255 9 the the O\n", "256 9 street street O\n", "257 9 . . O" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentence1 = sentences[0]\n", "result = df[df[\"sentence_ID\"] == int(sentence)]\n", "result" ] }, { "cell_type": "markdown", "id": "747260e2-3d9a-4762-a997-829681692573", "metadata": {}, "source": [ "With this data, we can then grab all the words and reconstruct the sentence." ] }, { "cell_type": "code", "execution_count": 46, "id": "convertible-arrow", "metadata": {}, "outputs": [], "source": [ "words = result.word.tolist()\n", "resentence = \" \".join(words)" ] }, { "cell_type": "code", "execution_count": 47, "id": "ongoing-determination", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street .\n" ] } ], "source": [ "print (resentence)" ] }, { "cell_type": "markdown", "id": "dcaeb98c-844b-4ce1-99cc-5f76be4adb1f", "metadata": {}, "source": [ "## Bringing Everything Together" ] }, { "cell_type": "markdown", "id": "c8de1569-cd7c-400b-9dbd-f3fde0eed7fd", "metadata": {}, "source": [ "Let's now bring everything together we just learned in this chapter and make it into a function. This function will receive a file that corresponds to the .tokens file. It will find the relevant event rows and then reconstruct the sentences that correspond to each event word. The output will be a list of dictionaries that are event-centric. Each dictionary will have 3 keys:\n", "\n", "- event_word = the event-triggering word\n", "- event_lemma = the event_word's lemma\n", "- sentence = the sentence that the event-triggering word is in" ] }, { "cell_type": "code", "execution_count": 53, "id": "de4726b0-167c-48ad-ae37-4a6ed18f4649", "metadata": {}, "outputs": [], "source": [ "def grab_event_sentences(file):\n", " df = pd.read_csv(file, delimiter=\"\\t\")\n", " real_events = df.loc[df[\"event\"] == \"EVENT\"]\n", " sentences = real_events.sentence_ID.tolist()\n", " event_words = real_events.word.tolist()\n", " event_lemmas = real_events.lemma.tolist()\n", " final_sentences = []\n", " x=0\n", " for sentence in sentences:\n", " result = df[df[\"sentence_ID\"] == int(sentence)]\n", " words = result.word.tolist()\n", " resentence = \" \".join(words)\n", " final_sentences.append({\"event_word\": event_words[x],\n", " \"event_lemma\": event_lemmas[x],\n", " \"sentence\": resentence\n", " \n", " })\n", " x=x+1\n", " return final_sentences\n", " \n", " \n", "event_data = grab_event_sentences(\"data/harry_potter/harry_potter.tokens\")" ] }, { "cell_type": "markdown", "id": "15095738-6b42-4671-bca0-ff91fb5fd095", "metadata": {}, "source": [ "Let's take a look at the output now." ] }, { "cell_type": "code", "execution_count": 54, "id": "a51d0ea6-2444-4c44-af10-4050fbbe1ff8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'event_word': 'shuddered', 'event_lemma': 'shudder', 'sentence': 'The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street .'}\n" ] } ], "source": [ "print (event_data[0])" ] }, { "cell_type": "markdown", "id": "c824f1bf-7065-4309-a151-fddc5f8e48ae", "metadata": {}, "source": [ "## Creating a .events File." ] }, { "cell_type": "markdown", "id": "de2211e1-39d2-48e9-bd92-4d7ee22b17a0", "metadata": {}, "source": [ "This allows us to now analyze the events identified in the BookNLP pipeline a bit more easily. Since we don't have a .events output file, this is currently one way that we can simulate the same result by creating a special events-centric output. With this data, we can now create a new DataFrame." ] }, { "cell_type": "code", "execution_count": 66, "id": "d0ecc2d9-e4e9-44a2-9337-74e498a7a9d7", "metadata": {}, "outputs": [], "source": [ "new_df = pd.DataFrame(event_data)" ] }, { "cell_type": "code", "execution_count": 67, "id": "393e714b-6e09-4eb4-89fc-7c2ae7fc442f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
event_wordevent_lemmasentence
0shudderedshudderThe Dursleys shuddered to think what the neigh...
1wokewakeWhen Mr. and Mrs. Dursley woke up on the dull ...
2hummedhumMr. Dursley hummed as he picked out his most b...
3pickedpickMr. Dursley hummed as he picked out his most b...
4gossipedgossipMr. Dursley hummed as he picked out his most b...
............
6024hunghangHarry hung back for a last word with Ron and H...
6025saidsay\\t \\t Hope you have -- er -- a good holiday , ...
6026saidsay\\t Oh , I will , \\t said Harry , and they were...
6027surprisedsurprised\\t Oh , I will , \\t said Harry , and they were...
6028gringrin\\t Oh , I will , \\t said Harry , and they were...
\n", "

6029 rows × 3 columns

\n", "
" ], "text/plain": [ " event_word event_lemma sentence\n", "0 shuddered shudder The Dursleys shuddered to think what the neigh...\n", "1 woke wake When Mr. and Mrs. Dursley woke up on the dull ...\n", "2 hummed hum Mr. Dursley hummed as he picked out his most b...\n", "3 picked pick Mr. Dursley hummed as he picked out his most b...\n", "4 gossiped gossip Mr. Dursley hummed as he picked out his most b...\n", "... ... ... ...\n", "6024 hung hang Harry hung back for a last word with Ron and H...\n", "6025 said say \\t \\t Hope you have -- er -- a good holiday , ...\n", "6026 said say \\t Oh , I will , \\t said Harry , and they were...\n", "6027 surprised surprised \\t Oh , I will , \\t said Harry , and they were...\n", "6028 grin grin \\t Oh , I will , \\t said Harry , and they were...\n", "\n", "[6029 rows x 3 columns]" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_df" ] }, { "cell_type": "markdown", "id": "e8e7f25f-5b27-4145-9a33-db6202a81e2c", "metadata": {}, "source": [ "We can also output it to the same subdirectory as the other files." ] }, { "cell_type": "code", "execution_count": 70, "id": "984b898a-0a19-4f24-a6e2-af0f4652bccc", "metadata": {}, "outputs": [], "source": [ "new_df.to_csv(\"data/harry_potter/harry_potter.events\", index=False)" ] }, { "cell_type": "markdown", "id": "d53e481f-aaa4-4f0e-aa5a-2d3ec07e8e7d", "metadata": {}, "source": [ "And now you have a .events file!" ] }, { "cell_type": "markdown", "id": "c586cbe3-d4f8-4b55-8bc1-074979c21d38", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "id": "8e1914f2-1ba2-4ec7-ac6c-52557eaae17f", "metadata": {}, "source": [ "You should now have a basic understanding of BookNLP and what it can do. While the results will not be perfect, it will give you a great starting point for understanding the salient characters, extracting, quotes, and identifying the major events within a large work of fiction." ] }, { "cell_type": "code", "execution_count": null, "id": "99000543-38e7-48c5-b15f-5573cbc2ced1", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 5 }