{ "cells": [ { "cell_type": "markdown", "id": "dominant-celebration", "metadata": {}, "source": [ "# Events Analysis" ] }, { "cell_type": "markdown", "id": "075538b3-d372-4025-b0d6-808882086c0b", "metadata": {}, "source": [ "The only output file that details event data is the .tokens file. As a result, this file will be the focus of this chapter. Each section of this chapter will analyze the .tokens file in a deeper way to identify and extract event data. At the end of the chapter, we will bring everything together with a single function that can recreate these results on any BookNLP output .tokens file." ] }, { "cell_type": "markdown", "id": "dependent-greeting", "metadata": {}, "source": [ "## Exploring the Tokens File" ] }, { "cell_type": "markdown", "id": "30d6d992-ae58-4a43-b28c-ad5c95121671", "metadata": {}, "source": [ "Let's first go ahead and open up the .tokens file and take a look at it so we can remember precisely what the .tsv file looks like. If you remember from chapter 3, we can analyze the files a bit more easily, if we use pandas, a tabular data analysis library in Python." ] }, { "cell_type": "code", "execution_count": 8, "id": "cultural-medicine", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | paragraph_ID | \n", "sentence_ID | \n", "token_ID_within_sentence | \n", "token_ID_within_document | \n", "word | \n", "lemma | \n", "byte_onset | \n", "byte_offset | \n", "POS_tag | \n", "fine_POS_tag | \n", "dependency_relation | \n", "syntactic_head_ID | \n", "event | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "Mr. | \n", "Mr. | \n", "0 | \n", "3 | \n", "PROPN | \n", "NNP | \n", "nmod | \n", "3 | \n", "O | \n", "
1 | \n", "0 | \n", "0 | \n", "1 | \n", "1 | \n", "and | \n", "and | \n", "4 | \n", "7 | \n", "CCONJ | \n", "CC | \n", "cc | \n", "0 | \n", "O | \n", "
2 | \n", "0 | \n", "0 | \n", "2 | \n", "2 | \n", "Mrs. | \n", "Mrs. | \n", "8 | \n", "12 | \n", "PROPN | \n", "NNP | \n", "compound | \n", "3 | \n", "O | \n", "
3 | \n", "0 | \n", "0 | \n", "3 | \n", "3 | \n", "Dursley | \n", "Dursley | \n", "13 | \n", "20 | \n", "PROPN | \n", "NNP | \n", "nsubj | \n", "12 | \n", "O | \n", "
4 | \n", "0 | \n", "0 | \n", "4 | \n", "4 | \n", ", | \n", ", | \n", "20 | \n", "21 | \n", "PUNCT | \n", ", | \n", "punct | \n", "3 | \n", "O | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
99251 | \n", "2995 | \n", "6172 | \n", "10 | \n", "99251 | \n", "Dudley | \n", "Dudley | \n", "438929 | \n", "438935 | \n", "PROPN | \n", "NNP | \n", "pobj | \n", "99250 | \n", "O | \n", "
99252 | \n", "2995 | \n", "6172 | \n", "11 | \n", "99252 | \n", "this | \n", "this | \n", "438936 | \n", "438940 | \n", "DET | \n", "DT | \n", "det | \n", "99253 | \n", "O | \n", "
99253 | \n", "2995 | \n", "6172 | \n", "12 | \n", "99253 | \n", "summer | \n", "summer | \n", "438941 | \n", "438947 | \n", "NOUN | \n", "NN | \n", "npadvmod | \n", "99245 | \n", "O | \n", "
99254 | \n", "2995 | \n", "6172 | \n", "13 | \n", "99254 | \n", ".... | \n", ".... | \n", "438947 | \n", "438951 | \n", "PUNCT | \n", ". | \n", "punct | \n", "99243 | \n", "O | \n", "
99255 | \n", "2995 | \n", "6172 | \n", "14 | \n", "99255 | \n", "\\t | \n", "438951 | \n", "438952 | \n", "PUNCT | \n", "'' | \n", "punct | \n", "99243 | \n", "O | \n", "NaN | \n", "
99256 rows × 13 columns
\n", "\n", " | sentence_ID | \n", "word | \n", "lemma | \n", "event | \n", "
---|---|---|---|---|
0 | \n", "0 | \n", "Mr. | \n", "Mr. | \n", "O | \n", "
1 | \n", "0 | \n", "and | \n", "and | \n", "O | \n", "
2 | \n", "0 | \n", "Mrs. | \n", "Mrs. | \n", "O | \n", "
3 | \n", "0 | \n", "Dursley | \n", "Dursley | \n", "O | \n", "
4 | \n", "0 | \n", ", | \n", ", | \n", "O | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
99250 | \n", "6172 | \n", "with | \n", "with | \n", "O | \n", "
99251 | \n", "6172 | \n", "Dudley | \n", "Dudley | \n", "O | \n", "
99252 | \n", "6172 | \n", "this | \n", "this | \n", "O | \n", "
99253 | \n", "6172 | \n", "summer | \n", "summer | \n", "O | \n", "
99254 | \n", "6172 | \n", ".... | \n", ".... | \n", "O | \n", "
94498 rows × 4 columns
\n", "\n", " | sentence_ID | \n", "word | \n", "lemma | \n", "event | \n", "
---|---|---|---|---|
242 | \n", "9 | \n", "shuddered | \n", "shudder | \n", "EVENT | \n", "
308 | \n", "12 | \n", "woke | \n", "wake | \n", "EVENT | \n", "
346 | \n", "13 | \n", "hummed | \n", "hum | \n", "EVENT | \n", "
349 | \n", "13 | \n", "picked | \n", "pick | \n", "EVENT | \n", "
361 | \n", "13 | \n", "gossiped | \n", "gossip | \n", "EVENT | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
99152 | \n", "6167 | \n", "hung | \n", "hang | \n", "EVENT | \n", "
99185 | \n", "6169 | \n", "said | \n", "say | \n", "EVENT | \n", "
99209 | \n", "6170 | \n", "said | \n", "say | \n", "EVENT | \n", "
99215 | \n", "6170 | \n", "surprised | \n", "surprised | \n", "EVENT | \n", "
99218 | \n", "6170 | \n", "grin | \n", "grin | \n", "EVENT | \n", "
6029 rows × 4 columns
\n", "\n", " | sentence_ID | \n", "word | \n", "lemma | \n", "event | \n", "
---|---|---|---|---|
0 | \n", "0 | \n", "Mr. | \n", "Mr. | \n", "O | \n", "
1 | \n", "0 | \n", "and | \n", "and | \n", "O | \n", "
2 | \n", "0 | \n", "Mrs. | \n", "Mrs. | \n", "O | \n", "
3 | \n", "0 | \n", "Dursley | \n", "Dursley | \n", "O | \n", "
4 | \n", "0 | \n", ", | \n", ", | \n", "O | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
99251 | \n", "6172 | \n", "Dudley | \n", "Dudley | \n", "O | \n", "
99252 | \n", "6172 | \n", "this | \n", "this | \n", "O | \n", "
99253 | \n", "6172 | \n", "summer | \n", "summer | \n", "O | \n", "
99254 | \n", "6172 | \n", ".... | \n", ".... | \n", "O | \n", "
99255 | \n", "6172 | \n", "\\t | \n", "438951 | \n", "NaN | \n", "
99256 rows × 4 columns
\n", "\n", " | sentence_ID | \n", "word | \n", "lemma | \n", "event | \n", "
---|---|---|---|---|
240 | \n", "9 | \n", "The | \n", "the | \n", "O | \n", "
241 | \n", "9 | \n", "Dursleys | \n", "Dursleys | \n", "O | \n", "
242 | \n", "9 | \n", "shuddered | \n", "shudder | \n", "EVENT | \n", "
243 | \n", "9 | \n", "to | \n", "to | \n", "O | \n", "
244 | \n", "9 | \n", "think | \n", "think | \n", "O | \n", "
245 | \n", "9 | \n", "what | \n", "what | \n", "O | \n", "
246 | \n", "9 | \n", "the | \n", "the | \n", "O | \n", "
247 | \n", "9 | \n", "neighbors | \n", "neighbor | \n", "O | \n", "
248 | \n", "9 | \n", "would | \n", "would | \n", "O | \n", "
249 | \n", "9 | \n", "say | \n", "say | \n", "O | \n", "
250 | \n", "9 | \n", "if | \n", "if | \n", "O | \n", "
251 | \n", "9 | \n", "the | \n", "the | \n", "O | \n", "
252 | \n", "9 | \n", "Potters | \n", "Potters | \n", "O | \n", "
253 | \n", "9 | \n", "arrived | \n", "arrive | \n", "O | \n", "
254 | \n", "9 | \n", "in | \n", "in | \n", "O | \n", "
255 | \n", "9 | \n", "the | \n", "the | \n", "O | \n", "
256 | \n", "9 | \n", "street | \n", "street | \n", "O | \n", "
257 | \n", "9 | \n", ". | \n", ". | \n", "O | \n", "
\n", " | event_word | \n", "event_lemma | \n", "sentence | \n", "
---|---|---|---|
0 | \n", "shuddered | \n", "shudder | \n", "The Dursleys shuddered to think what the neigh... | \n", "
1 | \n", "woke | \n", "wake | \n", "When Mr. and Mrs. Dursley woke up on the dull ... | \n", "
2 | \n", "hummed | \n", "hum | \n", "Mr. Dursley hummed as he picked out his most b... | \n", "
3 | \n", "picked | \n", "pick | \n", "Mr. Dursley hummed as he picked out his most b... | \n", "
4 | \n", "gossiped | \n", "gossip | \n", "Mr. Dursley hummed as he picked out his most b... | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
6024 | \n", "hung | \n", "hang | \n", "Harry hung back for a last word with Ron and H... | \n", "
6025 | \n", "said | \n", "say | \n", "\\t \\t Hope you have -- er -- a good holiday , ... | \n", "
6026 | \n", "said | \n", "say | \n", "\\t Oh , I will , \\t said Harry , and they were... | \n", "
6027 | \n", "surprised | \n", "surprised | \n", "\\t Oh , I will , \\t said Harry , and they were... | \n", "
6028 | \n", "grin | \n", "grin | \n", "\\t Oh , I will , \\t said Harry , and they were... | \n", "
6029 rows × 3 columns
\n", "