{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Creating LDA in Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " In the last section, we learned a lot about the key idea and concepts behind LDA topic modeling. Let's now put those ideas into practice. In this section, we will be using the Gensim library to create our topic model and the PyLDAVis library to visualize it. You can install both libraries with `pip` with the following commands:\n", " \n", "```\n", "pip install gensim\n", "```\n", "\n", "and\n", "\n", "```\n", "pip install pyldavis\n", "```\n", "\n", "We will also need to install NLTK, or the Natural Language Toolkit, in order to get a list of stop words. We can install the library with pip:\n", "\n", "```\n", "pip install nltk\n", "```\n", "\n", "Once you have installed `NLTK`, you will need to download the list of English stop words. You can do so with the following command:\n", "\n", "```\n", "nltk.download('stopwords')\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Importing the Required Libraries and Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our libraries installed correctly, we can import everything." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from nltk.corpus import stopwords\n", "import string\n", "import gensim.corpora as corpora\n", "from gensim.models import LdaModel\n", "import pyLDAvis.gensim_models\n", "pyLDAvis.enable_notebook()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will import the requisite model from Gensim. For this notebook, we will be using the `LdaModel` class. This class allows us to create an LDA model. Before we can populate our model, however, we must first load and clean our data." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LastFirstDescription
0AARONThabo SimonAn ANCYL member who was shot and severely inju...
1ABBOTTMontaigneA member of the SADF who was severely injured ...
2ABRAHAMNzaliseko ChristopherA COSAS supporter who was kicked and beaten wi...
3ABRAHAMSAchmat FardielWas shot and blinded in one eye by members of ...
4ABRAHAMSAnnalene MildredWas shot and injured by members of the SAP in ...
............
20829XUZAMandlaWas severely injured when he was stoned by a f...
20830YAKAMbangomuniAn IFP supporter and acting induna who was sho...
20831YALIKhayalethuWas shot by members of the SAP in Lingelihle, ...
20832YALOBikiweAn IFP supporter whose house and possessions w...
20833YALOLO-BOOYSENGeoffrey YaliAn ANC supporter and youth activist who was to...
\n", "

20834 rows × 3 columns

\n", "
" ], "text/plain": [ " Last First \\\n", "0 AARON Thabo Simon \n", "1 ABBOTT Montaigne \n", "2 ABRAHAM Nzaliseko Christopher \n", "3 ABRAHAMS Achmat Fardiel \n", "4 ABRAHAMS Annalene Mildred \n", "... ... ... \n", "20829 XUZA Mandla \n", "20830 YAKA Mbangomuni \n", "20831 YALI Khayalethu \n", "20832 YALO Bikiwe \n", "20833 YALOLO-BOOYSEN Geoffrey Yali \n", "\n", " Description \n", "0 An ANCYL member who was shot and severely inju... \n", "1 A member of the SADF who was severely injured ... \n", "2 A COSAS supporter who was kicked and beaten wi... \n", "3 Was shot and blinded in one eye by members of ... \n", "4 Was shot and injured by members of the SAP in ... \n", "... ... \n", "20829 Was severely injured when he was stoned by a f... \n", "20830 An IFP supporter and acting induna who was sho... \n", "20831 Was shot by members of the SAP in Lingelihle, ... \n", "20832 An IFP supporter whose house and possessions w... \n", "20833 An ANC supporter and youth activist who was to... \n", "\n", "[20834 rows x 3 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"../data/trc.csv\")\n", "df = df[[\"Last\", \"First\", \"Description\"]]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our goal in this section will be to model all the descriptions of violence in the TRC Volume 7 Final Report. We will, therefore, grab all documents and place them into a list." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\"An ANCYL member who was shot and severely injured by SAP members at Lephoi, Bethulie, Orange Free State (OFS) on 17 April 1991. Police opened fire on a gathering at an ANC supporter's house following a dispute between two neighbours, one of whom was linked to the ANC and the other to the SAP and a councillor.\"" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docs = df.Description.tolist()\n", "docs[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleaning Documents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have our documents, let's go ahead and load up our stop words. These will be the words that we remove from our documents." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "stop_words = stopwords.words('english')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we need a function to clean our documents. The purpose of the function below is to take a single document as an input and return a cleaned sequence of words with no punctuation or stop words." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def clean_doc(doc):\n", " no_punct = ''\n", " for c in doc:\n", " if c not in string.punctuation:\n", " no_punct = no_punct+c\n", " # with list comprehension\n", " # no_punct = ''.join([c for c in doc if c not in string.punctuation])\n", " \n", " words = no_punct.lower().split()\n", " \n", " final_words = []\n", " for word in words:\n", " if word not in stop_words:\n", " final_words.append(word)\n", " \n", " # with list comprehension\n", " # final_words = [word for word in words if word not in stop_words]\n", "\n", " return final_words" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look and see what this looks like now when we run it over our first document." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "An ANCYL member who was shot and severely injured by SAP members at Lephoi, Bethulie, Orange Free State (OFS) on 17 April 1991. Police opened fire on a gathering at an ANC supporter's house following a dispute between two neighbours, one of whom was linked to the ANC and the other to the SAP and a councillor.\n", "['ancyl', 'member', 'shot', 'severely', 'injured', 'sap', 'members', 'lephoi', 'bethulie', 'orange', 'free', 'state', 'ofs', '17', 'april', '1991', 'police', 'opened', 'fire', 'gathering', 'anc', 'supporters', 'house', 'following', 'dispute', 'two', 'neighbours', 'one', 'linked', 'anc', 'sap', 'councillor']\n" ] } ], "source": [ "cleaned = clean_doc(docs[0])\n", "print(docs[0])\n", "print(cleaned)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With our function created, we can now process all our documents. In the line be below, we will convert all documents into a list called `cleaned_docs`. This will be our documents that are now represented as a sequence of words." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "cleaned_docs = [clean_doc(doc) for doc in docs]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create ID-Word Index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remember, an LDA topic model cannot look at words, rather it must look at numbers in order for the algorithm to work. This means that we need to convert all words into numbers. We can do this with a bag-of-words approach. In this approach, we create an ID-Word Index. This is essentially a dictionary where each unique word has a unique number. The dictionary is sorted alphabetically." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "id2word = corpora.Dictionary(cleaned_docs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the ID-Word Index created, let's query it and see what word maps to index number `250`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'bmw'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "id2word[250]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, it is `BMW`, a political organization in South Africa. Now that we have our dictionary, we can convert all our documents into a sequence of numbers, rather than words. We can do this via the `doc2bow()` method. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "corpus = [id2word.doc2bow(cleaned_doc) for cleaned_doc in cleaned_docs]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `corpus` object now contains all the documents but represented as a bag-of-words index, rather than as a sequence of words. This is the precise data that our LDA model will expect." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 2), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1)]\n" ] } ], "source": [ "print(corpus[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to see what these numbers correspond to, let's take a look at our first document and map each number to the ID-Word Index." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\t17\n", "1\t1991\n", "2\tanc\n", "3\tancyl\n", "4\tapril\n", "5\tbethulie\n", "6\tcouncillor\n", "7\tdispute\n", "8\tfire\n", "9\tfollowing\n", "10\tfree\n", "11\tgathering\n", "12\thouse\n", "13\tinjured\n", "14\tlephoi\n", "15\tlinked\n", "16\tmember\n", "17\tmembers\n", "18\tneighbours\n", "19\tofs\n", "20\tone\n", "21\topened\n", "22\torange\n", "23\tpolice\n", "24\tsap\n", "25\tseverely\n", "26\tshot\n", "27\tstate\n", "28\tsupporters\n", "29\ttwo\n" ] } ], "source": [ "for num in corpus[0]:\n", " num = num[0]\n", " print(f\"{num}\\t{id2word[num]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating LDA Topic Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With our `corpus` now prepared,, we can pass it to our LDA Model. We need have a lot of parameters that we can use here, but three things are mandatory:\n", "\n", "- `corpus` which will be our corpus object\n", "- `id2word` which will be our ID-Word Index\n", "- `num_topics` which will be the number of topics we want the model to find.\n", "\n", "Now, we can train our topic model." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Training a model may take several seconds to minutes. The more documents you have, the longer it will take. Once the model is trained, we can begin to analyze the model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analyze a Document" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to get the topics for each document in our corpus, we can do so via `get_document_topics()`. This will take a single argument, the `corpus` object. It will return a list of topics and a score for each topic." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "20834" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "topics = lda_model.get_document_topics(corpus)\n", "len(topics)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can analyze the topics for our first document." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(6, 0.037855655),\n", " (15, 0.037333403),\n", " (27, 0.5222747),\n", " (36, 0.06073708),\n", " (43, 0.041372776),\n", " (74, 0.21932246),\n", " (94, 0.05292057)]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "topics[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These topics were assigned to the following document:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "An ANCYL member who was shot and severely injured by SAP members at Lephoi, Bethulie, Orange Free State (OFS) on 17 April 1991. Police opened fire on a gathering at an ANC supporter's house following a dispute between two neighbours, one of whom was linked to the ANC and the other to the SAP and a councillor.\n" ] } ], "source": [ "print(docs[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now print off some of the key words for each of the top-3 topics. For each topic, we will print off the top-10 words with the `get_topic_terms()` method which will take two arguments: the topic number and the amount of words to return." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(6, 0.038197085)\n", "733 seriously\n", "1394 boipatong\n", "13 injured\n", "809 left\n", "750 massacre\n", "0 17\n", "284 thirteen\n", "68 people\n", "551 five\n", "406 refused\n", "\n", "(15, 0.037302956)\n", "625 campaign\n", "3085 soldiers\n", "226 torture\n", "37 transvaal\n", "7 dispute\n", "1350 military\n", "33 landmine\n", "1224 thrown\n", "711 section\n", "2 anc\n", "\n", "(27, 0.5223308)\n", "26 shot\n", "8 fire\n", "21 opened\n", "17 members\n", "23 police\n", "24 sap\n", "13 injured\n", "255 march\n", "58 town\n", "61 1990\n", "\n" ] } ], "source": [ "for topic in topics[0][:3]:\n", " terms = lda_model.get_topic_terms(topic[0], 10)\n", " print(topic)\n", " for num in terms:\n", " num = num[0]\n", " print(num, id2word[num])\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These results seem acceptable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analyze the Topic Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is often difficult to gauge the quality of a topic model by simply looking at raw output. For these reasons, it is useful to be able to study a model in its entirety. We can do this with PyLDAVis, a library that was designed to display in 2-dimensional space the topics of a topic model and the key words associated with each topic." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "c:\\Users\\wma22\\anaconda3\\envs\\textbook\\lib\\site-packages\\sklearn\\manifold\\_mds.py:298: FutureWarning: The default value of `normalized_stress` will change to `'auto'` in version 1.4. To suppress this warning, manually set the value of `normalized_stress`.\n", " warnings.warn(\n" ] } ], "source": [ "vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, mds=\"mmds\", R=30)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "PreparedData(topic_coordinates= x y topics cluster Freq\n", "topic \n", "64 -0.263505 -0.019382 1 1 4.214569\n", "60 -0.378828 -0.058674 2 1 4.087336\n", "32 -0.311093 0.087178 3 1 3.365182\n", "3 -0.475425 -0.041942 4 1 3.204816\n", "7 0.187479 -0.492747 5 1 2.998725\n", "... ... ... ... ... ...\n", "22 0.116352 -0.170869 96 1 0.325247\n", "45 0.336949 0.072965 97 1 0.294095\n", "98 0.444777 0.159917 98 1 0.259635\n", "46 -0.170074 -0.317671 99 1 0.189554\n", "12 0.038489 0.276368 100 1 0.188674\n", "\n", "[100 rows x 5 columns], topic_info= Term Freq Total Category logprob loglift\n", "347 ifp 7949.000000 7949.000000 Default 30.0000 30.0000\n", "28 supporters 10675.000000 10675.000000 Default 29.0000 29.0000\n", "505 1994 2512.000000 2512.000000 Default 28.0000 28.0000\n", "994 tvl 1298.000000 1298.000000 Default 27.0000 27.0000\n", "2 anc 11188.000000 11188.000000 Default 26.0000 26.0000\n", ".. ... ... ... ... ... ...\n", "2 anc 9.235738 11188.061947 Topic100 -4.5341 -0.8266\n", "13 injured 5.861281 5936.602787 Topic100 -4.9888 -0.6476\n", "29 two 3.980958 2343.026874 Topic100 -5.3756 -0.1048\n", "352 allegedly 3.854998 2549.389728 Topic100 -5.4078 -0.2213\n", "85 dead 4.208761 4723.345846 Topic100 -5.3200 -0.7502\n", "\n", "[4666 rows x 6 columns], token_table= Topic Freq Term\n", "term \n", "697 3 0.150945 1\n", "697 17 0.009635 1\n", "697 18 0.001606 1\n", "697 23 0.081895 1\n", "697 25 0.043356 1\n", "... ... ... ...\n", "1577 31 0.988717 ‘red’\n", "2855 67 0.877141 ‘sharpeville\n", "3006 84 0.398323 ‘terrorist’\n", "518 74 0.672499 ‘whites’\n", "520 5 0.949699 ’s\n", "\n", "[10878 rows x 3 columns], R=30, lambda_step=0.01, plot_opts={'xlab': 'PC1', 'ylab': 'PC2'}, topic_order=[65, 61, 33, 4, 8, 73, 60, 6, 76, 39, 53, 41, 9, 56, 77, 2, 50, 28, 7, 66, 64, 18, 90, 55, 87, 48, 43, 67, 36, 27, 17, 100, 26, 15, 11, 51, 40, 14, 75, 91, 68, 37, 98, 96, 35, 38, 16, 3, 71, 21, 79, 83, 1, 49, 86, 19, 89, 24, 31, 34, 70, 80, 22, 52, 59, 94, 58, 5, 93, 10, 97, 54, 62, 92, 45, 12, 72, 44, 78, 69, 42, 32, 81, 85, 63, 20, 25, 84, 30, 74, 95, 82, 57, 29, 88, 23, 46, 99, 47, 13])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can see all 100 topics plotted in two-dimensional space and we can study the degree to which the words in each topic are unique to that topic when compared with the corpus as as whole.\n", "\n", "While useful as a first step to exploring data, we have more powerful ways of clustering documents today. As we will see over the remainder of the chapter, other libraries and approaches that leverage state of the art language models are perhaps better suited to finding and extracting topics from your corpus." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 4 }