{ "cells": [ { "cell_type": "markdown", "id": "210870de-b1d8-4c75-b2c9-57877c87ce70", "metadata": {}, "source": [ "# Introduction to HTML" ] }, { "cell_type": "markdown", "id": "3ba0a7db-3563-4542-b312-a0b6acb6020e", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "id": "0b59ad74-ec41-4b3b-8465-218e4ef0ff49", "metadata": {}, "source": [ "In this chapter, we will learn about web scraping, one of the more vital skills of a digital humanist. **Web scraping** is process by which we automate the calling of a server (which hosts a website) and parsing that request which is an HTML file. **HTML** stands for HyperText Markup Language. It is the way in which websites are structured. When we scrape a website, we write rules for extracting pieces of information from it based on how that data is structured within the HTML. To be competent at web scraping, therefore, one must be able to understand and parse HTML.\n", "\n", "In this section, we will break down HTML and you will learn the most common tags, such as div, p, strong, and span. You will also learn about attributes within these tags, such as href, class, and id. It is vital that you understand this before moving onto the next chapter in which we learn how to web scrape with Python." ] }, { "cell_type": "markdown", "id": "38b63b13-214f-45af-8cea-1ee9e9d02482", "metadata": {}, "source": [ "## Diving into HTML" ] }, { "cell_type": "markdown", "id": "aebb20af-21b9-4a0e-aa31-6c74c1deb04b", "metadata": {}, "source": [ "So why is HTML useful? HTML, like other markup languages, such as **XML**, or eXstensible Markup Language, allows users to structure data within data. This is achieved by what are known as tags. I think It is best to see what this looks like in practice. Let's examine a simple HTML file.\n", "\n", "```\n", "
\n", "

This is a paragraph

\n", "
\n", "```\n", "\n", "Above, we see a very simple HTML file structure. In the first line of this HTML, we see `
`. Note the use of `<` and `>`. The opening `<` indicates the start of a tag in HTML. A **tag** is a way of denoting structure within an HTML file. It is a way of saying that what comes after this nested bit of code is this type of data. After the `<`, we see the word `div`. This word denotes the type of tag that we are using. In this case, we are creating a `div` tag. This is one of the most common types of tags in HTML. After the tag's name, we see `>`. This identifies the end of the tag creation.\n", "\n", "In line 2, we see a nested, or indented bit of HTML. In HTML, unlike Python, indentation is optional. It is, however, good practice to use line breaks and indentation in HTML to make the document easier for humans to parse. Line two begins with a `p` tag. The `p` tag in HTML is used to denote the start of a paragraph.\n", "\n", "After the creation of the `p tag`, we see `This is a paragraph`. In HTML, text that lies outside of the tags is text that appears on the web page. In this case, the HTML file would display the text `This is a paragraph`. Immediately after this bit of text we encounter our first close tag. A `close tag` in HTML indicates that this nested bit of structure is over. In our case, the first close tag is `

`. We know that it is a close tag because of the `\n", "

This is a paragraph

\n", "
\n", "```\n", "\n", "If you said the `class=\"content\"` portion of the open div tag, then you would be right. This bit is nested within the tag and is known as an **attribute**. In our case, the special attribute used is a `class` attribute (a very common attribute type). This attribute has a value of `content`. \n", "\n", "When scraping websites, you can use these attributes to specify which div tag to grab. There are several common attributes, specifically `class` and `id`." ] }, { "cell_type": "markdown", "id": "84f9b071-55f4-4ad5-bb72-02ebe495248e", "metadata": {}, "source": [ "## Parsing HTML with BeautifulSoup" ] }, { "cell_type": "markdown", "id": "d186f8d9-8075-4e50-a844-b04f9b0d8623", "metadata": {}, "source": [ "Now that we understand a bit about HTML, let's start trying to parse it in Python. When we **parse** HTML, we attempt to automate the identification of HTML's structure and systematically interpret it. This is the basis for web scraping. To begin, let's create a simple HTML file in memory." ] }, { "cell_type": "code", "execution_count": 9, "id": "cda73f06-d7c5-4c3f-b014-f7f3c7fe2560", "metadata": {}, "outputs": [], "source": [ "html = \"\"\"\n", "\n", " \n", "
\n", "

This is our first content

\n", "
\n", "
\n", "

This is is another piece of content

\n", "
\n", " \n", "\n", "\"\"\"" ] }, { "cell_type": "markdown", "id": "65943aff-6a1f-4064-92b5-6fbe00a14e43", "metadata": {}, "source": [ "There are many Python libraries available to you for parsing HTML data. For more robust problems, [Selinium](https://selenium-python.readthedocs.io) is the industry standard. While Selinium is powerful, it has a steep learning curve and can be challenging for those new to Python, especially for those programming on Windows. Additionally, most web scraping problems can be solved without Selinium.\n", "\n", "For those reasons, we will not be using Selinium in this textbook, rather [BeautifulSoup](https://pypi.org/project/beautifulsoup4/). BeautifulSoup is a light-weight Python library for parsing HTML. It is quick and effective. The most challenging thing about BeautifulSoup is remembering how to install it and import it in Python.\n", "\n", "To install BeautifulSoup, you must run the following command in your terminal:\n", "\n", "```\n", "pip install beautifulsoup4\n", "```\n", "\n", "Once it is installed, you can then import BeautifulSoup in the following way:" ] }, { "cell_type": "code", "execution_count": 23, "id": "4f6d217b-e204-4be6-bdf4-d0381d0211c8", "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup" ] }, { "cell_type": "markdown", "id": "585725ac-e316-45ad-b176-4da2dc26aa92", "metadata": {}, "source": [ "With BeautifulSoup imported, we can use the `BeautifulSoup` class. This class allows us to parse HTML. As we will see throughout this chapter, you will rarely have HTML as a string within your Python script, but for now, since we are starting, let's try to parse the above `html` string by passing it to the BeautifulSoup class. It is Pythonic to name your BeautifulSoup variable `soup`. If you have a more complex script that is parsing soup objects from multiple websites, you may want to be more original in your naming conventions, but for our purposes,`soup` will serve us well." ] }, { "cell_type": "code", "execution_count": 10, "id": "6aeb540b-dd99-42c4-9fe9-d942a39445f1", "metadata": {}, "outputs": [], "source": [ "soup = BeautifulSoup(html)" ] }, { "cell_type": "markdown", "id": "22f87fec-e941-415b-8c9c-e2460faf4e14", "metadata": {}, "source": [ "With our soup object created in memory, we can now begin to examine it. If we print it off, we won't notice anything special about it. It appears as a regular string." ] }, { "cell_type": "code", "execution_count": 11, "id": "30d72710-f8db-43f9-a81d-1590cf40ea4f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "
\n", "

This is our first content

\n", "
\n", "
\n", "

This is is another piece of content

\n", "
\n", "\n", "\n", "\n" ] } ], "source": [ "print(soup)" ] }, { "cell_type": "markdown", "id": "42e0dcc5-6677-48e7-8a63-37270b5eafa3", "metadata": {}, "source": [ "While it may look like a string, it is not. We can observe this by asking Python what type of object it is with the `type` function" ] }, { "cell_type": "code", "execution_count": 26, "id": "aedc53d8-c730-42e4-96ea-47134906f9c3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bs4.BeautifulSoup" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(soup)" ] }, { "cell_type": "markdown", "id": "a1c62ef7-22fe-4a68-9a37-69cf0c663cd0", "metadata": {}, "source": [ "As we can see, this is a bs4.BeautifulSoup class. That means that while it may look like a string, it actually contains more data that can be accessed. For example, We can use the `.find` method to find specific the first occurrence of a specific tag. The `find` method has one mandatory argument, a string that will be the tag name that you wish to extract. Let's grab the first `div` tag." ] }, { "cell_type": "code", "execution_count": 27, "id": "0942f3e2-11cf-4b1f-87df-a7e0201a6e5c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "
\n", "

This is our first content

\n", "
\n" ] } ], "source": [ "first_div = soup.find(\"div\")\n", "print(first_div)" ] }, { "cell_type": "markdown", "id": "b3ec5583-2afc-4f0d-96fa-46b186aa2ef1", "metadata": {}, "source": [ "As you can see, we were able to grab the first `div` tag with `.find`, but we know that there are multiple `div` tags in our HTML string. What if we wanted to grab all of them? For that, we can use the `.find_all` method. Like `.find`, `.find_all` takes a single mandatory argument. Again, it is the string of the tag name you wish to extract." ] }, { "cell_type": "code", "execution_count": 28, "id": "bae6a79a-120d-4383-8896-a4f27bcb1435", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[
\n", "

This is our first content

\n", "
,
\n", "

This is is another piece of content

\n", "
]\n" ] } ], "source": [ "all_divs = soup.find_all(\"div\")\n", "print(all_divs)" ] }, { "cell_type": "markdown", "id": "91e4d602-0865-4154-9ff4-55df4351d43b", "metadata": {}, "source": [ "Unlike `.find` which returns a single item to us, the `.find_all` method returns a list of tags. What if we did not want all tags? What if we only wanted to grab the `div` tags with a specific class attribute. BeautifulSoup allows us to do this by passing a second argument to `.find` or `.find_all`. This argument will be a dictionary whose keys will be the attributes and whose values will be the attribute names of the tags you wish to extract." ] }, { "cell_type": "code", "execution_count": 29, "id": "198c29dd-ef8c-42fd-91b6-1c46ce74baaa", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[
\n", "

This is is another piece of content

\n", "
]\n" ] } ], "source": [ "div_other = soup.find_all(\"div\", {\"class\": \"other\"})\n", "print(div_other)" ] }, { "cell_type": "markdown", "id": "b9c9cb7f-7049-44e0-b2f3-859a968053a1", "metadata": {}, "source": [ "Once we have obtained the specific tag that we want to extract from the HTML, we can then access its nested components. Each `bs4.BeautifulSoup` object functions the same way as the original soup. It contains all children (nested) tags. We can, therefore, grab the `p` tag embedded within the `div` tag that has a `class` attribute of `other` by using `.find(\"p\")`." ] }, { "cell_type": "code", "execution_count": 30, "id": "ae8de037-74f0-464a-8553-9ddedf04850a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "

This is is another piece of content

\n" ] } ], "source": [ "paragraph = div_other[0].find(\"p\")\n", "print(paragraph)" ] }, { "cell_type": "markdown", "id": "b72bff36-7d55-49c9-a299-ca216387d0ec", "metadata": {}, "source": [ "If we are working with textual data, though, we rarely want to retain the HTML, rather we want to extract the text within the HTML. For this, we can access the raw text that falls within the HTML with `.text`." ] }, { "cell_type": "code", "execution_count": 31, "id": "b31fe9a2-7781-44ee-a024-39002d767d4e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is is another piece of content\n" ] } ], "source": [ "print(paragraph.text)" ] }, { "cell_type": "markdown", "id": "8cc361f7-2b3b-4df5-964f-add6e5f8036f", "metadata": {}, "source": [ "Being able to do this programatically means that we can automate the scraping of HTML files via Python. We can, for example, extract all the text from each `div` tag in our file via the following two lines of code." ] }, { "cell_type": "code", "execution_count": 33, "id": "f71bc91a-12bb-419b-8dad-2257f61eb114", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "This is our first content\n", "\n", "\n", "This is is another piece of content\n", "\n" ] } ], "source": [ "for div in all_divs:\n", " print(div.text)" ] }, { "cell_type": "markdown", "id": "6e763b26-d076-4221-b167-c3c359fa3976", "metadata": {}, "source": [ "## How to Find a Website's HTML" ] }, { "cell_type": "markdown", "id": "13c136fe-2f1e-467c-af34-96fe45aacdfd", "metadata": {}, "source": [ "Now that you are familiar generally with HTML and how it works, let's dive in and take a look at some real-world HTML from a real website. In the next chapter, we will web scrape Wikipedia, so let's go ahead and focus on Wikipedia here as well. If you are using a web browser that supports it (Chrome and Firefox), you can enter developer mode. Each operating system and browser has a different set of hotkeys to do this, but on all you can right click the webpage and click \"inspect\". This will open up developer mode. At this point in the chapter, I highly recommend switching over to the video as it will be a bit easier to follow along.\n", "\n", "For this chapter (and the next), we will be working specifically with this page:\n", "https://en.wikipedia.org/wiki/List_of_French_monarchs\n", "\n", "Go ahead and open it up on another screen or in a new tab. This Wikipedia article contains some text, but primarily it hosts a list of all French monarchs, from the Carolingians up through the mid-19th century with Napoleon III.\n", "\n", "When you inspect this page, you will see all the nested HTML within it. Spend some time and go through these tags. In the next section, we will learn how to scrape this page, but for now take a look at what we can do with a few basic commands in Python." ] }, { "cell_type": "code", "execution_count": 6, "id": "3554f4b6-92e9-4709-8d03-82258b2dee6e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The monarchs of the Kingdom of France ruled from the establishment of the Kingdom of the West Franks in 843 until the fall of the Second French Empire in 1870, with several interruptions. Between the period from King Charles the Bald in 843 to King Louis XVI in 1792, France had 45 kings. Adding the 7 emperors and kings after the French Revolution, this comes to a total of 52 monarchs of France.\n", "\n", "In August 843 the Treaty of Verdun divided the Frankish realm into three kingdoms, one of which (Middle Francia) was short-lived; the other two evolved into France (West Francia) and, eventually, Germany (East Francia). By this time, the eastern and western parts of the land had already developed different languages and cultures.\n", "\n", "Initially, the kingdom was ruled primarily by two dynasties, the Carolingians and the Robertians, who alternated rule from 843 until 987, when Hugh Capet, the progenitor of the Capetian dynasty, took the throne. The kings use the title \"King of the Franks\" until the late twelfth century; the first to adopt the title of \"King of France\" was Philip II (r. 1180–1223). The Capetians ruled continuously from 987 to 1792 and again from 1814 to 1848. The branches of the dynasty which ruled after 1328, however, are generally given the specific branch names of Valois (until 1589), Bourbon (from 1589 until 1792 and from 1814 until 1830), and the Orléans (from 1830 until 1848).\n", "\n" ] } ], "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "\n", "url = \"https://en.wikipedia.org/wiki/List_of_French_monarchs\"\n", "s = requests.get(url)\n", "\n", "soup = BeautifulSoup(s.content)\n", "body = soup.find(\"div\", {\"id\": \"mw-content-text\"})\n", "for paragraph in body.find_all(\"p\")[:5]:\n", " if paragraph.text.strip() != \"\":\n", " print (paragraph.text)" ] }, { "cell_type": "markdown", "id": "ae7c0842-3877-4b5f-a297-a1017acb7c4c", "metadata": {}, "source": [ "In the cell above, we used two libraries, requests and BeautifulSoup to call the Wikipedia server that hosts that particular page. We then searched for the div tag that contains the main body of the page. In this case, it was a div tag whose attribute \"id\" corresponded to \"mw-content-text\". I thin searched for all of the \"p\" tags, or paragraphs within that body and printed off the the text if the text was not blank. By the end of the next chapter, you will not only understand the code above, but you will be able to write it and code it out yourself. For now, inspect that page and see if you can find where the div tag whose id corresponds to \"mw-content-text\" is located in the HTML. It is okay if this is hard! It is not something that you naturally can do quickly. It takes practice. A trick to help get you started, however, is to right click the area that you want to scrape and then click inspect.\n", "\n", "Once you feel comfortable with this, feel free to move on to the next chapter to start learning how to web scrape!" ] }, { "cell_type": "code", "execution_count": null, "id": "fb314d49-7da3-4aea-8133-46db2aea02e6", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 5 }