{ "cells": [ { "cell_type": "markdown", "id": "210870de-b1d8-4c75-b2c9-57877c87ce70", "metadata": {}, "source": [ "# Introduction to HTML" ] }, { "cell_type": "markdown", "id": "3ba0a7db-3563-4542-b312-a0b6acb6020e", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "id": "0b59ad74-ec41-4b3b-8465-218e4ef0ff49", "metadata": {}, "source": [ "In this chapter, we will learn about web scraping, one of the more vital skills of a digital humanist. **Web scraping** is process by which we automate the calling of a server (which hosts a website) and parsing that request which is an HTML file. **HTML** stands for HyperText Markup Language. It is the way in which websites are structured. When we scrape a website, we write rules for extracting pieces of information from it based on how that data is structured within the HTML. To be competent at web scraping, therefore, one must be able to understand and parse HTML.\n", "\n", "In this section, we will break down HTML and you will learn the most common tags, such as div, p, strong, and span. You will also learn about attributes within these tags, such as href, class, and id. It is vital that you understand this before moving onto the next chapter in which we learn how to web scrape with Python." ] }, { "cell_type": "markdown", "id": "38b63b13-214f-45af-8cea-1ee9e9d02482", "metadata": {}, "source": [ "## Diving into HTML" ] }, { "cell_type": "markdown", "id": "aebb20af-21b9-4a0e-aa31-6c74c1deb04b", "metadata": {}, "source": [ "So why is HTML useful? HTML, like other markup languages, such as **XML**, or eXstensible Markup Language, allows users to structure data within data. This is achieved by what are known as tags. I think It is best to see what this looks like in practice. Let's examine a simple HTML file.\n", "\n", "```\n", "
This is a paragraph
\n", "This is a paragraph
\n", "This is our first content
\n", "This is is another piece of content
\n", "This is our first content
\n", "This is is another piece of content
\n", "This is our first content
\n", "This is our first content
\n", "This is is another piece of content
\n", "This is is another piece of content
\n", "This is is another piece of content
\n" ] } ], "source": [ "paragraph = div_other[0].find(\"p\")\n", "print(paragraph)" ] }, { "cell_type": "markdown", "id": "b72bff36-7d55-49c9-a299-ca216387d0ec", "metadata": {}, "source": [ "If we are working with textual data, though, we rarely want to retain the HTML, rather we want to extract the text within the HTML. For this, we can access the raw text that falls within the HTML with `.text`." ] }, { "cell_type": "code", "execution_count": 31, "id": "b31fe9a2-7781-44ee-a024-39002d767d4e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is is another piece of content\n" ] } ], "source": [ "print(paragraph.text)" ] }, { "cell_type": "markdown", "id": "8cc361f7-2b3b-4df5-964f-add6e5f8036f", "metadata": {}, "source": [ "Being able to do this programatically means that we can automate the scraping of HTML files via Python. We can, for example, extract all the text from each `div` tag in our file via the following two lines of code." ] }, { "cell_type": "code", "execution_count": 33, "id": "f71bc91a-12bb-419b-8dad-2257f61eb114", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "This is our first content\n", "\n", "\n", "This is is another piece of content\n", "\n" ] } ], "source": [ "for div in all_divs:\n", " print(div.text)" ] }, { "cell_type": "markdown", "id": "6e763b26-d076-4221-b167-c3c359fa3976", "metadata": {}, "source": [ "## How to Find a Website's HTML" ] }, { "cell_type": "markdown", "id": "13c136fe-2f1e-467c-af34-96fe45aacdfd", "metadata": {}, "source": [ "Now that you are familiar generally with HTML and how it works, let's dive in and take a look at some real-world HTML from a real website. In the next chapter, we will web scrape Wikipedia, so let's go ahead and focus on Wikipedia here as well. If you are using a web browser that supports it (Chrome and Firefox), you can enter developer mode. Each operating system and browser has a different set of hotkeys to do this, but on all you can right click the webpage and click \"inspect\". This will open up developer mode. At this point in the chapter, I highly recommend switching over to the video as it will be a bit easier to follow along.\n", "\n", "For this chapter (and the next), we will be working specifically with this page:\n", "https://en.wikipedia.org/wiki/List_of_French_monarchs\n", "\n", "Go ahead and open it up on another screen or in a new tab. This Wikipedia article contains some text, but primarily it hosts a list of all French monarchs, from the Carolingians up through the mid-19th century with Napoleon III.\n", "\n", "When you inspect this page, you will see all the nested HTML within it. Spend some time and go through these tags. In the next section, we will learn how to scrape this page, but for now take a look at what we can do with a few basic commands in Python." ] }, { "cell_type": "code", "execution_count": 6, "id": "3554f4b6-92e9-4709-8d03-82258b2dee6e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The monarchs of the Kingdom of France ruled from the establishment of the Kingdom of the West Franks in 843 until the fall of the Second French Empire in 1870, with several interruptions. Between the period from King Charles the Bald in 843 to King Louis XVI in 1792, France had 45 kings. Adding the 7 emperors and kings after the French Revolution, this comes to a total of 52 monarchs of France.\n", "\n", "In August 843 the Treaty of Verdun divided the Frankish realm into three kingdoms, one of which (Middle Francia) was short-lived; the other two evolved into France (West Francia) and, eventually, Germany (East Francia). By this time, the eastern and western parts of the land had already developed different languages and cultures.\n", "\n", "Initially, the kingdom was ruled primarily by two dynasties, the Carolingians and the Robertians, who alternated rule from 843 until 987, when Hugh Capet, the progenitor of the Capetian dynasty, took the throne. The kings use the title \"King of the Franks\" until the late twelfth century; the first to adopt the title of \"King of France\" was Philip II (r. 1180–1223). The Capetians ruled continuously from 987 to 1792 and again from 1814 to 1848. The branches of the dynasty which ruled after 1328, however, are generally given the specific branch names of Valois (until 1589), Bourbon (from 1589 until 1792 and from 1814 until 1830), and the Orléans (from 1830 until 1848).\n", "\n" ] } ], "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "\n", "url = \"https://en.wikipedia.org/wiki/List_of_French_monarchs\"\n", "s = requests.get(url)\n", "\n", "soup = BeautifulSoup(s.content)\n", "body = soup.find(\"div\", {\"id\": \"mw-content-text\"})\n", "for paragraph in body.find_all(\"p\")[:5]:\n", " if paragraph.text.strip() != \"\":\n", " print (paragraph.text)" ] }, { "cell_type": "markdown", "id": "ae7c0842-3877-4b5f-a297-a1017acb7c4c", "metadata": {}, "source": [ "In the cell above, we used two libraries, requests and BeautifulSoup to call the Wikipedia server that hosts that particular page. We then searched for the div tag that contains the main body of the page. In this case, it was a div tag whose attribute \"id\" corresponded to \"mw-content-text\". I thin searched for all of the \"p\" tags, or paragraphs within that body and printed off the the text if the text was not blank. By the end of the next chapter, you will not only understand the code above, but you will be able to write it and code it out yourself. For now, inspect that page and see if you can find where the div tag whose id corresponds to \"mw-content-text\" is located in the HTML. It is okay if this is hard! It is not something that you naturally can do quickly. It takes practice. A trick to help get you started, however, is to right click the area that you want to scrape and then click inspect.\n", "\n", "Once you feel comfortable with this, feel free to move on to the next chapter to start learning how to web scrape!" ] }, { "cell_type": "code", "execution_count": null, "id": "fb314d49-7da3-4aea-8133-46db2aea02e6", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 5 }