6.2. Scraping Web Pages with Requests and BeautifulSoup#
6.2.1. Introduction#
In the last section, we learned about the basics of HTML and the BeautifulSoup library. We were not, however, working with data found on the web, rather our HTML was stored locally. In this section, we will learn how to make calls to a remote server with the requests library and then parse that data with BeautifulSoup.
6.2.2. Requests#
The requests library allows us to send a signal via Python to server. A good way to think about requests is as an invisible browser that opens in the background of your computer. Requests does the precise thing your browser does. It sends a signal over the internet to connect to a specific server address. While all servers have a unique IP address. Often the internet links a specific and unique address that can be used as a way to connect to a server without having to type out an IP address. Unlike your browser, however, requests does not will up the results for you to see. Instead, it receives the return signal and simply stores the HTML data in memory.
To begin learning how requests works, let’s first import the requests library
import requests
Now that we have imported requests, let’s go ahead and create a string object that will be the website we want to scrape. I always call this string “url” in my code.
url = "https://en.wikipedia.org/wiki/List_of_French_monarchs"
Excellent! Now we can use the requests library to make a call to this particular page. We will do this via the get function in the requests library. The get function has one mandatory argument: the website that you want to request. In our case, this will be our “url” string.
s = requests.get(url)
Now that we have created a request object, let’s take a look at what this looks like.
print (s)
<Response [200]>
On the surface, this looks like we may have failed. What is this odd “response 200” and what does it mean? This particular response means that our attempt to connect to a server was successful. There are many types of responses, but 200 is the one we want to see. If you ever see a response that is not 200, you can Google the particular server response and you will find out what is happening. Sometimes, a response indicates that your request was blocked. This may be because the website has anti-web scraping measures in place. In other cases, the page may be protected, meaning that it lies behind a login. There are too many potential errors that may surface that I cannot detail them all in this basic introduction chapter. I will, however, give you a solution to a very common problem that can allow you to get around a common 403 response. For that solution, check out the final section of this chapter.
6.2.3. BeautifulSoup#
Now that we have learned how to make a call to a server and stored the response (the HTML) in memory, we need a way to parse that data. Buried within the s request object is the HTML content. We can access that data by accessing the content method of the response object class. Let’s do that and check out the first 100 characters.
print (s.content[:100])
b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title'
Notice that this data is HTML. At this stage, however, we do not have an easy way to take this string and process it as structured data. This is where BeautifulSoup comes into play. BeautifulSoup allows us to convert s.content into structured data that we can then parse. To do this, we first need to import BeautifulSoup. Unlike most libraries, BeautifulSoup installs as bs4 (BeautifulSoup4). Because of this we need to import the BeautifulSoup class from bs4. The command below does this for us.
from bs4 import BeautifulSoup
Next, we need to create a new soup object.
soup = BeautifulSoup(s.content)
If we don’t see an error, then it means we have successfully created a soup object. Let’s print it off to see what it looks like:
print (str(soup)[:200])
<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of French monarchs - Wikipedia</title>
<script>document.documentElement.className="client-js";
While the soup object looks precisely the same as the s.content, it is entirely different. It retains the structure of the HTML because BeautifulSoup has parsed it for us. This means that we can use special methods. In this section of the textbook, we will use find and find_all.
find - this will allow us to find the first instance of a tag being used and grab that tag and all its nested components.
find_all - this will return a list of all tags and their nested components that align to this specific tag.
Let’s take a look at a basic example of the find method.
div = soup.find("div")
Here, we grabbed the first div tag on the page. Div tags, however, are quite common because they are one of the essential building blocks of HTML. Let’s take a look at how many there are on the entire page. We can do this with the find_all method and then count the list size with the len function that we met in chapter 02_02.
all_divs = soup.find_all("div")
print (len(all_divs))
97
So, if we want to grab a specific div with only find and find_all, we would have to count all the div tags and find the right index and grab it. This would not work at scale because this would vary from page to page, even if the HTML data structure were similar across all pages on a site. We need a better solution. This is where our second optional argument comes in. The find function can also take a dictionary that allows us to pass some specificity to our parsing of the soup object.
Let’s say I want to grab the main body of the Wikipedia article. If I inspect the page, I will notice that one particular div tag contains all the data corresponding to the body of the Wikipedia article. This div tag has a special unique id attribute. This id attribute’s name is “mw-content-text”. This means that I can pass as a econd argument a dictionary where id is the key and the corresponding id name is the value. This will tell BeautifulSoup to find the first div tag that has an id attribute that matches mw-content-text. Let’s take a look at this in code.
body = soup.find("div", {"id": "mw-content-text"})
Now that we have grabbed this body portion of the article, we can print it off with the text method.
print (body.text[:500])
This article is about French monarchs. For Frankish kings, see List of Frankish kings.
Division of the Frankish Empire at the Treaty of Verdun in 843
The monarchs of the Kingdom of France ruled from the establishment of the Kingdom of the West Franks in 843 until the fall of the Second French Empire in 1870, with several interruptions. Between the period from King Charles the Bald in 843 to King Louis XVI in 1792, France had 45 kings. Adding the 7 emperors and kings after the French Revolut
This is fantastic! But, what if we wanted to grab the text and maintain the structure of the paragraphs. We can do this by searching the soup object at the body level. The body object that we created is still a soup object that retains all that HTML data, but it only contains the data for the data found under that particular div. We can use find all now to grab all the paragraphs from within the body. We can use find_all on the body object to find all the paragraphs like so:
paragraphs = body.find_all("p")
We can now iterate over the paragraphs. Let’s do that now over the first five paragraphs and print off the text.
for paragraph in paragraphs[:5]:
print (paragraph.text)
The monarchs of the Kingdom of France ruled from the establishment of the Kingdom of the West Franks in 843 until the fall of the Second French Empire in 1870, with several interruptions. Between the period from King Charles the Bald in 843 to King Louis XVI in 1792, France had 45 kings. Adding the 7 emperors and kings after the French Revolution, this comes to a total of 52 monarchs of France.
In August 843 the Treaty of Verdun divided the Frankish realm into three kingdoms, one of which (Middle Francia) was short-lived; the other two evolved into France (West Francia) and, eventually, Germany (East Francia). By this time, the eastern and western parts of the land had already developed different languages and cultures.
Initially, the kingdom was ruled primarily by two dynasties, the Carolingians and the Robertians, who alternated rule from 843 until 987, when Hugh Capet, the progenitor of the Capetian dynasty, took the throne. The kings use the title "King of the Franks" until the late twelfth century; the first to adopt the title of "King of France" was Philip II (r. 1180–1223). The Capetians ruled continuously from 987 to 1792 and again from 1814 to 1848. The branches of the dynasty which ruled after 1328, however, are generally given the specific branch names of Valois (until 1589), Bourbon (from 1589 until 1792 and from 1814 until 1830), and the Orléans (from 1830 until 1848).
Viola! You have now officially webscraped your first page in Python and grabbed some relevant data. Scraping data from the web is never a copy-and-paste task because every website structures its HTML a bit differently. Nevertheless, the mechanics are the same. The basic methods you learned in this chapter should allow you to scrape the majority of static websites.