{ "cells": [ { "cell_type": "markdown", "id": "suited-breathing", "metadata": {}, "source": [ "# Scraping Web Pages with Requests and BeautifulSoup" ] }, { "cell_type": "markdown", "id": "posted-netherlands", "metadata": {}, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "id": "weekly-locking", "metadata": {}, "source": [ "In the last section, we learned about the basics of HTML and the BeautifulSoup library. We were not, however, working with data found on the web, rather our HTML was stored locally. In this section, we will learn how to make calls to a remote server with the requests library and then parse that data with BeautifulSoup." ] }, { "cell_type": "markdown", "id": "exposed-ethiopia", "metadata": {}, "source": [ "## Requests" ] }, { "cell_type": "markdown", "id": "buried-remove", "metadata": {}, "source": [ "The **requests** library allows us to send a signal via Python to server. A good way to think about requests is as an invisible browser that opens in the background of your computer. Requests does the precise thing your browser does. It sends a signal over the internet to connect to a specific server address. While all servers have a unique IP address. Often the internet links a specific and unique address that can be used as a way to connect to a server without having to type out an IP address. Unlike your browser, however, requests does not will up the results for you to see. Instead, it receives the return signal and simply stores the HTML data in memory.\n", "\n", "To begin learning how requests works, let's first import the requests library" ] }, { "cell_type": "code", "execution_count": 1, "id": "beginning-socket", "metadata": {}, "outputs": [], "source": [ "import requests" ] }, { "cell_type": "markdown", "id": "3bf7a346-988b-43e1-b9ec-ae5d6dea588c", "metadata": {}, "source": [ "Now that we have imported requests, let's go ahead and create a string object that will be the website we want to scrape. I always call this string \"url\" in my code. " ] }, { "cell_type": "code", "execution_count": 2, "id": "three-action", "metadata": {}, "outputs": [], "source": [ "url = \"https://en.wikipedia.org/wiki/List_of_French_monarchs\"" ] }, { "cell_type": "markdown", "id": "fef3e8b3-eb1d-40d4-b519-918aac67bee2", "metadata": {}, "source": [ "Excellent! Now we can use the requests library to make a call to this particular page. We will do this via the get function in the requests library. The get function has one mandatory argument: the website that you want to request. In our case, this will be our \"url\" string." ] }, { "cell_type": "code", "execution_count": 3, "id": "invisible-military", "metadata": {}, "outputs": [], "source": [ "s = requests.get(url)" ] }, { "cell_type": "markdown", "id": "ebd78656-86df-4f8a-bc95-ed6a2a3d4f3c", "metadata": {}, "source": [ "Now that we have created a request object, let's take a look at what this looks like." ] }, { "cell_type": "code", "execution_count": 4, "id": "d9a4835e-b418-4718-936a-e6e7f497788d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "print (s)" ] }, { "cell_type": "markdown", "id": "298f20ef-0204-46e7-8e72-f3be096282b3", "metadata": {}, "source": [ "On the surface, this looks like we may have failed. What is this odd \"response 200\" and what does it mean? This particular response means that our attempt to connect to a server was successful. There are many types of responses, but 200 is the one we want to see. If you ever see a response that is not 200, you can Google the particular server response and you will find out what is happening. Sometimes, a response indicates that your request was blocked. This may be because the website has anti-web scraping measures in place. In other cases, the page may be protected, meaning that it lies behind a login. There are too many potential errors that may surface that I cannot detail them all in this basic introduction chapter. I will, however, give you a solution to a very common problem that can allow you to get around a common 403 response. For that solution, check out the final section of this chapter." ] }, { "cell_type": "markdown", "id": "f8a79a16-0f26-4a71-b1f0-839fa94dc46d", "metadata": {}, "source": [ "## BeautifulSoup" ] }, { "cell_type": "markdown", "id": "9253b7c0-f6fa-4b94-9a2e-6794e8bdef83", "metadata": {}, "source": [ "Now that we have learned how to make a call to a server and stored the response (the HTML) in memory, we need a way to parse that data. Buried within the s request object is the HTML content. We can access that data by accessing the content method of the response object class. Let's do that and check out the first 100 characters." ] }, { "cell_type": "code", "execution_count": 5, "id": "ee722a71-2b17-4cb0-b635-81df38b69556", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "b'\\n\\n\\n\\n\n", "\n", "\n", "\n", "List of French monarchs - Wikipedia\n", "