Web Scraping

Most webpages are created for humans, but that doesn't mean we can't write computers to "scrape" information out of them. Today, we're going to learn selenium, a powerful tool that allows us to automate typing, clicks, and other actions in a real web browser. It will help us pull data from sites with a number of challenging designs, several of which are illustrated by simple examples here: https://tyler.caraza-harter.com/cs320/tricky/scrape.html

First, let's take a look web scraping via two simpler tools, requests and BeautifulSoup.

requests and BeautifulSoup

If you don't already have these packages, install them: pip3 install requests beautifulsoup4. requests lets us to execute GET (download) and other HTTP requests. Often, the file requested might be an HTML file. In this case, BeautifulSoup lets us extract information from HTML tags.

We'll try to scrape the tabels from this page: https://tyler.caraza-harter.com/cs320/tricky/page1.html. Visit it, right click on the page, then click "View Page Source".

You'll see something like this:

<html>
  <head>
    <script src="https://code.jquery.com/jquery-3.4.1.js"></script>
    <script>
      ... LOTS OF JAVASCRIPT CODE HERE...
    </script>
  </head>
  <body onload="main()">
    <h1>Welcome</h1>
    <h3>Here's a table</h3>
    <table border=1 id='alpha'>
      <tr><td>A</td><td>B</td><td>C</td></tr>
      <tr><td>1</td><td>2</td><td>3</td></tr>
      <tr><td>4</td><td>5</td><td>6</td></tr>
    </table>

    <h3>And another one...</h3>
  </body>
</html>

Inside the <script> tags there is code in the JavaScript programming language. One the page is loaded in the browser, this code starts executing, and may make some changes to the tags/elements.

In the above HTML, we see one table (<table>); however, the JavaScript code will automatically generate a second table. With the requests module, we can only grab the version of the page before the JavaScript runs. So in this example, we'll just extract data from that first table (later we'll use Selenium to get data from the second table too).

Above, we see we can use .get to request a web page; a Response object is returned. The .raise_for_status() call makes sure we crash if it is an error page (such as a "404 missing" error).

Once we're sure the page is good, we can access the .text attribute of the Response object. As show above, this is a regular string.

We'll use BeautifulSoup to convert this text to a BeautifulSoup object, which is a searchable tree of elements.

Looks like there are three tr tags (table rows) on the page:

They are represented as Tag objects; Tag elements looks like HTML when converted to a string. Alternatively, the .text attribute shows us the raw content, without all the surrounding HTML.

The findAll method can be used on the whole page, or to search within a single element/Tag.

Let's right a method that does three things:

  1. fetch a page using requests.get
  2. parse that page with BeautifulSoup
  3. search through the tree for rows/cells, placing them in a list of lists

The above is a short step away from getting a useful DataFrame:

Selenium

Selenium is able to automate clicks and type text in an actual web browser. This will let us do things like take a screenshot of a page and actually execute the JavaScript in the <script> tags (to get the data in the second table too).

Selenium has some features of both requests and BeautifulSoup: it can both grab content, and provide a searchable tree.

Let's create a web browser (of type WebDriver):

The headless = True means the browser is hidden, in the background (this is necessary to run on a virtual machine without graphics). But we can still manipulate this browser and see what it is doing by taking screenshots:

Looks like that second table wasn't loaded yet. But if we wait a few seconds, it will be.

Here's what happened in the browser:

  1. Chrome converted HTML tags to elements in the DOM (Document Object Model)
  2. JavaScript added some elements to the DOM (for the second table)
  3. the new version with both tables got shown

Selenium's .page_source element lets us convert the new version back to HTML (notice there are two <table>'s instead of the original one -- requests could only see the first one, but Selenium also sees the one generated by the JavaScript code).

One option would be to used this updated HTML string with BeautifulSoup to search for tables and data:

Alternatively, Selenium has a (somewhat clunkier) interface for directly searching for elements:

Instead of BeautifulSoup Tag objects, we get Selenium WebElement objects. Fortunately, both have a .text attribute to see the raw data (without surrounding HTML tags):

Conclusion

A web browser like Chrome understands both HTML and JavaScript (and is able to run JavaScript that may update the elements corresponding to HTML tags). For simple pages, the requests module is great way to fetch HTML and other resources, without any JavaScript engine. requests does not parse HTML, so it is often used in conjunction with BeautifulSoup (which can parse an HTML document to a searchable tree structure).

Selenium is much slower than requests, but it is able to control a real web browser (capable of executing JavaScript), so it allows data scientists to scrape many pages that would not be scrapable otherwise. Selenium also provides a searchable tree of elements, like BeautifulSoup, but the methods tend to be less convenient, so one may choose to still use BeautifulSoup with Selenium (though it's less necessary than when working with requests).