{ "cells": [ { "cell_type": "markdown", "id": "03b14f3a", "metadata": {}, "source": [ "# Web Scraping\n", "\n", "Most webpages are created for humans, but that doesn't mean we can't write computers to \"scrape\" information out of them. Today, we're going to learn selenium, a powerful tool that allows us to automate typing, clicks, and other actions in a real web browser. It will help us pull data from sites with a number of challenging designs, several of which are illustrated by simple examples here: https://tyler.caraza-harter.com/cs320/tricky/scrape.html\n", "\n", "First, let's take a look web scraping via two simpler tools, `requests` and `BeautifulSoup`.\n", "\n", "## requests and BeautifulSoup\n", "\n", "If you don't already have these packages, install them: `pip3 install requests beautifulsoup4`. `requests` lets us to execute GET (download) and other HTTP requests. Often, the file requested might be an HTML file. In this case, BeautifulSoup lets us extract information from HTML tags.\n", "\n", "We'll try to scrape the tabels from this page: https://tyler.caraza-harter.com/cs320/tricky/page1.html. Visit it, right click on the page, then click \"View Page Source\".\n", "\n", "You'll see something like this:" ] }, { "cell_type": "markdown", "id": "3e078b28", "metadata": {}, "source": [ "```html\n", "\n", " \n", " \n", " \n", " \n", " \n", "

Welcome

\n", "

Here's a table

\n", " \n", " \n", " \n", " \n", "

A	B	C
1	2	3
4	5	6

\n", "\n", "

And another one...

\n", " \n", "\n", "```" ] }, { "cell_type": "markdown", "id": "fa3f0b8a", "metadata": {}, "source": [ "Inside the `\\n \n", " \n", " \n", " \n", "

Welcome

\n", "

Here's a table

\n", " \n", " \n", " \n", " \n", "

A	B	C
1	2	3
4	5	6

\n", "\n", "

And another one...

\n", " \n", "\n", "

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "

x	y
0	1
2	3
4	5
6	7
8	9
10	11
12	13
14	15
16	17
18	19

\n" ] } ], "source": [ "print(b.page_source)" ] }, { "cell_type": "markdown", "id": "a299f1fb", "metadata": {}, "source": [ "One option would be to used this updated HTML string with BeautifulSoup to search for tables and data:" ] }, { "cell_type": "code", "execution_count": 19, "id": "a7baf446", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 2 table(s)\n" ] } ], "source": [ "doc = BeautifulSoup(b.page_source)\n", "tables = doc.findAll(\"table\")\n", "print(f\"There are {len(tables)} table(s)\")" ] }, { "cell_type": "markdown", "id": "5a0e4b75", "metadata": {}, "source": [ "Alternatively, Selenium has a (somewhat clunkier) interface for directly searching for elements:" ] }, { "cell_type": "code", "execution_count": 20, "id": "f9d2b214", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 2 table(s)\n" ] } ], "source": [ "tables = b.find_elements_by_tag_name(\"table\")\n", "print(f\"There are {len(tables)} table(s)\")" ] }, { "cell_type": "markdown", "id": "6e28ed79", "metadata": {}, "source": [ "Instead of BeautifulSoup `Tag` objects, we get Selenium `WebElement` objects. Fortunately, both have a `.text` attribute to see the raw data (without surrounding HTML tags):" ] }, { "cell_type": "code", "execution_count": 21, "id": "91241793", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "selenium.webdriver.remote.webelement.WebElement" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(tables[0])" ] }, { "cell_type": "code", "execution_count": 22, "id": "d881a8d6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A B C\n", "1 2 3\n", "4 5 6\n" ] } ], "source": [ "print(tables[0].text)" ] }, { "cell_type": "markdown", "id": "f203ab87", "metadata": {}, "source": [ "## Conclusion\n", "\n", "A web browser like Chrome understands both HTML and JavaScript (and is able to run JavaScript that may update the elements corresponding to HTML tags). For simple pages, the `requests` module is great way to fetch HTML and other resources, without any JavaScript engine. `requests` does not parse HTML, so it is often used in conjunction with BeautifulSoup (which can parse an HTML document to a searchable tree structure).\n", "\n", "Selenium is much slower than `requests`, but it is able to control a real web browser (capable of executing JavaScript), so it allows data scientists to scrape many pages that would not be scrapable otherwise. Selenium also provides a searchable tree of elements, like BeautifulSoup, but the methods tend to be less convenient, so one may choose to still use BeautifulSoup with Selenium (though it's less necessary than when working with `requests`)." ] }, { "cell_type": "code", "execution_count": null, "id": "002c3ca4", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 5 }