{
"cells": [
{
"cell_type": "markdown",
"id": "03b14f3a",
"metadata": {},
"source": [
"# Web Scraping\n",
"\n",
"Most webpages are created for humans, but that doesn't mean we can't write computers to \"scrape\" information out of them. Today, we're going to learn selenium, a powerful tool that allows us to automate typing, clicks, and other actions in a real web browser. It will help us pull data from sites with a number of challenging designs, several of which are illustrated by simple examples here: https://tyler.caraza-harter.com/cs320/tricky/scrape.html\n",
"\n",
"First, let's take a look web scraping via two simpler tools, `requests` and `BeautifulSoup`.\n",
"\n",
"## requests and BeautifulSoup\n",
"\n",
"If you don't already have these packages, install them: `pip3 install requests beautifulsoup4`. `requests` lets us to execute GET (download) and other HTTP requests. Often, the file requested might be an HTML file. In this case, BeautifulSoup lets us extract information from HTML tags.\n",
"\n",
"We'll try to scrape the tabels from this page: https://tyler.caraza-harter.com/cs320/tricky/page1.html. Visit it, right click on the page, then click \"View Page Source\".\n",
"\n",
"You'll see something like this:"
]
},
{
"cell_type": "markdown",
"id": "3e078b28",
"metadata": {},
"source": [
"```html\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" Welcome
\n",
" Here's a table
\n",
" \n",
" A | B | C |
\n",
" 1 | 2 | 3 |
\n",
" 4 | 5 | 6 |
\n",
"
\n",
"\n",
" And another one...
\n",
" \n",
"\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "fa3f0b8a",
"metadata": {},
"source": [
"Inside the `\\n \n",
" \n",
" \n",
" \n",
" Welcome
\n",
" Here's a table
\n",
" \n",
" A | B | C |
\n",
" 1 | 2 | 3 |
\n",
" 4 | 5 | 6 |
\n",
"
\n",
"\n",
" And another one...
\n",
" \n",
"\n",
"\n",
"x | y |
\n",
"0 | 1 |
\n",
"2 | 3 |
\n",
"4 | 5 |
\n",
"6 | 7 |
\n",
"8 | 9 |
\n",
"10 | 11 |
\n",
"12 | 13 |
\n",
"14 | 15 |
\n",
"16 | 17 |
\n",
"18 | 19 |
\n",
"