Web scraping

Objectives

Define what web scraping is and the issues surrounding it
Use the requests and BeautifulSoup modules to parse HTML
Explain some common problems with web scraping
Explore other tools that can interact with web pages

Introduction to Web Scraping

Web scraping involves programmatically grabbing data from a web page
Three steps: Download, extract data, PROFIT!

Okay...more like, do something with data

Why Scrape?

There's data on a site that you want to store or analyze
You can't get by other means (e.g. an API)
You want to programmatically grab the data (instead of lots of manual copying/pasting)

Is it...ok?

Some websites don't want people scraping them
Best practice: consult the robots.txt file
If making many requests, time them out
If you're too aggressive, your IP can be blocked

Introduction to Beautiful Soup

Getting started with Beautiful Soup

To extract data from HTML, we'll use Beautiful Soup
Install it with pip
Beautiful Soup lets us navigate through HTML with Python
Beautiful Soup does NOT download HTML - for this, we need the requests module!

Parsing and Navigating HTML

BeautifulSoup(html_string, "html.parser") - parse HTML
Once parsed, There are several ways to navigate:

By Tag Name
Using find - returns one matching tag
Using find_all - returns a list of matching tags

Navigating with CSS Selectors

select - returns a list of elements matching a CSS selector

Selector Cheatsheet

Select by id of foo: #foo
Select by class of bar: .bar
Select children: div > p
Select descendents: div p

Selecting Elements by Attribute

# find an element with an id of foo
soup.find(id="foo")
soup.select("#foo")[0]

# find all elements with a class of bar
# careful! "class" is a reserved word in Python
soup.find_all(class_="bar")
soup.select(".bar")

# find all elements with a data
# attribute of "baz"
# using the general attrs kwarg
soup.find_all(attrs={"data-baz": True})
soup.select("[data-baz]")

Accessing Data in Elements

get_text - access the inner text in an element
name - tag name
attrs - dictionary of attributes
You can also access attribute values using brackets!

Navigating with Beautiful Soup

Via Tags

```
parent / parents
```
```
contents
```
```
next_sibling / next_siblings
```
```
previous_sibling / previous_siblings
```

Via Searching

```
find_parent / find_parents
```
```
find_next_sibling / find_next_siblings
```

find_previous_sibling / find_previous_siblings

Web Scraping Example with Beautiful Soup

Requests + Beautiful Soup Example

Let's scrape data into a CSV!
Goal: Grab all links from Rithm School blog
Data: store URL, anchor tag text, and date

Common Issues with Web Scraping

Gnarly HTML
Code tightly coupled to UI
Sanitizing data after grabbing it
Data that isn't part of HTML, but is loaded later!

Other Tools for Web Scraping

Other Tools

Scrapy: https://scrapy.org/
Selenium: http://www.seleniumhq.org/

Scrapy

A more streamlined way to build web crawlers, which can programmatically navigate across multiple pages
Can export to many different file formats from the command line

Selenium

Allows you to open up a browser window from your code!
Often used with testing
Requires a driver for your browser of choice
Doesn't navigate through the page until all contents have loaded

Recap

Web scraping is the process of downloading, extracting, and storing data from a web page
It's helpful when there's no other way to grab data you want
Be sure you're allowed to scrape before you do so
BeautifulSoup + requests allow you to scrape websites in Python
Building scrapers can take time up front, but should save you time in the long term
Other helpful tools include Scrapy and Selenium

Web scraping

Objectives

YOUR TURN