Web scraping
Objectives
- Define what web scraping is and the issues surrounding it
- Use the requests and BeautifulSoup modules to parse HTML
- Explain some common problems with web scraping
- Explore other tools that can interact with web pages
Introduction to Web Scraping
- Web scraping involves programmatically grabbing data from a web page
- Three steps: Download, extract data, PROFIT!
Okay...more like, do something with data
Why Scrape?
- There's data on a site that you want to store or analyze
- You can't get by other means (e.g. an API)
- You want to programmatically grab the data (instead of lots of manual copying/pasting)
Is it...ok?
- Some websites don't want people scraping them
- Best practice: consult the robots.txt file
- If making many requests, time them out
- If you're too aggressive, your IP can be blocked
Introduction to Beautiful Soup
Getting started with Beautiful Soup
- To extract data from HTML, we'll use Beautiful Soup
- Install it with pip
- Beautiful Soup lets us navigate through HTML with Python
- Beautiful Soup does NOT download HTML - for this, we need the requests module!
Parsing and Navigating HTML
-
BeautifulSoup(html_string, "html.parser")
- parse HTML - Once parsed, There are several ways to navigate:
- By Tag Name
- Using
find
- returns one matching tag - Using
find_all
- returns a list of matching tags
Navigating with CSS Selectors
select
- returns a list of elements matching a CSS selector
Selector Cheatsheet
- Select by id of foo:
#foo
- Select by class of bar:
.bar
- Select children:
div > p
- Select descendents:
div p
Selecting Elements by Attribute
# find an element with an id of foo
soup.find(id="foo")
soup.select("#foo")[0]
# find all elements with a class of bar
# careful! "class" is a reserved word in Python
soup.find_all(class_="bar")
soup.select(".bar")
# find all elements with a data
# attribute of "baz"
# using the general attrs kwarg
soup.find_all(attrs={"data-baz": True})
soup.select("[data-baz]")
Accessing Data in Elements
-
get_text
- access the inner text in an element -
name
- tag name -
attrs
- dictionary of attributes - You can also access attribute values using brackets!
Navigating with Beautiful Soup
Via Tags
-
parent / parents
-
contents
-
next_sibling / next_siblings
-
previous_sibling / previous_siblings
Via Searching
-
find_parent / find_parents
-
find_next_sibling / find_next_siblings
-
find_previous_sibling / find_previous_siblings
Web Scraping Example with Beautiful Soup
Requests + Beautiful Soup Example
- Let's scrape data into a CSV!
- Goal: Grab all links from Rithm School blog
- Data: store URL, anchor tag text, and date
Common Issues with Web Scraping
- Gnarly HTML
- Code tightly coupled to UI
- Sanitizing data after grabbing it
- Data that isn't part of HTML, but is loaded later!
Other Tools for Web Scraping
Other Tools
- Scrapy: https://scrapy.org/
- Selenium: http://www.seleniumhq.org/
Scrapy
- A more streamlined way to build web crawlers, which can programmatically navigate across multiple pages
- Can export to many different file formats from the command line
Selenium
- Allows you to open up a browser window from your code!
- Often used with testing
- Requires a driver for your browser of choice
- Doesn't navigate through the page until all contents have loaded
Recap
- Web scraping is the process of downloading, extracting, and storing data from a web page
- It's helpful when there's no other way to grab data you want
- Be sure you're allowed to scrape before you do so
- BeautifulSoup + requests allow you to scrape websites in Python
- Building scrapers can take time up front, but should save you time in the long term
- Other helpful tools include Scrapy and Selenium
YOUR TURN
Web Scraping
By colt
Web Scraping
- 4,566