Introduction to Web Scraping
Okay...more like, do something with data
Why Scrape?
Is it...ok?
Introduction to Beautiful Soup
Getting started with Beautiful Soup
Parsing and Navigating HTML
BeautifulSoup(html_string, "html.parser")
- parse HTML
find
- returns one matching tag
find_all
- returns a list of matching tags
Navigating with CSS Selectors
select
- returns a list of elements matching a CSS selector
Selector Cheatsheet
#foo
.bar
div > p
div p
Selecting Elements by Attribute
# find an element with an id of foo
soup.find(id="foo")
soup.select("#foo")[0]
# find all elements with a class of bar
# careful! "class" is a reserved word in Python
soup.find_all(class_="bar")
soup.select(".bar")
# find all elements with a data
# attribute of "baz"
# using the general attrs kwarg
soup.find_all(attrs={"data-baz": True})
soup.select("[data-baz]")
Accessing Data in Elements
get_text
- access the inner text in an element
name
- tag name
attrs
- dictionary of attributes
Navigating with Beautiful Soup
Via Tags
parent / parents
contents
next_sibling / next_siblings
previous_sibling / previous_siblings
Via Searching
find_parent / find_parents
find_next_sibling / find_next_siblings
find_previous_sibling / find_previous_siblings
Web Scraping Example with Beautiful Soup
Requests + Beautiful Soup Example
Common Issues with Web Scraping
Other Tools for Web Scraping
Other Tools
Scrapy
Selenium
Recap