Python excels in web scraping due to its versatile libraries. I’ve used these tools extensively to gather data from diverse websites, each with unique requirements. Here’s a practical overview of six essential libraries, complete with code samples from real projects.
Beautiful Soup handles HTML parsing elegantly. When I needed product details from an e-commerce site, it efficiently processed messy markup. Install it with pip install beautifulsoup4
. Consider this product page extraction:
from bs4 import BeautifulSoup
import requests
url = "https://example-store.com/products"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
product_cards = soup.find_all("div", class_="product-card")
for card in product_cards:
name = card.find("h3").text.strip()
price = card.find("span", class_="price").text
print(f"{name}: {price}")
The find_all
method locates repeating elements, while find
extracts specifics. For complex hierarchies, chain selectors like card.select("div > a.tag")
. I often pair it with Requests for static sites – it’s saved me hours on data extraction tasks.
Scrapy scales for industrial-level scraping. Building a spider for news archives, I processed 50,000 pages daily. Start a project: scrapy startproject news_crawler
. Define items in items.py
:
import scrapy
class NewsItem(scrapy.Item):
headline = scrapy.Field()
author = scrapy.Field()
publish_date = scrapy.Field()
Create a spider in spiders/news.py
:
class NewsSpider(scrapy.Spider):
name = "news_spider"
start_urls = ["https://example-news.com/archives"]
def parse(self, response):
articles = response.css("article.post")
for article in articles:
yield {
"headline": article.css("h2.title::text").get(),
"author": article.css("span.byline::text").get(),
"date": article.xpath(".//time/@datetime").get()
}
next_page = response.css("a.next-page::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
Run with scrapy crawl news_spider -o output.json
. The built-in scheduler handles concurrency and retries. For e-commerce scraping, I added auto-throttling in settings.py
to prevent bans: AUTOTHROTTLE_ENABLED = True
.
Selenium automates browsers for JavaScript-heavy sites. When a real estate portal loaded listings dynamically, this script worked:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://example-homes.com/listings")
try:
listings = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.listing-card"))
)
for listing in listings:
address = listing.find_element(By.CLASS_NAME, "address").text
beds = listing.find_element(By.XPATH, ".//span[@data-role='beds']").text
print(f"{address} | Beds: {beds}")
finally:
driver.quit()
Explicit waits prevent timing issues. For login-protected data, I use send_keys()
:
driver.find_element(By.ID, "username").send_keys("[email protected]")
driver.find_element(By.ID, "password").send_keys("secure_pass123")
driver.find_element(By.XPATH, "//button[text()='Login']").click()
Requests manages HTTP operations cleanly. When APIs aren’t available, I simulate sessions:
session = requests.Session()
login_payload = {"user": "my_user", "pass": "secure123"}
session.post("https://example.com/login", data=login_payload)
profile_page = session.get("https://example.com/profile")
print(f"Logged in as: {profile_page.cookies.get('username')}")
For paginated APIs, this pattern works well:
page = 1
while True:
response = requests.get(
f"https://api.example-data.com/records?page={page}",
headers={"Authorization": "Bearer API_KEY123"}
)
data = response.json()
if not data["results"]:
break
process_records(data["results"])
page += 1
lxml delivers speed for large XML datasets. Parsing a 2GB sitemap took seconds:
from lxml import etree
parser = etree.XMLParser(recover=True)
tree = etree.parse("sitemap.xml", parser)
urls = tree.xpath("//loc/text()")
with open("urls.txt", "w") as f:
f.write("\n".join(urls))
For HTML, combine XPath and CSS:
html = etree.HTML(response.content)
titles = html.xpath("//div[contains(@class,'product')]/h3/text()")
prices = html.cssselect("div.product > span.price")
PyQuery uses jQuery syntax for frontend developers. Scraping a forum:
from pyquery import PyQuery as pq
doc = pq(url="https://example-forum.com/python")
threads = doc("div.thread-list > div.thread")
for thread in threads:
item = pq(thread)
title = item.find("h3").text()
replies = item("span.reply-count").text()
print(f"Topic: {title} ({replies} replies)")
Chain methods for complex queries:
last_page = doc("ul.pagination").children().eq(-2).text()
Key Considerations:
- Rotate user-agents:
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
- Handle errors with retries:
from tenacity import retry, stop_after_attempt
- Respect robots.txt:
import robotparser; rp = robotparser.RobotFileParser()
These tools form a versatile scraping toolkit. I choose based on project needs: Beautiful Soup for quick extracts, Scrapy for pipelines, Selenium for dynamic content. Always verify site permissions before scraping.