11.4 Scraping and Extracting Data

Course: Claude Code - Power User Section: Accessing the Web Video Length: 2-5 minutes Presenter: Daniel Treasure

Opening Hook

"You need data from a website: product prices, table rows, a list of links, JSON embedded in a page. You could manually copy-paste. Or you could let Claude scrape it—find the elements, extract the data, and give it back to you in a clean, structured format."

Key Talking Points

1. What Data Scraping Is

Extracting structured information from web pages
Can be static (HTML tables, lists) or dynamic (JavaScript-rendered content)
Returns data in usable formats: JSON, CSV, arrays, objects
Uses CSS selectors or XPath to target specific elements
Often combined with Playwright for dynamic/interactive pages
Respects robots.txt and rate limits—use responsibly

What to say: "Scraping is taking a web page and pulling out the useful data. A product listing becomes a spreadsheet. A table becomes JSON. A list of links becomes an array. Claude does the extraction work."

What to show on screen: Show before/after: raw HTML page vs. clean extracted data in JSON or CSV format.

2. Static vs. Dynamic Content

Static: HTML is already loaded. Use WebFetch and Claude parses the markdown. Fast, simple.
Dynamic: Content loads via JavaScript. Use Playwright to wait for content, then extract. Slower but necessary for modern web apps.

What to say: "Static content? Fetch the page, done. Dynamic content? That's where Playwright shines. It waits for the JavaScript to render, then Claude extracts the data from the fully-loaded page."

What to show on screen: Example of each: a static table (GitHub README) vs. a dynamic SPA (product listing with JS).

3. Extraction Methods

CSS Selectors: .product-item, [data-price], button[type="submit"]
XPath: //div[@class='product']//span[@class='price']
Text matching: "Find all links with text 'Download'"
Attribute extraction: Get href, data-*, id, class
Nested extraction: Pull parent elements, then children

What to say: "Claude knows how to find elements and pull their data. You just describe what you want: 'Get all product names and prices' or 'Extract every link from this page.'"

What to show on screen: Show a few selector examples. Highlight how Claude can use both exact selectors and fuzzy text matching.

4. Handling Dynamic and Complex Scenarios

Pagination: Loop through pages, extract data from each
Lazy loading: Scroll to trigger loading, wait for new content
Modal/Popup content: Dismiss modals, extract from main page
Filtered data: Use filters/search before extracting
Real-time updates: Refresh page, re-extract updated data

What to say: "Complex scenarios just mean more steps. Claude can loop, wait, dismiss popups, apply filters, whatever it takes to get the data you need."

What to show on screen: Flow diagram of a multi-step scraping task: navigate → apply filter → scroll → wait → extract → loop for pagination.

Demo Plan

Demo 1: Static Table Extraction
Navigate to a page with a clean HTML table
Ask Claude to extract all rows into JSON or CSV
Show the clean output
Mention no waiting needed—it's instant
Demo 2: Dynamic Content Extraction
Navigate to a site with JavaScript-rendered content (product listing, search results)
Ask Claude to wait for content to load, then extract
Show data being scraped from dynamic elements
Display the structured result
Demo 3: Paginated Data
Start on page 1 of a paginated listing
Ask Claude to scrape all pages (3-4 pages)
Show the accumulated data
Explain how Claude handles looping and pagination

Code Examples & Commands

Example 1: Static Table Extraction

User: "Go to the page at example.com/data and extract the entire table into JSON format."

Claude:
1. Fetches or navigates to the page
2. Identifies the table element
3. Parses rows, columns, headers
4. Returns JSON: { headers: [...], rows: [...] }

Example 2: Product Scraping

User: "Extract all products from example.com/shop. For each, get: name, price, availability."

Claude with Playwright:
1. Navigates to the shop page
2. Waits for product list to load
3. Extracts data using selectors or text matching
4. Returns array of objects:
[
  { name: "Product A", price: "$29.99", availability: "In Stock" },
  { name: "Product B", price: "$49.99", availability: "Out of Stock" }
]

Example 3: Paginated Scraping

User: "Scrape all results from example.com/search?q=books (3 pages). Get: title, author, rating."

Claude:
1. Navigates to page 1
2. Extracts all items with specified fields
3. Clicks "Next" or navigates to page 2
4. Repeats for all pages
5. Combines and returns all data

Example 4: Conditional Extraction

User: "Extract all links from example.com/directory, but only those that contain 'github' in the URL."

Claude:
1. Finds all links
2. Filters by URL content
3. Returns filtered list with title, URL, description (if available)

Gotchas & Tips

Selector Fragility: If a site redesigns, your selectors break. Use text-based matching when possible.
Rate Limiting: Scraping too fast can get you IP-banned. Playwright can add delays, Claude is respectful by default.
robots.txt and Terms: Always respect a site's robots.txt and ToS. Some sites explicitly forbid scraping.
Large Datasets: Scraping 10,000 items takes time. Set realistic expectations and consider pagination limits.
Image URLs: Extracting src attributes is easy. Downloading images requires additional steps.
Nested/Complex HTML: Some sites have deeply nested, messy HTML. Claude can handle it, but may need guidance on what counts as a "row" or "item."

Pro tip: Start with a small sample (first page, first 5 items) to verify your extraction is working correctly, then scale up.

Lead-out

"Scraping turns web pages into data. In the next video, we'll shift focus: instead of extracting random data, we'll look at systematically pulling documentation and reference material—the information developers actually need most."

Reference URLs

https://playwright.dev/docs/evaluator
https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
https://www.w3.org/TR/xpath-10/

Prep Reading

Test scraping a static table and a dynamic page before recording
Have a realistic scraping task ready (not too simple, not too complex)
Understand CSS selectors and how Claude targets elements
Know how to handle pagination in Playwright
Test rate limiting and respectful scraping practices
Prepare examples of clean vs. messy HTML

Notes for Daniel: Show the value here—"Instead of manually copying data, you ask Claude, it scrapes, you get clean data." The demos should feel practical. Use real websites or realistic test data. Emphasize the respecting robots.txt and rate limits angle—this is an ethical tool. If a scrape takes >10 seconds per page, use prerecorded results.

Quick Reference