11.4 Scraping and Extracting Data
Course: Claude Code - Power User Section: Accessing the Web Video Length: 2-5 minutes Presenter: Daniel Treasure
Opening Hook
"You need data from a website: product prices, table rows, a list of links, JSON embedded in a page. You could manually copy-paste. Or you could let Claude scrape it—find the elements, extract the data, and give it back to you in a clean, structured format."
Key Talking Points
1. What Data Scraping Is
- Extracting structured information from web pages
- Can be static (HTML tables, lists) or dynamic (JavaScript-rendered content)
- Returns data in usable formats: JSON, CSV, arrays, objects
- Uses CSS selectors or XPath to target specific elements
- Often combined with Playwright for dynamic/interactive pages
- Respects robots.txt and rate limits—use responsibly
What to say: "Scraping is taking a web page and pulling out the useful data. A product listing becomes a spreadsheet. A table becomes JSON. A list of links becomes an array. Claude does the extraction work."
What to show on screen: Show before/after: raw HTML page vs. clean extracted data in JSON or CSV format.
2. Static vs. Dynamic Content
- Static: HTML is already loaded. Use WebFetch and Claude parses the markdown. Fast, simple.
- Dynamic: Content loads via JavaScript. Use Playwright to wait for content, then extract. Slower but necessary for modern web apps.
What to say: "Static content? Fetch the page, done. Dynamic content? That's where Playwright shines. It waits for the JavaScript to render, then Claude extracts the data from the fully-loaded page."
What to show on screen: Example of each: a static table (GitHub README) vs. a dynamic SPA (product listing with JS).
3. Extraction Methods
- CSS Selectors:
.product-item,[data-price],button[type="submit"] - XPath:
//div[@class='product']//span[@class='price'] - Text matching: "Find all links with text 'Download'"
- Attribute extraction: Get href, data-*, id, class
- Nested extraction: Pull parent elements, then children
What to say: "Claude knows how to find elements and pull their data. You just describe what you want: 'Get all product names and prices' or 'Extract every link from this page.'"
What to show on screen: Show a few selector examples. Highlight how Claude can use both exact selectors and fuzzy text matching.
4. Handling Dynamic and Complex Scenarios
- Pagination: Loop through pages, extract data from each
- Lazy loading: Scroll to trigger loading, wait for new content
- Modal/Popup content: Dismiss modals, extract from main page
- Filtered data: Use filters/search before extracting
- Real-time updates: Refresh page, re-extract updated data
What to say: "Complex scenarios just mean more steps. Claude can loop, wait, dismiss popups, apply filters, whatever it takes to get the data you need."
What to show on screen: Flow diagram of a multi-step scraping task: navigate → apply filter → scroll → wait → extract → loop for pagination.
Demo Plan
- Demo 1: Static Table Extraction
- Navigate to a page with a clean HTML table
- Ask Claude to extract all rows into JSON or CSV
- Show the clean output
-
Mention no waiting needed—it's instant
-
Demo 2: Dynamic Content Extraction
- Navigate to a site with JavaScript-rendered content (product listing, search results)
- Ask Claude to wait for content to load, then extract
- Show data being scraped from dynamic elements
-
Display the structured result
-
Demo 3: Paginated Data
- Start on page 1 of a paginated listing
- Ask Claude to scrape all pages (3-4 pages)
- Show the accumulated data
- Explain how Claude handles looping and pagination
Code Examples & Commands
Example 1: Static Table Extraction
User: "Go to the page at example.com/data and extract the entire table into JSON format."
Claude:
1. Fetches or navigates to the page
2. Identifies the table element
3. Parses rows, columns, headers
4. Returns JSON: { headers: [...], rows: [...] }
Example 2: Product Scraping
User: "Extract all products from example.com/shop. For each, get: name, price, availability."
Claude with Playwright:
1. Navigates to the shop page
2. Waits for product list to load
3. Extracts data using selectors or text matching
4. Returns array of objects:
[
{ name: "Product A", price: "$29.99", availability: "In Stock" },
{ name: "Product B", price: "$49.99", availability: "Out of Stock" }
]
Example 3: Paginated Scraping
User: "Scrape all results from example.com/search?q=books (3 pages). Get: title, author, rating."
Claude:
1. Navigates to page 1
2. Extracts all items with specified fields
3. Clicks "Next" or navigates to page 2
4. Repeats for all pages
5. Combines and returns all data
Example 4: Conditional Extraction
User: "Extract all links from example.com/directory, but only those that contain 'github' in the URL."
Claude:
1. Finds all links
2. Filters by URL content
3. Returns filtered list with title, URL, description (if available)
Gotchas & Tips
- Selector Fragility: If a site redesigns, your selectors break. Use text-based matching when possible.
- Rate Limiting: Scraping too fast can get you IP-banned. Playwright can add delays, Claude is respectful by default.
- robots.txt and Terms: Always respect a site's robots.txt and ToS. Some sites explicitly forbid scraping.
- Large Datasets: Scraping 10,000 items takes time. Set realistic expectations and consider pagination limits.
- Image URLs: Extracting src attributes is easy. Downloading images requires additional steps.
- Nested/Complex HTML: Some sites have deeply nested, messy HTML. Claude can handle it, but may need guidance on what counts as a "row" or "item."
Pro tip: Start with a small sample (first page, first 5 items) to verify your extraction is working correctly, then scale up.
Lead-out
"Scraping turns web pages into data. In the next video, we'll shift focus: instead of extracting random data, we'll look at systematically pulling documentation and reference material—the information developers actually need most."
Reference URLs
- https://playwright.dev/docs/evaluator
- https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
- https://www.w3.org/TR/xpath-10/
Prep Reading
- Test scraping a static table and a dynamic page before recording
- Have a realistic scraping task ready (not too simple, not too complex)
- Understand CSS selectors and how Claude targets elements
- Know how to handle pagination in Playwright
- Test rate limiting and respectful scraping practices
- Prepare examples of clean vs. messy HTML
Notes for Daniel: Show the value here—"Instead of manually copying data, you ask Claude, it scrapes, you get clean data." The demos should feel practical. Use real websites or realistic test data. Emphasize the respecting robots.txt and rate limits angle—this is an ethical tool. If a scrape takes >10 seconds per page, use prerecorded results.