DEV Community

Cover image for Building a Production Web Scraper: A Real-World Case Study
Max B.
Max B.

Posted on • Edited on • Originally published at blog.domharvest.dev

Building a Production Web Scraper: A Real-World Case Study

Building a web scraper that works on your laptop is easy. Making it reliable in production is hard. Here's how I used domharvest-playwright to build a scraper that's been running smoothly for months.

The Challenge

Goal: Scrape product listings from an e-commerce site daily

Volume: ~10,000 products

Requirements:

  • Run daily at 2 AM UTC
  • Handle pagination (200+ pages)
  • Detect and skip unchanged products
  • Alert on failures
  • Store results in PostgreSQL

Why domharvest-playwright?

I evaluated several options:

Tool Pro Con
Cheerio Fast No JavaScript execution
Puppeteer Powerful Complex API
Scrapy Battle-tested Python (team uses JS)
domharvest-playwright Simple + JS rendering + built-in resilience New tool

domharvest won because it handled JavaScript-heavy pages with production-ready features like retry logic, rate limiting, and batch processing built-in.

Architecture Overview

┌─────────────┐
│   Cron Job  │
└──────┬──────┘
       │
       ▼
┌─────────────────────────┐
│  domharvest-playwright  │
│  (with built-in:)       │
│  - Retry logic          │
│  - Rate limiting        │
│  - Batch processing     │
│  - Error handling       │
└────────┬────────────────┘
         │
    ┌────┴────┐
    ▼         ▼
┌───────┐  ┌──────────┐
│Change │  │  Alert   │
│Detect │  │ System   │
└───┬───┘  └──────────┘
    │
    ▼
┌──────────┐
│PostgreSQL│
└──────────┘
Enter fullscreen mode Exit fullscreen mode

Implementation

1. Production-Ready Configuration

domharvest-playwright comes with built-in resilience features:

import { DOMHarvester } from 'domharvest-playwright'

const harvester = new DOMHarvester({
  headless: true,
  timeout: 30000,

  // Built-in rate limiting
  rateLimit: {
    requests: 10,
    per: 60000 // 10 requests per minute
  },

  // Structured logging
  logging: {
    level: 'info'
  },

  // Centralized error handling
  onError: (error, context) => {
    console.error(`Error on ${context.url}:`, error.message)
    // Send to monitoring service
    if (error.name === 'TimeoutError') {
      sendAlert(`Timeout on ${context.url}`)
    }
  }
})

await harvester.init()
Enter fullscreen mode Exit fullscreen mode

Key features:

  • Rate limiting prevents server overload
  • Automatic error context tracking
  • Structured logging for debugging

2. Batch Scraping with Retry Logic

Use harvestBatch() with automatic retries:

// Generate configs for all pages
const configs = Array.from({ length: 200 }, (_, i) => ({
  url: `https://example.com/products?page=${i + 1}`,
  selector: '.product-card',
  extractor: (el) => ({
    id: el.querySelector('[data-product-id]')?.getAttribute('data-product-id'),
    name: el.querySelector('.product-name')?.textContent?.trim(),
    price: parseFloat(el.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '') || '0'),
    inStock: !el.querySelector('.out-of-stock'),
    scrapedAt: new Date().toISOString()
  }),
  options: {
    retries: 3,
    backoff: 'exponential',
    maxBackoff: 10000,
    retryOn: ['TimeoutError', 'NavigationError']
  }
}))

// Batch scrape with concurrency control
const results = await harvester.harvestBatch(configs, {
  concurrency: 5,
  onProgress: (done, total) => {
    console.log(`Progress: ${done}/${total}`)
  }
})

// Filter successful results
const allProducts = results
  .filter(r => r.success)
  .flatMap(r => r.data)

console.log(`Successfully scraped ${allProducts.length} products`)
console.log(`Failed: ${results.filter(r => !r.success).length}`)
Enter fullscreen mode Exit fullscreen mode

Built-in resilience:

  • Automatic retries with exponential backoff (1s, 2s, 4s, max 10s)
  • Retry only specific error types
  • Concurrency control prevents overwhelming servers
  • Individual error handling (one failure doesn't stop batch)
  • Progress tracking for long operations

3. Product Extraction Function

Extract structured data with defensive coding:

function extractProduct(element) {
  return {
    id: element.querySelector('[data-product-id]')
      ?.getAttribute('data-product-id'),
    name: element.querySelector('.product-name')
      ?.textContent?.trim(),
    price: parsePrice(
      element.querySelector('.price')?.textContent
    ),
    imageUrl: element.querySelector('.product-image')
      ?.getAttribute('src'),
    inStock: !element.querySelector('.out-of-stock'),
    url: element.querySelector('a')
      ?.getAttribute('href'),
    scrapedAt: new Date().toISOString()
  }
}

function parsePrice(priceText) {
  if (!priceText) return null
  const match = priceText.match(/[\d,]+\.?\d*/)?.[0]
  return match ? parseFloat(match.replace(',', '')) : null
}
Enter fullscreen mode Exit fullscreen mode

Defensive extraction:

  • Optional chaining everywhere (?.)
  • Trim whitespace
  • Parse prices consistently
  • Handle missing elements gracefully

4. Change Detection (Application Layer)

Only store what changed:

import { createHash } from 'crypto'

async function saveProducts(products, db) {
  let newCount = 0
  let updatedCount = 0

  for (const product of products) {
    const hash = hashProduct(product)
    const existing = await db.findProduct(product.id)

    if (!existing) {
      await db.insertProduct({ ...product, hash })
      newCount++
    } else if (existing.hash !== hash) {
      await db.updateProduct(product.id, { ...product, hash })
      updatedCount++
    }
    // Skip unchanged products
  }

  return { newCount, updatedCount }
}

function hashProduct(product) {
  const relevant = {
    price: product.price,
    inStock: product.inStock,
    name: product.name
  }
  return createHash('md5')
    .update(JSON.stringify(relevant))
    .digest('hex')
}
Enter fullscreen mode Exit fullscreen mode

This reduced database writes by 90%.

5. Complete Production Script

Putting it all together:

import { DOMHarvester } from 'domharvest-playwright'
import { db } from './database.js'
import { sendMetrics, sendAlert } from './monitoring.js'

async function runDailyScraper() {
  const startTime = Date.now()
  const harvester = new DOMHarvester({
    headless: true,
    timeout: 30000,
    rateLimit: {
      requests: 10,
      per: 60000
    },
    logging: {
      level: 'info'
    },
    onError: (error, context) => {
      console.error(`Error: ${error.message} at ${context.url}`)
    }
  })

  try {
    await harvester.init()

    // Generate page configs
    const configs = Array.from({ length: 200 }, (_, i) => ({
      url: `https://example.com/products?page=${i + 1}`,
      selector: '.product-card',
      extractor: extractProduct,
      options: {
        retries: 3,
        backoff: 'exponential'
      }
    }))

    // Batch scrape
    const results = await harvester.harvestBatch(configs, {
      concurrency: 5,
      onProgress: (done, total) => {
        console.log(`Progress: ${done}/${total}`)
      }
    })

    // Process results
    const products = results
      .filter(r => r.success)
      .flatMap(r => r.data)

    const { newCount, updatedCount } = await saveProducts(products, db)

    // Send metrics
    await sendMetrics({
      totalScraped: products.length,
      newProducts: newCount,
      updatedProducts: updatedCount,
      failedPages: results.filter(r => !r.success).length,
      duration: Date.now() - startTime
    })

    console.log(`✅ Scraping complete: ${products.length} products`)

  } catch (error) {
    await sendAlert({
      message: `Scraper failed: ${error.message}`,
      stack: error.stack,
      timestamp: new Date().toISOString()
    })
    throw error
  } finally {
    await harvester.close()
    await db.close()
  }
}

// Run
runDailyScraper().catch(console.error)
Enter fullscreen mode Exit fullscreen mode

6. Deployment with Systemd

Running on a small VPS with systemd:

# /etc/systemd/system/product-scraper.timer
[Unit]
Description=Daily product scraper

[Timer]
OnCalendar=daily
OnCalendar=02:00 UTC
Persistent=true

[Install]
WantedBy=timers.target
Enter fullscreen mode Exit fullscreen mode
# /etc/systemd/system/product-scraper.service
[Unit]
Description=Product Scraper Service

[Service]
Type=oneshot
User=scraper
WorkingDirectory=/opt/scraper
ExecStart=/usr/bin/node scraper.js
StandardOutput=journal
StandardError=journal
Enter fullscreen mode Exit fullscreen mode

Built-in Solutions to Common Challenges

Automatic Resource Management

domharvest-playwright handles browser lifecycle automatically:

// No manual page management needed
const results = await harvester.harvestBatch(configs, {
  concurrency: 5
})
// Pages are automatically created, used, and cleaned up
Enter fullscreen mode Exit fullscreen mode

Benefit: Eliminates memory leaks from forgotten page.close() calls.

Rate Limiting Built-in

Configure once, enforced everywhere:

const harvester = new DOMHarvester({
  rateLimit: {
    global: { requests: 20, per: 60000 },
    perDomain: { requests: 5, per: 60000 }
  }
})
// Automatically enforced across all harvest operations
Enter fullscreen mode Exit fullscreen mode

Benefit: No need to manually track request timestamps or implement delays.

Retry Logic for Flaky Networks

Handles transient failures automatically:

const data = await harvester.harvest(url, selector, extractor, {
  retries: 3,
  backoff: 'exponential', // 1s, 2s, 4s
  retryOn: ['TimeoutError', 'NavigationError']
})
Enter fullscreen mode Exit fullscreen mode

Benefit: Recovers from temporary network issues, slow page loads, and intermittent errors automatically.

Error Context for Debugging

Rich error information out of the box:

const harvester = new DOMHarvester({
  onError: (error, context) => {
    // Automatically includes:
    // - error.name (TimeoutError, NavigationError, ExtractionError)
    // - context.url
    // - context.selector
    // - context.operation
    // - context.timestamp

    logger.error('Scraping failed', {
      type: error.name,
      url: context.url,
      operation: context.operation
    })
  }
})
Enter fullscreen mode Exit fullscreen mode

Benefit: Know exactly what failed, where, and when.

Custom Browser Configuration

Control user agent, headers, viewport, and more:

const harvester = new DOMHarvester({
  userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
  viewport: { width: 1920, height: 1080 },
  extraHTTPHeaders: {
    'Accept-Language': 'en-US,en;q=0.9'
  }
})
Enter fullscreen mode Exit fullscreen mode

Benefit: Appear as a regular browser, avoid basic bot detection.

Results

After 3 months in production using domharvest-playwright:

  • ✅ 99.2% uptime
  • ✅ ~300,000 products scraped
  • ✅ Average runtime: 45 minutes (with built-in rate limiting)
  • ✅ Zero manual interventions needed (thanks to retry logic)
  • ✅ Automatic error recovery handled 95%+ of transient failures
  • ✅ Database size: 2.3 GB

Lessons Learned

  1. Built-in resilience beats custom code - Using domharvest's retry logic and rate limiting saved weeks of debugging custom error handling
  2. Configuration over code - Declarative config (retries, backoff, rate limits) is clearer and more maintainable than imperative error handling
  3. Batch processing with concurrency control - harvestBatch() handles parallelism safely without overwhelming servers
  4. Centralized error handling - onError callback provides one place to handle all failures
  5. Structured logging from day one - Built-in logging made production debugging trivial

Feature-Rich Yet Simple

The production scraper is ~250 lines of application code. domharvest-playwright's built-in features (retry logic, rate limiting, batch processing, error handling, logging) eliminated the need for hundreds of lines of boilerplate.

Focus on business logic, not infrastructure.

Try It Yourself

npm install domharvest-playwright
Enter fullscreen mode Exit fullscreen mode

Check out the full documentation for:

  • Retry logic configuration
  • Rate limiting options
  • Batch processing examples
  • Error handling patterns
  • Custom browser configuration

Links:

What's your experience with production scrapers? Share in the comments!

Top comments (0)