Max B.

Posted on Jan 10 • Edited on Jan 16 • Originally published at blog.domharvest.dev

Building a Production Web Scraper: A Real-World Case Study

#automation #javascript #node #tutorial

Building a web scraper that works on your laptop is easy. Making it reliable in production is hard. Here's how I used domharvest-playwright to build a scraper that's been running smoothly for months.

The Challenge

Goal: Scrape product listings from an e-commerce site daily

Volume: ~10,000 products

Requirements:

Run daily at 2 AM UTC
Handle pagination (200+ pages)
Detect and skip unchanged products
Alert on failures
Store results in PostgreSQL

Why domharvest-playwright?

I evaluated several options:

Tool	Pro	Con
Cheerio	Fast	No JavaScript execution
Puppeteer	Powerful	Complex API
Scrapy	Battle-tested	Python (team uses JS)
domharvest-playwright	Simple + JS rendering + built-in resilience	New tool

domharvest won because it handled JavaScript-heavy pages with production-ready features like retry logic, rate limiting, and batch processing built-in.

Architecture Overview

┌─────────────┐
│   Cron Job  │
└──────┬──────┘
       │
       ▼
┌─────────────────────────┐
│  domharvest-playwright  │
│  (with built-in:)       │
│  - Retry logic          │
│  - Rate limiting        │
│  - Batch processing     │
│  - Error handling       │
└────────┬────────────────┘
         │
    ┌────┴────┐
    ▼         ▼
┌───────┐  ┌──────────┐
│Change │  │  Alert   │
│Detect │  │ System   │
└───┬───┘  └──────────┘
    │
    ▼
┌──────────┐
│PostgreSQL│
└──────────┘

Implementation

1. Production-Ready Configuration

domharvest-playwright comes with built-in resilience features:

import { DOMHarvester } from 'domharvest-playwright'

const harvester = new DOMHarvester({
  headless: true,
  timeout: 30000,

  // Built-in rate limiting
  rateLimit: {
    requests: 10,
    per: 60000 // 10 requests per minute
  },

  // Structured logging
  logging: {
    level: 'info'
  },

  // Centralized error handling
  onError: (error, context) => {
    console.error(`Error on ${context.url}:`, error.message)
    // Send to monitoring service
    if (error.name === 'TimeoutError') {
      sendAlert(`Timeout on ${context.url}`)
    }
  }
})

await harvester.init()

Key features:

Rate limiting prevents server overload
Automatic error context tracking
Structured logging for debugging

2. Batch Scraping with Retry Logic

Use harvestBatch() with automatic retries:

// Generate configs for all pages
const configs = Array.from({ length: 200 }, (_, i) => ({
  url: `https://example.com/products?page=${i + 1}`,
  selector: '.product-card',
  extractor: (el) => ({
    id: el.querySelector('[data-product-id]')?.getAttribute('data-product-id'),
    name: el.querySelector('.product-name')?.textContent?.trim(),
    price: parseFloat(el.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '') || '0'),
    inStock: !el.querySelector('.out-of-stock'),
    scrapedAt: new Date().toISOString()
  }),
  options: {
    retries: 3,
    backoff: 'exponential',
    maxBackoff: 10000,
    retryOn: ['TimeoutError', 'NavigationError']
  }
}))

// Batch scrape with concurrency control
const results = await harvester.harvestBatch(configs, {
  concurrency: 5,
  onProgress: (done, total) => {
    console.log(`Progress: ${done}/${total}`)
  }
})

// Filter successful results
const allProducts = results
  .filter(r => r.success)
  .flatMap(r => r.data)

console.log(`Successfully scraped ${allProducts.length} products`)
console.log(`Failed: ${results.filter(r => !r.success).length}`)

Built-in resilience:

Automatic retries with exponential backoff (1s, 2s, 4s, max 10s)
Retry only specific error types
Concurrency control prevents overwhelming servers
Individual error handling (one failure doesn't stop batch)
Progress tracking for long operations

3. Product Extraction Function

Extract structured data with defensive coding:

function extractProduct(element) {
  return {
    id: element.querySelector('[data-product-id]')
      ?.getAttribute('data-product-id'),
    name: element.querySelector('.product-name')
      ?.textContent?.trim(),
    price: parsePrice(
      element.querySelector('.price')?.textContent
    ),
    imageUrl: element.querySelector('.product-image')
      ?.getAttribute('src'),
    inStock: !element.querySelector('.out-of-stock'),
    url: element.querySelector('a')
      ?.getAttribute('href'),
    scrapedAt: new Date().toISOString()
  }
}

function parsePrice(priceText) {
  if (!priceText) return null
  const match = priceText.match(/[\d,]+\.?\d*/)?.[0]
  return match ? parseFloat(match.replace(',', '')) : null
}

Defensive extraction:

Optional chaining everywhere (?.)
Trim whitespace
Parse prices consistently
Handle missing elements gracefully

4. Change Detection (Application Layer)

Only store what changed:

import { createHash } from 'crypto'

async function saveProducts(products, db) {
  let newCount = 0
  let updatedCount = 0

  for (const product of products) {
    const hash = hashProduct(product)
    const existing = await db.findProduct(product.id)

    if (!existing) {
      await db.insertProduct({ ...product, hash })
      newCount++
    } else if (existing.hash !== hash) {
      await db.updateProduct(product.id, { ...product, hash })
      updatedCount++
    }
    // Skip unchanged products
  }

  return { newCount, updatedCount }
}

function hashProduct(product) {
  const relevant = {
    price: product.price,
    inStock: product.inStock,
    name: product.name
  }
  return createHash('md5')
    .update(JSON.stringify(relevant))
    .digest('hex')
}

This reduced database writes by 90%.

5. Complete Production Script

Putting it all together:

import { DOMHarvester } from 'domharvest-playwright'
import { db } from './database.js'
import { sendMetrics, sendAlert } from './monitoring.js'

async function runDailyScraper() {
  const startTime = Date.now()
  const harvester = new DOMHarvester({
    headless: true,
    timeout: 30000,
    rateLimit: {
      requests: 10,
      per: 60000
    },
    logging: {
      level: 'info'
    },
    onError: (error, context) => {
      console.error(`Error: ${error.message} at ${context.url}`)
    }
  })

  try {
    await harvester.init()

    // Generate page configs
    const configs = Array.from({ length: 200 }, (_, i) => ({
      url: `https://example.com/products?page=${i + 1}`,
      selector: '.product-card',
      extractor: extractProduct,
      options: {
        retries: 3,
        backoff: 'exponential'
      }
    }))

    // Batch scrape
    const results = await harvester.harvestBatch(configs, {
      concurrency: 5,
      onProgress: (done, total) => {
        console.log(`Progress: ${done}/${total}`)
      }
    })

    // Process results
    const products = results
      .filter(r => r.success)
      .flatMap(r => r.data)

    const { newCount, updatedCount } = await saveProducts(products, db)

    // Send metrics
    await sendMetrics({
      totalScraped: products.length,
      newProducts: newCount,
      updatedProducts: updatedCount,
      failedPages: results.filter(r => !r.success).length,
      duration: Date.now() - startTime
    })

    console.log(`✅ Scraping complete: ${products.length} products`)

  } catch (error) {
    await sendAlert({
      message: `Scraper failed: ${error.message}`,
      stack: error.stack,
      timestamp: new Date().toISOString()
    })
    throw error
  } finally {
    await harvester.close()
    await db.close()
  }
}

// Run
runDailyScraper().catch(console.error)

6. Deployment with Systemd

Running on a small VPS with systemd:

# /etc/systemd/system/product-scraper.timer
[Unit]
Description=Daily product scraper

[Timer]
OnCalendar=daily
OnCalendar=02:00 UTC
Persistent=true

[Install]
WantedBy=timers.target

# /etc/systemd/system/product-scraper.service
[Unit]
Description=Product Scraper Service

[Service]
Type=oneshot
User=scraper
WorkingDirectory=/opt/scraper
ExecStart=/usr/bin/node scraper.js
StandardOutput=journal
StandardError=journal

Built-in Solutions to Common Challenges

Automatic Resource Management

domharvest-playwright handles browser lifecycle automatically:

// No manual page management needed
const results = await harvester.harvestBatch(configs, {
  concurrency: 5
})
// Pages are automatically created, used, and cleaned up

Benefit: Eliminates memory leaks from forgotten page.close() calls.

Rate Limiting Built-in

Configure once, enforced everywhere:

const harvester = new DOMHarvester({
  rateLimit: {
    global: { requests: 20, per: 60000 },
    perDomain: { requests: 5, per: 60000 }
  }
})
// Automatically enforced across all harvest operations

Benefit: No need to manually track request timestamps or implement delays.

Retry Logic for Flaky Networks

Handles transient failures automatically:

const data = await harvester.harvest(url, selector, extractor, {
  retries: 3,
  backoff: 'exponential', // 1s, 2s, 4s
  retryOn: ['TimeoutError', 'NavigationError']
})

Benefit: Recovers from temporary network issues, slow page loads, and intermittent errors automatically.

Error Context for Debugging

Rich error information out of the box:

const harvester = new DOMHarvester({
  onError: (error, context) => {
    // Automatically includes:
    // - error.name (TimeoutError, NavigationError, ExtractionError)
    // - context.url
    // - context.selector
    // - context.operation
    // - context.timestamp

    logger.error('Scraping failed', {
      type: error.name,
      url: context.url,
      operation: context.operation
    })
  }
})

Benefit: Know exactly what failed, where, and when.

Custom Browser Configuration

Control user agent, headers, viewport, and more:

const harvester = new DOMHarvester({
  userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
  viewport: { width: 1920, height: 1080 },
  extraHTTPHeaders: {
    'Accept-Language': 'en-US,en;q=0.9'
  }
})

Benefit: Appear as a regular browser, avoid basic bot detection.

Results

After 3 months in production using domharvest-playwright:

✅ 99.2% uptime
✅ ~300,000 products scraped
✅ Average runtime: 45 minutes (with built-in rate limiting)
✅ Zero manual interventions needed (thanks to retry logic)
✅ Automatic error recovery handled 95%+ of transient failures
✅ Database size: 2.3 GB

Lessons Learned

Built-in resilience beats custom code - Using domharvest's retry logic and rate limiting saved weeks of debugging custom error handling
Configuration over code - Declarative config (retries, backoff, rate limits) is clearer and more maintainable than imperative error handling
Batch processing with concurrency control - harvestBatch() handles parallelism safely without overwhelming servers
Centralized error handling - onError callback provides one place to handle all failures
Structured logging from day one - Built-in logging made production debugging trivial

Feature-Rich Yet Simple

The production scraper is ~250 lines of application code. domharvest-playwright's built-in features (retry logic, rate limiting, batch processing, error handling, logging) eliminated the need for hundreds of lines of boilerplate.

Focus on business logic, not infrastructure.

Try It Yourself

npm install domharvest-playwright

Check out the full documentation for:

Retry logic configuration
Rate limiting options
Batch processing examples
Error handling patterns
Custom browser configuration

Links:

What's your experience with production scrapers? Share in the comments!

DEV Community