Building a web scraper that works on your laptop is easy. Making it reliable in production is hard. Here's how I used domharvest-playwright to build a scraper that's been running smoothly for months.
The Challenge
Goal: Scrape product listings from an e-commerce site daily
Volume: ~10,000 products
Requirements:
- Run daily at 2 AM UTC
- Handle pagination (200+ pages)
- Detect and skip unchanged products
- Alert on failures
- Store results in PostgreSQL
Why domharvest-playwright?
I evaluated several options:
| Tool | Pro | Con |
|---|---|---|
| Cheerio | Fast | No JavaScript execution |
| Puppeteer | Powerful | Complex API |
| Scrapy | Battle-tested | Python (team uses JS) |
| domharvest-playwright | Simple + JS rendering + built-in resilience | New tool |
domharvest won because it handled JavaScript-heavy pages with production-ready features like retry logic, rate limiting, and batch processing built-in.
Architecture Overview
┌─────────────┐
│ Cron Job │
└──────┬──────┘
│
▼
┌─────────────────────────┐
│ domharvest-playwright │
│ (with built-in:) │
│ - Retry logic │
│ - Rate limiting │
│ - Batch processing │
│ - Error handling │
└────────┬────────────────┘
│
┌────┴────┐
▼ ▼
┌───────┐ ┌──────────┐
│Change │ │ Alert │
│Detect │ │ System │
└───┬───┘ └──────────┘
│
▼
┌──────────┐
│PostgreSQL│
└──────────┘
Implementation
1. Production-Ready Configuration
domharvest-playwright comes with built-in resilience features:
import { DOMHarvester } from 'domharvest-playwright'
const harvester = new DOMHarvester({
headless: true,
timeout: 30000,
// Built-in rate limiting
rateLimit: {
requests: 10,
per: 60000 // 10 requests per minute
},
// Structured logging
logging: {
level: 'info'
},
// Centralized error handling
onError: (error, context) => {
console.error(`Error on ${context.url}:`, error.message)
// Send to monitoring service
if (error.name === 'TimeoutError') {
sendAlert(`Timeout on ${context.url}`)
}
}
})
await harvester.init()
Key features:
- Rate limiting prevents server overload
- Automatic error context tracking
- Structured logging for debugging
2. Batch Scraping with Retry Logic
Use harvestBatch() with automatic retries:
// Generate configs for all pages
const configs = Array.from({ length: 200 }, (_, i) => ({
url: `https://example.com/products?page=${i + 1}`,
selector: '.product-card',
extractor: (el) => ({
id: el.querySelector('[data-product-id]')?.getAttribute('data-product-id'),
name: el.querySelector('.product-name')?.textContent?.trim(),
price: parseFloat(el.querySelector('.price')?.textContent?.replace(/[^0-9.]/g, '') || '0'),
inStock: !el.querySelector('.out-of-stock'),
scrapedAt: new Date().toISOString()
}),
options: {
retries: 3,
backoff: 'exponential',
maxBackoff: 10000,
retryOn: ['TimeoutError', 'NavigationError']
}
}))
// Batch scrape with concurrency control
const results = await harvester.harvestBatch(configs, {
concurrency: 5,
onProgress: (done, total) => {
console.log(`Progress: ${done}/${total}`)
}
})
// Filter successful results
const allProducts = results
.filter(r => r.success)
.flatMap(r => r.data)
console.log(`Successfully scraped ${allProducts.length} products`)
console.log(`Failed: ${results.filter(r => !r.success).length}`)
Built-in resilience:
- Automatic retries with exponential backoff (1s, 2s, 4s, max 10s)
- Retry only specific error types
- Concurrency control prevents overwhelming servers
- Individual error handling (one failure doesn't stop batch)
- Progress tracking for long operations
3. Product Extraction Function
Extract structured data with defensive coding:
function extractProduct(element) {
return {
id: element.querySelector('[data-product-id]')
?.getAttribute('data-product-id'),
name: element.querySelector('.product-name')
?.textContent?.trim(),
price: parsePrice(
element.querySelector('.price')?.textContent
),
imageUrl: element.querySelector('.product-image')
?.getAttribute('src'),
inStock: !element.querySelector('.out-of-stock'),
url: element.querySelector('a')
?.getAttribute('href'),
scrapedAt: new Date().toISOString()
}
}
function parsePrice(priceText) {
if (!priceText) return null
const match = priceText.match(/[\d,]+\.?\d*/)?.[0]
return match ? parseFloat(match.replace(',', '')) : null
}
Defensive extraction:
- Optional chaining everywhere (
?.) - Trim whitespace
- Parse prices consistently
- Handle missing elements gracefully
4. Change Detection (Application Layer)
Only store what changed:
import { createHash } from 'crypto'
async function saveProducts(products, db) {
let newCount = 0
let updatedCount = 0
for (const product of products) {
const hash = hashProduct(product)
const existing = await db.findProduct(product.id)
if (!existing) {
await db.insertProduct({ ...product, hash })
newCount++
} else if (existing.hash !== hash) {
await db.updateProduct(product.id, { ...product, hash })
updatedCount++
}
// Skip unchanged products
}
return { newCount, updatedCount }
}
function hashProduct(product) {
const relevant = {
price: product.price,
inStock: product.inStock,
name: product.name
}
return createHash('md5')
.update(JSON.stringify(relevant))
.digest('hex')
}
This reduced database writes by 90%.
5. Complete Production Script
Putting it all together:
import { DOMHarvester } from 'domharvest-playwright'
import { db } from './database.js'
import { sendMetrics, sendAlert } from './monitoring.js'
async function runDailyScraper() {
const startTime = Date.now()
const harvester = new DOMHarvester({
headless: true,
timeout: 30000,
rateLimit: {
requests: 10,
per: 60000
},
logging: {
level: 'info'
},
onError: (error, context) => {
console.error(`Error: ${error.message} at ${context.url}`)
}
})
try {
await harvester.init()
// Generate page configs
const configs = Array.from({ length: 200 }, (_, i) => ({
url: `https://example.com/products?page=${i + 1}`,
selector: '.product-card',
extractor: extractProduct,
options: {
retries: 3,
backoff: 'exponential'
}
}))
// Batch scrape
const results = await harvester.harvestBatch(configs, {
concurrency: 5,
onProgress: (done, total) => {
console.log(`Progress: ${done}/${total}`)
}
})
// Process results
const products = results
.filter(r => r.success)
.flatMap(r => r.data)
const { newCount, updatedCount } = await saveProducts(products, db)
// Send metrics
await sendMetrics({
totalScraped: products.length,
newProducts: newCount,
updatedProducts: updatedCount,
failedPages: results.filter(r => !r.success).length,
duration: Date.now() - startTime
})
console.log(`✅ Scraping complete: ${products.length} products`)
} catch (error) {
await sendAlert({
message: `Scraper failed: ${error.message}`,
stack: error.stack,
timestamp: new Date().toISOString()
})
throw error
} finally {
await harvester.close()
await db.close()
}
}
// Run
runDailyScraper().catch(console.error)
6. Deployment with Systemd
Running on a small VPS with systemd:
# /etc/systemd/system/product-scraper.timer
[Unit]
Description=Daily product scraper
[Timer]
OnCalendar=daily
OnCalendar=02:00 UTC
Persistent=true
[Install]
WantedBy=timers.target
# /etc/systemd/system/product-scraper.service
[Unit]
Description=Product Scraper Service
[Service]
Type=oneshot
User=scraper
WorkingDirectory=/opt/scraper
ExecStart=/usr/bin/node scraper.js
StandardOutput=journal
StandardError=journal
Built-in Solutions to Common Challenges
Automatic Resource Management
domharvest-playwright handles browser lifecycle automatically:
// No manual page management needed
const results = await harvester.harvestBatch(configs, {
concurrency: 5
})
// Pages are automatically created, used, and cleaned up
Benefit: Eliminates memory leaks from forgotten page.close() calls.
Rate Limiting Built-in
Configure once, enforced everywhere:
const harvester = new DOMHarvester({
rateLimit: {
global: { requests: 20, per: 60000 },
perDomain: { requests: 5, per: 60000 }
}
})
// Automatically enforced across all harvest operations
Benefit: No need to manually track request timestamps or implement delays.
Retry Logic for Flaky Networks
Handles transient failures automatically:
const data = await harvester.harvest(url, selector, extractor, {
retries: 3,
backoff: 'exponential', // 1s, 2s, 4s
retryOn: ['TimeoutError', 'NavigationError']
})
Benefit: Recovers from temporary network issues, slow page loads, and intermittent errors automatically.
Error Context for Debugging
Rich error information out of the box:
const harvester = new DOMHarvester({
onError: (error, context) => {
// Automatically includes:
// - error.name (TimeoutError, NavigationError, ExtractionError)
// - context.url
// - context.selector
// - context.operation
// - context.timestamp
logger.error('Scraping failed', {
type: error.name,
url: context.url,
operation: context.operation
})
}
})
Benefit: Know exactly what failed, where, and when.
Custom Browser Configuration
Control user agent, headers, viewport, and more:
const harvester = new DOMHarvester({
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
viewport: { width: 1920, height: 1080 },
extraHTTPHeaders: {
'Accept-Language': 'en-US,en;q=0.9'
}
})
Benefit: Appear as a regular browser, avoid basic bot detection.
Results
After 3 months in production using domharvest-playwright:
- ✅ 99.2% uptime
- ✅ ~300,000 products scraped
- ✅ Average runtime: 45 minutes (with built-in rate limiting)
- ✅ Zero manual interventions needed (thanks to retry logic)
- ✅ Automatic error recovery handled 95%+ of transient failures
- ✅ Database size: 2.3 GB
Lessons Learned
- Built-in resilience beats custom code - Using domharvest's retry logic and rate limiting saved weeks of debugging custom error handling
- Configuration over code - Declarative config (retries, backoff, rate limits) is clearer and more maintainable than imperative error handling
-
Batch processing with concurrency control -
harvestBatch()handles parallelism safely without overwhelming servers -
Centralized error handling -
onErrorcallback provides one place to handle all failures - Structured logging from day one - Built-in logging made production debugging trivial
Feature-Rich Yet Simple
The production scraper is ~250 lines of application code. domharvest-playwright's built-in features (retry logic, rate limiting, batch processing, error handling, logging) eliminated the need for hundreds of lines of boilerplate.
Focus on business logic, not infrastructure.
Try It Yourself
npm install domharvest-playwright
Check out the full documentation for:
- Retry logic configuration
- Rate limiting options
- Batch processing examples
- Error handling patterns
- Custom browser configuration
Links:
What's your experience with production scrapers? Share in the comments!
Top comments (0)