DEV Community

Cover image for LucideCrawl: AI-Powered Web Ingestion and Phishing Detection API Built on Xano
Temitope
Temitope

Posted on • Edited on

LucideCrawl: AI-Powered Web Ingestion and Phishing Detection API Built on Xano

Xano AI-Powered Backend Challenge: Public API Submission

This is a submission for the Xano AI-Powered Backend Challenge: Production-Ready Public API

What I Built

I built LucideCrawl — a production-ready public API that allows developers to safely ingest and analyze web content at scale while protecting users from phishing, scams, and malicious sites.

LucideCrawl provides four core capabilities, all implemented entirely in Xano with robust authentication, per-user rate limiting, usage tracking, and audit logging:

  • Phishing & Safety Detection – AI-powered, real-time evaluation of URLs to detect scams, impersonation, urgent threats, and security risks.
  • Ask Questions About Web Pages – Extract clean content and answer natural language questions grounded in the page content. Perfect for RAG, summarization, or compliance checks.
  • Sitemap-Based Bulk Ingestion – Crawl all pages from a sitemap.xml with include/exclude path filtering.
  • Full Website Crawl – Depth-controlled crawling of entire websites with domain/path rules, delivering structured, clean data.

LucideCrawl is ideal for:

  • Browser extensions and email tools needing instant phishing detection
  • AI agents requiring safe, grounded web data
  • Knowledge platforms building search indexes or SEO audits
  • Security teams monitoring brand impersonation

All core logic, authentication, API key management, and rate limiting are built natively in Xano.


API Documentation

Base URL: https://xmmh-djbw-xefx.n7e.xano.io/api:x9tl6bvx

Authentication:

  • All endpoints require an x-api-key header.
  • API keys are generated upon signup and displayed only once. Users can manage them in their account.

Rate Limits (monthly, per user):

Plan trust_scan ask_the_page load_sitemap site_crawl
Free 4 5 5 2
Pro 100 5,000 50 20
Enterprise 1000 50,000 500 200

Core Endpoints:

  1. POST /trust_scan – AI-powered URL safety scan
    Input: { "url": "https://example.com" }
    Returns: safety_score, safety_label, confidence_level, phishing_category, impersonated_brand, detected_threats, risk_factors, details, and user_action_recommendation.

  2. POST /ask_the_page – Answer questions about a web page
    Input: { "url": "...", "question": "..." }
    Returns: Grounded AI answer with metadata.

  3. POST /load_sitemap – Bulk page ingestion from sitemap.xml
    Input: { "sitemap_url": "...", "include_paths": [...], "exclude_paths": [...] }
    Returns: Array of structured page data.

  4. POST /site_crawl – Depth-first crawl of a website
    Input:

   {
     "base_url": "https://example.com",
     "page_limit": 100,
     "crawl_depth": 3,
     "include_subdomains": false,
     "follow_external_links": false,
     "include_paths": ["/blog/", "/docs/"],
     "exclude_paths": ["/login", "/checkout"]
   }
Enter fullscreen mode Exit fullscreen mode

Returns: Array of crawled pages in clean, structured format.

Each response includes a usage object with monthly consumption and remaining quota.

Demo

/trust_scan API


curl -X POST https://xmmh-djbw-xefx.n7e.xano.io/api:x9tl6bvx/trust_scan \
  -H "x-api-key: sk_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://paypal-security-update-2025.com/login"}'
Enter fullscreen mode Exit fullscreen mode

Response (simplified):

{
  "success": true,
  "data": {
    "safety_score": 0.08,
    "safety_label": "Danger",
    "confidence_level": "high",
    "phishing_category": "financial",
    "impersonated_brand": "PayPal",
    "detected_threats": [
      "URGENT_ACTION_REQUIRED",
      "FAKE_LOGIN_FORM"
    ],
    "details": "This page mimics PayPal's login interface and uses urgency tactics to steal credentials.",
    "user_action_recommendation": "Do not enter any information. Close immediately."
  },
  "usage": {
    "month": "2025-12",
    "used": 12,
    "limit": 100,
    "remaining": 88
  }
}
Enter fullscreen mode Exit fullscreen mode

/ask_the_page API

Ask a direct question about a webpage and receive an AI-generated explanation based on the page’s content and domain signals.

Example: Verify a banking login page

curl -X POST https://xmmh-djbw-xefx.n7e.xano.io/api:x9tl6bvx/ask_the_page \
  -H "x-api-key: sk_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://bankofamerica-secure-login-2025.com",
    "question": "Is this the real Bank of America login page?"
  }'
Enter fullscreen mode Exit fullscreen mode

Response (simplified):

{
  "success": true,
  "data": {
    "url": "https://bankofamerica-secure-login-2025.com",
    "question": "Is this the real Bank of America login page?",
    "answer": "No, this is not a legitimate Bank of America page. The domain is not owned by Bank of America and uses urgency-driven login prompts commonly associated with phishing attacks.",
    "page_title": "Bank of America Secure Login"
  },
  "usage": {
    "month": "2025-12",
    "used": 13,
    "limit": 100,
    "remaining": 87
  }
}
Enter fullscreen mode Exit fullscreen mode

/crawl_webpage API

Fetch and extract structured data from any public webpage.
This endpoint is useful for content analysis, AI training, indexing, or downstream security scans.

Example: Crawl a webpage

curl -X POST https://xmmh-djbw-xefx.n7e.xano.io/api:x9tl6bvx/crawl_webpage \
  -H "x-api-key: sk_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com"
  }'
Enter fullscreen mode Exit fullscreen mode

Response (simplified):

{
  "success": true,
  "data": {
    "url": "https://example.com",
    "raw_html": "<!doctype html>...</html>",
    "clean_text": "This domain is for use in documentation examples without needing permission. Avoid use in operations. Learn more",
    "metadata": {
      "title": "Example Domain",
      "description": null,
      "keywords": null,
      "canonical": null,
      "language": "en"
    },
    "headings": [
      {
        "level": "h1",
        "text": "Example Domain"
      }
    ],
    "links": [
      {
        "url": "https://iana.org/domains/example",
        "text": "Learn more",
        "is_external": true
      }
    ],
    "structured_page_json": {
      "title": "Example Domain",
      "url": "https://example.com",
      "language": "en",
      "clean_text": "This domain is for use in documentation examples without needing permission. Avoid use in operations. Learn more",
      "metadata": {
        "title": "Example Domain",
        "description": null,
        "keywords": null,
        "canonical": null,
        "language": "en"
      },
      "headings": [
        {
          "level": "h1",
          "text": "Example Domain"
        }
      ],
      "links": [
        {
          "url": "https://iana.org/domains/example",
          "text": "Learn more",
          "is_external": true
        }
      ]
    }
  },
  "usage": {
    "month": "2025-12",
    "used": 24,
    "limit": 100,
    "remaining": 76
  }
}
Enter fullscreen mode Exit fullscreen mode

/site_crawl API

Crawl an entire website (or section of it) and return structured content from multiple pages in a single request.
Ideal for bulk analysis, AI training, indexing, and large-scale security scans.

Input Parameters

  • url – Starting URL for the crawl
  • pageLimit – Maximum number of pages to crawl
  • crawlDepth – How deep the crawler should follow internal links
  • includeSubdomains – Whether to include subdomains
  • followExternalLinks – Whether to crawl external domains
  • includePaths – Optional allowlist of URL paths
  • excludePaths – Optional blocklist of URL paths

Example: Crawl a small website

curl -X POST https://xmmh-djbw-xefx.n7e.xano.io/api:x9tl6bvx/site_crawl \
  -H "x-api-key: sk_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "pageLimit": 3,
    "crawlDepth": 1,
    "includeSubdomains": false,
    "followExternalLinks": false,
    "includePaths": [],
    "excludePaths": []
  }'
Enter fullscreen mode Exit fullscreen mode

Response (simplified):

{
  "success": true,
  "message": "Crawl completed successfully",
  "crawl_id": 6,
  "pages_crawled": 2,
  "formattedData": [
    {
      "url": "https://httpbin.org/forms/post",
      "title": null,
      "content": "Customer name: Telephone: E-mail address: Pizza Size Small Medium Large...",
      "metadata": {
        "title": null,
        "description": null,
        "keywords": null,
        "canonical": null,
        "language": null
      },
      "headings": []
    }
  ],
  "crawledData": [
    {
      "url": "https://httpbin.org",
      "raw_html": "<!DOCTYPE html>...</html>",
      "clean_text": "A simple HTTP Request & Response Service.",
      "metadata": {
        "title": "httpbin.org",
        "description": null,
        "keywords": null,
        "canonical": null,
        "language": "en"
      },
      "headings": [
        {
          "level": "h2",
          "text": "httpbin.org"
        }
      ],
      "links": [
        {
          "url": "https://github.com/requests/httpbin",
          "is_external": true
        }
      ]
    }
  ],
  "usage": {
    "current": 49,
    "limit": 50
  }
}
Enter fullscreen mode Exit fullscreen mode

/load_sitemap API

Load and process an XML sitemap, extract valid URLs, and crawl only the pages you care about using path filters.
Ideal for bulk ingestion, SEO analysis, AI training, and large-scale monitoring without full site crawling.

Input Parameters

  • sitemap_url – Full URL to the sitemap XML
  • include_paths – Optional list of URL path prefixes to allow
  • exclude_paths – Optional list of URL path prefixes to block

include_paths and exclude_paths work together to precisely control which sitemap URLs are processed.


Example: Load and filter a sitemap

curl -X POST https://xmmh-djbw-xefx.n7e.xano.io/api:x9tl6bvx/load_sitemap \
  -H "x-api-key: sk_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "sitemap_url": "https://octopus.do/sitemap.xml",
    "include_paths": [
      "/sitemap/changelog/"
    ],
    "exclude_paths": []
  }'
Enter fullscreen mode Exit fullscreen mode

Response (simplified)

{
  "success": true,
  "message": "Sitemap processed successfully",
  "sitemap_id": 4,
  "data": {
    "message": "Sitemap processed. Found 1 valid pages.",
    "formattedData": [
      {
        "url": "https://octopus.do/sitemap/changelog",
        "title": "Check out our Updates and Roadmap | Octopus.do",
        "content": "November 28, 2025 Feature Export to Excel... Octopus core refactoring...",
        "metadata": {
          "title": "Check out our Updates and Roadmap | Octopus.do",
          "description": "Our changelog includes new product features and development updates.",
          "keywords": null,
          "canonical": null,
          "language": "en"
        },
        "headings": [
          { "level": "h1", "text": "Changelog" },
          { "level": "h2", "text": "Export to Excel" },
          { "level": "h2", "text": "Introducing Sitemap AI assistant BETA" }
        ]
      }
    ],
    "raw": [
      {
        "url": "https://octopus.do/sitemap/changelog",
        "raw_html": "<!DOCTYPE html>...</html>",
        "clean_text": "November 28, 2025 Feature Export to Excel...",
        "metadata": {
          "title": "Check out our Updates and Roadmap | Octopus.do",
          "description": "Our changelog includes new product features and development updates.",
          "keywords": null,
          "canonical": null,
          "language": "en"
        },
        "headings": [
          { "level": "h1", "text": "Changelog" }
        ],
        "links": [
          {
            "url": "https://x.com/octopusdoHQ",
            "text": "Follow for updates",
            "is_external": true
          }
        ]
      }
    ]
  },
  "usage": {
    "current": 70,
    "limit": 70,
    "remaining": 0
  }
}
Enter fullscreen mode Exit fullscreen mode

Endpoint Purpose
/crawl_webpage Crawl and extract structured data from a single webpage
/site_crawl Crawl multiple pages across a website with depth and path controls
/load_sitemap Load and process URLs from an XML sitemap with smart filtering
/trust_scan Detect phishing, scams, and impersonation signals
/ask_the_page Explain page legitimacy and risks in clear, human-readable language

Account & Key Management Endpoints

These endpoints allow developers to programmatically manage their access keys.

5. `GET /get_api_keys` – List Active Keys
Retrieves a paginated list of API keys for the authenticated user.

  • Input: None (uses Auth Token).
  • Returns: A list of keys with values masked (e.g., sk_123...890), creation date, and plan type.

6. `POST /generate_api_key` – Create New Key
Generates a new secure API key (24 random bytes, hex-encoded).

  • Input: { "name": "Production App" } (Optional label).
  • Returns: The full API key string (shown only once), id, and name.
  • Note: This action is logged in the event_log table for security auditing.

7. `DELETE /delete_api_key` – Revoke Key
Permanently deletes an API key.

  • Input: { "key_id": 15 }
  • Returns: Success message.
  • Security: Validates that the key belongs to the authenticated user before deletion.


The AI Prompt I Used

“Build a production-ready backend for an application called LucideCrawl.
LucideCrawl is a Web Crawling + Content Extraction API that allows developers to crawl websites, extract clean data, and optionally run AI analysis using Google Gemini.
The backend must expose a secure public API with proper documentation, authentication, rate limiting, and usage logs.
Focus on practicality and simplicity—avoid overly complex ML/RAG pipelines.”

The AI produced strong foundations that I then refined for production readiness.


How I Refined the AI-Generated Code

I used the AI-generated Xano backend as a strong starting point, then refined and extended it to meet the needs of a production-ready, public API for website scanning and phishing detection. My goal was to align the generated structure with real-world usage patterns such as API access control, usage tracking, detailed scan results, and long-term data integrity.

Below are the key transformations I made, with examples where applicable.


1. API Key Management and Access Control

Initial state (AI-generated):

  • Basic API key support tied to a user

Improvements made:

  • Introduced a dedicated api_key table that supports:

    • Multiple API keys per user
    • A unique constraint on the key field
    • An optional name field so users can label keys (e.g., “Production”, “Testing”)
    • A plan_type field to support tiered access (free / pro)
api_key
- id
- user_id
- key (unique)
- name (optional)
- plan_type
- created_at
Enter fullscreen mode Exit fullscreen mode

This makes API access more flexible, secure, and easier to manage for end users.


2. Monthly Usage Aggregation

Initial state:

  • Request-level usage was logged, but not aggregated

Improvements made:

  • Added an api_monthly_usage table to track request counts per user per month
  • Enforced a unique constraint on (user_id, month) to guarantee one row per user per billing cycle
api_monthly_usage
- user_id
- month (YYYY-MM)
- call_count
Enter fullscreen mode Exit fullscreen mode

This structure enables efficient rate limiting, analytics, and future billing logic without relying on expensive log scans.


3. Usage Log Enhancements

Initial state:

  • Core usage logging was already present

Improvements made:

  • Extended the existing usage_log table by adding a url field
  • Preserved all existing fields and historical data

This provides better visibility into how each endpoint is used and which URLs are being analyzed, improving traceability and auditing.


4. Detailed Scan Result Storage

Initial state:

  • Scan results were stored at a high level

Improvements made:

  • Expanded the scan_history table to store detailed analysis returned by the external scanning API
  • Added only missing columns to ensure idempotent and non-destructive schema updates

New fields include:

  • Safety score and label
  • Confidence level
  • Phishing category
  • Detected threats and risk factors (JSON)
  • User-facing explanations and recommendations
  • External verification results
  • Scan timestamp

This allows scan results to be both machine-readable and human-friendly.


5. Question & Answer History Tracking

New addition:

  • Created an ask_history table to persist user questions, AI responses, and related metadata such as:

    • Model used
    • Scraped text length
    • Timestamp of the request

This provides a clear audit trail and supports analytics, debugging, and future model evaluation.


6. Crawl and Sitemap History Separation

Improvements made:

  • Added a crawl_history table to store website crawl configurations and outcomes
  • Added a sitemap_history table to track sitemap processing independently

By separating crawl and sitemap data, the backend remains clean, easier to query, and more adaptable as crawling features expand.


7. Schema Evolution and Safety

All schema changes were applied in a way that:

  • Adds new capabilities without breaking existing functionality
  • Avoids recreating tables or modifying existing columns
  • Preserves historical data

This approach ensures smooth iteration while maintaining system stability.

The AI-generated backend provided a solid foundation in Xano. By thoughtfully extending it with additional tables, constraints, and fields, I tailored the system to support real-world API usage, detailed scan analysis, and long-term scalability—while staying fully aligned with Xano best practices.

Key improvements:

  • Robust header handling: Case-insensitive x-api-key detection
  • Per-endpoint rate limiting: Separate usage tracking for each endpoint
  • Atomic usage counting: Prevents accidental overcharging on failed requests
  • Comprehensive history tables: For auditing and dashboard support
  • Plan-based dynamic limits: Free, Pro, Enterprise
  • Long operation timeouts: Up to 600s for deep crawls
  • Consistent response format: Always { success, data, usage }

These refinements ensure fairness, transparency, and a developer-friendly API.


My Overall Experience Using Xano

Xano made it possible to build a fully featured, secure, public-facing API in a relatively short amount of time.

One of the most helpful aspects was the visual canvas and function stack, which clearly shows how data flows through each API—from authentication, to rate limiting, processing, logging, and response handling. While I didn’t have enough time to explore the visual builder as deeply as I would have liked due to the deadline, it was still very useful for understanding and structuring complex logic.

I relied mostly on XanoScript for implementation. As someone still gaining experience with it, I encountered several syntax and structure errors along the way. However, the Logic Assistant was extremely helpful in resolving these issues. It not only identified errors quickly but often suggested cleaner, more efficient ways to write the logic, which significantly improved the quality of my code and helped me learn faster.

Overall, Xano struck a great balance between visual tooling and low-level control. Even when working primarily with XanoScript, the platform provided enough guidance and tooling to move quickly, fix mistakes, and ship with confidence.

LucideCrawl is now live, helping developers build safer, smarter web-powered applications.

Perfect — here’s a tighter, DEV.to–editorial–ready version.
It’s more concise, more confident, and reads like a featured challenge submission, not documentation.

You can paste this as-is under ## Demo.


Demo

To showcase LucideCrawl in a real-world scenario, I built Trust Shield — a lightweight Chrome extension that uses the POST /trust_scan endpoint to protect users from phishing in real time.

Trust Shield turns LucideCrawl from a backend API into always-on, user-facing protection that runs quietly in the background while users browse.


What Trust Shield Does

On every page visit:

  • The current URL is automatically scanned using LucideCrawl
  • The extension badge updates instantly:

    • 🛡️ Green — Safe
    • ⚠️ Red — Potentially dangerous
  • Risky sites trigger a Chrome notification with:

    • Safety label
    • Risk score
    • Clear next-step guidance
  • Clicking the extension icon opens a popup with:

    • Safety score and confidence
    • Detected threats and risk factors
    • Impersonated brand (if applicable)
    • Human-readable explanation of the risk

The user never has to think about scanning, protection happens by default.

Source Code

Open source:
trust-shield


Try It Yourself (≈ 2 Minutes)

1. Get an API key
Create a free LucideCrawl account via the POST /auth/signup endpoint using any REST client (Postman, cURL, etc.).
Copy the generated sk_... API key (shown once).

2. Install the Chrome extension

  1. Clone or download: trust-shield
  2. Open chrome://extensions/
  3. Enable Developer mode
  4. Click Load unpacked and select the project folder

3. Connect your API key

  • Click the Trust Shield icon
  • Paste your API key
  • Click Save

4. Test it

  • Visit a safe site → badge turns green 🛡️
  • Visit a phishing test site → badge turns red ⚠️ and a warning appears

No refreshes, no manual scans — protection is automatic.

Top comments (2)

Collapse
 
sloan profile image
Sloan the DEV Moderator

We loved your post so we shared it on social.

Keep up the great work!

Collapse
 
kingdavid profile image
Temitope

Thank you so much for the share! Hello to everyone seeing this👋, we’re just getting started.