Jason

Posted on Feb 2

Building Indx.sh - Automating Content Discovery: How We Crawl GitHub for AI Resources

#github #webdev #tutorial #programming

TL;DR: I built automated crawlers that discover AI coding prompts, skills, and MCP servers from GitHub, running daily via Vercel cron jobs. Here's how.

The Manual Content Problem

When I launched indx.sh, I had a content problem. The AI coding ecosystem moves fast:

New MCP servers pop up daily
Developers publish cursor rules and skill definitions constantly
Official repositories get updates
Star counts change

Manually tracking all this? Impossible.

The Solution: GitHub Crawlers

I built three automated crawlers that run daily:

Prompts Crawler - Discovers .cursorrules, CLAUDE.md, and copilot-instructions.md files
Skills Crawler - Finds repos with SKILL.md files
MCP Crawler - Finds Model Context Protocol servers

All run as Vercel cron jobs, so the directory stays fresh without manual work.

How the Prompts Crawler Works

The newest crawler searches for AI coding rules across multiple tools:

const FILE_SEARCHES = [
  { query: 'filename:.cursorrules', tool: 'cursor' },
  { query: 'filename:CLAUDE.md', tool: 'claude-code' },
  { query: 'filename:copilot-instructions.md', tool: 'copilot' },
];

const REPO_SEARCHES = [
  'cursor-rules in:name,description',
  'awesome-cursorrules',
  'topic:cursor-rules',
];

For each file found:

Fetch the content from GitHub
Generate a slug from owner-repo-filename
Infer category and tags from content
Auto-verify repos with 100+ stars
Upsert to database

First run indexed 175 prompts across Cursor, Claude Code, and Copilot.

How the Skills Crawler Works

// Search GitHub for SKILL.md files
const { items } = await searchGitHub('filename:SKILL.md');

for (const item of items) {
  // Fetch the actual SKILL.md content
  const content = await fetchFileContent(owner, repo, item.path);

  // Parse frontmatter (name, description, tags)
  const metadata = parseFrontmatter(content);

  // Upsert to database
  await prisma.skill.upsert({
    where: { slug },
    create: { ...metadata, content, githubStars },
    update: { githubStars }, // Keep stars fresh
  });
}

The key insight: GitHub's code search API lets you search by filename. filename:SKILL.md returns every repo with that file.

How the MCP Crawler Works

MCP servers are trickier - there's no single file convention. I use multiple search strategies:

const SEARCH_STRATEGIES = [
  'mcp server in:name,description',
  'model context protocol server',
  'topic:mcp',
  '@modelcontextprotocol/server',
  'mcp server typescript',
  'mcp server python',
];

For each strategy:

Search GitHub repos sorted by stars
Filter for MCP-related content
Fetch package.json for npm package names
Infer categories from description/topics
Mark official repos (from modelcontextprotocol org) as verified

The Cron Schedule

{
  "crons": [
    { "path": "/api/cron/sync-github-stats", "schedule": "0 3 * * *" },
    { "path": "/api/cron/crawl-skills", "schedule": "0 4 * * *" },
    { "path": "/api/cron/crawl-mcp", "schedule": "0 5 * * *" },
    { "path": "/api/cron/crawl-prompts", "schedule": "0 6 * * *" }
  ]
}

Every night (UTC):

3:00 AM - Sync GitHub star counts for existing resources
4:00 AM - Discover new skills
5:00 AM - Discover new MCP servers
6:00 AM - Discover new prompts/rules

Rate Limiting Matters

GitHub's API has limits. Without a token: 10 requests/minute. With a token: 5,000 requests/hour.

I handle this carefully:

Small delays between requests
Process in batches (50 items per cron run)
Graceful retry on rate limit errors

if (res.status === 403) {
  const resetTime = res.headers.get('X-RateLimit-Reset');
  console.log(`Rate limited. Resets at ${new Date(resetTime * 1000)}`);
  await sleep(60000); // Wait and retry
}

What I Learned

1. Incremental is better than bulk

Early versions tried to crawl everything at once. Timeouts, rate limits, chaos. Now I process 50 items per run and let it accumulate.

2. Deduplication by slug

Same repo can appear in multiple search strategies. I generate consistent slugs (owner-repo-path) and upsert instead of insert.

3. Don't trust descriptions

Many repos have empty or useless descriptions. I fall back to: "AI rules from {owner}/{repo}". Not pretty, but works.

4. Official = trusted

Repos from modelcontextprotocol, anthropics, or anthropic-ai orgs get auto-verified badges. Community repos need manual verification.

Current Stats

After the crawlers have been running:

790+ MCP servers indexed
1,300+ skills discovered
300+ prompts/rules indexed
Daily updates keep star counts fresh

The Honest Struggle

GitHub search isn't perfect. I get false positives - repos that mention "mcp" but aren't MCP servers. Manual review still matters for quality.

Also: the 50-item limit per cron run means it takes days to fully index everything. Vercel's 10-second timeout for hobby plans is real.

What's Next

Better category inference using AI
README parsing for richer descriptions
Automatic quality scoring based on stars, activity, docs
User submissions to fill gaps

Try It

Browse the auto-discovered resources at indx.sh:

Rules & Prompts - Cursor, Claude Code, Copilot rules
MCP Servers - sorted by GitHub stars
Skills - searchable by name/tags

Got a resource that's not indexed? Submit it or wait for the crawlers to find it.

This is part 2 of the "Building indx.sh" series.

DEV Community