DEV Community

Cover image for Building Indx.sh - Automating Content Discovery: How We Crawl GitHub for AI Resources
Jason
Jason

Posted on

Building Indx.sh - Automating Content Discovery: How We Crawl GitHub for AI Resources

TL;DR: I built automated crawlers that discover AI coding prompts, skills, and MCP servers from GitHub, running daily via Vercel cron jobs. Here's how.


The Manual Content Problem

When I launched indx.sh, I had a content problem. The AI coding ecosystem moves fast:

  • New MCP servers pop up daily
  • Developers publish cursor rules and skill definitions constantly
  • Official repositories get updates
  • Star counts change

Manually tracking all this? Impossible.

The Solution: GitHub Crawlers

I built three automated crawlers that run daily:

  1. Prompts Crawler - Discovers .cursorrules, CLAUDE.md, and copilot-instructions.md files
  2. Skills Crawler - Finds repos with SKILL.md files
  3. MCP Crawler - Finds Model Context Protocol servers

All run as Vercel cron jobs, so the directory stays fresh without manual work.

How the Prompts Crawler Works

The newest crawler searches for AI coding rules across multiple tools:

const FILE_SEARCHES = [
  { query: 'filename:.cursorrules', tool: 'cursor' },
  { query: 'filename:CLAUDE.md', tool: 'claude-code' },
  { query: 'filename:copilot-instructions.md', tool: 'copilot' },
];

const REPO_SEARCHES = [
  'cursor-rules in:name,description',
  'awesome-cursorrules',
  'topic:cursor-rules',
];
Enter fullscreen mode Exit fullscreen mode

For each file found:

  1. Fetch the content from GitHub
  2. Generate a slug from owner-repo-filename
  3. Infer category and tags from content
  4. Auto-verify repos with 100+ stars
  5. Upsert to database

First run indexed 175 prompts across Cursor, Claude Code, and Copilot.

How the Skills Crawler Works

// Search GitHub for SKILL.md files
const { items } = await searchGitHub('filename:SKILL.md');

for (const item of items) {
  // Fetch the actual SKILL.md content
  const content = await fetchFileContent(owner, repo, item.path);

  // Parse frontmatter (name, description, tags)
  const metadata = parseFrontmatter(content);

  // Upsert to database
  await prisma.skill.upsert({
    where: { slug },
    create: { ...metadata, content, githubStars },
    update: { githubStars }, // Keep stars fresh
  });
}
Enter fullscreen mode Exit fullscreen mode

The key insight: GitHub's code search API lets you search by filename. filename:SKILL.md returns every repo with that file.

How the MCP Crawler Works

MCP servers are trickier - there's no single file convention. I use multiple search strategies:

const SEARCH_STRATEGIES = [
  'mcp server in:name,description',
  'model context protocol server',
  'topic:mcp',
  '@modelcontextprotocol/server',
  'mcp server typescript',
  'mcp server python',
];
Enter fullscreen mode Exit fullscreen mode

For each strategy:

  1. Search GitHub repos sorted by stars
  2. Filter for MCP-related content
  3. Fetch package.json for npm package names
  4. Infer categories from description/topics
  5. Mark official repos (from modelcontextprotocol org) as verified

The Cron Schedule

{
  "crons": [
    { "path": "/api/cron/sync-github-stats", "schedule": "0 3 * * *" },
    { "path": "/api/cron/crawl-skills", "schedule": "0 4 * * *" },
    { "path": "/api/cron/crawl-mcp", "schedule": "0 5 * * *" },
    { "path": "/api/cron/crawl-prompts", "schedule": "0 6 * * *" }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Every night (UTC):

  • 3:00 AM - Sync GitHub star counts for existing resources
  • 4:00 AM - Discover new skills
  • 5:00 AM - Discover new MCP servers
  • 6:00 AM - Discover new prompts/rules

Rate Limiting Matters

GitHub's API has limits. Without a token: 10 requests/minute. With a token: 5,000 requests/hour.

I handle this carefully:

  • Small delays between requests
  • Process in batches (50 items per cron run)
  • Graceful retry on rate limit errors
if (res.status === 403) {
  const resetTime = res.headers.get('X-RateLimit-Reset');
  console.log(`Rate limited. Resets at ${new Date(resetTime * 1000)}`);
  await sleep(60000); // Wait and retry
}
Enter fullscreen mode Exit fullscreen mode

What I Learned

1. Incremental is better than bulk

Early versions tried to crawl everything at once. Timeouts, rate limits, chaos. Now I process 50 items per run and let it accumulate.

2. Deduplication by slug

Same repo can appear in multiple search strategies. I generate consistent slugs (owner-repo-path) and upsert instead of insert.

3. Don't trust descriptions

Many repos have empty or useless descriptions. I fall back to: "AI rules from {owner}/{repo}". Not pretty, but works.

4. Official = trusted

Repos from modelcontextprotocol, anthropics, or anthropic-ai orgs get auto-verified badges. Community repos need manual verification.

Current Stats

After the crawlers have been running:

  • 790+ MCP servers indexed
  • 1,300+ skills discovered
  • 300+ prompts/rules indexed
  • Daily updates keep star counts fresh

The Honest Struggle

GitHub search isn't perfect. I get false positives - repos that mention "mcp" but aren't MCP servers. Manual review still matters for quality.

Also: the 50-item limit per cron run means it takes days to fully index everything. Vercel's 10-second timeout for hobby plans is real.

What's Next

  • Better category inference using AI
  • README parsing for richer descriptions
  • Automatic quality scoring based on stars, activity, docs
  • User submissions to fill gaps

Try It

Browse the auto-discovered resources at indx.sh:

Got a resource that's not indexed? Submit it or wait for the crawlers to find it.


This is part 2 of the "Building indx.sh" series.

Top comments (0)