DEV Community

Cover image for Meet Bumblebee: Agentic AI Flagging Risky Merchants in Under 90 Seconds
Ankur for Razorpay

Posted on • Edited on

Meet Bumblebee: Agentic AI Flagging Risky Merchants in Under 90 Seconds

contributors: @parin-k, @sumit12dec @yashshree_shinde

If you're familiar with a payments company, you know the drill. Risk agents manually review thousands of merchant websites every month, checking for red flags: sketchy privacy policies, misaligned pricing, questionable social media presence, suspicious domain registration patterns.

At Razorpay, our risk operations team was conducting 10,000 to 12,000 manual website reviews monthly, each taking roughly four minutes of human attention. That's 700 to 800 human hours consumed every month, and the quality was inconsistent because different agents would interpret the same signals differently.

The traditional approach to fraud detection involves throwing bodies at the problem or building rigid rule engines that break the moment fraudsters adapt their tactics. We needed something better, something that could scale with our transaction volume while actually getting smarter over time.

That's why we built what we're calling Agentic Risk, a multi-agent AI system that automates merchant website evaluation from end to end while maintaining the nuanced judgment that used to require human expertise.

Here's what makes this interesting: we didn't just replace humans with AI and call it done. We went through three distinct architectural iterations, each one teaching us hard lessons about what works and what doesn't when you're building AI agents for production fraud detection.

The journey from our initial n8n prototype through an AI agent to our current multi-agent architecture reveals fundamental truths about building reliable AI systems at scale.

The Business Problem: When Manual Review Can't Keep Up

Let me paint the picture of what risk operations looked like before automation. When a new merchant signs up for Razorpay or when our fraud detection system flags an existing merchant, a case lands in our Risk Case Manager system. A human agent picks up that case and begins the investigation dance.

This process takes four minutes when everything goes smoothly, but that's rarely the case. Websites are structured differently, policy pages are hidden in weird places, domain information services have different interfaces, and social media handles aren't always obvious. The worst part isn't the time; it's the inconsistency. One agent might flag a merchant for having a generic privacy policy while another agent considers the same policy acceptable.

We were also paying thousands of dollars monthly for a third-party explicit content screening service, and it was generating about 50 alerts per month with less than 10% precision. Moreover, this service only caught one specific type of risk while ignoring dozens of other fraud indicators we cared about.

The fundamental issue was that we had excellent observability tools, structured data systems, and experienced risk analysts, but the connective tissue between all these components was human labor. Scaling meant hiring more agents, which meant more inconsistency, higher cost, and no improvement in detection speed or accuracy.

Phase 1: The n8n Prototype - When Visual Orchestration Hits Its Limits

We started with n8n, a visual workflow automation platform, to quickly prototype and validate our hypothesis. Within weeks, we had a working proof-of-concept integrating webhook ingestion, merchant metadata fetching, website content review via multimodal AI, domain lookups, GST enrichment, fraud metrics, and LLM-based risk analysis.

bumblebee-n8n-workflow

The prototype validated that automation was feasible and helped us identify the complete set of data points needed. However, n8n quickly revealed fundamental limitations: branch explosion (handling edge cases created unmaintainable 40-node workflows with duplicated logic), observability gaps (debugging failed nodes was painful with coarse logs), and platform instability (non-deterministic behavior in HTTP and merge operations). The n8n prototype taught us that production-grade risk automation would require a code-first approach with proper observability and the ability to use Python libraries directly.

Phase 2: Python + ReAct Agent - Better Control, New Bottlenecks

We rebuilt as a Python web application with an API frontend and task workers. This immediately solved several Phase 1 problems: native Python libraries, structured logging with trace IDs, proper exception handling with retry logic, and complex NLP preprocessing capabilities.

bumblebee-sequence-diagram

The core was a single ReAct-style agent that iteratively reasoned about which tools to call, executed them, and incorporated results until producing a structured risk assessment. Phase 2 brought full observability, easy tool addition, and dynamic behavior that replaced brittle conditional logic.

However, new bottlenecks emerged. Token bloat became critical as the agent accumulated 50KB+ of HTML content, domain data, and fraud metrics in its context window, regularly hitting token limits. Sequential execution meant tool invocations happened one after another even when they had no dependencies, scaling linearly with tool count. Temperature conflation forced a compromise setting that was suboptimal for both exploration (tool selection) and exploitation (final scoring). Phase 2 proved agentic orchestration was right, but single-agent architecture couldn't scale to thousands of concurrent evaluations.

Phase 3: Multi-Agent Architecture - When Specialization Wins

The breakthrough came when we stopped treating fraud detection as a single AI task and started building a multi-agent collaboration system. Rather than one agent doing everything, we split responsibilities across specialized agents optimized for specific roles: Planner, Fetchers, and Analyzer.

bumblebee-multiagent-arch

The Planner Agent receives the merchant case, examines available tools, checks system health and API quotas, and generates an execution plan. This isn't a rigid script; it's a structured specification of what information to gather, with priorities, timeouts, token budgets, and expected schemas. The Planner enforces business rules deterministically. Skip GST validation for non-Indian merchants. Deprioritize social media checks for B2B merchants where social presence matters less. This reduces unnecessary API calls and focuses resources on high-signal checks.

Data Fetcher Agents execute in parallel, each owning one data source or tool. Website scraping, WHOIS lookups, fraud database queries, social media metrics, pricing comparisons, policy verification. Here's the critical insight: fetchers don't just retrieve raw data. They perform local data pruning before returning results.

The website content reviewer doesn't send back 50KB of HTML. It extracts only relevant sections: privacy policies, contact information, pricing tables, product descriptions. Using keyword matching or lightweight NLP models, it returns a compact JSON payload with structured snippets, confidence scores, and provenance links. This solves the token bloat problem. Instead of accumulating full raw outputs, the system maintains small, information-dense summaries.

Each fetcher compresses its domain's data into a format optimized for downstream analysis. Fetchers also implement caching for data that doesn't change frequently. WHOIS information and domain reputation scores get cached with appropriate TTLs, reducing redundant external API calls and improving throughput during traffic spikes.

The Analyzer Agent consumes these structured payloads and produces the final risk assessment. It runs deterministic rules first: hard thresholds for fraud metrics, blacklist checks, compliance violations. These rules are fast, explainable, and don't require LLM inference.

Only after deterministic rules does the Analyzer invoke the LLM for interpretive tasks: generating human-readable summaries, explaining why certain indicators triggered, identifying nuanced patterns that don't fit simple rules. Because fetchers already pruned and structured the data, the Analyzer's LLM calls work with minimal context, avoiding token limit issues entirely.

Different agents use different temperature settings tuned for their roles. The Planner runs at medium temperature for flexible tool selection. The Analyzer uses very low temperature for deterministic risk scoring and higher temperature when generating business narratives where creative expression improves readability. This per-agent temperature control eliminates the compromises from Phase 2.

The execution model leverages Celery for orchestration. When a case arrives, the API enqueues a planning job. The Planner generates the execution plan and enqueues multiple fetcher jobs in parallel. As fetchers complete, their results stream into a shared state store. The Analyzer subscribes to fetcher completion events and begins processing as soon as enough data is available, not waiting for every fetcher if some are slow or failing.

If a fetcher fails entirely (website unreachable, API rate-limiting), the Planner degrades gracefully. The Analyzer proceeds with available data and flags the missing information for manual review rather than blocking the entire evaluation. This resilience was impossible in Phase 2's sequential architecture.

The Results: When Architecture Meets Reality

The shift to multi-agent architecture produced measurable improvements across every dimension. Token usage dropped 60% through fetcher-level pruning and elimination of full raw data in LLM context. End-to-end latency fell from 35 seconds to 8-12 seconds via parallel fetcher execution and focused LLM calls. Success rate rose from 88% to 99%+, measured as cases completing without token limits or LLM failures.

Cost per evaluation decreased despite adding sophisticated analysis. Smaller context windows meant cheaper LLM calls. Caching at the fetcher level reduced external API expenses. The system now handles thousands of concurrent evaluations without bottlenecking, scaling horizontally by adding task workers rather than vertically with bigger servers.

The most important improvement is maintainability and extensibility. Adding a new risk signal requires writing a new fetcher agent with its pruning logic and output schema. The Planner automatically incorporates new tools once registered. The Analyzer adapts to new data sources without modification. This composability enables continuous fraud detection improvement by adding signals incrementally rather than requiring architectural rewrites.

The multi-agent approach provides observability impossible in earlier phases. Each agent logs trace IDs, tokens consumed, latency, confidence scores, and reasoning. When a case produces unexpected results, we replay the exact sequence of fetcher outputs, examine what the Analyzer saw, and understand why it reached that conclusion. This audit trail is critical for debugging, regulatory compliance, and explaining decisions to merchants who dispute risk assessments.

What We Learned: Principles for Building Production AI Agents

Our journey from n8n through ReAct to multi-agent orchestration taught us several lessons that apply broadly to anyone building AI systems for production use cases.

Start simple, evolve deliberately. N8n was the right choice for Phase 1 even though we knew it wouldn't scale. Rapid prototyping and stakeholder validation matter more than architectural purity in early stages. What's critical is recognizing when you've outgrown your current approach and having the discipline to rebuild rather than patch over fundamental limitations.

Token budgets are real constraints. Many blog posts about AI agents gloss over token management, but in production systems with large, messy real-world data, token limits are where architectures break. Design explicitly for token efficiency: prune early, prune often, and never pass raw, unstructured data to LLMs when you can send structured summaries instead.

Specialization beats generalization at scale. A single agent trying to handle planning, data fetching, and analysis will hit walls that can't be solved with better prompts or bigger models. Splitting responsibilities across specialized agents with clear interfaces between them produces systems that are faster, more reliable, and easier to understand.

Temperature is not a hyperparameter you tune once. Different tasks need different temperature settings, and trying to find a compromise temperature for a single agent produces mediocre results everywhere. Per-agent temperature control is a fundamental architectural requirement, not an optimization detail.

Parallelism matters more than model size. Running multiple smaller, focused agents in parallel often outperforms running one large agent sequentially, both in terms of latency and cost. This runs counter to the instinct to throw the biggest model available at every problem.

Observability is not optional. Without structured logging, trace IDs, and the ability to replay decision sequences, debugging production AI systems is nearly impossible. Invest in observability infrastructure early, ideally before you have a production incident that requires it.

The Path Forward: Continuous Improvement by Design

What makes the multi-agent architecture particularly powerful is that it's designed for continuous improvement. As we accumulate more cases, we can identify patterns where the Analyzer produces low-confidence results or where human reviewers frequently override AI decisions. These cases become training data for improving fetcher pruning heuristics, refining Planner rules, and tuning Analyzer prompts.

We're exploring several extensions. Fine-tuning small, specialized models for specific fetchers rather than relying entirely on general-purpose LLMs. This could further reduce cost and latency while improving accuracy for domain-specific tasks like policy compliance checking. Implementing feedback loops where human overrides automatically update Planner rules or Analyzer thresholds, creating a self-improving system that gets smarter as risk operators correct its mistakes.

Another direction is adding predictive agents that don't just evaluate merchant risk at onboarding but continuously monitor for behavioral changes that might indicate fraud. Imagine fetchers running periodically in the background, detecting when a merchant's website content changes significantly, when pricing diverges from competitors, or when social media presence suddenly evaporates. The same multi-agent architecture that handles point-in-time evaluation can drive continuous risk monitoring with minimal modification.

Why This Matters Beyond Fraud Detection

I've been talking about merchant risk evaluation specifically, but the architectural patterns we discovered apply broadly to any domain where AI agents need to process large amounts of heterogeneous data, make complex decisions, and produce explainable results. Financial services, healthcare, supply chain management, cybersecurity, and legal research all have similar characteristics: multiple data sources with different formats and latencies, domain expertise encoded in rules and models, and requirements for auditability and compliance.

The lesson isn't "use multi-agent architecture for everything." The lesson is that as AI systems scale from demos to production, the architecture that got you to the first prototype often becomes the main thing preventing you from scaling further. Having the discipline to recognize when you've hit architectural limits, the willingness to rebuild from first principles, and the engineering rigor to measure improvements objectively separates successful production AI from expensive science projects.

At Razorpay, we've taken fraud detection from a manual, inconsistent process consuming 800 agent hours monthly to an automated system that evaluates merchants in seconds with higher accuracy and comprehensive audit trails. We've reduced our per-review time by 75%, improved detection consistency, and freed up risk operators to focus on genuinely complex cases that require human judgment. And we've done it with an architecture that gets better over time rather than more fragile.

If you're building AI agents for production use cases, the technology is ready. The LLMs are capable, the orchestration frameworks exist, and the integration tools work. The hard part is designing systems that handle real-world messiness, scale with your business, and maintain reliability when things inevitably break. That's where architecture matters, and that's what we learned the hard way through three iterations of building Agentic Risk.

editor: @paaarth96

Top comments (0)