Jon

Posted on Jan 29

Why Cosine Similarity Fails in RAG (And What to Use Instead)

#rag #llm #embeddings #ai

You've built a RAG system. Your retriever returns chunks with 0.85 cosine similarity. Your LLM still hallucinates.

Sound familiar?

I've spent months debugging this exact problem across production RAG systems. The issue isn't your embedding model or your chunking strategy. It's that cosine similarity measures the wrong thing.

The Problem: "Relevant" ≠ "Useful"

Here's a real example that broke in production:

User Query: "How do I cancel my free trial?"

Top Retrieved Chunk (cosine: 0.78):
"Subscriptions renew monthly or yearly, depending on your plan."

LLM Output:
"You can cancel by not renewing at the end of your billing cycle."

This is completely wrong. The chunk mentions subscriptions and renewal, so it scores high on cosine similarity. But it doesn't actually explain how to cancel a trial.

Why Cosine Similarity Misleads

Cosine similarity measures vector proximity in embedding space. It's optimized to capture:

Keyword overlap
Phrasing similarity
Topic relatedness

What it doesn't measure:

Whether the chunk can answer the question
Logical usefulness for the specific query
Semantic fitness for the user's intent

Think about it: both "cancel free trial" and "subscription renewal" contain similar vocabulary. Embedding models learn that these concepts are related. So they end up close in vector space.

But topic similarity ≠ answer capability.

The Fix: Semantic Stress (ΔS)

Instead of measuring proximity, we need to measure semantic fitness - how well a chunk actually serves the question's intent.

Enter Semantic Stress (ΔS):

ΔS = 1 − cos(I, G)

Where:
  I = question embedding (Intent)
  G = chunk embedding (Grounding)

"Wait, Isn't That Just 1 Minus Cosine?"

Mathematically, yes. But the key difference is what you do with it.

Traditional RAG uses cosine to rank chunks:

# Standard approach
chunks = retriever.search(query, k=10)  # Returns top-10 by cosine
# Hope for the best

Semantic stress uses hard thresholds to filter chunks:

ΔS < 0.40     → STABLE (chunk is semantically fit)
ΔS 0.40-0.60  → TRANSITIONAL (risky, review before using)  
ΔS ≥ 0.60     → REJECT (will cause hallucinations)

Why This Works: Engineering Tolerances

Think of it like bridge engineering:

Cosine similarity is like measuring distance:

"These two points are 5 meters apart"

Semantic stress is like measuring load capacity:

"This bridge will collapse under 500kg"

Cosine tells you chunks are "related." ΔS tells you if they'll break your reasoning.

Real Example: The Numbers Tell the Story

Let's calculate both metrics for our subscription example:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

query = "How do I cancel my free trial?"
chunk = "Subscriptions renew monthly or yearly depending on plan."

q_emb = model.encode(query, normalize_embeddings=True)
c_emb = model.encode(chunk, normalize_embeddings=True)

cosine = float(util.cos_sim(q_emb, c_emb)[0][0])
delta_s = 1 - cosine

print(f"Cosine: {cosine:.3f}")  # 0.457
print(f"ΔS: {delta_s:.3f}")      # 0.543

The insight:

Cosine 0.457 might rank this chunk in your top-10
ΔS 0.543 tells you it's in the transitional danger zone
Traditional RAG would use this chunk
Semantic filtering (ΔS < 0.60) would reject it and prevent the hallucination

A More Dramatic Example: High Cosine, Total Failure

Here's where cosine similarity really falls apart:

Query: "How do I cancel my subscription after the free trial?"

Retrieved Chunk (cosine: 0.78):
"Subscriptions renew monthly or yearly, depending on your plan."

Metrics:
- Cosine: 0.78 (HIGH - ranks #1 or #2)
- ΔS: 0.54 (TRANSITIONAL - semantically weak)

Standard RAG output:
"Simply choose not to renew your plan at the end of the billing cycle."
❌ Doesn't explain trial cancellation process

With ΔS filtering (threshold 0.50 for transactional queries):
"This chunk discusses subscription renewal but doesn't address 
trial cancellation. Looking for content about trial-specific policies..."
✅ Identifies the gap, searches for better chunk

Why this happens:

Keyword overlap: "subscription" appears in both
Semantic proximity: Embeddings learn "cancel," "renew," "trial," "plan" are related
Surface match: Vectors are close in embedding space
Intent mismatch: Query asks about trial cancellation, chunk describes renewal billing

Cosine measures "are these about similar topics?" ΔS measures "can this chunk answer the question?"

Implementation: Add 5 Lines to Your RAG Pipeline

Here's the minimal semantic filter:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

def filter_by_semantic_stress(query: str, chunks: list[str], 
                               threshold: float = 0.60) -> list[str]:
    """Filter chunks by semantic fitness."""
    q_emb = model.encode(query, normalize_embeddings=True)

    filtered = []
    for chunk in chunks:
        c_emb = model.encode(chunk, normalize_embeddings=True)
        cosine = float(util.cos_sim(c_emb, q_emb)[0][0])
        delta_s = 1 - cosine

        if delta_s < threshold:
            filtered.append(chunk)

    return filtered

# In your RAG pipeline:
chunks = retriever.search(query, k=20)
safe_chunks = filter_by_semantic_stress(query, chunks)

if safe_chunks:
    context = "\n\n".join(safe_chunks)
    answer = llm.complete(f"Context: {context}\n\nQuestion: {query}")
else:
    answer = "No relevant content found. Please refine your query."

When to Use Stricter Thresholds

Not all queries are equal. Adjust your threshold based on risk:

Use Case	Threshold	Why
High-stakes (medical, legal)	< 0.35	Need very high confidence
Transactional (pricing, policies)	< 0.40	Accuracy critical
General FAQ	< 0.50	Some tolerance acceptable
Exploratory search	< 0.60	Broader matching ok

Complete Diagnostic Pipeline

Here's a production-ready implementation with metrics:

import numpy as np
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

def diagnose_and_filter(query: str, chunks: list[dict], 
                        threshold: float = 0.60):
    """
    Complete diagnostic pipeline with metrics.

    Args:
        query: User query
        chunks: List of dicts with 'text' and 'id' keys
        threshold: ΔS rejection threshold

    Returns:
        {
            'accepted': list[dict],
            'rejected': list[dict],
            'stats': dict
        }
    """
    q_emb = model.encode(query, normalize_embeddings=True)

    accepted = []
    rejected = []
    delta_s_scores = []

    for chunk in chunks:
        c_emb = model.encode(chunk['text'], normalize_embeddings=True)
        cosine = float(util.cos_sim(c_emb, q_emb)[0][0])
        delta_s = 1 - cosine

        delta_s_scores.append(delta_s)

        chunk_with_metrics = {
            **chunk,
            'delta_s': delta_s,
            'cosine': cosine
        }

        if delta_s < threshold:
            accepted.append(chunk_with_metrics)
        else:
            rejected.append(chunk_with_metrics)

    return {
        'accepted': accepted,
        'rejected': rejected,
        'stats': {
            'total': len(chunks),
            'accepted_count': len(accepted),
            'rejected_count': len(rejected),
            'delta_s_mean': np.mean(delta_s_scores),
            'delta_s_min': np.min(delta_s_scores),
            'delta_s_max': np.max(delta_s_scores)
        }
    }

# Example usage
query = "How do I cancel my free trial?"
chunks = retriever.search(query, k=20)

result = diagnose_and_filter(query, chunks)

print(f"Accepted: {result['stats']['accepted_count']}/{result['stats']['total']}")
print(f"ΔS range: {result['stats']['delta_s_min']:.2f} - {result['stats']['delta_s_max']:.2f}")

if result['accepted']:
    context = "\n\n".join([c['text'] for c in result['accepted']])
    answer = llm.complete(f"Context: {context}\n\nQuestion: {query}")
else:
    answer = "No sufficiently relevant chunks found."

What This Tells You About Your System

Run this on 100 queries and look at delta_s_mean:

ΔS < 0.30: Your retrieval is excellent, chunks are highly aligned
ΔS 0.30-0.45: Good retrieval, acceptable for production
ΔS 0.45-0.60: Marginal quality, investigate further
ΔS > 0.60: Your retrieval is broken. Fix this before tuning prompts.

Key Takeaways

Cosine measures proximity, ΔS measures fitness - they're fundamentally different metrics
Hard thresholds prevent hallucinations - ΔS < 0.60 accept, ≥ 0.60 reject
5-line semantic filter - easy to add to any RAG stack
Measurable acceptance criteria - ΔS ≤ 0.45 for production readiness
Works with any embedding model - just normalize embeddings and invert cosine

The Bottom Line

If you can't measure it, you can't fix it.

Cosine similarity measures "are these topics related?" That's useful for ranking. But for RAG, you need to know "will this chunk lead to a correct answer?"

That's what semantic stress gives you.

Stop guessing why your RAG system hallucinates. Start measuring semantic fitness.

Want to go deeper? I have written a comprehensive 400+ page RAG debugging guide walking through this any other steps - https://mossforge.gumroad.com/l/rag-firewall

Questions? Drop them in the comments. I've debugged this across multiple production systems and happy to help troubleshoot!

DEV Community