Ryan Giggs

Posted on Feb 2

Understanding Semantic Search: Vector Embeddings and Similarity Search

#semanticsearch #vectorembeddings #machinelearning

Semantic search represents a fundamental shift in how we retrieve information from databases and search engines. Unlike traditional keyword-based search that relies on exact text matches, semantic search understands the meaning and context behind queries, enabling more intuitive and accurate information retrieval.

What is Semantic Search?

Semantic search is an advanced search technique that goes beyond keyword matching to understand the intent and contextual meaning behind a query. Instead of looking for exact word matches, it retrieves results based on semantic similarity—finding content that means the same thing, even when different words are used.

For example, searching for "healthy dinner ideas" could return results like "nutritious meal prep for busy nights" even though the exact keywords don't match. This is possible because semantic search operates on the underlying meaning of the content.

Understanding Vector Data Distribution

Vector embeddings, which power semantic search, have unique characteristics in how they're distributed in vector space:

Key Characteristics of Vector Data:

1. Uneven Distribution
Vector data points are typically not uniformly distributed across the vector space. Instead, they tend to cluster around regions of semantic similarity. This natural clustering reflects how related concepts group together in meaning.

2. Semantic Clustering
Vectors representing similar concepts naturally cluster together in vector space. For instance:

Words like "king," "queen," "prince," and "princess" form a cluster related to royalty
Technical terms like "algorithm," "function," and "code" cluster in programming-related regions
Synonyms and semantically related phrases are positioned close to each other

This clustering property is fundamental to how semantic search works—we can find related content by finding nearby vectors in this space.

How Similarity Search Works

At its core, semantic search relies on a mathematical concept called k-Nearest Neighbors (k-NN) search.

The k-NN Principle

When you perform a similarity search based on a query vector, you're essentially:

Converting your query into a vector embedding
Finding the k nearest vectors to your query vector in the vector space
Retrieving the corresponding documents or data points

The result is an ordered list ranked by similarity, with the most semantically similar items appearing first.

Distance Metrics

The "closeness" or similarity between vectors is measured using distance metrics such as:

Cosine Similarity: Measures the angle between vectors (commonly used for text)
Euclidean Distance: Straight-line distance between points in vector space
Dot Product: Useful for normalized vectors
Manhattan Distance: Sum of absolute differences along each dimension

Types of Similarity Search

Modern semantic search systems employ two main approaches, each with distinct trade-offs:

1. Exact Search (Exhaustive Search)

How It Works:
Compares the query vector against every single vector in the database to find the truly closest matches.

Characteristics:

Accuracy: 100% accurate—guarantees finding the actual nearest neighbors
Performance: Computational cost grows linearly with dataset size O(n)
Speed: Slow for large datasets (can take hours for millions of vectors)
Use Cases: Small datasets (typically < 10,000 documents) or when perfect accuracy is critical

When to Use Exact Search:

Datasets with fewer than 10,000 documents
When you need guaranteed accuracy
For low-dimensional vectors (fewer dimensions mean faster computation)
In scenarios where query filters significantly reduce the search space

2. Approximate Search (ANN - Approximate Nearest Neighbor)

How It Works:
Uses specialized algorithms and data structures (like HNSW, IVF, or LSH) to efficiently search through large datasets by narrowing down the search space through clever indexing.

Characteristics:

Accuracy: High accuracy (typically 90-99%) but not guaranteed perfect
Performance: Sub-linear or logarithmic time complexity O(log n)
Speed: Dramatically faster—searches that take 65 hours with exact search can complete in seconds
Use Cases: Large datasets (hundreds of thousands to billions of vectors)

Popular ANN Algorithms:

HNSW (Hierarchical Navigable Small World): Graph-based, extremely fast for queries
IVF (Inverted File Index): Cluster-based, good for very large datasets
LSH (Locality-Sensitive Hashing): Hash-based, excellent for high-dimensional data
Product Quantization: Compression-based, reduces memory footprint

When to Use Approximate Search:

Large datasets (> 10,000 documents)
When slight accuracy trade-offs are acceptable
High-dimensional vector spaces (100+ dimensions)
Real-time or latency-sensitive applications
When memory constraints are a concern

Comparing the Two Approaches

Aspect	Exact Search	Approximate Search
Accuracy	100%	90-99% (configurable)
Speed	Slow (linear)	Fast (sub-linear)
Scalability	Poor for large datasets	Excellent
Memory	Lower	Higher (needs indexes)
Best For	< 10K documents	> 10K documents

Real-World Example:
Finding similar documents in a 10,000-sentence database:

Exact Search: ~65 hours to find all matches
Approximate Search (HNSW): Create embeddings in ~5 seconds, search in ~0.01 seconds

For most production applications with large datasets, the 90-99% accuracy of approximate search combined with massive speed improvements makes it the clear choice.

Vector Embedding Models

Vector embeddings are the foundation of semantic search. They're the "translation layer" that converts human-readable content into machine-understandable numerical representations.

What Are Embedding Models?

Embedding models are machine learning models—typically based on transformer architectures—that convert data into dense vector representations. These models have been trained on massive datasets to understand semantic relationships.

Key Capabilities:

1. Contextual Understanding
Embedding models assign meaning based on context. For example:

The word "bank" in "river bank" vs. "financial bank" gets different embeddings
Each pixel in an image is understood in relation to surrounding pixels
Words in a sentence are interpreted based on their position and neighbors

2. Feature Extraction
These models identify and quantify relevant features or dimensions:

In text: semantic meaning, sentiment, topic, grammatical role
In images: shapes, colors, textures, objects
In audio: pitch, rhythm, timbre, speech patterns

3. Transformer Architecture
Most modern embedding models use transformer architectures, which excel at:

Processing sequences (text, time-series data)
Capturing long-range dependencies
Parallel processing for efficiency
Attention mechanisms to focus on relevant parts of the input

Popular Embedding Models

For Text:

Sentence Transformers (e.g., all-MiniLM-L6-v2, all-mpnet-base-v2)
- Optimized for sentence and paragraph embeddings
- 384 to 768 dimensions
- Open-source and widely used
BERT (Bidirectional Encoder Representations from Transformers)
- General-purpose language understanding
- 768 dimensions (base), 1024 dimensions (large)
- Foundation for many specialized models
GPT Embeddings (OpenAI)
- text-embedding-ada-002: 1536 dimensions
- Excellent for semantic search and clustering
E5 Models (multilingual-e5-large)
- Strong multilingual support
- Great for cross-language semantic search

For Images:

CLIP (Contrastive Language-Image Pre-training)
- Jointly embeds images and text in the same space
- Enables text-to-image and image-to-image search
ResNet (Residual Networks)
- Deep convolutional neural network for image features
- Available in various depths (ResNet-50, ResNet-101)
ViT (Vision Transformer)
- Transformer-based image understanding
- State-of-the-art performance on many vision tasks

For Audio:

Wav2Vec 2.0: Speech and audio embeddings
VGGish: Audio event detection and classification
CLAP: Contrastive Language-Audio Pre-training

Model Selection Criteria

When choosing an embedding model, consider:

Task Requirements: Text, image, audio, or multimodal?
Performance vs. Speed: Larger models are more accurate but slower
Dimension Count: Higher dimensions = more detail but more storage
Domain Specificity: General-purpose vs. specialized (medical, legal, etc.)
Language Support: Monolingual vs. multilingual
Deployment Environment: Cloud API vs. local inference

Example Comparison:

Model	Type	Dimensions	Use Case	Speed
all-MiniLM-L6-v2	Text	384	Fast, lightweight semantic search	Very Fast
all-mpnet-base-v2	Text	768	Higher quality embeddings	Fast
text-embedding-ada-002	Text	1536	Production-grade, API-based	API Latency
CLIP ViT-B/32	Image + Text	512	Multimodal search	Medium

Types of Embedding Models

Organizations have several options for deploying embedding models, each with different trade-offs:

1. Pre-trained Open Source Models

Characteristics:

Ready to use without additional training
Trained on massive public datasets (Wikipedia, Common Crawl, etc.)
Free to download and deploy
Wide variety available on platforms like Hugging Face

Advantages:

Zero training cost and time
Proven performance on general tasks
Large community support
Regular updates and improvements

Limitations:

May not capture domain-specific nuances
Fixed to the knowledge in training data
Can't adapt to proprietary terminology

Popular Examples:

Sentence Transformers library (15,000+ models)
BERT and its variants (RoBERTa, DistilBERT, ALBERT)
Universal Sentence Encoder
OpenAI's embedding models (via API)

When to Use:

General semantic search applications
Quick prototyping and proof of concepts
When your domain is well-represented in public data
Resource-constrained environments

2. Custom Models Based on Your Own Dataset

Characteristics:

Fine-tuned or trained from scratch on your specific data
Captures domain-specific language, jargon, and relationships
Learns organizational or industry-specific context

Advantages:

Optimal performance for your specific use case
Understands proprietary terminology and concepts
Can adapt to unique data distributions
Competitive advantage through specialized understanding

Process:

Start with a pre-trained model (transfer learning)
Fine-tune on your labeled data (typically 1,000+ examples)
Evaluate on your specific tasks
Iterate and optimize

Use Cases:

Medical applications with specialized terminology
Legal document analysis
E-commerce with unique product catalogs
Scientific research in niche fields
Internal corporate knowledge bases

Considerations:

Requires labeled training data
Needs computational resources for training
Ongoing maintenance and retraining
Expertise in machine learning required

Example Scenarios:

A hospital training a model on medical records to improve clinical search
An e-commerce site fine-tuning on product descriptions and user behavior
A law firm training on case law and legal documents
A financial institution fine-tuning on market reports and regulations

3. Hybrid Approach

Many organizations use a combination:

Base layer: Start with a pre-trained general model
Specialization layer: Fine-tune on domain-specific data
Multiple models: Use different models for different types of content

Generating Vector Embeddings

Once you've selected an embedding model, you need to generate embeddings for your data. There are two main approaches:

1. Outside the Database

Generate embeddings externally using:

Third-Party APIs:

OpenAI Embeddings API: text-embedding-ada-002
Cohere Embed API: Multiple model sizes available
Google Vertex AI: Various embedding models
Hugging Face Inference API: Access to thousands of models

Local Inference:

Python Libraries: sentence-transformers, transformers
ONNX Runtime: Optimized inference with ONNX models
TensorFlow/PyTorch: Direct model inference
Dedicated embedding services: Self-hosted or cloud-based

Workflow:

Process your data through the embedding service
Receive vector embeddings
Store vectors in your database alongside original data
Index vectors for efficient search

Example (Python):

from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings
texts = ["Semantic search is powerful", "Machine learning enables AI"]
embeddings = model.encode(texts)

# Store in database
# db.insert(texts, embeddings)

Advantages:

Flexibility in model choice
Can use specialized or proprietary models
Control over the embedding pipeline
Can batch process large datasets

Disadvantages:

Requires additional infrastructure
Data movement between systems
Potential latency for real-time embedding generation
Need to manage model updates separately

2. Within the Database (ONNX)

Generate embeddings internally using database-integrated models:

ONNX (Open Neural Network Exchange):
ONNX is an open format for representing machine learning models that enables models trained in one framework to be deployed in another. Many modern databases support loading ONNX models directly.

Supported Databases:

Oracle Database 23ai: Native ONNX support with VECTOR_EMBEDDING() function
PostgreSQL (with extensions): pgvector + ONNX Runtime
Microsoft SQL Server: ONNX model inference
SingleStore: Built-in embedding generation

Workflow:

Export your embedding model to ONNX format
Load the ONNX model into the database
Use database functions to generate embeddings automatically
Vectors are generated on-demand or during data insertion

Example (Oracle Database):

-- Load ONNX model into database
BEGIN
  DBMS_VECTOR.LOAD_ONNX_MODEL(
    directory => 'MODEL_DIR',
    file_name => 'all-MiniLM-L6-v2.onnx',
    model_name => 'text_embedding_model'
  );
END;
/

-- Generate embeddings automatically
INSERT INTO documents (id, text, embedding)
VALUES (
  1,
  'Semantic search enables better information retrieval',
  VECTOR_EMBEDDING(text_embedding_model USING 
    'Semantic search enables better information retrieval' AS data)
);

-- Or update existing data
UPDATE documents
SET embedding = VECTOR_EMBEDDING(text_embedding_model USING text AS data);

Advantages:

No data movement—embeddings generated where data lives
Reduced latency for real-time applications
Simplified architecture (fewer components)
Automatic embedding refresh when data updates
Database security and governance apply to embeddings
Transactional consistency between data and embeddings

Disadvantages:

Limited to models compatible with ONNX format
Database computational overhead
May require additional database resources
Less flexibility in model selection
Dependent on database's ONNX implementation

Choosing the Right Approach

Factor	External Generation	In-Database (ONNX)
Model Flexibility	High	Medium
Latency	Higher (data transfer)	Lower
Architecture Complexity	Higher	Lower
Data Security	Requires data export	Data stays in DB
Scalability	Independent scaling	Limited by DB resources
Best For	Batch processing, custom models	Real-time apps, integrated systems

Recommendation:

Use external generation for: Batch processing, custom models, flexibility
Use in-database ONNX for: Real-time applications, simplified architecture, security requirements

Practical Implementation Considerations

1. Dimensionality

Vector dimensions typically range from:

Small models: 128-384 dimensions (faster, less accurate)
Medium models: 512-768 dimensions (balanced)
Large models: 1024-1536+ dimensions (slower, more accurate)

Trade-off: More dimensions = better semantic capture but higher computational cost and storage requirements.

2. Normalization

Many embedding models produce normalized vectors (unit length), which:

Makes cosine similarity equivalent to dot product (faster computation)
Ensures consistent scale across different embeddings
Simplifies distance calculations

3. Vector Storage

Modern vector databases optimize storage through:

Quantization: Reducing precision (float32 → int8) to save memory
Compression: Using Product Quantization or similar techniques
Sharding: Distributing vectors across multiple nodes
Memory-mapping: Efficient disk-to-memory loading

4. Index Updates

Consider how often your data changes:

Static data: Build index once, optimize for query speed
Frequently updated data: Use indexes that support incremental updates
Streaming data: Consider real-time embedding and indexing strategies

Real-World Applications

1. Document Search and Retrieval

Find relevant documents based on meaning rather than keywords. Users can search using natural language questions and receive semantically relevant results.

2. Recommendation Systems

Recommend products, content, or services based on similarity to user preferences. E-commerce sites use this to show "similar items" or "you might also like" suggestions.

3. Question Answering Systems

Build intelligent Q&A systems that match user questions to the most relevant answers in a knowledge base, even when phrased differently.

4. Content Moderation

Identify similar or duplicate content, detect variations of prohibited material, and flag potentially harmful content based on semantic similarity.

5. Image and Video Search

Enable search by visual similarity—find similar images, locate objects in video content, or search images using text descriptions (via multimodal models like CLIP).

6. Customer Support

Automatically route support tickets to appropriate teams, find similar past issues and their resolutions, and provide agents with relevant knowledge articles.

7. Fraud Detection

Identify unusual patterns by detecting transactions or behaviors that are semantically similar to known fraud cases.

8. Code Search

Find similar code snippets, detect duplicate or near-duplicate code, and search codebases using natural language descriptions of desired functionality.

Performance Optimization Tips

1. Choose the Right Balance

For small datasets (< 10K): Use exact search
For large datasets (> 100K): Use approximate search with high accuracy settings
For real-time apps: Optimize for speed with acceptable accuracy trade-offs

2. Tune Approximate Search Parameters

Accuracy vs. Speed: Adjust parameters like ef_search (HNSW) or nprobe (IVF)
Index Build Time: Balance initial index construction time with query performance
Memory Usage: Consider index size vs. available memory

3. Optimize Vector Dimensions

Use dimensionality reduction (PCA, t-SNE) if needed
Choose models with appropriate dimension counts for your use case
Consider quantization to reduce memory footprint

4. Implement Caching

Cache frequently accessed embeddings
Pre-compute embeddings fUnderstanding Semantic Search: Vector Embeddings and Similarity Searchor static content
Use result caching for common queries

5. Batch Processing

Generate embeddings in batches for efficiency
Use batch search for multiple similar queries
Leverage GPU acceleration for large-scale embedding generation

The Future of Semantic Search

Semantic search continues to evolve rapidly:

Multimodal Models: Combining text, image, audio, and video in unified search
Improved Efficiency: Faster algorithms and better hardware acceleration
Smaller Models: Distilled models with comparable performance but lower resource requirements
Context-Aware Search: Better understanding of user intent and query context
Domain-Specific Models: More specialized embeddings for vertical applications
Real-Time Learning: Systems that continuously improve from user interactions
Privacy-Preserving Search: Encrypted embeddings and secure similarity computation

Semantic search, powered by vector embeddings and similarity search algorithms, represents a fundamental advancement in information retrieval. By understanding meaning rather than matching keywords, it enables more intuitive and powerful search experiences across diverse applications.

Key takeaways:

Vector embeddings capture semantic meaning in numerical form
Vector data naturally clusters by semantic similarity
Choose between exact search (accurate, slow) and approximate search (fast, highly accurate) based on your needs
Transformer-based embedding models provide state-of-the-art semantic understanding
Models can be pre-trained, custom-trained, or fine-tuned for specific domains
Embeddings can be generated externally or within databases using ONNX

Whether you're building a search engine, recommendation system, or AI-powered application, understanding these concepts is crucial for leveraging the full power of modern semantic search technologies.