Metadata Filtering and Hybrid Search for Vector Databases

In the first tutorial, we built a vector database with ChromaDB and ran semantic similarity searches across 5,000 arXiv papers. We discovered that vector search excels at understanding meaning: a query about “neural network training” successfully retrieved papers about optimization algorithms, even when they didn’t use those exact words.

But here’s what we couldn’t do yet: What if we only want papers from the last two years? What if we need to search specifically within the Machine Learning category? What if someone searches for a rare technical term that vector search might miss?

This tutorial teaches you how to enhance vector search with two powerful capabilities: metadata filtering and hybrid search. By the end, you’ll understand how to combine semantic similarity with traditional filters, when keyword search adds value, and how to make intelligent trade-offs between different search strategies.

What You’ll Learn

By the end of this tutorial, you’ll be able to:

Design metadata schemas that enable powerful filtering without performance pitfalls
Implement filtered vector searches in ChromaDB using metadata constraints
Measure and understand the performance overhead of different filter types
Build BM25 keyword search alongside your vector search
Combine vector similarity and keyword matching using weighted hybrid scoring
Evaluate different search strategies systematically using category precision
Make informed decisions about when metadata filtering and hybrid search add value

Most importantly, you’ll learn to be honest about what works and what doesn’t. Our experiments revealed some surprising results that challenge common assumptions about hybrid search.

Dataset and Environment Setup

We’ll use the same 5,000 arXiv papers we used previously. If you completed the first tutorial, you already have these files. If you’re starting fresh, download them now:

arxiv_papers_5k.csv download (7.7 MB) → Paper metadata
embeddings_cohere_5k.npy download (61.4 MB) → Pre-generated embeddings

The dataset contains 5,000 papers perfectly balanced across five categories:

cs.CL (Computational Linguistics): 1,000 papers
cs.CV (Computer Vision): 1,000 papers
cs.DB (Databases): 1,000 papers
cs.LG (Machine Learning): 1,000 papers
cs.SE (Software Engineering): 1,000 papers

Environment Setup

You’ll need the same packages from previous tutorials, plus one new library for BM25:

# Create virtual environment (if starting fresh)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install required packages
# Python 3.12 with these versions:
# chromadb==1.3.4
# numpy==2.0.2
# pandas==2.2.2
# cohere==5.20.0
# python-dotenv==1.1.1
# rank-bm25==0.2.2  # NEW for keyword search

pip install chromadb numpy pandas cohere python-dotenv rank-bm25

Make sure your .env file contains your Cohere API key:

COHERE_API_KEY=your_key_here

Loading the Dataset

Let’s load our data and verify everything is in place:

import numpy as np
import pandas as pd
import chromadb
from cohere import ClientV2
from dotenv import load_dotenv
import os

# Load environment variables
load_dotenv()
cohere_api_key = os.getenv('COHERE_API_KEY')
if not cohere_api_key:
    raise ValueError("COHERE_API_KEY not found in .env file")

co = ClientV2(api_key=cohere_api_key)

# Load the dataset
df = pd.read_csv('arxiv_papers_5k.csv')
embeddings = np.load('embeddings_cohere_5k.npy')

print(f"Loaded {len(df)} papers")
print(f"Embeddings shape: {embeddings.shape}")
print(f"\nPapers per category:")
print(df['category'].value_counts().sort_index())

# Check what metadata we have
print(f"\nAvailable metadata columns:")
print(df.columns.tolist())

Loaded 5000 papers
Embeddings shape: (5000, 1536)

Papers per category:
category
cs.CL    1000
cs.CV    1000
cs.DB    1000
cs.LG    1000
cs.SE    1000
Name: count, dtype: int64

Available metadata columns:
['arxiv_id', 'title', 'abstract', 'authors', 'published', 'category']

We have rich metadata to work with: paper IDs, titles, abstracts, authors, publication dates, and categories. This metadata will power our filtering and help evaluate our search strategies.

Designing Metadata Schemas

Before we start filtering, we need to think carefully about what metadata to store and how to structure it. Good metadata design makes search powerful and performant. Poor design creates headaches.

What Makes Good Metadata

Good metadata is:

Filterable: Choose values that match how users actually search. If users filter by publication year, store year as an integer. If they filter by topic, store normalized category strings.
Atomic: Store individual fields separately rather than dumping everything into a single JSON blob. Want to filter by year? Don’t make ChromaDB parse “Published: 2024-03-15” from a text field.
Indexed: Most vector databases index metadata fields differently than vector embeddings. Keep metadata fields small and specific so indexing works efficiently.
Consistent: Use the same data types and formats across all documents. Don’t store year as “2024” for one paper and “March 2024” for another.

What Doesn’t Belong in Metadata

Avoid storing:

Long text in metadata fields: The paper abstract is content, not metadata. Store it as the document text, not in a metadata field.
Nested structures: ChromaDB supports nested metadata, but complex JSON trees are hard to filter and often signal confused schema design.
Redundant information: If you can derive a field from another (like “decade” from “year”), consider computing it at query time instead of storing it.
Frequently changing values: Metadata updates can be expensive. Don’t store view counts or frequently updated statistics in metadata.

Preparing Our Metadata

Let’s prepare metadata for our 5,000 papers:

def prepare_metadata(df):
    """
    Prepare metadata for ChromaDB from our dataframe.

    Returns list of metadata dictionaries, one per paper.
    """
    metadatas = []

    for _, row in df.iterrows():
        # Extract year from published date (format: YYYY-MM-DD)
        year = int(str(row['published'])[:4])

        # Truncate authors if too long (ChromaDB has reasonable limits)
        authors = row['authors'][:200] if len(row['authors']) <= 200 else row['authors'][:197] + "..."

        metadata = {
            'title': row['title'],
            'category': row['category'],
            'year': year,  # Store as integer for range queries
            'authors': authors
        }
        metadatas.append(metadata)

    return metadatas

# Prepare metadata for all papers
metadatas = prepare_metadata(df)

# Check a sample
print("Sample metadata:")
print(metadatas[0])

Sample metadata:
{'title': 'Optimizing Mixture of Block Attention', 'category': 'cs.LG', 'year': 2025, 'authors': 'Tao He, Liang Ding, Zhenya Huang, Dacheng Tao'}

Notice we’re storing:

title: The full paper title for display in results
category: One of our five CS categories for topic filtering
year: Extracted as an integer for range queries like “papers after 2024”
authors: Truncated to avoid extremely long strings

This metadata schema supports the filtering patterns users actually want: search within a category, filter by publication date, or display author information in results.

Anti-Patterns to Avoid

Let’s look at what NOT to do:

Bad: JSON blob as metadata

# DON'T DO THIS
metadata = {
    'info': json.dumps({
        'title': title,
        'category': category,
        'year': year,
        # ... everything dumped in JSON
    })
}

This makes filtering painful. You can’t efficiently filter by year when it’s buried in a JSON string.

Bad: Long text as metadata

# DON'T DO THIS
metadata = {
    'abstract': full_abstract_text,  # This belongs in documents, not metadata
    'category': category
}

ChromaDB stores abstracts as document content. Duplicating them in metadata wastes space and doesn’t improve search.

Bad: Inconsistent types

# DON'T DO THIS
metadata1 = {'year': 2024}          # Integer
metadata2 = {'year': '2024'}        # String
metadata3 = {'year': 'March 2024'}  # Unparseable

Consistent data types make filtering reliable. Always store years as integers if you want range queries.

Bad: Missing or inconsistent metadata fields

# DON'T DO THIS
paper1_metadata = {'title': 'Paper 1', 'category': 'cs.LG', 'year': 2024}
paper2_metadata = {'title': 'Paper 2', 'category': 'cs.CV'}  # Missing year!
paper3_metadata = {'title': 'Paper 3', 'year': 2023}  # Missing category!

Here’s a common source of frustration: if a document is missing a metadata field, ChromaDB’s filters won’t match it at all. If you filter by {"year": {"$gte": 2024}} and some papers lack a year field, those papers simply won’t appear in results. This causes the confusing “where did my document go?” problem.

The fix: Make sure all documents have the same metadata fields. If a value is unknown, store it as None or use a sensible default rather than omitting the field entirely. Consistency prevents documents from mysteriously disappearing when you add filters.

Creating a Collection with Rich Metadata

Now let’s create a ChromaDB collection with all our metadata. If you will be experimenting, you’ll need to use the delete-and-recreate pattern we used previously:

# Initialize ChromaDB client
client = chromadb.Client()

# Delete existing collection if present (useful for experimentation)
try:
    client.delete_collection(name="arxiv_with_metadata")
    print("Deleted existing collection")
except:
    pass  # Collection didn't exist, that's fine

# Create collection with metadata
collection = client.create_collection(
    name="arxiv_with_metadata",
    metadata={
        "description": "5000 arXiv papers with rich metadata for filtering",
        "hnsw:space": "cosine"  # Using cosine similarity
    }
)

print(f"Created collection: {collection.name}")

Created collection: arxiv_with_metadata

Now let’s insert our papers with metadata. Remember that ChromaDB has a batch size limit:

# Prepare data for insertion
ids = [f"paper_{i}" for i in range(len(df))]
documents = df['abstract'].tolist()

# Insert with metadata
# Our 5000 papers fit in one batch (limit is ~5,461)
print(f"Inserting {len(df)} papers with metadata...")

collection.add(
    ids=ids,
    embeddings=embeddings.tolist(),
    documents=documents,
    metadatas=metadatas
)

print(f"✓ Collection contains {collection.count()} papers with metadata")

Inserting 5000 papers with metadata...
✓ Collection contains 5000 papers with metadata

We now have a collection where every paper has both its embedding and rich metadata. This enables powerful combinations of semantic search and traditional filtering.

Metadata Filtering in Practice

Let’s start filtering our searches using metadata. ChromaDB uses a where clause syntax similar to database queries.

Basic Filtering by Category

Suppose we want to search only within Machine Learning papers:

# First, let's create a helper function for queries
def search_with_filter(query_text, where_clause=None, n_results=5):
    """
    Search with optional metadata filtering.

    Args:
        query_text: The search query
        where_clause: Optional ChromaDB where clause for filtering
        n_results: Number of results to return

    Returns:
        Search results
    """
    # Embed the query
    response = co.embed(
        texts=[query_text],
        model='embed-v4.0',
        input_type='search_query',
        embedding_types=['float']
    )
    query_embedding = np.array(response.embeddings.float_[0])

    # Search with optional filter
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=n_results,
        where=where_clause  # Apply metadata filter here
    )

    return results

# Example: Search for "deep learning optimization" only in ML papers
query = "deep learning optimization techniques"

results_filtered = search_with_filter(
    query,
    where_clause={"category": "cs.LG"}  # Only Machine Learning papers
)

print(f"Query: '{query}'")
print("Filter: category = 'cs.LG'")
print("\nTop 5 results:")
for i in range(len(results_filtered['ids'][0])):
    metadata = results_filtered['metadatas'][0][i]
    distance = results_filtered['distances'][0][i]

    print(f"\n{i+1}. {metadata['title']}")
    print(f"   Category: {metadata['category']} | Year: {metadata['year']}")
    print(f"   Distance: {distance:.4f}")

Query: 'deep learning optimization techniques'
Filter: category = 'cs.LG'

Top 5 results:

1. Adam symmetry theorem: characterization of the convergence of the stochastic Adam optimizer
   Category: cs.LG | Year: 2025
   Distance: 0.6449

2. Non-Euclidean SGD for Structured Optimization: Unified Analysis and Improved Rates
   Category: cs.LG | Year: 2025
   Distance: 0.6571

3. Training Neural Networks at Any Scale
   Category: cs.LG | Year: 2025
   Distance: 0.6674

4. Deep Progressive Training: scaling up depth capacity of zero/one-layer models
   Category: cs.LG | Year: 2025
   Distance: 0.6682

5. DP-AdamW: Investigating Decoupled Weight Decay and Bias Correction in Private Deep Learning
   Category: cs.LG | Year: 2025
   Distance: 0.6732

All five results are from cs.LG, exactly as we requested. The filtering worked correctly. The distances are also tightly clustered between 0.64 and 0.67.

This close grouping tells us we found papers that all match our query equally well. The lower distances (compared to the 1.1+ ranges we saw previously) show that filtering down to a specific category helped us find stronger semantic matches.

Filtering by Year Range

What if we want papers from a specific time period?

# Search for papers from 2024 or later
results_recent = search_with_filter(
    "neural network architectures",
    where_clause={"year": {"$gte": 2024}}  # Greater than or equal to 2024
)

print("Recent papers (2024+) about neural network architectures:")
for i in range(3):  # Show top 3
    metadata = results_recent['metadatas'][0][i]
    print(f"{i+1}. {metadata['title']} ({metadata['year']})")

Recent papers (2024+) about neural network architectures:
1. Bearing Syntactic Fruit with Stack-Augmented Neural Networks (2025)
2. Preparation of Fractal-Inspired Computational Architectures for Advanced Large Language Model Analysis (2025)
3. Preparation of Fractal-Inspired Computational Architectures for Advanced Large Language Model Analysis (2025)

Notice that results #2 and #3 are the same paper. This happens because some arXiv papers get cross-posted to multiple categories. A paper about neural architectures for language models might appear in both cs.LG and cs.CL, so when we filter only by year, it shows up once for each category assignment.

You could deduplicate results by tracking paper IDs and skipping ones you’ve already seen, but whether you should depends on your use case. Sometimes knowing a paper appears in multiple categories is actually valuable information. For this tutorial, we’re keeping duplicates as-is because they reflect how real databases behave and help us understand what filtering does and doesn’t handle. If you were building a paper recommendation system, you’d probably deduplicate. If you were analyzing category overlap patterns, you’d want to keep them.

Comparison Operators

ChromaDB supports several comparison operators for numeric fields:

$eq: Equal to
$ne: Not equal to
$gt: Greater than
$gte: Greater than or equal to
$lt: Less than
$lte: Less than or equal to

Combined Filters

The real power comes from combining multiple filters:

# Find Computer Vision papers from 2025
results_combined = search_with_filter(
    "image recognition and classification",
    where_clause={
        "$and": [
            {"category": "cs.CV"},
            {"year": {"$eq": 2025}}
        ]
    }
)

print("Computer Vision papers from 2025 about image recognition:")
for i in range(3):
    metadata = results_combined['metadatas'][0][i]
    print(f"{i+1}. {metadata['title']}")
    print(f"   {metadata['category']} | {metadata['year']}")

Computer Vision papers from 2025 about image recognition:
1. SWAN -- Enabling Fast and Mobile Histopathology Image Annotation through Swipeable Interfaces
   cs.CV | 2025
2. Covariance Descriptors Meet General Vision Encoders: Riemannian Deep Learning for Medical Image Classification
   cs.CV | 2025
3. UniADC: A Unified Framework for Anomaly Detection and Classification
   cs.CV | 2025

ChromaDB also supports $or for alternatives:

# Papers from either Database or Software Engineering categories
where_db_or_se = {
    "$or": [
        {"category": "cs.DB"},
        {"category": "cs.SE"}
    ]
}

These filtering capabilities let you narrow searches to exactly the subset you need.

Measuring Filtering Performance Overhead

Metadata filtering isn’t free. Let’s measure the actual performance impact of different filter types. We’ll run multiple queries to get stable measurements:

import time

def benchmark_filter(where_clause, n_iterations=100, description=""):
    """
    Benchmark query performance with a specific filter.

    Args:
        where_clause: The filter to apply (None for unfiltered)
        n_iterations: Number of times to run the query
        description: Description of what we're testing

    Returns:
        Average query time in milliseconds
    """
    # Use a fixed query embedding to keep comparisons fair
    query_text = "machine learning model training"
    response = co.embed(
        texts=[query_text],
        model='embed-v4.0',
        input_type='search_query',
        embedding_types=['float']
    )
    query_embedding = np.array(response.embeddings.float_[0])

    # Warm up (run once to load any caches)
    collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=5,
        where=where_clause
    )

    # Benchmark
    start_time = time.time()
    for _ in range(n_iterations):
        collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=5,
            where=where_clause
        )
    elapsed = time.time() - start_time
    avg_ms = (elapsed / n_iterations) * 1000

    print(f"{description}")
    print(f"  Average query time: {avg_ms:.2f} ms")
    return avg_ms

print("Running filtering performance benchmarks (100 iterations each)...")
print("=" * 70)

# Baseline: No filtering
baseline_ms = benchmark_filter(None, description="Baseline (no filter)")

print()

# Category filter
category_ms = benchmark_filter(
    {"category": "cs.LG"},
    description="Category filter (category = 'cs.LG')"
)
category_overhead = (category_ms / baseline_ms)
print(f"  Overhead: {category_overhead:.1f}x slower ({(category_overhead-1)*100:.0f}%)")

print()

# Year range filter
year_ms = benchmark_filter(
    {"year": {"$gte": 2024}},
    description="Year range filter (year >= 2024)"
)
year_overhead = (year_ms / baseline_ms)
print(f"  Overhead: {year_overhead:.1f}x slower ({(year_overhead-1)*100:.0f}%)")

print()

# Combined filter
combined_ms = benchmark_filter(
    {"$and": [{"category": "cs.LG"}, {"year": {"$gte": 2024}}]},
    description="Combined filter (category AND year)"
)
combined_overhead = (combined_ms / baseline_ms)
print(f"  Overhead: {combined_overhead:.1f}x slower ({(combined_overhead-1)*100:.0f}%)")

print("\n" + "=" * 70)
print("Summary: Filtering adds 3-10x overhead depending on filter type")

Running filtering performance benchmarks (100 iterations each)...
======================================================================
Baseline (no filter)
  Average query time: 4.45 ms

Category filter (category = 'cs.LG')
  Average query time: 14.82 ms
  Overhead: 3.3x slower (233%)

Year range filter (year >= 2024)
  Average query time: 35.67 ms
  Overhead: 8.0x slower (702%)

Combined filter (category AND year)
  Average query time: 22.34 ms
  Overhead: 5.0x slower (402%)

======================================================================
Summary: Filtering adds 3-10x overhead depending on filter type

What these numbers tell us:

Unfiltered queries are fast: Our baseline of 4.45ms means ChromaDB’s HNSW index works well.
Category filtering costs 3.3x overhead: The query still completes in 14.82ms, which is totally usable, but it’s noticeably slower than unfiltered search.
Numeric range queries are most expensive: Year filtering at 8x overhead (35.67ms) shows that range queries on numeric fields are particularly costly in ChromaDB.
Combined filters fall in between: At 5x overhead (22.34ms), combining filters doesn’t just multiply the costs. There’s some optimization happening.
Real-world variability: If you run these benchmarks yourself, you’ll see the exact numbers vary between runs. We saw category filtering range from 13.8-16.1ms across multiple benchmark sessions. This variability is normal. What stays consistent is the order: year filters are always most expensive, then combined filters, then category filters.

Understanding the Performance Trade-off

This overhead is significant. A multi-fold slowdown matters when you’re processing hundreds of queries per second. But context is important:

When filtering makes sense despite overhead:

Users explicitly request filters (“Show me recent papers”)
The filtered results are substantially better than unfiltered
Your query volume is manageable (even 35ms per query handles 28 queries/second)
User experience benefits outweigh the performance cost

When to reconsider filtering:

Very high query volume with tight latency requirements
Filters don’t meaningfully improve results for most queries
You need sub-10ms response times at scale

Important context: This overhead is how ChromaDB implements filtering at this scale. When we explore production vector databases in the next tutorial, you’ll see how systems like Qdrant handle filtering more efficiently. This isn’t a fundamental limitation of vector databases, it’s a characteristic of how different systems approach the problem.

For now, understand that metadata filtering in ChromaDB works and is usable, but it comes with measurable performance costs. Design your metadata schema carefully and filter only when the value justifies the overhead.

Implementing BM25 Keyword Search

Vector search excels at understanding semantic meaning, but it can struggle with rare keywords, specific technical terms, or exact name matches. BM25 keyword search complements vector search by ranking documents based on term frequency and document length.

Understanding BM25

BM25 (Best Matching 25) is a ranking function that scores documents based on:

How many times query terms appear in the document (term frequency)
How rare those terms are across all documents (inverse document frequency)
Document length normalization (shorter documents aren’t penalized)

BM25 treats words as independent tokens. If you search for “SQL query optimization,” BM25 looks for documents containing those exact words, giving higher scores to documents where these terms appear frequently.

Building a BM25 Index

Let’s implement BM25 search on our arXiv abstracts:

from rank_bm25 import BM25Okapi
import string

def simple_tokenize(text):
    """
    Basic tokenization for BM25.

    Lowercase text, remove punctuation, split on whitespace.
    """
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text.split()

# Tokenize all abstracts
print("Building BM25 index from 5000 abstracts...")
tokenized_corpus = [simple_tokenize(abstract) for abstract in df['abstract']]

# Create BM25 index
bm25 = BM25Okapi(tokenized_corpus)
print("✓ BM25 index created")

# Test it with a sample query
query = "SQL query optimization indexing"
tokenized_query = simple_tokenize(query)

# Get BM25 scores for all documents
bm25_scores = bm25.get_scores(tokenized_query)

# Find top 5 papers by BM25 score
top_indices = np.argsort(bm25_scores)[::-1][:5]

print(f"\nQuery: '{query}'")
print("Top 5 by BM25 keyword matching:")
for rank, idx in enumerate(top_indices, 1):
    score = bm25_scores[idx]
    title = df.iloc[idx]['title']
    category = df.iloc[idx]['category']
    print(f"{rank}. [{category}] {title[:60]}...")
    print(f"   BM25 Score: {score:.2f}")

Building BM25 index from 5000 abstracts...
✓ BM25 index created

Query: 'SQL query optimization indexing'
Top 5 by BM25 keyword matching:
1. [cs.DB] Learned Adaptive Indexing...
   BM25 Score: 13.34
2. [cs.DB] LLM4Hint: Leveraging Large Language Models for Hint Recommen...
   BM25 Score: 13.25
3. [cs.LG] Cortex AISQL: A Production SQL Engine for Unstructured Data...
   BM25 Score: 12.83
4. [cs.DB] Cortex AISQL: A Production SQL Engine for Unstructured Data...
   BM25 Score: 12.83
5. [cs.DB] A Functional Data Model and Query Language is All You Need...
   BM25 Score: 11.91

BM25 correctly identified Database papers about query optimization, with 4 out of 5 results from cs.DB. The third result is from Machine Learning but still relevant to SQL processing (Cortex AISQL), showing how keyword matching can surface related papers from adjacent domains. When the query contains specific technical terms, keyword matching works well.

A note about scale: The rank-bm25 library works great for our 5,000 abstracts and similar small datasets. It’s perfect for learning BM25 concepts without complexity. For larger datasets or production systems, you’d typically use faster BM25 implementations found in search engines like Elasticsearch, OpenSearch, or Apache Lucene. These are optimized for millions of documents and high query volumes. For now, rank-bm25 gives us everything we need to understand how keyword search complements vector search.

Comparing BM25 to Vector Search

Let’s run the same query through vector search:

# Vector search for the same query
results_vector = search_with_filter(query, n_results=5)

print(f"\nSame query: '{query}'")
print("Top 5 by vector similarity:")
for i in range(5):
    metadata = results_vector['metadatas'][0][i]
    distance = results_vector['distances'][0][i]
    print(f"{i+1}. [{metadata['category']}] {metadata['title'][:60]}...")
    print(f"   Distance: {distance:.4f}")

Same query: 'SQL query optimization indexing'
Top 5 by vector similarity:
1. [cs.DB] VIDEX: A Disaggregated and Extensible Virtual Index for the ...
   Distance: 0.5510
2. [cs.DB] AMAZe: A Multi-Agent Zero-shot Index Advisor for Relational ...
   Distance: 0.5586
3. [cs.DB] AutoIndexer: A Reinforcement Learning-Enhanced Index Advisor...
   Distance: 0.5602
4. [cs.DB] LLM4Hint: Leveraging Large Language Models for Hint Recommen...
   Distance: 0.5837
5. [cs.DB] Training-Free Query Optimization via LLM-Based Plan Similari...
   Distance: 0.5856

Interesting! While only one paper (LLM4Hint) appears in both top 5 lists, both approaches successfully identify relevant Database papers. The keywords “SQL” and “query” and “optimization” appear frequently in database papers, and the semantic meaning also points to that domain. The different rankings show how keyword matching and semantic search can prioritize different aspects of relevance, even when both correctly identify the target category.

This convergence of categories (both returning cs.DB papers) is common when queries contain domain-specific terminology that appears naturally in relevant documents.

Hybrid Search: Combining Vector and Keyword Search

Hybrid search combines the strengths of both approaches: vector search for semantic understanding, keyword search for exact term matching. Let’s implement weighted hybrid scoring.

Our Implementation

Before we dive into the code, let’s be clear about what we’re building. This is a simplified implementation designed to teach you the core concepts of hybrid search: score normalization, weighted combination, and balancing semantic versus keyword signals.

Production vector databases often handle hybrid scoring internally or use more sophisticated approaches like rank-based fusion (combining rankings rather than scores) or learned rerankers (neural models that re-score results). We’ll explore these production systems in the next tutorial. For now, our implementation focuses on the fundamentals that apply across all hybrid approaches.

The Challenge: Normalizing Different Score Scales

BM25 scores range from 0 to potentially 20+ (higher is better). ChromaDB distances range from 0 to 2+ (lower is better). We can’t just add them together. We need to:

Normalize both score types to the same 0-1 range
Convert ChromaDB distances to similarities (flip the scale)
Apply weights to combine them

Implementation

Here’s our complete hybrid search function:

def hybrid_search(query_text, alpha=0.5, n_results=10):
    """
    Combine BM25 keyword search with vector similarity search.

    Args:
        query_text: The search query
        alpha: Weight for BM25 (0 = pure vector, 1 = pure keyword)
        n_results: Number of results to return

    Returns:
        Combined results with hybrid scores
    """
    # Get BM25 scores
    tokenized_query = simple_tokenize(query_text)
    bm25_scores = bm25.get_scores(tokenized_query)

    # Get vector similarities (we'll search more to ensure good coverage)
    response = co.embed(
        texts=[query_text],
        model='embed-v4.0',
        input_type='search_query',
        embedding_types=['float']
    )
    query_embedding = np.array(response.embeddings.float_[0])

    vector_results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=100  # Get more candidates for better coverage
    )

    # Extract vector distances and convert to similarities
    # ChromaDB returns cosine distance (0 to 2, lower = more similar)
    # We'll convert to similarity scores where higher = better for easier combination
    vector_distances = {}
    for i, paper_id in enumerate(vector_results['ids'][0]):
        distance = vector_results['distances'][0][i]
        # Convert distance to similarity (simple inversion)
        similarity = 1 / (1 + distance)
        vector_distances[paper_id] = similarity

    # Normalize BM25 scores to 0-1 range
    max_bm25 = max(bm25_scores) if max(bm25_scores) > 0 else 1
    min_bm25 = min(bm25_scores)
    bm25_normalized = {}
    for i, score in enumerate(bm25_scores):
        paper_id = f"paper_{i}"
        normalized = (score - min_bm25) / (max_bm25 - min_bm25) if max_bm25 > min_bm25 else 0
        bm25_normalized[paper_id] = normalized

    # Combine scores using weighted average
    # hybrid_score = alpha * bm25 + (1 - alpha) * vector
    hybrid_scores = {}
    all_paper_ids = set(bm25_normalized.keys()) | set(vector_distances.keys())

    for paper_id in all_paper_ids:
        bm25_score = bm25_normalized.get(paper_id, 0)
        vector_score = vector_distances.get(paper_id, 0)

        hybrid_score = alpha * bm25_score + (1 - alpha) * vector_score
        hybrid_scores[paper_id] = hybrid_score

    # Get top N by hybrid score
    top_paper_ids = sorted(hybrid_scores.items(), key=lambda x: x[1], reverse=True)[:n_results]

    # Format results
    results = []
    for paper_id, score in top_paper_ids:
        paper_idx = int(paper_id.split('_')[1])
        results.append({
            'paper_id': paper_id,
            'title': df.iloc[paper_idx]['title'],
            'category': df.iloc[paper_idx]['category'],
            'abstract': df.iloc[paper_idx]['abstract'][:200] + "...",
            'hybrid_score': score,
            'bm25_score': bm25_normalized.get(paper_id, 0),
            'vector_score': vector_distances.get(paper_id, 0)
        })

    return results

# Test with different alpha values
query = "neural network training optimization"

print(f"Query: '{query}'")
print("=" * 80)

# Pure vector (alpha = 0)
print("\nPure Vector Search (alpha=0.0):")
results = hybrid_search(query, alpha=0.0, n_results=5)
for i, r in enumerate(results, 1):
    print(f"{i}. [{r['category']}] {r['title'][:60]}...")
    print(f"   Hybrid: {r['hybrid_score']:.3f} (Vector: {r['vector_score']:.3f}, BM25: {r['bm25_score']:.3f})")

# Hybrid 30% keyword, 70% vector
print("\nHybrid 30/70 (alpha=0.3):")
results = hybrid_search(query, alpha=0.3, n_results=5)
for i, r in enumerate(results, 1):
    print(f"{i}. [{r['category']}] {r['title'][:60]}...")
    print(f"   Hybrid: {r['hybrid_score']:.3f} (Vector: {r['vector_score']:.3f}, BM25: {r['bm25_score']:.3f})")

# Hybrid 50/50
print("\nHybrid 50/50 (alpha=0.5):")
results = hybrid_search(query, alpha=0.5, n_results=5)
for i, r in enumerate(results, 1):
    print(f"{i}. [{r['category']}] {r['title'][:60]}...")
    print(f"   Hybrid: {r['hybrid_score']:.3f} (Vector: {r['vector_score']:.3f}, BM25: {r['bm25_score']:.3f})")

# Pure keyword (alpha = 1.0)
print("\nPure BM25 Keyword (alpha=1.0):")
results = hybrid_search(query, alpha=1.0, n_results=5)
for i, r in enumerate(results, 1):
    print(f"{i}. [{r['category']}] {r['title'][:60]}...")
    print(f"   Hybrid: {r['hybrid_score']:.3f} (Vector: {r['vector_score']:.3f}, BM25: {r['bm25_score']:.3f})")

Query: 'neural network training optimization'
================================================================================

Pure Vector Search (alpha=0.0):
1. [cs.LG] Training Neural Networks at Any Scale...
   Hybrid: 0.642 (Vector: 0.642, BM25: 0.749)
2. [cs.LG] On the Convergence of Overparameterized Problems: Inherent P...
   Hybrid: 0.630 (Vector: 0.630, BM25: 1.000)
3. [cs.LG] Adam symmetry theorem: characterization of the convergence o...
   Hybrid: 0.617 (Vector: 0.617, BM25: 0.381)
4. [cs.LG] A Distributed Training Architecture For Combinatorial Optimi...
   Hybrid: 0.617 (Vector: 0.617, BM25: 0.884)
5. [cs.LG] Can Training Dynamics of Scale-Invariant Neural Networks Be ...
   Hybrid: 0.609 (Vector: 0.609, BM25: 0.566)

Hybrid 30/70 (alpha=0.3):
1. [cs.LG] On the Convergence of Overparameterized Problems: Inherent P...
   Hybrid: 0.741 (Vector: 0.630, BM25: 1.000)
2. [cs.LG] Neuronal Fluctuations: Learning Rates vs Participating Neuro...
   Hybrid: 0.714 (Vector: 0.603, BM25: 0.971)
3. [cs.CV] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
   Hybrid: 0.709 (Vector: 0.601, BM25: 0.960)
4. [cs.LG] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
   Hybrid: 0.708 (Vector: 0.601, BM25: 0.960)
5. [cs.LG] N-ReLU: Zero-Mean Stochastic Extension of ReLU...
   Hybrid: 0.707 (Vector: 0.603, BM25: 0.948)

Hybrid 50/50 (alpha=0.5):
1. [cs.LG] On the Convergence of Overparameterized Problems: Inherent P...
   Hybrid: 0.815 (Vector: 0.630, BM25: 1.000)
2. [cs.LG] Neuronal Fluctuations: Learning Rates vs Participating Neuro...
   Hybrid: 0.787 (Vector: 0.603, BM25: 0.971)
3. [cs.CV] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
   Hybrid: 0.780 (Vector: 0.601, BM25: 0.960)
4. [cs.LG] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
   Hybrid: 0.780 (Vector: 0.601, BM25: 0.960)
5. [cs.LG] N-ReLU: Zero-Mean Stochastic Extension of ReLU...
   Hybrid: 0.775 (Vector: 0.603, BM25: 0.948)

Pure BM25 Keyword (alpha=1.0):
1. [cs.LG] On the Convergence of Overparameterized Problems: Inherent P...
   Hybrid: 1.000 (Vector: 0.630, BM25: 1.000)
2. [cs.LG] Neuronal Fluctuations: Learning Rates vs Participating Neuro...
   Hybrid: 0.971 (Vector: 0.603, BM25: 0.971)
3. [cs.LG] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
   Hybrid: 0.960 (Vector: 0.601, BM25: 0.960)
4. [cs.CV] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
   Hybrid: 0.960 (Vector: 0.601, BM25: 0.960)
5. [cs.LG] N-ReLU: Zero-Mean Stochastic Extension of ReLU...
   Hybrid: 0.948 (Vector: 0.603, BM25: 0.948)

The output shows how different alpha values affect which papers surface. With pure vector search (alpha=0), you’ll see papers that semantically relate to neural network training. As you increase alpha toward 1, you’ll increasingly weight papers that literally contain the words “neural,” “network,” “training,” and “optimization.”

Evaluating Search Strategies Systematically

We’ve implemented three search approaches: pure vector, pure keyword, and hybrid. But which one actually works better? We need systematic evaluation.

The Evaluation Metric: Category Precision

For our balanced 5k dataset, we can use category precision as our success metric:

Category precision @k: What percentage of the top k results are in the expected category?

If we search for “SQL query optimization,” we expect Database papers (cs.DB). If 4 out of 5 top results are from cs.DB, we have 80% precision@5.

This metric works because our dataset is perfectly balanced and we can predict which category should dominate for specific queries.

Creating Test Queries

Let’s create 10 diverse queries targeting different categories:

test_queries = [
    {
        "text": "natural language processing transformers",
        "expected_category": "cs.CL",
        "description": "NLP query"
    },
    {
        "text": "image segmentation computer vision",
        "expected_category": "cs.CV",
        "description": "Vision query"
    },
    {
        "text": "database query optimization indexing",
        "expected_category": "cs.DB",
        "description": "Database query"
    },
    {
        "text": "neural network training deep learning",
        "expected_category": "cs.LG",
        "description": "ML query with clear terms"
    },
    {
        "text": "software testing debugging quality assurance",
        "expected_category": "cs.SE",
        "description": "Software engineering query"
    },
    {
        "text": "attention mechanisms sequence models",
        "expected_category": "cs.CL",
        "description": "NLP architecture query"
    },
    {
        "text": "convolutional neural networks image recognition",
        "expected_category": "cs.CV",
        "description": "Vision with technical terms"
    },
    {
        "text": "distributed systems database consistency",
        "expected_category": "cs.DB",
        "description": "Database systems query"
    },
    {
        "text": "reinforcement learning policy gradient",
        "expected_category": "cs.LG",
        "description": "RL query"
    },
    {
        "text": "code review static analysis",
        "expected_category": "cs.SE",
        "description": "SE development query"
    }
]

print(f"Created {len(test_queries)} test queries")
print("Expected category distribution:")
categories = [q['expected_category'] for q in test_queries]
print(pd.Series(categories).value_counts().sort_index())

Created 10 test queries
Expected category distribution:
cs.CL    2
cs.CV    2
cs.DB    2
cs.LG    2
cs.SE    2
Name: count, dtype: int64

Our test set is balanced across categories, ensuring fair evaluation.

Running the Evaluation

Now let’s test pure vector, pure keyword, and hybrid approaches:

def calculate_category_precision(query_text, expected_category, search_type="vector", alpha=0.5):
    """
    Calculate what percentage of top 5 results match expected category.

    Args:
        query_text: The search query
        expected_category: Expected category (e.g., 'cs.LG')
        search_type: 'vector', 'bm25', or 'hybrid'
        alpha: Weight for BM25 if using hybrid

    Returns:
        Precision (0.0 to 1.0)
    """
    if search_type == "vector":
        results = search_with_filter(query_text, n_results=5)
        categories = [r['category'] for r in results['metadatas'][0]]

    elif search_type == "bm25":
        tokenized_query = simple_tokenize(query_text)
        bm25_scores = bm25.get_scores(tokenized_query)
        top_indices = np.argsort(bm25_scores)[::-1][:5]
        categories = [df.iloc[idx]['category'] for idx in top_indices]

    elif search_type == "hybrid":
        results = hybrid_search(query_text, alpha=alpha, n_results=5)
        categories = [r['category'] for r in results]

    # Calculate precision
    matches = sum(1 for cat in categories if cat == expected_category)
    precision = matches / len(categories)

    return precision, categories

# Evaluate all strategies
results_summary = {
    'Pure Vector': [],
    'Hybrid 30/70': [],
    'Hybrid 50/50': [],
    'Pure BM25': []
}

print("Evaluating search strategies on 10 test queries...")
print("=" * 80)

for query_info in test_queries:
    query = query_info['text']
    expected = query_info['expected_category']

    print(f"\nQuery: {query}")
    print(f"Expected: {expected}")

    # Pure vector
    precision, _ = calculate_category_precision(query, expected, "vector")
    results_summary['Pure Vector'].append(precision)
    print(f"  Pure Vector: {precision*100:.0f}% precision")

    # Hybrid 30/70
    precision, _ = calculate_category_precision(query, expected, "hybrid", alpha=0.3)
    results_summary['Hybrid 30/70'].append(precision)
    print(f"  Hybrid 30/70: {precision*100:.0f}% precision")

    # Hybrid 50/50
    precision, _ = calculate_category_precision(query, expected, "hybrid", alpha=0.5)
    results_summary['Hybrid 50/50'].append(precision)
    print(f"  Hybrid 50/50: {precision*100:.0f}% precision")

    # Pure BM25
    precision, _ = calculate_category_precision(query, expected, "bm25")
    results_summary['Pure BM25'].append(precision)
    print(f"  Pure BM25: {precision*100:.0f}% precision")

# Calculate average precision for each strategy
print("\n" + "=" * 80)
print("OVERALL RESULTS")
print("=" * 80)
for strategy, precisions in results_summary.items():
    avg_precision = sum(precisions) / len(precisions)
    print(f"{strategy}: {avg_precision*100:.0f}% average category precision")

Evaluating search strategies on 10 test queries...
================================================================================

Query: natural language processing transformers
Expected: cs.CL
  Pure Vector: 80% precision
  Hybrid 30/70: 60% precision
  Hybrid 50/50: 60% precision
  Pure BM25: 60% precision

Query: image segmentation computer vision
Expected: cs.CV
  Pure Vector: 80% precision
  Hybrid 30/70: 80% precision
  Hybrid 50/50: 80% precision
  Pure BM25: 80% precision

[... additional queries ...]

================================================================================
OVERALL RESULTS
================================================================================
Pure Vector: 84% average category precision
Hybrid 30/70: 78% average category precision
Hybrid 50/50: 78% average category precision
Pure BM25: 78% average category precision

Understanding What the Results Tell Us

These results deserve careful interpretation. Let’s be honest about what we discovered.

Finding 1: Pure Vector Performed Best on This Dataset

Pure vector search achieved 84% category precision compared to 78% for hybrid and 78% for BM25. This might surprise you if you’ve read guides claiming hybrid search always outperforms pure approaches.

Why pure vector dominated on academic abstracts:

Academic papers have rich vocabulary and technical terminology. ML papers naturally use words like “training,” “optimization,” “neural networks.” Database papers naturally use words like “query,” “index,” “transaction.” The semantic embeddings capture these domain-specific patterns well.

Adding BM25 keyword matching introduced false positives. Papers that coincidentally used similar words in different contexts got boosted incorrectly. For example, a database paper might mention “model training” when discussing query optimization models, causing it to rank high for “neural network training” queries even though it’s not about neural networks.

Finding 2: Hybrid Search Can Still Add Value

Just because pure vector won on this dataset doesn’t mean hybrid search is worthless. There are scenarios where keyword matching helps:

When hybrid might outperform pure vector:

Searching structured data (product catalogs, API documentation)
Queries with rare technical terms that might not embed well
Domains where exact keyword presence is meaningful
Documents with inconsistent writing quality where semantic meaning is unclear

On our academic abstracts: The rich vocabulary gave vector search everything it needed. Keyword matching added noise more than signal.

Finding 3: The Vocabulary Mismatch Problem

Some queries failed across ALL strategies. For example, we tested “reducing storage requirements for system event data” hoping to find a paper about log compression. None of the approaches found it. Why?

The query used “reducing storage requirements” but the paper said “compression” and “resource savings.” These are semantically equivalent, but the vocabulary differs. At 5k scale with multiple papers legitimately matching each query, vocabulary mismatches become visible.

This isn’t a failure of vector search or hybrid search. It’s the reality of semantic retrieval: users search with general terms, papers use technical jargon. Sometimes the gap is too wide.

Finding 4: Query Quality Matters More Than Strategy

Throughout our evaluation, we noticed that well-crafted queries with clear technical terms performed well across all strategies, while vague queries struggled everywhere.

A query like “neural network training optimization techniques” succeeded because it used the same language papers use. A query like “making models work better” failed because it’s too general and uses informal language.

The lesson: Before optimizing your search strategy, make sure your queries match how your documents are written. Understanding your corpus matters more than choosing between vector and keyword search.

Practical Guidance for Real Projects

Let’s consolidate what we’ve learned into actionable advice.

When to Use Metadata Filtering

Use filtering when:

Users explicitly request filters (“show me papers from 2024”)
Filtering meaningfully improves result quality
Your query volume is manageable (ChromaDB can handle dozens of filtered queries per second)
The performance cost is acceptable for your use case

Design your schema carefully:

Store filterable fields as atomic values (integers for years, strings for categories)
Avoid nested JSON blobs or long text in metadata
Keep metadata consistent across documents
Test filtering performance on your actual data before deploying

Accept the overhead:

Filtered queries run slower than unfiltered ones in ChromaDB
This is a characteristic of how ChromaDB approaches the problem
Production databases handle filtering with different tradeoffs (we’ll see this in the next tutorial)
Design for the database you’re actually using

When to Consider Hybrid Search

Try hybrid search when:

Your documents have structured fields where exact matches matter
Queries include rare technical terms that might not embed well
Testing shows hybrid outperforms pure vector on your test queries
You can afford the implementation and maintenance complexity

Stick with pure vector when:

Your documents have rich natural language (like our academic abstracts)
Vector search already achieves high precision on test queries
Simplicity and maintainability matter
Your embedding model captures domain terminology well

The decision framework:

Build pure vector search first
Create representative test queries
Measure precision/recall on pure vector
Only if results are inadequate, implement hybrid
Compare hybrid against pure vector on same test queries
Choose the approach with measurably better results

Don’t add complexity without evidence it helps.

Start Simple, Measure, Then Optimize

The pattern that emerged across our experiments:

Start with pure vector search: It’s simpler to implement and maintain
Build evaluation framework: Create test queries with expected results
Measure performance: Calculate precision, recall, or domain-specific metrics
Identify gaps: Where does pure vector fail?
Add complexity thoughtfully: Try metadata filtering or hybrid search
Re-evaluate: Does the added complexity improve results?
Choose based on data: Not based on what tutorials claim always works

This approach keeps your system maintainable while ensuring each added feature provides real value.

Looking Ahead to Production Databases

Throughout this tutorial, we’ve explored filtering and hybrid search using ChromaDB. We’ve seen that:

Filtering adds measurable overhead, but remains usable for moderate query volumes
ChromaDB excels at local development and prototyping
Production systems optimize these patterns differently

ChromaDB is designed to be lightweight, easy to use, and perfect for learning. We’ve used it to understand vector database concepts without worrying about infrastructure. The patterns we learned (metadata schema design, hybrid scoring, evaluation frameworks) transfer directly to production systems.

In the next tutorial, we’ll explore production vector databases:

PostgreSQL with pgvector: See how vector search integrates with SQL and existing infrastructure
Pinecone: Experience managed services with auto-scaling
Qdrant: Explore Rust-backed performance and efficient filtering

You’ll discover how different systems approach filtering, when managed services make sense, and how to choose the right database for your needs. The core concepts remain the same, but production systems offer different tradeoffs in performance, features, and operational complexity.

But you needed to understand these concepts with an accessible tool first. ChromaDB gave us that foundation.

Practical Exercises

Before moving on, try these experiments to deepen your understanding:

Exercise 1: Explore Different Queries

Test pure vector vs hybrid search on queries from your own domain:

my_queries = [
    "your domain-specific query here",
    "another query relevant to your work",
    # Add more
]

for query in my_queries:
    print(f"\n{'='*70}")
    print(f"Query: {query}")

    # Try pure vector
    results_vector = search_with_filter(query, n_results=5)

    # Try hybrid
    results_hybrid = hybrid_search(query, alpha=0.5, n_results=5)

    # Compare the categories returned
    # Which approach surfaces more relevant papers?

Exercise 2: Tune Hybrid Alpha

Find the optimal alpha value for a specific query:

query = "your challenging query here"

for alpha in [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]:
    results = hybrid_search(query, alpha=alpha, n_results=5)
    categories = [r['category'] for r in results]

    print(f"Alpha={alpha}: {categories}")
    # Which alpha gives the best results for this query?

Exercise 3: Analyze Filter Combinations

Test different metadata filter combinations:

# Try various filter patterns
filters_to_test = [
    {"category": "cs.LG"},
    {"year": {"$gte": 2024}},
    {"category": "cs.LG", "year": {"$eq": 2025}},
    {"$or": [{"category": "cs.LG"}, {"category": "cs.CV"}]}
]

query = "deep learning applications"

for where_clause in filters_to_test:
    results = search_with_filter(query, where_clause, n_results=5)
    categories = [r['category'] for r in results['metadatas'][0]]
    print(f"Filter {where_clause}: {categories}")

Exercise 4: Build Your Own Evaluation

Create test queries for a different domain:

# If you have expertise in a specific field,
# create queries where you KNOW which papers should match

domain_specific_queries = [
    {
        "text": "your expert query",
        "expected_category": "cs.XX",
        "notes": "why this query should return this category"
    },
    # Add more
]

# Run evaluation and see which strategy performs best
# on YOUR domain-specific queries

Summary: What You’ve Learned

We’ve covered a lot of ground in this tutorial. Here’s what you can now do:

Core Skills

Metadata Schema Design:

Store filterable fields as atomic, consistent values
Avoid anti-patterns like JSON blobs and long text in metadata
Ensure all documents have the same metadata fields to prevent filtering issues
Understand that good schema design enables powerful filtering

Metadata Filtering in ChromaDB:

Implement category filters, numeric range filters, and combinations
Measure the performance overhead of filtering
Make informed decisions about when filtering justifies the cost

BM25 Keyword Search:

Build BM25 indexes from document text
Understand term frequency and inverse document frequency
Recognize when keyword matching complements vector search
Know the scale limitations of different BM25 implementations

Hybrid Search Implementation:

Normalize different score scales (BM25 and vector similarity)
Combine scores using weighted averages
Test different alpha values to balance keyword vs semantic search
Understand this is a teaching implementation of fundamental concepts

Systematic Evaluation:

Create test queries with ground truth expectations
Calculate precision metrics to compare strategies
Make data-driven decisions rather than assuming one approach always wins

Key Insights

1. Pure vector search performed best on our academic abstracts (84% category precision vs 78% for hybrid/BM25). This challenges the assumption that hybrid always wins. The rich vocabulary in academic papers gave vector search everything it needed.

2. Filtering overhead is real but manageable for moderate query volumes. ChromaDB’s approach to filtering creates measurable costs that production databases handle differently.

3. Vocabulary mismatch is the biggest challenge. Users search with general terms (“reducing storage”), papers use jargon (“compression”). This gap affects all search strategies.

4. Query quality matters more than search strategy. Well-crafted queries using domain terminology succeed across approaches. Vague queries struggle everywhere.

5. Start simple, measure, then optimize. Build pure vector first, evaluate systematically, add complexity only when data shows it helps.

What’s Next

We now understand how to enhance vector search with metadata filtering and hybrid approaches. We’ve seen what works, what doesn’t, and how to measure the difference.