
Metadata Filtering and Hybrid Search for Vector Databases – Dataquest
In the first tutorial, we built a vector database with ChromaDB and ran semantic similarity searches across 5,000 arXiv papers. We discovered that vector search excels at understanding meaning: a query about “neural network training” successfully retrieved papers about optimization algorithms, even when they didn’t use those exact words.
But here’s what we couldn’t do yet: What if we only want papers from the last two years? What if we need to search specifically within the Machine Learning category? What if someone searches for a rare technical term that vector search might miss?
This tutorial teaches you how to enhance vector search with two powerful capabilities: metadata filtering and hybrid search. By the end, you’ll understand how to combine semantic similarity with traditional filters, when keyword search adds value, and how to make intelligent trade-offs between different search strategies.
What You’ll Learn
By the end of this tutorial, you’ll be able to:
- Design metadata schemas that enable powerful filtering without performance pitfalls
- Implement filtered vector searches in ChromaDB using metadata constraints
- Measure and understand the performance overhead of different filter types
- Build BM25 keyword search alongside your vector search
- Combine vector similarity and keyword matching using weighted hybrid scoring
- Evaluate different search strategies systematically using category precision
- Make informed decisions about when metadata filtering and hybrid search add value
Most importantly, you’ll learn to be honest about what works and what doesn’t. Our experiments revealed some surprising results that challenge common assumptions about hybrid search.
Dataset and Environment Setup
We’ll use the same 5,000 arXiv papers we used previously. If you completed the first tutorial, you already have these files. If you’re starting fresh, download them now:
arxiv_papers_5k.csv download (7.7 MB) → Paper metadataembeddings_cohere_5k.npy download (61.4 MB) → Pre-generated embeddings
The dataset contains 5,000 papers perfectly balanced across five categories:
- cs.CL (Computational Linguistics): 1,000 papers
- cs.CV (Computer Vision): 1,000 papers
- cs.DB (Databases): 1,000 papers
- cs.LG (Machine Learning): 1,000 papers
- cs.SE (Software Engineering): 1,000 papers
Environment Setup
You’ll need the same packages from previous tutorials, plus one new library for BM25:
# Create virtual environment (if starting fresh)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install required packages
# Python 3.12 with these versions:
# chromadb==1.3.4
# numpy==2.0.2
# pandas==2.2.2
# cohere==5.20.0
# python-dotenv==1.1.1
# rank-bm25==0.2.2 # NEW for keyword search
pip install chromadb numpy pandas cohere python-dotenv rank-bm25
Make sure your .env file contains your Cohere API key:
COHERE_API_KEY=your_key_here
Loading the Dataset
Let’s load our data and verify everything is in place:
import numpy as np
import pandas as pd
import chromadb
from cohere import ClientV2
from dotenv import load_dotenv
import os
# Load environment variables
load_dotenv()
cohere_api_key = os.getenv('COHERE_API_KEY')
if not cohere_api_key:
raise ValueError("COHERE_API_KEY not found in .env file")
co = ClientV2(api_key=cohere_api_key)
# Load the dataset
df = pd.read_csv('arxiv_papers_5k.csv')
embeddings = np.load('embeddings_cohere_5k.npy')
print(f"Loaded {len(df)} papers")
print(f"Embeddings shape: {embeddings.shape}")
print(f"\nPapers per category:")
print(df['category'].value_counts().sort_index())
# Check what metadata we have
print(f"\nAvailable metadata columns:")
print(df.columns.tolist())
Loaded 5000 papers
Embeddings shape: (5000, 1536)
Papers per category:
category
cs.CL 1000
cs.CV 1000
cs.DB 1000
cs.LG 1000
cs.SE 1000
Name: count, dtype: int64
Available metadata columns:
['arxiv_id', 'title', 'abstract', 'authors', 'published', 'category']
We have rich metadata to work with: paper IDs, titles, abstracts, authors, publication dates, and categories. This metadata will power our filtering and help evaluate our search strategies.
Designing Metadata Schemas
Before we start filtering, we need to think carefully about what metadata to store and how to structure it. Good metadata design makes search powerful and performant. Poor design creates headaches.
What Makes Good Metadata
Good metadata is:
- Filterable: Choose values that match how users actually search. If users filter by publication year, store year as an integer. If they filter by topic, store normalized category strings.
- Atomic: Store individual fields separately rather than dumping everything into a single JSON blob. Want to filter by year? Don’t make ChromaDB parse “Published: 2024-03-15” from a text field.
- Indexed: Most vector databases index metadata fields differently than vector embeddings. Keep metadata fields small and specific so indexing works efficiently.
- Consistent: Use the same data types and formats across all documents. Don’t store year as “2024” for one paper and “March 2024” for another.
What Doesn’t Belong in Metadata
Avoid storing:
- Long text in metadata fields: The paper abstract is content, not metadata. Store it as the document text, not in a metadata field.
- Nested structures: ChromaDB supports nested metadata, but complex JSON trees are hard to filter and often signal confused schema design.
- Redundant information: If you can derive a field from another (like “decade” from “year”), consider computing it at query time instead of storing it.
- Frequently changing values: Metadata updates can be expensive. Don’t store view counts or frequently updated statistics in metadata.
Preparing Our Metadata
Let’s prepare metadata for our 5,000 papers:
def prepare_metadata(df):
"""
Prepare metadata for ChromaDB from our dataframe.
Returns list of metadata dictionaries, one per paper.
"""
metadatas = []
for _, row in df.iterrows():
# Extract year from published date (format: YYYY-MM-DD)
year = int(str(row['published'])[:4])
# Truncate authors if too long (ChromaDB has reasonable limits)
authors = row['authors'][:200] if len(row['authors']) <= 200 else row['authors'][:197] + "..."
metadata = {
'title': row['title'],
'category': row['category'],
'year': year, # Store as integer for range queries
'authors': authors
}
metadatas.append(metadata)
return metadatas
# Prepare metadata for all papers
metadatas = prepare_metadata(df)
# Check a sample
print("Sample metadata:")
print(metadatas[0])
Sample metadata:
{'title': 'Optimizing Mixture of Block Attention', 'category': 'cs.LG', 'year': 2025, 'authors': 'Tao He, Liang Ding, Zhenya Huang, Dacheng Tao'}
Notice we’re storing:
title: The full paper title for display in resultscategory: One of our five CS categories for topic filteringyear: Extracted as an integer for range queries like “papers after 2024”authors: Truncated to avoid extremely long strings
This metadata schema supports the filtering patterns users actually want: search within a category, filter by publication date, or display author information in results.
Anti-Patterns to Avoid
Let’s look at what NOT to do:
Bad: JSON blob as metadata
# DON'T DO THIS
metadata = {
'info': json.dumps({
'title': title,
'category': category,
'year': year,
# ... everything dumped in JSON
})
}
This makes filtering painful. You can’t efficiently filter by year when it’s buried in a JSON string.
Bad: Long text as metadata
# DON'T DO THIS
metadata = {
'abstract': full_abstract_text, # This belongs in documents, not metadata
'category': category
}
ChromaDB stores abstracts as document content. Duplicating them in metadata wastes space and doesn’t improve search.
Bad: Inconsistent types
# DON'T DO THIS
metadata1 = {'year': 2024} # Integer
metadata2 = {'year': '2024'} # String
metadata3 = {'year': 'March 2024'} # Unparseable
Consistent data types make filtering reliable. Always store years as integers if you want range queries.
Bad: Missing or inconsistent metadata fields
# DON'T DO THIS
paper1_metadata = {'title': 'Paper 1', 'category': 'cs.LG', 'year': 2024}
paper2_metadata = {'title': 'Paper 2', 'category': 'cs.CV'} # Missing year!
paper3_metadata = {'title': 'Paper 3', 'year': 2023} # Missing category!
Here’s a common source of frustration: if a document is missing a metadata field, ChromaDB’s filters won’t match it at all. If you filter by {"year": {"$gte": 2024}} and some papers lack a year field, those papers simply won’t appear in results. This causes the confusing “where did my document go?” problem.
The fix: Make sure all documents have the same metadata fields. If a value is unknown, store it as None or use a sensible default rather than omitting the field entirely. Consistency prevents documents from mysteriously disappearing when you add filters.
Creating a Collection with Rich Metadata
Now let’s create a ChromaDB collection with all our metadata. If you will be experimenting, you’ll need to use the delete-and-recreate pattern we used previously:
# Initialize ChromaDB client
client = chromadb.Client()
# Delete existing collection if present (useful for experimentation)
try:
client.delete_collection(name="arxiv_with_metadata")
print("Deleted existing collection")
except:
pass # Collection didn't exist, that's fine
# Create collection with metadata
collection = client.create_collection(
name="arxiv_with_metadata",
metadata={
"description": "5000 arXiv papers with rich metadata for filtering",
"hnsw:space": "cosine" # Using cosine similarity
}
)
print(f"Created collection: {collection.name}")
Created collection: arxiv_with_metadata
Now let’s insert our papers with metadata. Remember that ChromaDB has a batch size limit:
# Prepare data for insertion
ids = [f"paper_{i}" for i in range(len(df))]
documents = df['abstract'].tolist()
# Insert with metadata
# Our 5000 papers fit in one batch (limit is ~5,461)
print(f"Inserting {len(df)} papers with metadata...")
collection.add(
ids=ids,
embeddings=embeddings.tolist(),
documents=documents,
metadatas=metadatas
)
print(f"✓ Collection contains {collection.count()} papers with metadata")
Inserting 5000 papers with metadata...
✓ Collection contains 5000 papers with metadata
We now have a collection where every paper has both its embedding and rich metadata. This enables powerful combinations of semantic search and traditional filtering.
Metadata Filtering in Practice
Let’s start filtering our searches using metadata. ChromaDB uses a where clause syntax similar to database queries.
Basic Filtering by Category
Suppose we want to search only within Machine Learning papers:
# First, let's create a helper function for queries
def search_with_filter(query_text, where_clause=None, n_results=5):
"""
Search with optional metadata filtering.
Args:
query_text: The search query
where_clause: Optional ChromaDB where clause for filtering
n_results: Number of results to return
Returns:
Search results
"""
# Embed the query
response = co.embed(
texts=[query_text],
model='embed-v4.0',
input_type='search_query',
embedding_types=['float']
)
query_embedding = np.array(response.embeddings.float_[0])
# Search with optional filter
results = collection.query(
query_embeddings=[query_embedding.tolist()],
n_results=n_results,
where=where_clause # Apply metadata filter here
)
return results
# Example: Search for "deep learning optimization" only in ML papers
query = "deep learning optimization techniques"
results_filtered = search_with_filter(
query,
where_clause={"category": "cs.LG"} # Only Machine Learning papers
)
print(f"Query: '{query}'")
print("Filter: category = 'cs.LG'")
print("\nTop 5 results:")
for i in range(len(results_filtered['ids'][0])):
metadata = results_filtered['metadatas'][0][i]
distance = results_filtered['distances'][0][i]
print(f"\n{i+1}. {metadata['title']}")
print(f" Category: {metadata['category']} | Year: {metadata['year']}")
print(f" Distance: {distance:.4f}")
Query: 'deep learning optimization techniques'
Filter: category = 'cs.LG'
Top 5 results:
1. Adam symmetry theorem: characterization of the convergence of the stochastic Adam optimizer
Category: cs.LG | Year: 2025
Distance: 0.6449
2. Non-Euclidean SGD for Structured Optimization: Unified Analysis and Improved Rates
Category: cs.LG | Year: 2025
Distance: 0.6571
3. Training Neural Networks at Any Scale
Category: cs.LG | Year: 2025
Distance: 0.6674
4. Deep Progressive Training: scaling up depth capacity of zero/one-layer models
Category: cs.LG | Year: 2025
Distance: 0.6682
5. DP-AdamW: Investigating Decoupled Weight Decay and Bias Correction in Private Deep Learning
Category: cs.LG | Year: 2025
Distance: 0.6732
All five results are from cs.LG, exactly as we requested. The filtering worked correctly. The distances are also tightly clustered between 0.64 and 0.67.
This close grouping tells us we found papers that all match our query equally well. The lower distances (compared to the 1.1+ ranges we saw previously) show that filtering down to a specific category helped us find stronger semantic matches.
Filtering by Year Range
What if we want papers from a specific time period?
# Search for papers from 2024 or later
results_recent = search_with_filter(
"neural network architectures",
where_clause={"year": {"$gte": 2024}} # Greater than or equal to 2024
)
print("Recent papers (2024+) about neural network architectures:")
for i in range(3): # Show top 3
metadata = results_recent['metadatas'][0][i]
print(f"{i+1}. {metadata['title']} ({metadata['year']})")
Recent papers (2024+) about neural network architectures:
1. Bearing Syntactic Fruit with Stack-Augmented Neural Networks (2025)
2. Preparation of Fractal-Inspired Computational Architectures for Advanced Large Language Model Analysis (2025)
3. Preparation of Fractal-Inspired Computational Architectures for Advanced Large Language Model Analysis (2025)
Notice that results #2 and #3 are the same paper. This happens because some arXiv papers get cross-posted to multiple categories. A paper about neural architectures for language models might appear in both cs.LG and cs.CL, so when we filter only by year, it shows up once for each category assignment.
You could deduplicate results by tracking paper IDs and skipping ones you’ve already seen, but whether you should depends on your use case. Sometimes knowing a paper appears in multiple categories is actually valuable information. For this tutorial, we’re keeping duplicates as-is because they reflect how real databases behave and help us understand what filtering does and doesn’t handle. If you were building a paper recommendation system, you’d probably deduplicate. If you were analyzing category overlap patterns, you’d want to keep them.
Comparison Operators
ChromaDB supports several comparison operators for numeric fields:
$eq: Equal to$ne: Not equal to$gt: Greater than$gte: Greater than or equal to$lt: Less than$lte: Less than or equal to
Combined Filters
The real power comes from combining multiple filters:
# Find Computer Vision papers from 2025
results_combined = search_with_filter(
"image recognition and classification",
where_clause={
"$and": [
{"category": "cs.CV"},
{"year": {"$eq": 2025}}
]
}
)
print("Computer Vision papers from 2025 about image recognition:")
for i in range(3):
metadata = results_combined['metadatas'][0][i]
print(f"{i+1}. {metadata['title']}")
print(f" {metadata['category']} | {metadata['year']}")
Computer Vision papers from 2025 about image recognition:
1. SWAN -- Enabling Fast and Mobile Histopathology Image Annotation through Swipeable Interfaces
cs.CV | 2025
2. Covariance Descriptors Meet General Vision Encoders: Riemannian Deep Learning for Medical Image Classification
cs.CV | 2025
3. UniADC: A Unified Framework for Anomaly Detection and Classification
cs.CV | 2025
ChromaDB also supports $or for alternatives:
# Papers from either Database or Software Engineering categories
where_db_or_se = {
"$or": [
{"category": "cs.DB"},
{"category": "cs.SE"}
]
}
These filtering capabilities let you narrow searches to exactly the subset you need.
Measuring Filtering Performance Overhead
Metadata filtering isn’t free. Let’s measure the actual performance impact of different filter types. We’ll run multiple queries to get stable measurements:
import time
def benchmark_filter(where_clause, n_iterations=100, description=""):
"""
Benchmark query performance with a specific filter.
Args:
where_clause: The filter to apply (None for unfiltered)
n_iterations: Number of times to run the query
description: Description of what we're testing
Returns:
Average query time in milliseconds
"""
# Use a fixed query embedding to keep comparisons fair
query_text = "machine learning model training"
response = co.embed(
texts=[query_text],
model='embed-v4.0',
input_type='search_query',
embedding_types=['float']
)
query_embedding = np.array(response.embeddings.float_[0])
# Warm up (run once to load any caches)
collection.query(
query_embeddings=[query_embedding.tolist()],
n_results=5,
where=where_clause
)
# Benchmark
start_time = time.time()
for _ in range(n_iterations):
collection.query(
query_embeddings=[query_embedding.tolist()],
n_results=5,
where=where_clause
)
elapsed = time.time() - start_time
avg_ms = (elapsed / n_iterations) * 1000
print(f"{description}")
print(f" Average query time: {avg_ms:.2f} ms")
return avg_ms
print("Running filtering performance benchmarks (100 iterations each)...")
print("=" * 70)
# Baseline: No filtering
baseline_ms = benchmark_filter(None, description="Baseline (no filter)")
print()
# Category filter
category_ms = benchmark_filter(
{"category": "cs.LG"},
description="Category filter (category = 'cs.LG')"
)
category_overhead = (category_ms / baseline_ms)
print(f" Overhead: {category_overhead:.1f}x slower ({(category_overhead-1)*100:.0f}%)")
print()
# Year range filter
year_ms = benchmark_filter(
{"year": {"$gte": 2024}},
description="Year range filter (year >= 2024)"
)
year_overhead = (year_ms / baseline_ms)
print(f" Overhead: {year_overhead:.1f}x slower ({(year_overhead-1)*100:.0f}%)")
print()
# Combined filter
combined_ms = benchmark_filter(
{"$and": [{"category": "cs.LG"}, {"year": {"$gte": 2024}}]},
description="Combined filter (category AND year)"
)
combined_overhead = (combined_ms / baseline_ms)
print(f" Overhead: {combined_overhead:.1f}x slower ({(combined_overhead-1)*100:.0f}%)")
print("\n" + "=" * 70)
print("Summary: Filtering adds 3-10x overhead depending on filter type")
Running filtering performance benchmarks (100 iterations each)...
======================================================================
Baseline (no filter)
Average query time: 4.45 ms
Category filter (category = 'cs.LG')
Average query time: 14.82 ms
Overhead: 3.3x slower (233%)
Year range filter (year >= 2024)
Average query time: 35.67 ms
Overhead: 8.0x slower (702%)
Combined filter (category AND year)
Average query time: 22.34 ms
Overhead: 5.0x slower (402%)
======================================================================
Summary: Filtering adds 3-10x overhead depending on filter type
What these numbers tell us:
- Unfiltered queries are fast: Our baseline of 4.45ms means ChromaDB’s HNSW index works well.
- Category filtering costs 3.3x overhead: The query still completes in 14.82ms, which is totally usable, but it’s noticeably slower than unfiltered search.
- Numeric range queries are most expensive: Year filtering at 8x overhead (35.67ms) shows that range queries on numeric fields are particularly costly in ChromaDB.
- Combined filters fall in between: At 5x overhead (22.34ms), combining filters doesn’t just multiply the costs. There’s some optimization happening.
- Real-world variability: If you run these benchmarks yourself, you’ll see the exact numbers vary between runs. We saw category filtering range from 13.8-16.1ms across multiple benchmark sessions. This variability is normal. What stays consistent is the order: year filters are always most expensive, then combined filters, then category filters.
Understanding the Performance Trade-off
This overhead is significant. A multi-fold slowdown matters when you’re processing hundreds of queries per second. But context is important:
When filtering makes sense despite overhead:
- Users explicitly request filters (“Show me recent papers”)
- The filtered results are substantially better than unfiltered
- Your query volume is manageable (even 35ms per query handles 28 queries/second)
- User experience benefits outweigh the performance cost
When to reconsider filtering:
- Very high query volume with tight latency requirements
- Filters don’t meaningfully improve results for most queries
- You need sub-10ms response times at scale
Important context: This overhead is how ChromaDB implements filtering at this scale. When we explore production vector databases in the next tutorial, you’ll see how systems like Qdrant handle filtering more efficiently. This isn’t a fundamental limitation of vector databases, it’s a characteristic of how different systems approach the problem.
For now, understand that metadata filtering in ChromaDB works and is usable, but it comes with measurable performance costs. Design your metadata schema carefully and filter only when the value justifies the overhead.
Implementing BM25 Keyword Search
Vector search excels at understanding semantic meaning, but it can struggle with rare keywords, specific technical terms, or exact name matches. BM25 keyword search complements vector search by ranking documents based on term frequency and document length.
Understanding BM25
BM25 (Best Matching 25) is a ranking function that scores documents based on:
- How many times query terms appear in the document (term frequency)
- How rare those terms are across all documents (inverse document frequency)
- Document length normalization (shorter documents aren’t penalized)
BM25 treats words as independent tokens. If you search for “SQL query optimization,” BM25 looks for documents containing those exact words, giving higher scores to documents where these terms appear frequently.
Building a BM25 Index
Let’s implement BM25 search on our arXiv abstracts:
from rank_bm25 import BM25Okapi
import string
def simple_tokenize(text):
"""
Basic tokenization for BM25.
Lowercase text, remove punctuation, split on whitespace.
"""
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))
return text.split()
# Tokenize all abstracts
print("Building BM25 index from 5000 abstracts...")
tokenized_corpus = [simple_tokenize(abstract) for abstract in df['abstract']]
# Create BM25 index
bm25 = BM25Okapi(tokenized_corpus)
print("✓ BM25 index created")
# Test it with a sample query
query = "SQL query optimization indexing"
tokenized_query = simple_tokenize(query)
# Get BM25 scores for all documents
bm25_scores = bm25.get_scores(tokenized_query)
# Find top 5 papers by BM25 score
top_indices = np.argsort(bm25_scores)[::-1][:5]
print(f"\nQuery: '{query}'")
print("Top 5 by BM25 keyword matching:")
for rank, idx in enumerate(top_indices, 1):
score = bm25_scores[idx]
title = df.iloc[idx]['title']
category = df.iloc[idx]['category']
print(f"{rank}. [{category}] {title[:60]}...")
print(f" BM25 Score: {score:.2f}")
Building BM25 index from 5000 abstracts...
✓ BM25 index created
Query: 'SQL query optimization indexing'
Top 5 by BM25 keyword matching:
1. [cs.DB] Learned Adaptive Indexing...
BM25 Score: 13.34
2. [cs.DB] LLM4Hint: Leveraging Large Language Models for Hint Recommen...
BM25 Score: 13.25
3. [cs.LG] Cortex AISQL: A Production SQL Engine for Unstructured Data...
BM25 Score: 12.83
4. [cs.DB] Cortex AISQL: A Production SQL Engine for Unstructured Data...
BM25 Score: 12.83
5. [cs.DB] A Functional Data Model and Query Language is All You Need...
BM25 Score: 11.91
BM25 correctly identified Database papers about query optimization, with 4 out of 5 results from cs.DB. The third result is from Machine Learning but still relevant to SQL processing (Cortex AISQL), showing how keyword matching can surface related papers from adjacent domains. When the query contains specific technical terms, keyword matching works well.
A note about scale: The rank-bm25 library works great for our 5,000 abstracts and similar small datasets. It’s perfect for learning BM25 concepts without complexity. For larger datasets or production systems, you’d typically use faster BM25 implementations found in search engines like Elasticsearch, OpenSearch, or Apache Lucene. These are optimized for millions of documents and high query volumes. For now, rank-bm25 gives us everything we need to understand how keyword search complements vector search.
Comparing BM25 to Vector Search
Let’s run the same query through vector search:
# Vector search for the same query
results_vector = search_with_filter(query, n_results=5)
print(f"\nSame query: '{query}'")
print("Top 5 by vector similarity:")
for i in range(5):
metadata = results_vector['metadatas'][0][i]
distance = results_vector['distances'][0][i]
print(f"{i+1}. [{metadata['category']}] {metadata['title'][:60]}...")
print(f" Distance: {distance:.4f}")
Same query: 'SQL query optimization indexing'
Top 5 by vector similarity:
1. [cs.DB] VIDEX: A Disaggregated and Extensible Virtual Index for the ...
Distance: 0.5510
2. [cs.DB] AMAZe: A Multi-Agent Zero-shot Index Advisor for Relational ...
Distance: 0.5586
3. [cs.DB] AutoIndexer: A Reinforcement Learning-Enhanced Index Advisor...
Distance: 0.5602
4. [cs.DB] LLM4Hint: Leveraging Large Language Models for Hint Recommen...
Distance: 0.5837
5. [cs.DB] Training-Free Query Optimization via LLM-Based Plan Similari...
Distance: 0.5856
Interesting! While only one paper (LLM4Hint) appears in both top 5 lists, both approaches successfully identify relevant Database papers. The keywords “SQL” and “query” and “optimization” appear frequently in database papers, and the semantic meaning also points to that domain. The different rankings show how keyword matching and semantic search can prioritize different aspects of relevance, even when both correctly identify the target category.
This convergence of categories (both returning cs.DB papers) is common when queries contain domain-specific terminology that appears naturally in relevant documents.
Hybrid Search: Combining Vector and Keyword Search
Hybrid search combines the strengths of both approaches: vector search for semantic understanding, keyword search for exact term matching. Let’s implement weighted hybrid scoring.
Our Implementation
Before we dive into the code, let’s be clear about what we’re building. This is a simplified implementation designed to teach you the core concepts of hybrid search: score normalization, weighted combination, and balancing semantic versus keyword signals.
Production vector databases often handle hybrid scoring internally or use more sophisticated approaches like rank-based fusion (combining rankings rather than scores) or learned rerankers (neural models that re-score results). We’ll explore these production systems in the next tutorial. For now, our implementation focuses on the fundamentals that apply across all hybrid approaches.
The Challenge: Normalizing Different Score Scales
BM25 scores range from 0 to potentially 20+ (higher is better). ChromaDB distances range from 0 to 2+ (lower is better). We can’t just add them together. We need to:
- Normalize both score types to the same 0-1 range
- Convert ChromaDB distances to similarities (flip the scale)
- Apply weights to combine them
Implementation
Here’s our complete hybrid search function:
def hybrid_search(query_text, alpha=0.5, n_results=10):
"""
Combine BM25 keyword search with vector similarity search.
Args:
query_text: The search query
alpha: Weight for BM25 (0 = pure vector, 1 = pure keyword)
n_results: Number of results to return
Returns:
Combined results with hybrid scores
"""
# Get BM25 scores
tokenized_query = simple_tokenize(query_text)
bm25_scores = bm25.get_scores(tokenized_query)
# Get vector similarities (we'll search more to ensure good coverage)
response = co.embed(
texts=[query_text],
model='embed-v4.0',
input_type='search_query',
embedding_types=['float']
)
query_embedding = np.array(response.embeddings.float_[0])
vector_results = collection.query(
query_embeddings=[query_embedding.tolist()],
n_results=100 # Get more candidates for better coverage
)
# Extract vector distances and convert to similarities
# ChromaDB returns cosine distance (0 to 2, lower = more similar)
# We'll convert to similarity scores where higher = better for easier combination
vector_distances = {}
for i, paper_id in enumerate(vector_results['ids'][0]):
distance = vector_results['distances'][0][i]
# Convert distance to similarity (simple inversion)
similarity = 1 / (1 + distance)
vector_distances[paper_id] = similarity
# Normalize BM25 scores to 0-1 range
max_bm25 = max(bm25_scores) if max(bm25_scores) > 0 else 1
min_bm25 = min(bm25_scores)
bm25_normalized = {}
for i, score in enumerate(bm25_scores):
paper_id = f"paper_{i}"
normalized = (score - min_bm25) / (max_bm25 - min_bm25) if max_bm25 > min_bm25 else 0
bm25_normalized[paper_id] = normalized
# Combine scores using weighted average
# hybrid_score = alpha * bm25 + (1 - alpha) * vector
hybrid_scores = {}
all_paper_ids = set(bm25_normalized.keys()) | set(vector_distances.keys())
for paper_id in all_paper_ids:
bm25_score = bm25_normalized.get(paper_id, 0)
vector_score = vector_distances.get(paper_id, 0)
hybrid_score = alpha * bm25_score + (1 - alpha) * vector_score
hybrid_scores[paper_id] = hybrid_score
# Get top N by hybrid score
top_paper_ids = sorted(hybrid_scores.items(), key=lambda x: x[1], reverse=True)[:n_results]
# Format results
results = []
for paper_id, score in top_paper_ids:
paper_idx = int(paper_id.split('_')[1])
results.append({
'paper_id': paper_id,
'title': df.iloc[paper_idx]['title'],
'category': df.iloc[paper_idx]['category'],
'abstract': df.iloc[paper_idx]['abstract'][:200] + "...",
'hybrid_score': score,
'bm25_score': bm25_normalized.get(paper_id, 0),
'vector_score': vector_distances.get(paper_id, 0)
})
return results
# Test with different alpha values
query = "neural network training optimization"
print(f"Query: '{query}'")
print("=" * 80)
# Pure vector (alpha = 0)
print("\nPure Vector Search (alpha=0.0):")
results = hybrid_search(query, alpha=0.0, n_results=5)
for i, r in enumerate(results, 1):
print(f"{i}. [{r['category']}] {r['title'][:60]}...")
print(f" Hybrid: {r['hybrid_score']:.3f} (Vector: {r['vector_score']:.3f}, BM25: {r['bm25_score']:.3f})")
# Hybrid 30% keyword, 70% vector
print("\nHybrid 30/70 (alpha=0.3):")
results = hybrid_search(query, alpha=0.3, n_results=5)
for i, r in enumerate(results, 1):
print(f"{i}. [{r['category']}] {r['title'][:60]}...")
print(f" Hybrid: {r['hybrid_score']:.3f} (Vector: {r['vector_score']:.3f}, BM25: {r['bm25_score']:.3f})")
# Hybrid 50/50
print("\nHybrid 50/50 (alpha=0.5):")
results = hybrid_search(query, alpha=0.5, n_results=5)
for i, r in enumerate(results, 1):
print(f"{i}. [{r['category']}] {r['title'][:60]}...")
print(f" Hybrid: {r['hybrid_score']:.3f} (Vector: {r['vector_score']:.3f}, BM25: {r['bm25_score']:.3f})")
# Pure keyword (alpha = 1.0)
print("\nPure BM25 Keyword (alpha=1.0):")
results = hybrid_search(query, alpha=1.0, n_results=5)
for i, r in enumerate(results, 1):
print(f"{i}. [{r['category']}] {r['title'][:60]}...")
print(f" Hybrid: {r['hybrid_score']:.3f} (Vector: {r['vector_score']:.3f}, BM25: {r['bm25_score']:.3f})")
Query: 'neural network training optimization'
================================================================================
Pure Vector Search (alpha=0.0):
1. [cs.LG] Training Neural Networks at Any Scale...
Hybrid: 0.642 (Vector: 0.642, BM25: 0.749)
2. [cs.LG] On the Convergence of Overparameterized Problems: Inherent P...
Hybrid: 0.630 (Vector: 0.630, BM25: 1.000)
3. [cs.LG] Adam symmetry theorem: characterization of the convergence o...
Hybrid: 0.617 (Vector: 0.617, BM25: 0.381)
4. [cs.LG] A Distributed Training Architecture For Combinatorial Optimi...
Hybrid: 0.617 (Vector: 0.617, BM25: 0.884)
5. [cs.LG] Can Training Dynamics of Scale-Invariant Neural Networks Be ...
Hybrid: 0.609 (Vector: 0.609, BM25: 0.566)
Hybrid 30/70 (alpha=0.3):
1. [cs.LG] On the Convergence of Overparameterized Problems: Inherent P...
Hybrid: 0.741 (Vector: 0.630, BM25: 1.000)
2. [cs.LG] Neuronal Fluctuations: Learning Rates vs Participating Neuro...
Hybrid: 0.714 (Vector: 0.603, BM25: 0.971)
3. [cs.CV] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
Hybrid: 0.709 (Vector: 0.601, BM25: 0.960)
4. [cs.LG] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
Hybrid: 0.708 (Vector: 0.601, BM25: 0.960)
5. [cs.LG] N-ReLU: Zero-Mean Stochastic Extension of ReLU...
Hybrid: 0.707 (Vector: 0.603, BM25: 0.948)
Hybrid 50/50 (alpha=0.5):
1. [cs.LG] On the Convergence of Overparameterized Problems: Inherent P...
Hybrid: 0.815 (Vector: 0.630, BM25: 1.000)
2. [cs.LG] Neuronal Fluctuations: Learning Rates vs Participating Neuro...
Hybrid: 0.787 (Vector: 0.603, BM25: 0.971)
3. [cs.CV] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
Hybrid: 0.780 (Vector: 0.601, BM25: 0.960)
4. [cs.LG] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
Hybrid: 0.780 (Vector: 0.601, BM25: 0.960)
5. [cs.LG] N-ReLU: Zero-Mean Stochastic Extension of ReLU...
Hybrid: 0.775 (Vector: 0.603, BM25: 0.948)
Pure BM25 Keyword (alpha=1.0):
1. [cs.LG] On the Convergence of Overparameterized Problems: Inherent P...
Hybrid: 1.000 (Vector: 0.630, BM25: 1.000)
2. [cs.LG] Neuronal Fluctuations: Learning Rates vs Participating Neuro...
Hybrid: 0.971 (Vector: 0.603, BM25: 0.971)
3. [cs.LG] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
Hybrid: 0.960 (Vector: 0.601, BM25: 0.960)
4. [cs.CV] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Nor...
Hybrid: 0.960 (Vector: 0.601, BM25: 0.960)
5. [cs.LG] N-ReLU: Zero-Mean Stochastic Extension of ReLU...
Hybrid: 0.948 (Vector: 0.603, BM25: 0.948)
The output shows how different alpha values affect which papers surface. With pure vector search (alpha=0), you’ll see papers that semantically relate to neural network training. As you increase alpha toward 1, you’ll increasingly weight papers that literally contain the words “neural,” “network,” “training,” and “optimization.”
Evaluating Search Strategies Systematically
We’ve implemented three search approaches: pure vector, pure keyword, and hybrid. But which one actually works better? We need systematic evaluation.
The Evaluation Metric: Category Precision
For our balanced 5k dataset, we can use category precision as our success metric:
Category precision @k: What percentage of the top k results are in the expected category?
If we search for “SQL query optimization,” we expect Database papers (cs.DB). If 4 out of 5 top results are from cs.DB, we have 80% precision@5.
This metric works because our dataset is perfectly balanced and we can predict which category should dominate for specific queries.
Creating Test Queries
Let’s create 10 diverse queries targeting different categories:
test_queries = [
{
"text": "natural language processing transformers",
"expected_category": "cs.CL",
"description": "NLP query"
},
{
"text": "image segmentation computer vision",
"expected_category": "cs.CV",
"description": "Vision query"
},
{
"text": "database query optimization indexing",
"expected_category": "cs.DB",
"description": "Database query"
},
{
"text": "neural network training deep learning",
"expected_category": "cs.LG",
"description": "ML query with clear terms"
},
{
"text": "software testing debugging quality assurance",
"expected_category": "cs.SE",
"description": "Software engineering query"
},
{
"text": "attention mechanisms sequence models",
"expected_category": "cs.CL",
"description": "NLP architecture query"
},
{
"text": "convolutional neural networks image recognition",
"expected_category": "cs.CV",
"description": "Vision with technical terms"
},
{
"text": "distributed systems database consistency",
"expected_category": "cs.DB",
"description": "Database systems query"
},
{
"text": "reinforcement learning policy gradient",
"expected_category": "cs.LG",
"description": "RL query"
},
{
"text": "code review static analysis",
"expected_category": "cs.SE",
"description": "SE development query"
}
]
print(f"Created {len(test_queries)} test queries")
print("Expected category distribution:")
categories = [q['expected_category'] for q in test_queries]
print(pd.Series(categories).value_counts().sort_index())
Created 10 test queries
Expected category distribution:
cs.CL 2
cs.CV 2
cs.DB 2
cs.LG 2
cs.SE 2
Name: count, dtype: int64
Our test set is balanced across categories, ensuring fair evaluation.
Running the Evaluation
Now let’s test pure vector, pure keyword, and hybrid approaches:
def calculate_category_precision(query_text, expected_category, search_type="vector", alpha=0.5):
"""
Calculate what percentage of top 5 results match expected category.
Args:
query_text: The search query
expected_category: Expected category (e.g., 'cs.LG')
search_type: 'vector', 'bm25', or 'hybrid'
alpha: Weight for BM25 if using hybrid
Returns:
Precision (0.0 to 1.0)
"""
if search_type == "vector":
results = search_with_filter(query_text, n_results=5)
categories = [r['category'] for r in results['metadatas'][0]]
elif search_type == "bm25":
tokenized_query = simple_tokenize(query_text)
bm25_scores = bm25.get_scores(tokenized_query)
top_indices = np.argsort(bm25_scores)[::-1][:5]
categories = [df.iloc[idx]['category'] for idx in top_indices]
elif search_type == "hybrid":
results = hybrid_search(query_text, alpha=alpha, n_results=5)
categories = [r['category'] for r in results]
# Calculate precision
matches = sum(1 for cat in categories if cat == expected_category)
precision = matches / len(categories)
return precision, categories
# Evaluate all strategies
results_summary = {
'Pure Vector': [],
'Hybrid 30/70': [],
'Hybrid 50/50': [],
'Pure BM25': []
}
print("Evaluating search strategies on 10 test queries...")
print("=" * 80)
for query_info in test_queries:
query = query_info['text']
expected = query_info['expected_category']
print(f"\nQuery: {query}")
print(f"Expected: {expected}")
# Pure vector
precision, _ = calculate_category_precision(query, expected, "vector")
results_summary['Pure Vector'].append(precision)
print(f" Pure Vector: {precision*100:.0f}% precision")
# Hybrid 30/70
precision, _ = calculate_category_precision(query, expected, "hybrid", alpha=0.3)
results_summary['Hybrid 30/70'].append(precision)
print(f" Hybrid 30/70: {precision*100:.0f}% precision")
# Hybrid 50/50
precision, _ = calculate_category_precision(query, expected, "hybrid", alpha=0.5)
results_summary['Hybrid 50/50'].append(precision)
print(f" Hybrid 50/50: {precision*100:.0f}% precision")
# Pure BM25
precision, _ = calculate_category_precision(query, expected, "bm25")
results_summary['Pure BM25'].append(precision)
print(f" Pure BM25: {precision*100:.0f}% precision")
# Calculate average precision for each strategy
print("\n" + "=" * 80)
print("OVERALL RESULTS")
print("=" * 80)
for strategy, precisions in results_summary.items():
avg_precision = sum(precisions) / len(precisions)
print(f"{strategy}: {avg_precision*100:.0f}% average category precision")
Evaluating search strategies on 10 test queries...
================================================================================
Query: natural language processing transformers
Expected: cs.CL
Pure Vector: 80% precision
Hybrid 30/70: 60% precision
Hybrid 50/50: 60% precision
Pure BM25: 60% precision
Query: image segmentation computer vision
Expected: cs.CV
Pure Vector: 80% precision
Hybrid 30/70: 80% precision
Hybrid 50/50: 80% precision
Pure BM25: 80% precision
[... additional queries ...]
================================================================================
OVERALL RESULTS
================================================================================
Pure Vector: 84% average category precision
Hybrid 30/70: 78% average category precision
Hybrid 50/50: 78% average category precision
Pure BM25: 78% average category precision
Understanding What the Results Tell Us
These results deserve careful interpretation. Let’s be honest about what we discovered.
Finding 1: Pure Vector Performed Best on This Dataset
Pure vector search achieved 84% category precision compared to 78% for hybrid and 78% for BM25. This might surprise you if you’ve read guides claiming hybrid search always outperforms pure approaches.
Why pure vector dominated on academic abstracts:
Academic papers have rich vocabulary and technical terminology. ML papers naturally use words like “training,” “optimization,” “neural networks.” Database papers naturally use words like “query,” “index,” “transaction.” The semantic embeddings capture these domain-specific patterns well.
Adding BM25 keyword matching introduced false positives. Papers that coincidentally used similar words in different contexts got boosted incorrectly. For example, a database paper might mention “model training” when discussing query optimization models, causing it to rank high for “neural network training” queries even though it’s not about neural networks.
Finding 2: Hybrid Search Can Still Add Value
Just because pure vector won on this dataset doesn’t mean hybrid search is worthless. There are scenarios where keyword matching helps:
When hybrid might outperform pure vector:
- Searching structured data (product catalogs, API documentation)
- Queries with rare technical terms that might not embed well
- Domains where exact keyword presence is meaningful
- Documents with inconsistent writing quality where semantic meaning is unclear
On our academic abstracts: The rich vocabulary gave vector search everything it needed. Keyword matching added noise more than signal.
Finding 3: The Vocabulary Mismatch Problem
Some queries failed across ALL strategies. For example, we tested “reducing storage requirements for system event data” hoping to find a paper about log compression. None of the approaches found it. Why?
The query used “reducing storage requirements” but the paper said “compression” and “resource savings.” These are semantically equivalent, but the vocabulary differs. At 5k scale with multiple papers legitimately matching each query, vocabulary mismatches become visible.
This isn’t a failure of vector search or hybrid search. It’s the reality of semantic retrieval: users search with general terms, papers use technical jargon. Sometimes the gap is too wide.
Finding 4: Query Quality Matters More Than Strategy
Throughout our evaluation, we noticed that well-crafted queries with clear technical terms performed well across all strategies, while vague queries struggled everywhere.
A query like “neural network training optimization techniques” succeeded because it used the same language papers use. A query like “making models work better” failed because it’s too general and uses informal language.
The lesson: Before optimizing your search strategy, make sure your queries match how your documents are written. Understanding your corpus matters more than choosing between vector and keyword search.
Practical Guidance for Real Projects
Let’s consolidate what we’ve learned into actionable advice.
When to Use Metadata Filtering
Use filtering when:
- Users explicitly request filters (“show me papers from 2024”)
- Filtering meaningfully improves result quality
- Your query volume is manageable (ChromaDB can handle dozens of filtered queries per second)
- The performance cost is acceptable for your use case
Design your schema carefully:
- Store filterable fields as atomic values (integers for years, strings for categories)
- Avoid nested JSON blobs or long text in metadata
- Keep metadata consistent across documents
- Test filtering performance on your actual data before deploying
Accept the overhead:
- Filtered queries run slower than unfiltered ones in ChromaDB
- This is a characteristic of how ChromaDB approaches the problem
- Production databases handle filtering with different tradeoffs (we’ll see this in the next tutorial)
- Design for the database you’re actually using
When to Consider Hybrid Search
Try hybrid search when:
- Your documents have structured fields where exact matches matter
- Queries include rare technical terms that might not embed well
- Testing shows hybrid outperforms pure vector on your test queries
- You can afford the implementation and maintenance complexity
Stick with pure vector when:
- Your documents have rich natural language (like our academic abstracts)
- Vector search already achieves high precision on test queries
- Simplicity and maintainability matter
- Your embedding model captures domain terminology well
The decision framework:
- Build pure vector search first
- Create representative test queries
- Measure precision/recall on pure vector
- Only if results are inadequate, implement hybrid
- Compare hybrid against pure vector on same test queries
- Choose the approach with measurably better results
Don’t add complexity without evidence it helps.
Start Simple, Measure, Then Optimize
The pattern that emerged across our experiments:
- Start with pure vector search: It’s simpler to implement and maintain
- Build evaluation framework: Create test queries with expected results
- Measure performance: Calculate precision, recall, or domain-specific metrics
- Identify gaps: Where does pure vector fail?
- Add complexity thoughtfully: Try metadata filtering or hybrid search
- Re-evaluate: Does the added complexity improve results?
- Choose based on data: Not based on what tutorials claim always works
This approach keeps your system maintainable while ensuring each added feature provides real value.
Looking Ahead to Production Databases
Throughout this tutorial, we’ve explored filtering and hybrid search using ChromaDB. We’ve seen that:
- Filtering adds measurable overhead, but remains usable for moderate query volumes
- ChromaDB excels at local development and prototyping
- Production systems optimize these patterns differently
ChromaDB is designed to be lightweight, easy to use, and perfect for learning. We’ve used it to understand vector database concepts without worrying about infrastructure. The patterns we learned (metadata schema design, hybrid scoring, evaluation frameworks) transfer directly to production systems.
In the next tutorial, we’ll explore production vector databases:
- PostgreSQL with pgvector: See how vector search integrates with SQL and existing infrastructure
- Pinecone: Experience managed services with auto-scaling
- Qdrant: Explore Rust-backed performance and efficient filtering
You’ll discover how different systems approach filtering, when managed services make sense, and how to choose the right database for your needs. The core concepts remain the same, but production systems offer different tradeoffs in performance, features, and operational complexity.
But you needed to understand these concepts with an accessible tool first. ChromaDB gave us that foundation.
Practical Exercises
Before moving on, try these experiments to deepen your understanding:
Exercise 1: Explore Different Queries
Test pure vector vs hybrid search on queries from your own domain:
my_queries = [
"your domain-specific query here",
"another query relevant to your work",
# Add more
]
for query in my_queries:
print(f"\n{'='*70}")
print(f"Query: {query}")
# Try pure vector
results_vector = search_with_filter(query, n_results=5)
# Try hybrid
results_hybrid = hybrid_search(query, alpha=0.5, n_results=5)
# Compare the categories returned
# Which approach surfaces more relevant papers?
Exercise 2: Tune Hybrid Alpha
Find the optimal alpha value for a specific query:
query = "your challenging query here"
for alpha in [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]:
results = hybrid_search(query, alpha=alpha, n_results=5)
categories = [r['category'] for r in results]
print(f"Alpha={alpha}: {categories}")
# Which alpha gives the best results for this query?
Exercise 3: Analyze Filter Combinations
Test different metadata filter combinations:
# Try various filter patterns
filters_to_test = [
{"category": "cs.LG"},
{"year": {"$gte": 2024}},
{"category": "cs.LG", "year": {"$eq": 2025}},
{"$or": [{"category": "cs.LG"}, {"category": "cs.CV"}]}
]
query = "deep learning applications"
for where_clause in filters_to_test:
results = search_with_filter(query, where_clause, n_results=5)
categories = [r['category'] for r in results['metadatas'][0]]
print(f"Filter {where_clause}: {categories}")
Exercise 4: Build Your Own Evaluation
Create test queries for a different domain:
# If you have expertise in a specific field,
# create queries where you KNOW which papers should match
domain_specific_queries = [
{
"text": "your expert query",
"expected_category": "cs.XX",
"notes": "why this query should return this category"
},
# Add more
]
# Run evaluation and see which strategy performs best
# on YOUR domain-specific queries
Summary: What You’ve Learned
We’ve covered a lot of ground in this tutorial. Here’s what you can now do:
Core Skills
Metadata Schema Design:
- Store filterable fields as atomic, consistent values
- Avoid anti-patterns like JSON blobs and long text in metadata
- Ensure all documents have the same metadata fields to prevent filtering issues
- Understand that good schema design enables powerful filtering
Metadata Filtering in ChromaDB:
- Implement category filters, numeric range filters, and combinations
- Measure the performance overhead of filtering
- Make informed decisions about when filtering justifies the cost
BM25 Keyword Search:
- Build BM25 indexes from document text
- Understand term frequency and inverse document frequency
- Recognize when keyword matching complements vector search
- Know the scale limitations of different BM25 implementations
Hybrid Search Implementation:
- Normalize different score scales (BM25 and vector similarity)
- Combine scores using weighted averages
- Test different alpha values to balance keyword vs semantic search
- Understand this is a teaching implementation of fundamental concepts
Systematic Evaluation:
- Create test queries with ground truth expectations
- Calculate precision metrics to compare strategies
- Make data-driven decisions rather than assuming one approach always wins
Key Insights
1. Pure vector search performed best on our academic abstracts (84% category precision vs 78% for hybrid/BM25). This challenges the assumption that hybrid always wins. The rich vocabulary in academic papers gave vector search everything it needed.
2. Filtering overhead is real but manageable for moderate query volumes. ChromaDB’s approach to filtering creates measurable costs that production databases handle differently.
3. Vocabulary mismatch is the biggest challenge. Users search with general terms (“reducing storage”), papers use jargon (“compression”). This gap affects all search strategies.
4. Query quality matters more than search strategy. Well-crafted queries using domain terminology succeed across approaches. Vague queries struggle everywhere.
5. Start simple, measure, then optimize. Build pure vector first, evaluate systematically, add complexity only when data shows it helps.
What’s Next
We now understand how to enhance vector search with metadata filtering and hybrid approaches. We’ve seen what works, what doesn’t, and how to measure the difference.
In the next tutorial, we’ll explore production vector databases:
- Set up PostgreSQL with pgvector and see how vector search integrates with SQL
- Create a Pinecone index and experience managed vector database services
- Run Qdrant locally and compare its filtering performance
- Learn decision frameworks for choosing the right database for your needs
You’ll get hands-on experience with multiple production systems and develop the judgment to choose appropriately for different scenarios.
Before moving on, make sure you understand:
- How to design metadata schemas that enable effective filtering
- The performance tradeoffs of metadata filtering
- When hybrid search adds value vs adding complexity
- How to evaluate search strategies systematically using precision metrics
- Why pure vector search can outperform hybrid on certain datasets
When you’re comfortable with these concepts, you’re ready to explore production vector databases and learn when to move beyond ChromaDB.
Key Takeaways:
- Metadata schema design matters: store filterable fields as atomic, consistent values and ensure all documents have the same fields
- Filtering adds overhead in ChromaDB (category cheapest, year range most expensive, combined in between)
- Pure vector achieved 84% category precision vs 78% for hybrid/BM25 on academic abstracts due to rich vocabulary
- Hybrid search has value in specific scenarios (structured data, rare keywords) but adds complexity
- Vocabulary mismatch between queries and documents affects all search strategies equally
- Start with pure vector search, measure systematically, add complexity only when data justifies it
- ChromaDB taught us filtering concepts; production databases optimize differently
- Evaluation frameworks with test queries matter more than assumptions about “best practices”
Source link



