Text Analyzers
The vectlite.analyzers module provides a configurable text processing pipeline for generating sparse term vectors. This is useful for fine-tuning BM25 keyword search behavior.
info
Analyzers are currently available in the Python binding only.
Basic Usage
from vectlite.analyzers import Analyzer
analyzer = Analyzer().lowercase().stopwords("en").stemmer("english")
terms = analyzer.sparse_terms("How to authenticate users with SSO")
# {'authent': 0.333, 'user': 0.333, 'sso': 0.333}
Pipeline Steps
The analyzer applies steps in the order they are added:
Tokenizer
Replace the default tokenizer (alphanumeric word splitting):
analyzer = Analyzer().tokenizer(lambda text: text.split("-"))
Lowercase
Convert all tokens to lowercase:
analyzer = Analyzer().lowercase()
Stopwords
Remove common words. Built-in lists for English and French:
analyzer = Analyzer().stopwords("en") # English stopwords
analyzer = Analyzer().stopwords("fr") # French stopwords
analyzer = Analyzer().stopwords({"my", "custom", "words"}) # Custom set
Stemmer
Reduce words to their root form using Snowball stemming. Requires the PyStemmer package:
pip install PyStemmer
analyzer = Analyzer().stemmer("english")
N-grams
Generate character n-grams from tokens:
analyzer = Analyzer().ngrams(3)
# "hello" -> ["hel", "ell", "llo"]
Custom Filters
Add any function that transforms a token list:
def remove_short(tokens):
return [t for t in tokens if len(t) > 2]
analyzer = Analyzer().filter(remove_short)
Weighted Fields
Generate sparse vectors from multiple text fields with different weights:
analyzer = Analyzer().lowercase().stopwords("en")
terms = analyzer.sparse_terms_weighted(
fields={"title": "Auth Setup Guide", "body": "How to configure SSO for your organization"},
weights={"title": 2.0, "body": 1.0},
)
Using with Search
Pass analyzer-generated terms to the search API:
analyzer = Analyzer().lowercase().stopwords("en").stemmer("english")
# Index
terms = analyzer.sparse_terms("How to configure SSO authentication")
db.upsert("doc1", embedding, {"text": "..."}, sparse=terms)
# Search
query_terms = analyzer.sparse_terms("SSO setup guide")
results = db.search(query_embedding, sparse=query_terms, k=10)