Book Recommender — Hybrid Content + Collaborative Filtering

<span title='2025-06-15 00:00:00 +0000 UTC'>June 15, 2025</span> · 1 min · Luis Núñez

Table of Contents

What it covers
Key observations
Stack
Links

What it covers

Three recommender approaches compared on the same 10,000-book dataset:

TF-IDF similarity — classical bag-of-words with spaCy lemmatization and cosine similarity
Sentence embeddings via ChromaDB — semantic similarity using all-MiniLM-L6-v2
Hybrid — content-based candidate retrieval (ChromaDB) ranked by collaborative filtering scores (surprise SVD)

Key observations

TF-IDF matches specific terms: Queries with strong proper nouns (e.g. “Twilight”) return exact saga members
Embeddings capture theme: Abstract queries like “Animal Farm” yield thematically related political allegories rather than surface-term matches
Hybrid beats both: Content-based retrieval narrows the candidate set to relevant books; SVD personalizes the ranking to the target user’s predicted preferences

Stack

NLP preprocessing: spaCy (en_core_web_sm)
Classical similarity: scikit-learn TfidfVectorizer + cosine_similarity
Embedding search: ChromaDB (sentence transformers under the hood)
Collaborative filtering: surprise library — SVD matrix factorization

Links