What it covers
Three recommender approaches compared on the same 10,000-book dataset:
- TF-IDF similarity — classical bag-of-words with spaCy lemmatization and cosine similarity
- Sentence embeddings via ChromaDB — semantic similarity using
all-MiniLM-L6-v2 - Hybrid — content-based candidate retrieval (ChromaDB) ranked by collaborative filtering scores (
surpriseSVD)
Key observations
- TF-IDF matches specific terms: Queries with strong proper nouns (e.g. “Twilight”) return exact saga members
- Embeddings capture theme: Abstract queries like “Animal Farm” yield thematically related political allegories rather than surface-term matches
- Hybrid beats both: Content-based retrieval narrows the candidate set to relevant books; SVD personalizes the ranking to the target user’s predicted preferences
Stack
- NLP preprocessing: spaCy (
en_core_web_sm) - Classical similarity: scikit-learn
TfidfVectorizer+cosine_similarity - Embedding search: ChromaDB (sentence transformers under the hood)
- Collaborative filtering:
surpriselibrary — SVD matrix factorization