The problem
Train a binary sentiment classifier on the full Amazon Reviews dataset (~17M reviews) using PySpark MLlib on AWS Glue, with the trained pipeline persisted to S3 for downstream batch inference.
Architecture
Two-notebook split — training (notebook 1) and inference (notebook 2) separated by the S3 model artifact.
┌─────────────────────────────┐
│ amazon-reviews-pds-parquet │
│ (~17M reviews, partitioned) │
└──────────────┬──────────────┘
▼
┌──────────────────┐
│ Electronics │ .filter(product_category == "Electronics")
│ subset (3.1M) │ .repartition(32)
└─────────┬────────┘
▼ (write to own S3 bucket)
┌──────────────────┐
│ s3://.../ │
│ electronics/ │
└─────────┬────────┘
▼
┌────────────────────┐
│ Feature pipeline │ Tokenizer → StopWordsRemover
│ │ → HashingTF + IDF (or Word2Vec)
└─────────┬──────────┘
▼
┌────────────────────┐
│ LogisticRegression│ → fit, evaluate on 30% test split
│ or DecisionTree │
└─────────┬──────────┘
▼
┌────────────────────┐
│ Model serialized │
│ to S3 (.save) │
└─────────┬──────────┘
▼
┌────────────────────┐
│ Inference notebook │ → load model from S3
│ │ → batch score test set
│ │ → AUC evaluation
└────────────────────┘
Key decisions
Filter and repartition before caching
Starting from 17M rows, the work happens on a 3.1M Electronics slice. Filtering reduces scan volume; repartition(32) sets a reasonable file size for re-reads. Only after those two operations does the DataFrame get cached — avoids caching data we’ll never touch.
Write filtered data to our own S3 bucket
Public datasets have latency and throughput constraints. Writing the filtered slice to our own bucket gives stable, fast reads for all downstream work.
Binary target from the ordinal star rating
star_rating >= 3 → sentiment = 1, else 0. Simple definition that lets us use standard binary classification metrics (AUC, precision/recall at thresholds).
Split the pipeline at the serialization boundary
Training and inference notebooks are deliberately separate. The only dependency between them is the serialized model artifact in S3 — same contract a production system would have. Training can be retrained independently; inference can be re-run for new data without retraining.
Evaluate on a persisted test set, not a fresh split
Before training, the 30% test partition is written to electronics_test/ in S3. That guarantees training and inference notebooks see the same held-out data — reproducible evaluation even across separate sessions.
Stack
- Compute: AWS Glue (Spark)
- Storage: S3 Parquet
- Framework: PySpark MLlib (
Tokenizer,StopWordsRemover,HashingTF,IDF,Word2Vec,LogisticRegression) - Evaluation:
BinaryClassificationEvaluator→ AUC