Sentiment Analysis at Scale — PySpark on AWS | Luis Núñez

The problem

Train a binary sentiment classifier on the full Amazon Reviews dataset (~17M reviews) using PySpark MLlib on AWS Glue, with the trained pipeline persisted to S3 for downstream batch inference.

Architecture

Two-notebook split — training (notebook 1) and inference (notebook 2) separated by the S3 model artifact.

┌─────────────────────────────┐
│ amazon-reviews-pds-parquet  │
│ (~17M reviews, partitioned) │
└──────────────┬──────────────┘
               ▼
       ┌──────────────────┐
       │  Electronics     │  .filter(product_category == "Electronics")
       │  subset (3.1M)   │  .repartition(32)
       └─────────┬────────┘
                 ▼ (write to own S3 bucket)
       ┌──────────────────┐
       │  s3://.../       │
       │  electronics/    │
       └─────────┬────────┘
                 ▼
     ┌────────────────────┐
     │ Feature pipeline   │  Tokenizer → StopWordsRemover
     │                    │  → HashingTF + IDF (or Word2Vec)
     └─────────┬──────────┘
               ▼
     ┌────────────────────┐
     │  LogisticRegression│  → fit, evaluate on 30% test split
     │  or DecisionTree   │
     └─────────┬──────────┘
               ▼
     ┌────────────────────┐
     │  Model serialized  │
     │  to S3 (.save)     │
     └─────────┬──────────┘
               ▼
     ┌────────────────────┐
     │ Inference notebook │  → load model from S3
     │                    │  → batch score test set
     │                    │  → AUC evaluation
     └────────────────────┘

Key decisions

Filter and repartition before caching

Starting from 17M rows, the work happens on a 3.1M Electronics slice. Filtering reduces scan volume; repartition(32) sets a reasonable file size for re-reads. Only after those two operations does the DataFrame get cached — avoids caching data we’ll never touch.

Write filtered data to our own S3 bucket

Public datasets have latency and throughput constraints. Writing the filtered slice to our own bucket gives stable, fast reads for all downstream work.

Binary target from the ordinal star rating

star_rating >= 3 → sentiment = 1, else 0. Simple definition that lets us use standard binary classification metrics (AUC, precision/recall at thresholds).

Split the pipeline at the serialization boundary

Training and inference notebooks are deliberately separate. The only dependency between them is the serialized model artifact in S3 — same contract a production system would have. Training can be retrained independently; inference can be re-run for new data without retraining.

Evaluate on a persisted test set, not a fresh split

Before training, the 30% test partition is written to electronics_test/ in S3. That guarantees training and inference notebooks see the same held-out data — reproducible evaluation even across separate sessions.

Stack

Compute: AWS Glue (Spark)
Storage: S3 Parquet
Framework: PySpark MLlib (Tokenizer, StopWordsRemover, HashingTF, IDF, Word2Vec, LogisticRegression)
Evaluation: BinaryClassificationEvaluator → AUC

The problem#

Architecture#

Key decisions#

Filter and repartition before caching#

Write filtered data to our own S3 bucket#

Binary target from the ordinal star rating#

Split the pipeline at the serialization boundary#

Evaluate on a persisted test set, not a fresh split#

Stack#

Links#