What it covers
Three ML use cases built on a shared scikit-learn foundation:
- Churn prevention — binary classification to flag at-risk telco customers (logistic regression, decision tree, SVM with GridSearchCV hyperparameter tuning)
- Customer Lifetime Value — regression to estimate long-term customer revenue (linear regression vs. depth-tuned decision tree)
- Customer segmentation — KMeans clustering on RFM (Recency, Frequency, Monetary) features with elbow-method k selection
AE-adjacent themes
- ColumnTransformer for parallel numeric/categorical preprocessing
- Pipeline discipline — separate pandas-level cleanup (type casts, ordinal encoding) from scikit-learn’s fit-dependent transforms (imputation, scaling)
- Classification threshold tuning as a business decision, not a default
- AUC for threshold-independent model comparison
Key decisions
- Split preprocessing between pandas (column-level, training-independent) and sklearn Pipeline (fit-dependent). Separation avoids subtle data leakage and makes the whole thing reproducible on new data.
- Manual threshold sweep (0.25, 0.5, 0.75) to discuss precision/recall tradeoffs in business terms — retention campaigns are cheaper than losing customers, so bias toward recall.
- RFM features were log-transformed and standardized before clustering — KMeans is distance-based, skew destroys cluster geometry.
Results
- Churn model: SVM wins on AUC (~0.85)
- CLV: depth-tuned tree (10 → 0.78 test R²) beats unrestricted tree and linear regression
- Segmentation: k=4 clusters map cleanly to {premium / growing / occasional / churned}