Lab 2 - Interactive Explainer
Run distributed PySpark data processing without managing EMR clusters. Learn the StringIndexer → OneHotEncoder → VectorAssembler pipeline pattern for scalable feature engineering.
In this lab, you run a distributed Apache Spark job on SageMaker Processing to transform the Adult Census dataset into ML-ready features. Unlike Lab 1 (visual Data Wrangler), this lab uses code-based PySpark — the approach you'd use at AnyCompany for production pipelines processing millions of payroll records.
Create a serverless Spark cluster on SageMaker — 2x ml.m5.xlarge instances. No EMR provisioning, no cluster management. Pay only for processing time.
Chain StringIndexer → OneHotEncoder → VectorAssembler into a reproducible Spark ML Pipeline. Same pattern used at enterprise scale.
80/20 random split into training and validation datasets. Output as CSV to S3 for downstream SageMaker training jobs.
| Column | Type | Role in Pipeline | HCM Equivalent |
|---|---|---|---|
age | Integer | Numeric feature (passthrough) | years_at_company |
workclass | String | Categorical → StringIndex → OneHot | department |
education | String | Categorical → StringIndex → OneHot | highest_degree |
marital_status | String | Categorical → StringIndex → OneHot | employment_type |
occupation | String | Categorical → StringIndex → OneHot | job_role |
hours_per_week | Integer | Numeric feature (passthrough) | avg_weekly_hours |
income | String | Target label (≤50K / >50K) | left_company (0/1) |
instance_count.Spark ML Pipelines chain transformers into a single reproducible workflow. Click any node to explore that stage, or auto-play to watch data flow through the full pipeline.
The core of Lab 2 is transforming categorical strings into numeric vectors that Spark ML algorithms can consume. Here's exactly what happens to each column.
| Original Value | Frequency (in 999-row sample) | Index |
|---|---|---|
| Private | ~697 (most common) | 0 |
| Self-emp-not-inc | ~78 | 1 |
| Local-gov | ~67 | 2 |
| State-gov | ~42 | 3 |
| Self-emp-inc | ~36 | 4 |
| Federal-gov | ~30 | 5 |
| ? (missing) | ~49 | 6 |
| Index | Category | One-Hot Vector |
|---|---|---|
| 0 | Private | [1, 0, 0, 0, 0, 0] |
| 1 | Self-emp-not-inc | [0, 1, 0, 0, 0, 0] |
| 2 | Local-gov | [0, 0, 1, 0, 0, 0] |
| 3 | State-gov | [0, 0, 0, 1, 0, 0] |
| 4 | Self-emp-inc | [0, 0, 0, 0, 1, 0] |
| 5 | Federal-gov | [0, 0, 0, 0, 0, 1] |
native_country with 28 unique values in this sample, this saves massive memory vs storing 27 zeros per row. Spark's OneHotEncoder uses N-1 encoding (last category is all zeros) to avoid multicollinearity.All 8 one-hot vectors + 6 numeric columns are assembled into a single feature vector per row:
[workclass_onehot(6) | education_onehot(15) | marital_onehot(6) | occupation_onehot(14) | relationship_onehot(5) | race_onehot(4) | sex_onehot(1) | country_onehot(27) | age | fnlwgt | education_num | capital_gain | capital_loss | hours_per_week]
Total dimensions: ~84 features per row (78 one-hot + 6 numeric). Spark uses N-1 encoding (drops last category to avoid multicollinearity).
SageMaker Processing lets you run Spark jobs without managing infrastructure. Click any node to explore the architecture.
| Parameter | Lab 2 Value | Production (AnyCompany) |
|---|---|---|
instance_count | 2 | 10-20 (for millions of records) |
instance_type | ml.m5.xlarge | ml.m5.4xlarge or ml.r5.xlarge (memory) |
framework_version | 3.1 | 3.3+ (latest stable) |
max_runtime_in_seconds | 1200 (20 min) | 7200 (2 hours for large jobs) |
Here's how you'd adapt Lab 2's PySpark pipeline for AnyCompany's employee attrition prediction at scale — processing data from multiple HR systems across countries.
| Lab 2 Column | AnyCompany Column | Encoding | Rationale |
|---|---|---|---|
workclass | department | OneHot | No natural order between Engineering, Sales, Support |
education | highest_degree | OneHot (or Ordinal) | Could argue ordinal: HS < BS < MS < PhD |
marital_status | work_mode | OneHot | Remote, Hybrid, Office — no inherent order |
occupation | job_role | OneHot | SDE, PM, Designer, QA — nominal categories |
native_country | office_location | OneHot | Hyderabad, Pune, Chennai, Roseland — 10+ values |
age | years_at_company | Numeric (passthrough) | Continuous, already numeric |
hours_per_week | avg_weekly_hours | Numeric (passthrough) | Continuous, already numeric |
income | left_company | Target label (0/1) | Binary classification target |
Remove exit_interview_scheduled and resignation_notice_days — these only exist for leavers and would give the model the answer.
Create months_since_promotion, salary_vs_band_median, manager_changes_2yr — these require joins across multiple source tables.
Normalize currencies to USD, handle country-specific leave policies, align fiscal calendars. Spark's distributed joins handle this at scale.
Hash or drop employee_id, name, email before training. Use SageMaker Processing's VPC config to keep data in private subnets.
Lab scale: 999 rows, 2x ml.m5.xlarge, ~5 min, ~$0.50
Production scale: 500K employee records across 40+ countries, 4 source systems
Cluster: PySparkProcessor(instance_count=10, instance_type='ml.m5.4xlarge')
Schedule: Monthly trigger via SageMaker Pipelines (1st of each month)
Output: Parquet (not CSV) to S3 with 70/15/15 train/val/test split
Both labs transform the same dataset — but they represent two different approaches to data processing. Understanding when to use each is a key skill for ML engineers at AnyCompany.
| Dimension | Lab 1: Data Wrangler | Lab 2: PySpark |
|---|---|---|
| Interface | Visual GUI (drag-and-drop) | Code (Python + Spark) |
| Compute Environment | SageMaker Studio instance (single machine) | SageMaker Processing (managed Spark cluster, 2+ nodes) |
| Skill Level | Data scientists, analysts | ML engineers, developers |
| Scale | < 100K rows (single machine) | Millions+ (distributed cluster) |
| Reproducibility | Exportable flow (JSON) | Version-controlled .py scripts |
| Complex Joins | Limited | Full SQL + Spark DataFrame joins |
| CI/CD Integration | Difficult | Native (script in Git repo) |
| Cost | Studio instance time (always-on while open) | Processing job time only (auto-terminates after job) |
| Encoding Approach | Manual per-column transforms | Pipeline pattern (batch all categoricals) |
| AnyCompany Use | Prototyping, EDA, small datasets | Production pipelines, scheduled retraining |
PySparkProcessor.run() ships it to a provisioned cluster. The cluster auto-provisions, runs your code, writes output to S3, and auto-terminates. You never install Spark locally.Exploring a new dataset for the first time. Quick prototyping. Non-engineers need to contribute. Dataset fits in memory (<100K rows). Visual debugging of transforms.
Production pipeline that runs on schedule. Dataset exceeds single-machine memory. Complex multi-table joins required. Need version control + CI/CD. Processing millions of payroll records monthly.
SageMaker Processing provisions a Spark cluster on demand. You submit a .py script — it runs on ml.m5.xlarge instances, not your laptop. Cluster auto-terminates after job completes.
StringIndexer → OneHotEncoder → VectorAssembler. Composable, reproducible, parallelizable across the cluster.
Same script scales from 50K rows (lab) to 50M rows (production). Just increase instance_count. Code lives in version control.