Lab 2 - Interactive Explainer

Spark Processing on SageMaker

Run distributed PySpark data processing without managing EMR clusters. Learn the StringIndexer → OneHotEncoder → VectorAssembler pipeline pattern for scalable feature engineering.

⚡ PySpark ☁️ SageMaker Processing 🏢 HCM Context 🧪 Lab 2

📋 Lab 2 Overview

In this lab, you run a distributed Apache Spark job on SageMaker Processing to transform the Adult Census dataset into ML-ready features. Unlike Lab 1 (visual Data Wrangler), this lab uses code-based PySpark — the approach you'd use at AnyCompany for production pipelines processing millions of payroll records.

What You Build

PySparkProcessor

Create a serverless Spark cluster on SageMaker — 2x ml.m5.xlarge instances. No EMR provisioning, no cluster management. Pay only for processing time.

🔄

ML Pipeline

Chain StringIndexer → OneHotEncoder → VectorAssembler into a reproducible Spark ML Pipeline. Same pattern used at enterprise scale.

✂️

Train/Val Split

80/20 random split into training and validation datasets. Output as CSV to S3 for downstream SageMaker training jobs.

Input Dataset: Adult Census

ColumnTypeRole in PipelineHCM Equivalent
ageIntegerNumeric feature (passthrough)years_at_company
workclassStringCategorical → StringIndex → OneHotdepartment
educationStringCategorical → StringIndex → OneHothighest_degree
marital_statusStringCategorical → StringIndex → OneHotemployment_type
occupationStringCategorical → StringIndex → OneHotjob_role
hours_per_weekIntegerNumeric feature (passthrough)avg_weekly_hours
incomeStringTarget label (≤50K / >50K)left_company (0/1)
💡
15 columns, 999 rows: 8 categorical (need encoding) + 6 numeric (passthrough) + 1 target. This is a sample of the full Adult Census dataset (~48K rows). The PySpark pipeline handles all 8 categoricals in parallel across the cluster — at AnyCompany scale, the same code processes millions of records by simply increasing instance_count.

🔄 The PySpark ML Pipeline

Spark ML Pipelines chain transformers into a single reproducible workflow. Click any node to explore that stage, or auto-play to watch data flow through the full pipeline.

📋 StringIndexer: Converts categorical string values into numeric indices. "Private" → 0, "Self-emp" → 1, "Gov" → 2. Assigns indices by frequency (most common = 0). Applied to all 8 categorical columns in parallel.
🔠 StringIndexer str → int 🔥 OneHotEncoder int → binary vector 🧩 VectorAssembler combine all features ✂️ 80/20 Split train / validation 💾 S3 CSV output
🔠 StringIndexer
Input8 categorical string columns
Output8 numeric index columns (*-index)
MethodFrequency-based: most common = 0
Exampleworkclass: "Private"→0, "Self-emp"→1, "Gov"→2

🔢 Encoding Deep Dive

The core of Lab 2 is transforming categorical strings into numeric vectors that Spark ML algorithms can consume. Here's exactly what happens to each column.

Step 1: StringIndexer (String → Integer)

Original ValueFrequency (in 999-row sample)Index
Private~697 (most common)0
Self-emp-not-inc~781
Local-gov~672
State-gov~423
Self-emp-inc~364
Federal-gov~305
? (missing)~496
⚠️
Why not use indices directly? Index 5 is NOT "better" than index 2. These are nominal categories with no order. If you feed raw indices to a model, it will incorrectly assume Federal-gov (5) > Local-gov (2). That's why we need OneHotEncoder next.

Step 2: OneHotEncoder (Integer → Binary Vector)

IndexCategoryOne-Hot Vector
0Private[1, 0, 0, 0, 0, 0]
1Self-emp-not-inc[0, 1, 0, 0, 0, 0]
2Local-gov[0, 0, 1, 0, 0, 0]
3State-gov[0, 0, 0, 1, 0, 0]
4Self-emp-inc[0, 0, 0, 0, 1, 0]
5Federal-gov[0, 0, 0, 0, 0, 1]
💡
Sparse representation: Spark stores these as sparse vectors internally — only the position of the "1" is stored. For native_country with 28 unique values in this sample, this saves massive memory vs storing 27 zeros per row. Spark's OneHotEncoder uses N-1 encoding (last category is all zeros) to avoid multicollinearity.

Step 3: VectorAssembler (Combine Everything)

All 8 one-hot vectors + 6 numeric columns are assembled into a single feature vector per row:

Final Feature Vector Structure

[workclass_onehot(6) | education_onehot(15) | marital_onehot(6) | occupation_onehot(14) | relationship_onehot(5) | race_onehot(4) | sex_onehot(1) | country_onehot(27) | age | fnlwgt | education_num | capital_gain | capital_loss | hours_per_week]

Total dimensions: ~84 features per row (78 one-hot + 6 numeric). Spark uses N-1 encoding (drops last category to avoid multicollinearity).

☁️ SageMaker Processing Architecture

SageMaker Processing lets you run Spark jobs without managing infrastructure. Click any node to explore the architecture.

💻 Studio Notebook: You define the PySparkProcessor in a Jupyter notebook. Specify instance type, count, Spark version, and the preprocessing script path. One API call launches the entire distributed job.
💻 Studio Notebook ⚙️ PySparkProcessor 2x ml.m5.xlarge Spark Job pyspark_preprocessing.py 🪣 S3 Output train/ + validation/
💻 Studio Notebook
ActionDefine PySparkProcessor + call .run()
SDKsagemaker.spark.processing.PySparkProcessor
Spark Version3.1
Duration~5 minutes total (incl. provisioning)

Key Configuration Parameters

ParameterLab 2 ValueProduction (AnyCompany)
instance_count210-20 (for millions of records)
instance_typeml.m5.xlargeml.m5.4xlarge or ml.r5.xlarge (memory)
framework_version3.13.3+ (latest stable)
max_runtime_in_seconds1200 (20 min)7200 (2 hours for large jobs)

🏢 HCM Mapping: AnyCompany Attrition Pipeline

Here's how you'd adapt Lab 2's PySpark pipeline for AnyCompany's employee attrition prediction at scale — processing data from multiple HR systems across countries.

Column-by-Column Mapping

Lab 2 ColumnAnyCompany ColumnEncodingRationale
workclassdepartmentOneHotNo natural order between Engineering, Sales, Support
educationhighest_degreeOneHot (or Ordinal)Could argue ordinal: HS < BS < MS < PhD
marital_statuswork_modeOneHotRemote, Hybrid, Office — no inherent order
occupationjob_roleOneHotSDE, PM, Designer, QA — nominal categories
native_countryoffice_locationOneHotHyderabad, Pune, Chennai, Roseland — 10+ values
ageyears_at_companyNumeric (passthrough)Continuous, already numeric
hours_per_weekavg_weekly_hoursNumeric (passthrough)Continuous, already numeric
incomeleft_companyTarget label (0/1)Binary classification target

AnyCompany-Specific Additions

⚠️

Drop Leaky Features

Remove exit_interview_scheduled and resignation_notice_days — these only exist for leavers and would give the model the answer.

📊

Add Derived Features

Create months_since_promotion, salary_vs_band_median, manager_changes_2yr — these require joins across multiple source tables.

🌍

Multi-Country Handling

Normalize currencies to USD, handle country-specific leave policies, align fiscal calendars. Spark's distributed joins handle this at scale.

🔒

PII Protection

Hash or drop employee_id, name, email before training. Use SageMaker Processing's VPC config to keep data in private subnets.

AnyCompany Production Pipeline

Lab scale: 999 rows, 2x ml.m5.xlarge, ~5 min, ~$0.50

Production scale: 500K employee records across 40+ countries, 4 source systems

Cluster: PySparkProcessor(instance_count=10, instance_type='ml.m5.4xlarge')

Schedule: Monthly trigger via SageMaker Pipelines (1st of each month)

Output: Parquet (not CSV) to S3 with 70/15/15 train/val/test split

⚖️ Lab 1 (Data Wrangler) vs Lab 2 (PySpark)

Both labs transform the same dataset — but they represent two different approaches to data processing. Understanding when to use each is a key skill for ML engineers at AnyCompany.

Head-to-Head Comparison

DimensionLab 1: Data WranglerLab 2: PySpark
InterfaceVisual GUI (drag-and-drop)Code (Python + Spark)
Compute EnvironmentSageMaker Studio instance (single machine)SageMaker Processing (managed Spark cluster, 2+ nodes)
Skill LevelData scientists, analystsML engineers, developers
Scale< 100K rows (single machine)Millions+ (distributed cluster)
ReproducibilityExportable flow (JSON)Version-controlled .py scripts
Complex JoinsLimitedFull SQL + Spark DataFrame joins
CI/CD IntegrationDifficultNative (script in Git repo)
CostStudio instance time (always-on while open)Processing job time only (auto-terminates after job)
Encoding ApproachManual per-column transformsPipeline pattern (batch all categoricals)
AnyCompany UsePrototyping, EDA, small datasetsProduction pipelines, scheduled retraining
☁️
PySpark does NOT run on your local machine. In Lab 2, the PySpark script runs on a SageMaker Processing job — a fully managed Spark cluster (2x ml.m5.xlarge instances in the lab). You write the script locally or in a Studio notebook, then PySparkProcessor.run() ships it to a provisioned cluster. The cluster auto-provisions, runs your code, writes output to S3, and auto-terminates. You never install Spark locally.

Decision Framework

🔬

Use Data Wrangler When...

Exploring a new dataset for the first time. Quick prototyping. Non-engineers need to contribute. Dataset fits in memory (<100K rows). Visual debugging of transforms.

Use PySpark When...

Production pipeline that runs on schedule. Dataset exceeds single-machine memory. Complex multi-table joins required. Need version control + CI/CD. Processing millions of payroll records monthly.

🎯
The AnyCompany pattern: Start with Data Wrangler to explore and prototype transforms on a sample. Once validated, translate the logic into a PySpark script for production. Data Wrangler can even export to PySpark code automatically.

📝 Lab 2 Summary

Managed Spark (Not Local)

SageMaker Processing provisions a Spark cluster on demand. You submit a .py script — it runs on ml.m5.xlarge instances, not your laptop. Cluster auto-terminates after job completes.

ML Pipeline Pattern

StringIndexer → OneHotEncoder → VectorAssembler. Composable, reproducible, parallelizable across the cluster.

Production Ready

Same script scales from 50K rows (lab) to 50M rows (production). Just increase instance_count. Code lives in version control.