Lab 2: Spark Processing on SageMaker | AnyCompany ML Engineering

📋 Lab 2 Overview

In this lab, you run a distributed Apache Spark job on SageMaker Processing to transform the Adult Census dataset into ML-ready features. Unlike Lab 1 (visual Data Wrangler), this lab uses code-based PySpark — the approach you'd use at AnyCompany for production pipelines processing millions of payroll records.

What You Build

⚡

PySparkProcessor

Create a serverless Spark cluster on SageMaker — 2x ml.m5.xlarge instances. No EMR provisioning, no cluster management. Pay only for processing time.

🔄

ML Pipeline

Chain StringIndexer → OneHotEncoder → VectorAssembler into a reproducible Spark ML Pipeline. Same pattern used at enterprise scale.

✂️

Train/Val Split

80/20 random split into training and validation datasets. Output as CSV to S3 for downstream SageMaker training jobs.

Input Dataset: Adult Census

Column	Type	Role in Pipeline	HCM Equivalent
`age`	Integer	Numeric feature (passthrough)	years_at_company
`workclass`	String	Categorical → StringIndex → OneHot	department
`education`	String	Categorical → StringIndex → OneHot	highest_degree
`marital_status`	String	Categorical → StringIndex → OneHot	employment_type
`occupation`	String	Categorical → StringIndex → OneHot	job_role
`hours_per_week`	Integer	Numeric feature (passthrough)	avg_weekly_hours
`income`	String	Target label (≤50K / >50K)	left_company (0/1)

💡

15 columns, 999 rows: 8 categorical (need encoding) + 6 numeric (passthrough) + 1 target. This is a sample of the full Adult Census dataset (~48K rows). The PySpark pipeline handles all 8 categoricals in parallel across the cluster — at AnyCompany scale, the same code processes millions of records by simply increasing instance_count.

🔄 The PySpark ML Pipeline

Spark ML Pipelines chain transformers into a single reproducible workflow. Click any node to explore that stage, or auto-play to watch data flow through the full pipeline.

📋 StringIndexer: Converts categorical string values into numeric indices. "Private" → 0, "Self-emp" → 1, "Gov" → 2. Assigns indices by frequency (most common = 0). Applied to all 8 categorical columns in parallel.

🔠 StringIndexer

Input8 categorical string columns

Output8 numeric index columns (*-index)

MethodFrequency-based: most common = 0

Exampleworkclass: "Private"→0, "Self-emp"→1, "Gov"→2

🔢 Encoding Deep Dive

The core of Lab 2 is transforming categorical strings into numeric vectors that Spark ML algorithms can consume. Here's exactly what happens to each column.

Step 1: StringIndexer (String → Integer)

Original Value	Frequency (in 999-row sample)	Index
Private	~697 (most common)	0
Self-emp-not-inc	~78	1
Local-gov	~67	2
State-gov	~42	3
Self-emp-inc	~36	4
Federal-gov	~30	5
? (missing)	~49	6

⚠️

Why not use indices directly? Index 5 is NOT "better" than index 2. These are nominal categories with no order. If you feed raw indices to a model, it will incorrectly assume Federal-gov (5) > Local-gov (2). That's why we need OneHotEncoder next.

Step 2: OneHotEncoder (Integer → Binary Vector)

Index	Category	One-Hot Vector
0	Private	[1, 0, 0, 0, 0, 0]
1	Self-emp-not-inc	[0, 1, 0, 0, 0, 0]
2	Local-gov	[0, 0, 1, 0, 0, 0]
3	State-gov	[0, 0, 0, 1, 0, 0]
4	Self-emp-inc	[0, 0, 0, 0, 1, 0]
5	Federal-gov	[0, 0, 0, 0, 0, 1]

💡

Sparse representation: Spark stores these as sparse vectors internally — only the position of the "1" is stored. For native_country with 28 unique values in this sample, this saves massive memory vs storing 27 zeros per row. Spark's OneHotEncoder uses N-1 encoding (last category is all zeros) to avoid multicollinearity.

Step 3: VectorAssembler (Combine Everything)

All 8 one-hot vectors + 6 numeric columns are assembled into a single feature vector per row:

Final Feature Vector Structure

Total dimensions: ~84 features per row (78 one-hot + 6 numeric). Spark uses N-1 encoding (drops last category to avoid multicollinearity).

☁️ SageMaker Processing Architecture

SageMaker Processing lets you run Spark jobs without managing infrastructure. Click any node to explore the architecture.

💻 Studio Notebook: You define the PySparkProcessor in a Jupyter notebook. Specify instance type, count, Spark version, and the preprocessing script path. One API call launches the entire distributed job.

💻 Studio Notebook

ActionDefine PySparkProcessor + call .run()

SDKsagemaker.spark.processing.PySparkProcessor

Spark Version3.1

Duration~5 minutes total (incl. provisioning)

Key Configuration Parameters

Parameter	Lab 2 Value	Production (AnyCompany)
`instance_count`	2	10-20 (for millions of records)
`instance_type`	ml.m5.xlarge	ml.m5.4xlarge or ml.r5.xlarge (memory)
`framework_version`	3.1	3.3+ (latest stable)
`max_runtime_in_seconds`	1200 (20 min)	7200 (2 hours for large jobs)

🏢 HCM Mapping: AnyCompany Attrition Pipeline

Here's how you'd adapt Lab 2's PySpark pipeline for AnyCompany's employee attrition prediction at scale — processing data from multiple HR systems across countries.

Column-by-Column Mapping

Lab 2 Column	AnyCompany Column	Encoding	Rationale
`workclass`	`department`	OneHot	No natural order between Engineering, Sales, Support
`education`	`highest_degree`	OneHot (or Ordinal)	Could argue ordinal: HS < BS < MS < PhD
`marital_status`	`work_mode`	OneHot	Remote, Hybrid, Office — no inherent order
`occupation`	`job_role`	OneHot	SDE, PM, Designer, QA — nominal categories
`native_country`	`office_location`	OneHot	Hyderabad, Pune, Chennai, Roseland — 10+ values
`age`	`years_at_company`	Numeric (passthrough)	Continuous, already numeric
`hours_per_week`	`avg_weekly_hours`	Numeric (passthrough)	Continuous, already numeric
`income`	`left_company`	Target label (0/1)	Binary classification target

AnyCompany-Specific Additions

⚠️

Drop Leaky Features

Remove exit_interview_scheduled and resignation_notice_days — these only exist for leavers and would give the model the answer.

📊

Add Derived Features

Create months_since_promotion, salary_vs_band_median, manager_changes_2yr — these require joins across multiple source tables.

🌍

Multi-Country Handling

Normalize currencies to USD, handle country-specific leave policies, align fiscal calendars. Spark's distributed joins handle this at scale.

🔒

PII Protection

Hash or drop employee_id, name, email before training. Use SageMaker Processing's VPC config to keep data in private subnets.

AnyCompany Production Pipeline

Lab scale: 999 rows, 2x ml.m5.xlarge, ~5 min, ~$0.50

Production scale: 500K employee records across 40+ countries, 4 source systems

Cluster: PySparkProcessor(instance_count=10, instance_type='ml.m5.4xlarge')

Schedule: Monthly trigger via SageMaker Pipelines (1st of each month)

Output: Parquet (not CSV) to S3 with 70/15/15 train/val/test split

⚖️ Lab 1 (Data Wrangler) vs Lab 2 (PySpark)

Both labs transform the same dataset — but they represent two different approaches to data processing. Understanding when to use each is a key skill for ML engineers at AnyCompany.

Head-to-Head Comparison

Dimension	Lab 1: Data Wrangler	Lab 2: PySpark
Interface	Visual GUI (drag-and-drop)	Code (Python + Spark)
Compute Environment	SageMaker Studio instance (single machine)	SageMaker Processing (managed Spark cluster, 2+ nodes)
Skill Level	Data scientists, analysts	ML engineers, developers
Scale	< 100K rows (single machine)	Millions+ (distributed cluster)
Reproducibility	Exportable flow (JSON)	Version-controlled .py scripts
Complex Joins	Limited	Full SQL + Spark DataFrame joins
CI/CD Integration	Difficult	Native (script in Git repo)
Cost	Studio instance time (always-on while open)	Processing job time only (auto-terminates after job)
Encoding Approach	Manual per-column transforms	Pipeline pattern (batch all categoricals)
AnyCompany Use	Prototyping, EDA, small datasets	Production pipelines, scheduled retraining

☁️

PySpark does NOT run on your local machine. In Lab 2, the PySpark script runs on a SageMaker Processing job — a fully managed Spark cluster (2x ml.m5.xlarge instances in the lab). You write the script locally or in a Studio notebook, then PySparkProcessor.run() ships it to a provisioned cluster. The cluster auto-provisions, runs your code, writes output to S3, and auto-terminates. You never install Spark locally.

Decision Framework

🔬

Use Data Wrangler When...

Exploring a new dataset for the first time. Quick prototyping. Non-engineers need to contribute. Dataset fits in memory (<100K rows). Visual debugging of transforms.

⚡

Use PySpark When...

Production pipeline that runs on schedule. Dataset exceeds single-machine memory. Complex multi-table joins required. Need version control + CI/CD. Processing millions of payroll records monthly.

🎯

The AnyCompany pattern: Start with Data Wrangler to explore and prototype transforms on a sample. Once validated, translate the logic into a PySpark script for production. Data Wrangler can even export to PySpark code automatically.

📝 Lab 2 Summary

✅

Managed Spark (Not Local)

SageMaker Processing provisions a Spark cluster on demand. You submit a .py script — it runs on ml.m5.xlarge instances, not your laptop. Cluster auto-terminates after job completes.

✅

ML Pipeline Pattern

StringIndexer → OneHotEncoder → VectorAssembler. Composable, reproducible, parallelizable across the cluster.

✅

Production Ready

Same script scales from 50K rows (lab) to 50M rows (production). Just increase instance_count. Code lives in version control.