Amazon SageMaker Data Wrangler

Low-code data preparation for ML β€” import, transform, analyze, and export datasets without writing code. Now integrated into SageMaker Canvas with natural language support.

πŸ”¬ Lab 1 Service πŸ“Š Data Preparation 🎯 Interactive ⚑ No-Code / Low-Code

πŸ”¬ What is SageMaker Data Wrangler?

Amazon SageMaker Data Wrangler is a low-code data preparation tool that reduces the time to prepare data for machine learning from weeks to minutes. Now integrated into SageMaker Canvas, it provides both a visual interface and natural language support to import, transform, analyze, and export data without writing code.

πŸ’‘
Key Insight: Data scientists spend over 66% of their time on data management, EDA, feature selection, and feature engineering. Data Wrangler automates the repetitive parts so you can focus on model design.

πŸ—οΈ Core Capabilities

πŸ“₯

Import

Connect to 50+ data sources including Amazon S3, Athena, Redshift, Snowflake, and Databricks with point-and-click selection.

πŸ”„

Transform

300+ built-in transforms for cleaning, encoding, scaling, and featurizing. Custom PySpark, Pandas, and SQL also supported.

πŸ“Š

Analyze

Built-in visualizations, data quality reports, target leakage detection, and quick model evaluation to guide decisions.

πŸ“€

Export

Export to S3, SageMaker Pipelines, Feature Store, or Python code. Schedule automated processing with EventBridge.

πŸ—ΊοΈ Where It Fits in the ML Lifecycle

πŸ’ΎRaw DataS3, Redshift
β†’
πŸ”¬Data WranglerPrepare
β†’
πŸ‹οΈTrainingSageMaker
β†’
πŸš€DeployEndpoint
β†’
πŸ“ˆMonitorModel Monitor
🎯
Course Connection: Data Wrangler spans Module 3 (Data Processing) and Module 4 (Feature Engineering). It is the primary tool in Lab 1.

πŸ”€ Understanding Data Flows

A Data Flow is the core concept in Data Wrangler. It is a visual pipeline of sequential steps that define how your raw data gets transformed into ML-ready features. Each step modifies the dataframe and passes it to the next.

πŸ“‚SourceS3 / Athena
β†’
πŸ”AnalyzeQuality Report
β†’
🧹CleanDrop / Fill
β†’
βš™οΈTransformEncode / Scale
β†’
βœ‚οΈSplitTrain/Test/Val
β†’
πŸ“€ExportS3 Destination

πŸ“₯ Data Sources

Data Wrangler connects to a wide range of data sources. In Lab 1, you import a CSV from S3, but in production HCM scenarios you might pull from data warehouses or lakes.

SourceTypeHCM Example
Amazon S3Object StorageExported payroll CSVs, HR data extracts
Amazon AthenaServerless QueryQuery data lake for workforce analytics
Amazon RedshiftData WarehouseHistorical compensation and benefits data
SnowflakeCloud DWThird-party benchmarking datasets
DatabricksLakehouseUnified analytics on employee engagement

πŸ“€ Export Destinations

After transforming data, you can export to multiple targets depending on your ML workflow stage:

πŸͺ£

Amazon S3

Direct export to S3 buckets for training jobs. Supports partitioned output (train/test/validation splits).

πŸ”—

SageMaker Pipelines

Integrate your data flow directly into an automated ML pipeline for repeatable, production-grade workflows.

πŸͺ

Feature Store

Store engineered features for reuse across teams and models. Supports both online and offline stores.

🐍

Python Code

Export as a Python script or Jupyter notebook for custom integration into existing codebases.

⏰
Automation: You can schedule data flows with Amazon EventBridge to automatically process new data on a recurring basis β€” ideal for weekly payroll cycles or monthly workforce reports.

⚑ Processing at Scale

While you work interactively on a sample of your data, Data Wrangler processes the full dataset when you export. It automatically scales using:

πŸ–₯️

SageMaker Processing

Managed compute instances that run your data flow as a processing job. You control instance type and count.

🌐

EMR Serverless

For large datasets, Canvas automatically uses EMR Serverless to distribute processing across multiple nodes.

AnyCompany Scale Example

Processing millions of payroll records across multiple countries? EMR Serverless auto-scales to handle the volume, while you design transforms on a manageable sample in the visual UI.

βš™οΈ Built-in Transform Categories

Data Wrangler provides 300+ built-in transforms organized into categories. Each transform adds a step to your data flow and modifies the dataframe sequentially.

CategoryTransformsLab 1 Usage
Manage ColumnsDrop, rename, move, duplicate, concatenate columnsDrop fnlwgt, native_country; Move income to first column
Handle MissingDrop missing, fill with mean/median/mode/custom, imputeDrop rows with missing occupation, workclass
Manage RowsDrop duplicates, sort, filter, sampleDrop duplicate records
Format StringStrip whitespace, pad, lowercase, uppercase, regexStrip left/right spaces from categorical columns
Search & EditFind and replace substring, regex replaceReplace <=50K with 1, >50K with 0
Handle OutliersMin-max numeric, standard deviation, quantile clippingRemove capital_gain outliers (>80000 or <0)
Encode CategoricalOrdinal encode, one-hot encode, similarity encodeOrdinal: education, occupation; One-hot: sex, race, workclass
Split DataRandomized split, stratified split70/20/10 train/test/validation split
Process NumericNormalize, standardize, bin, log transformβ€”
Balance DataRandom oversample, undersample, SMOTEβ€”

πŸ”’ Encoding Deep Dive

Encoding converts categorical text into numbers that ML algorithms can process. Choosing the right encoding prevents introducing artificial relationships.

πŸ“

Ordinal Encoding

When: Categories have a natural order (education level, seniority).
How: Maps each category to an integer (0, 1, 2...).
Risk: Implies distance between categories that may not exist.

🎯

One-Hot Encoding

When: Categories are nominal with no order (sex, department, country).
How: Creates a binary column for each category value.
Risk: High cardinality = many columns (curse of dimensionality).

🧬

Similarity Encoding

When: High-cardinality text with typos (job titles, city names).
How: Uses string similarity to group similar values.
Risk: Computationally expensive for very large vocabularies.

AnyCompany Encoding Decisions

Ordinal: education_level (High School < Bachelor < Master < PhD) β€” natural hierarchy.
One-Hot: department (Engineering, Sales, HR, Finance) β€” no inherent order.
Similarity: job_title (thousands of variations like "Sr. Engineer" vs "Senior Engineer") β€” group similar titles.

🐍 Custom Transforms

When built-in transforms are not enough, write custom code directly in the Data Wrangler UI:

⚑

PySpark

Full Apache Spark DataFrame API. Best for distributed processing on large datasets.

🐼

Pandas

Familiar Python data manipulation. Runs on a single node β€” great for complex logic on sampled data.

πŸ—ƒοΈ

PySpark SQL

Write SQL queries against your dataframe. Ideal for analysts comfortable with SQL syntax.

πŸ’¬

Natural Language

New in Canvas: describe your transform in plain English and Data Wrangler generates the code.

πŸ“Š Data Quality & Insights Report

Data Wrangler can automatically generate a comprehensive report that analyzes your dataset and identifies issues before you start transforming. This is the first thing you run in Lab 1.

πŸ“‹

Summary

Feature count, data types, missing percentage, duplicate rows β€” your dataset at a glance.

πŸ”

Duplicate Rows

Identifies records that appear multiple times. Duplicates can bias your model toward repeated patterns.

⚠️

Anomalous Samples

Flags records with high probability of being outliers using statistical modeling.

🎯

Target Column

Shows class distribution. Imbalanced targets need special handling (oversampling, class weights).

⚑

Quick Model

Trains a fast model on raw data to establish a baseline. Shows how much room for improvement exists.

πŸ“ˆ

Feature Summary

Prediction power of each feature. Low-power features (like fnlwgt) can be dropped to improve efficiency.

🎯 Target Leakage Detection

Target leakage occurs when training data contains information that would not be available at prediction time. It leads to models that perform perfectly in training but fail in production.

⚠️
Example: Including "termination_date" when predicting employee attrition. The date only exists AFTER the employee leaves β€” it leaks the answer into the features.

Data Wrangler calculates ROC (Receiver Operating Characteristic) for each feature independently via cross-validation:

ROC ScoreInterpretationAction
0.9 - 1.0Possible target leakageInvestigate β€” likely a proxy for the target
0.6 - 0.9Good predictive powerKeep β€” valuable feature
~0.5No predictive power (random)Consider dropping β€” adds noise
< 0.5Inverse correlationInvestigate β€” may need transformation
AnyCompany Target Leakage Scenario

Predicting payroll fraud: If your dataset includes a "fraud_investigation_id" column, it perfectly predicts fraud (ROC = 1.0) but would never be available at prediction time. Data Wrangler flags this immediately.

πŸ“‰ Visualization Types

Data Wrangler provides built-in chart types for exploratory data analysis:

πŸ“Š

Histogram

Distribution of values. Supports color-by and facet-by for multi-dimensional analysis (used in Lab 1 for income by race/sex).

πŸ”΅

Scatter Plot

Relationship between two numeric features. Reveals correlations, clusters, and outliers visually.

πŸ“¦

Box Plot

Shows median, quartiles, and outliers. Great for comparing distributions across categories.

πŸ”₯

Correlation Matrix

Heatmap of feature correlations. Identifies multicollinearity that can destabilize models.

πŸ§ͺ Lab 1 Walkthrough: Income Prediction Data Prep

In Lab 1, you prepare demographic data to predict whether an individual earns less than $50K/year. Click any step to explore it, or use Auto-play to watch data flow through the pipeline.

πŸŽ™οΈ Click any step in the pipeline below to learn what it does, or press Auto-play to watch the full flow.
πŸ“₯ Import πŸ“‹ Quality Report 🎯 Leakage Check πŸ—‘οΈ Drop Columns 🧹 Clean Missing βœ‚οΈ Strip Strings πŸ”„ Target β†’ Binary πŸ“‰ Outliers πŸ”’ Encode πŸ“€ Split & Export ↩

πŸ’‘ Click any step to learn what happens at that stage β€” or Auto-play to watch data flow through all 10 transforms

πŸ“‹ Before & After: Data Transformation

Here is a simplified view of how the data changes through the pipeline:

Before (Raw)

ageworkclassfnlwgteducationoccupationsexcapital_gainnative_countryincome
39 State-gov77516BachelorsAdm-clericalMale2174United-States<=50K
50 Self-emp83311BachelorsExec-managerialMale99999United-States<=50K

After (Transformed)

incomeageeducationoccupationcapital_gainsex_Malesex_Femaleworkclass_State-gov
139902174101
Row removed β€” capital_gain outlier (99999 > 80000)
πŸ”‘
Key changes: Target column moved to position 1 (XGBoost requirement), categorical text β†’ numbers, outliers removed, uninformative columns dropped, whitespace cleaned.

🏒 AnyCompany HCM Use Cases for Data Wrangler

Data Wrangler is not just for lab exercises. Here is how AnyCompany teams would use it across real workforce analytics and payroll ML projects.

πŸ’° Scenario 1: Payroll Anomaly Detection

πŸ”

The Problem

Detect ghost employees, duplicate payments, and unusual payroll spikes across millions of transactions per cycle.

βš™οΈ

Data Wrangler Steps

Import payroll data from Redshift β†’ Handle missing employee IDs β†’ Encode pay frequency β†’ Create deviation features β†’ Export to training.

🎯

Key Transforms

Handle outliers (flag payments >3 std dev), one-hot encode pay_type, create rolling average features via custom PySpark.

Module Connection

This scenario connects to Module 5 (Choosing a Modeling Approach β€” Isolation Forest for anomaly detection) and Module 7 (Evaluation β€” optimizing recall to catch all fraud).

πŸšͺ Scenario 2: Employee Attrition Prediction

πŸ“Š

The Problem

Predict which employees are likely to leave within 6 months so HR can intervene with retention strategies.

βš™οΈ

Data Wrangler Steps

Import HR records from S3 β†’ Join with performance data β†’ Encode tenure bands β†’ Balance classes (SMOTE) β†’ Split and export.

🎯

Key Transforms

Ordinal encode satisfaction_level, one-hot encode department, create tenure_months from hire_date, balance with SMOTE (attrition is rare).

Module Connection

This scenario connects to Module 6 (Training β€” XGBoost binary classification) and Module 8 (Deployment β€” real-time endpoint for HR dashboard integration).

πŸ’΅ Scenario 3: Salary Benchmarking

πŸ“ˆ

The Problem

Predict fair market salary for any role/location/experience combination to power AnyCompany DataCloud benchmarking insights.

βš™οΈ

Data Wrangler Steps

Import compensation data from Athena β†’ Normalize salary to USD β†’ Log-transform salary (right-skewed) β†’ Encode job_family β†’ Feature selection.

🎯

Key Transforms

Process numeric (log transform on salary), ordinal encode experience_band, drop low-importance features identified by Data Quality Report.

Module Connection

This scenario connects to Module 4 (Feature Engineering β€” normalization and log transforms) and Module 7 (Evaluation β€” RMSE in dollars for regression).

πŸ”‘ Key Takeaways

1️⃣

Start with Analysis

Always run the Data Quality Report first. It reveals duplicates, missing values, and feature importance before you write any transforms.

2️⃣

Iterate on Samples

Work on sampled data interactively, then scale to full dataset on export. This keeps the feedback loop fast.

3️⃣

Choose Encoding Wisely

Ordinal for ordered categories, one-hot for nominal. Wrong encoding introduces false relationships into your model.

4️⃣

Automate for Production

Export to SageMaker Pipelines or schedule with EventBridge. Manual data prep does not scale to weekly payroll cycles.