Amazon SageMaker Data Wrangler is a low-code data preparation tool that reduces the time to prepare data for machine learning from weeks to minutes. Now integrated into SageMaker Canvas, it provides both a visual interface and natural language support to import, transform, analyze, and export data without writing code.
Connect to 50+ data sources including Amazon S3, Athena, Redshift, Snowflake, and Databricks with point-and-click selection.
300+ built-in transforms for cleaning, encoding, scaling, and featurizing. Custom PySpark, Pandas, and SQL also supported.
Built-in visualizations, data quality reports, target leakage detection, and quick model evaluation to guide decisions.
Export to S3, SageMaker Pipelines, Feature Store, or Python code. Schedule automated processing with EventBridge.
A Data Flow is the core concept in Data Wrangler. It is a visual pipeline of sequential steps that define how your raw data gets transformed into ML-ready features. Each step modifies the dataframe and passes it to the next.
Data Wrangler connects to a wide range of data sources. In Lab 1, you import a CSV from S3, but in production HCM scenarios you might pull from data warehouses or lakes.
| Source | Type | HCM Example |
|---|---|---|
| Amazon S3 | Object Storage | Exported payroll CSVs, HR data extracts |
| Amazon Athena | Serverless Query | Query data lake for workforce analytics |
| Amazon Redshift | Data Warehouse | Historical compensation and benefits data |
| Snowflake | Cloud DW | Third-party benchmarking datasets |
| Databricks | Lakehouse | Unified analytics on employee engagement |
After transforming data, you can export to multiple targets depending on your ML workflow stage:
Direct export to S3 buckets for training jobs. Supports partitioned output (train/test/validation splits).
Integrate your data flow directly into an automated ML pipeline for repeatable, production-grade workflows.
Store engineered features for reuse across teams and models. Supports both online and offline stores.
Export as a Python script or Jupyter notebook for custom integration into existing codebases.
While you work interactively on a sample of your data, Data Wrangler processes the full dataset when you export. It automatically scales using:
Managed compute instances that run your data flow as a processing job. You control instance type and count.
For large datasets, Canvas automatically uses EMR Serverless to distribute processing across multiple nodes.
Processing millions of payroll records across multiple countries? EMR Serverless auto-scales to handle the volume, while you design transforms on a manageable sample in the visual UI.
Data Wrangler provides 300+ built-in transforms organized into categories. Each transform adds a step to your data flow and modifies the dataframe sequentially.
| Category | Transforms | Lab 1 Usage |
|---|---|---|
| Manage Columns | Drop, rename, move, duplicate, concatenate columns | Drop fnlwgt, native_country; Move income to first column |
| Handle Missing | Drop missing, fill with mean/median/mode/custom, impute | Drop rows with missing occupation, workclass |
| Manage Rows | Drop duplicates, sort, filter, sample | Drop duplicate records |
| Format String | Strip whitespace, pad, lowercase, uppercase, regex | Strip left/right spaces from categorical columns |
| Search & Edit | Find and replace substring, regex replace | Replace <=50K with 1, >50K with 0 |
| Handle Outliers | Min-max numeric, standard deviation, quantile clipping | Remove capital_gain outliers (>80000 or <0) |
| Encode Categorical | Ordinal encode, one-hot encode, similarity encode | Ordinal: education, occupation; One-hot: sex, race, workclass |
| Split Data | Randomized split, stratified split | 70/20/10 train/test/validation split |
| Process Numeric | Normalize, standardize, bin, log transform | β |
| Balance Data | Random oversample, undersample, SMOTE | β |
Encoding converts categorical text into numbers that ML algorithms can process. Choosing the right encoding prevents introducing artificial relationships.
When: Categories have a natural order (education level, seniority).
How: Maps each category to an integer (0, 1, 2...).
Risk: Implies distance between categories that may not exist.
When: Categories are nominal with no order (sex, department, country).
How: Creates a binary column for each category value.
Risk: High cardinality = many columns (curse of dimensionality).
When: High-cardinality text with typos (job titles, city names).
How: Uses string similarity to group similar values.
Risk: Computationally expensive for very large vocabularies.
Ordinal: education_level (High School < Bachelor < Master < PhD) β natural hierarchy.
One-Hot: department (Engineering, Sales, HR, Finance) β no inherent order.
Similarity: job_title (thousands of variations like "Sr. Engineer" vs "Senior Engineer") β group similar titles.
When built-in transforms are not enough, write custom code directly in the Data Wrangler UI:
Full Apache Spark DataFrame API. Best for distributed processing on large datasets.
Familiar Python data manipulation. Runs on a single node β great for complex logic on sampled data.
Write SQL queries against your dataframe. Ideal for analysts comfortable with SQL syntax.
New in Canvas: describe your transform in plain English and Data Wrangler generates the code.
Data Wrangler can automatically generate a comprehensive report that analyzes your dataset and identifies issues before you start transforming. This is the first thing you run in Lab 1.
Feature count, data types, missing percentage, duplicate rows β your dataset at a glance.
Identifies records that appear multiple times. Duplicates can bias your model toward repeated patterns.
Flags records with high probability of being outliers using statistical modeling.
Shows class distribution. Imbalanced targets need special handling (oversampling, class weights).
Trains a fast model on raw data to establish a baseline. Shows how much room for improvement exists.
Prediction power of each feature. Low-power features (like fnlwgt) can be dropped to improve efficiency.
Target leakage occurs when training data contains information that would not be available at prediction time. It leads to models that perform perfectly in training but fail in production.
Data Wrangler calculates ROC (Receiver Operating Characteristic) for each feature independently via cross-validation:
| ROC Score | Interpretation | Action |
|---|---|---|
| 0.9 - 1.0 | Possible target leakage | Investigate β likely a proxy for the target |
| 0.6 - 0.9 | Good predictive power | Keep β valuable feature |
| ~0.5 | No predictive power (random) | Consider dropping β adds noise |
| < 0.5 | Inverse correlation | Investigate β may need transformation |
Predicting payroll fraud: If your dataset includes a "fraud_investigation_id" column, it perfectly predicts fraud (ROC = 1.0) but would never be available at prediction time. Data Wrangler flags this immediately.
Data Wrangler provides built-in chart types for exploratory data analysis:
Distribution of values. Supports color-by and facet-by for multi-dimensional analysis (used in Lab 1 for income by race/sex).
Relationship between two numeric features. Reveals correlations, clusters, and outliers visually.
Shows median, quartiles, and outliers. Great for comparing distributions across categories.
Heatmap of feature correlations. Identifies multicollinearity that can destabilize models.
In Lab 1, you prepare demographic data to predict whether an individual earns less than $50K/year. Click any step to explore it, or use Auto-play to watch data flow through the pipeline.
π‘ Click any step to learn what happens at that stage β or Auto-play to watch data flow through all 10 transforms
Here is a simplified view of how the data changes through the pipeline:
| age | workclass | fnlwgt | education | occupation | sex | capital_gain | native_country | income |
|---|---|---|---|---|---|---|---|---|
| 39 | State-gov | 77516 | Bachelors | Adm-clerical | Male | 2174 | United-States | <=50K |
| 50 | Self-emp | 83311 | Bachelors | Exec-managerial | Male | 99999 | United-States | <=50K |
| income | age | education | occupation | capital_gain | sex_Male | sex_Female | workclass_State-gov |
|---|---|---|---|---|---|---|---|
| 1 | 39 | 9 | 0 | 2174 | 1 | 0 | 1 |
| Row removed β capital_gain outlier (99999 > 80000) | |||||||
Data Wrangler is not just for lab exercises. Here is how AnyCompany teams would use it across real workforce analytics and payroll ML projects.
Detect ghost employees, duplicate payments, and unusual payroll spikes across millions of transactions per cycle.
Import payroll data from Redshift β Handle missing employee IDs β Encode pay frequency β Create deviation features β Export to training.
Handle outliers (flag payments >3 std dev), one-hot encode pay_type, create rolling average features via custom PySpark.
This scenario connects to Module 5 (Choosing a Modeling Approach β Isolation Forest for anomaly detection) and Module 7 (Evaluation β optimizing recall to catch all fraud).
Predict which employees are likely to leave within 6 months so HR can intervene with retention strategies.
Import HR records from S3 β Join with performance data β Encode tenure bands β Balance classes (SMOTE) β Split and export.
Ordinal encode satisfaction_level, one-hot encode department, create tenure_months from hire_date, balance with SMOTE (attrition is rare).
This scenario connects to Module 6 (Training β XGBoost binary classification) and Module 8 (Deployment β real-time endpoint for HR dashboard integration).
Predict fair market salary for any role/location/experience combination to power AnyCompany DataCloud benchmarking insights.
Import compensation data from Athena β Normalize salary to USD β Log-transform salary (right-skewed) β Encode job_family β Feature selection.
Process numeric (log transform on salary), ordinal encode experience_band, drop low-importance features identified by Data Quality Report.
This scenario connects to Module 4 (Feature Engineering β normalization and log transforms) and Module 7 (Evaluation β RMSE in dollars for regression).
Always run the Data Quality Report first. It reveals duplicates, missing values, and feature importance before you write any transforms.
Work on sampled data interactively, then scale to full dataset on export. This keeps the feedback loop fast.
Ordinal for ordered categories, one-hot for nominal. Wrong encoding introduces false relationships into your model.
Export to SageMaker Pipelines or schedule with EventBridge. Manual data prep does not scale to weekly payroll cycles.