Module 4 - Interactive Explainer

Data Transformation & Feature Engineering

Clean messy data, encode categories, scale numbers, select features, and leverage AWS services to build ML-ready datasets at enterprise scale.

🧹 Data Cleaning ⚡ Interactive 🏢 HCM Context 🧪 Labs 1-2

🧹 Data Cleaning Techniques

Raw data is messy. One of the first steps after collection and storage is cleaning — a time-consuming but essential process for training effective ML models. At AnyCompany, payroll data spans multiple countries with different formats, currencies, and rules. Poor quality data leads to inaccurate models, biased predictions, and wasted resources.

💡
Note: This module builds on Module 3 (data ingestion and exploration). You've already collected, ingested, and stored your data. Now you transform it into a clean, encoded, and engineered dataset ready for model training.

The Cleaning Pipeline

Click any node to explore that cleaning stage, or auto-play to walk through the full flow:

📋 Profile Data: Scan every column for null rates, unique counts, data types, and value distributions. At AnyCompany: check 50+ columns across payroll, HR, and time-tracking tables. This reveals the scope of cleaning needed.
🔍 Profile Scan & Assess 📏 Standardize Fix Formats 🧹 Deduplicate Remove Copies Impute Handle Missing 📈 Outliers Detect & Handle

Three Categories of Data Issues

Incorrect & Duplicate Data

Typos, format inconsistencies (50K vs $50,000 vs 50000), language differences, copies from multiple source systems. Fix formatting or deduplicate.

Missing Data

Null values, blank fields, incomplete records. Determine if missing at random (MAR), completely at random (MCAR), or not at random (MNAR). Then impute or drop.

📈

Outliers & Inconsistencies

Values that deviate from normal behavior. A salary of $25M could be a data entry error or a legitimate CEO record. Investigate before removing.

🔧 Techniques by Issue Type

IssueExamplesTechniquesHCM Context
Incorrect50K vs $50,000; NYC vs New YorkStandardize formats, fix typos, validate against rulesIncome formats vary across countries (USD, INR, EUR)
DuplicateSame employee from two HR systemsDeduplication by key fields (employee ID + date)Mergers create duplicate records across legacy systems
Missing (MAR)Performance score blank for new hiresImpute with group median, flag as missingNew employees have no performance history — missingness depends on observed data (hire date)
Missing (MCAR)Random blank fields from system glitchSafe to drop rows or impute with mean/medianSoftware bug randomly blanks fields — unrelated to any data value
Missing (MNAR)Salary blank for high earners who skip surveysDomain expert input, separate model for groupPeople with high salaries specifically avoid answering — missingness depends on the missing value itself
OutliersAge = 154, Salary = $25MRemove if error, keep if legitimate (with flag)Executive compensation is legitimately extreme — do not auto-remove
Outliers shift your mean. With outliers, the mean gets pulled toward extreme values while the median stays stable. Always compare mean vs median — a large gap signals outlier influence. Consider whether to remove, cap, or keep outliers based on domain knowledge.

🏷 Categorical Feature Encoding

ML algorithms work with numbers, not text. Categorical features (department names, education levels, states) must be converted to numeric representations. The encoding method depends on the type of category — binary, ordinal, or nominal. Getting this wrong can mislead your model into seeing relationships that don't exist.

Pitfall example: Encoding days of the week as [Monday=1, Tuesday=2, ... Sunday=7] implies Sunday is "7 times Monday" — which makes no sense. One-hot encoding avoids this false ordinal relationship. Always consider whether your categories have a meaningful order before choosing ordinal encoding.

Encoding Methods Explorer

Click any encoding type to see how it transforms AnyCompany data:

📋 Binary Encoding: Two options only — map to 0 and 1. Examples: Active subscription (Yes=1, No=0), Employee status (Active=1, Terminated=0), Fraud flag (Yes=1, No=0). Simplest encoding, no information loss.
Raw Category "Engineering", "Sales", "HR" Click a method to see the transformation 🔘 Binary (0/1) Two classes only 📊 Ordinal Ranked integers 🎨 One-Hot Binary per category 🔢 Label Arbitrary integers
🔄 Encoding Output
Input:Active_subscription: Yes / No
Method:Binary (0/1)
Output:Yes → 1, No → 0
Best for:Two-class features, simple toggle flags
Watch out:Only works for exactly 2 categories

When to Use Each Method

MethodHow It WorksBest ForWatch Out
Binary (0/1)Map Yes=1, No=0Two-class featuresSimple, no issues
Ordinal (integers)Map ordered categories to 1, 2, 3...Ranked categories (education, rating)Implies equal spacing between levels
One-Hot EncodingCreate binary column per categoryNominal with few categories (<20)High cardinality explodes dimensions
Label EncodingAssign arbitrary integer per categoryTree-based models (XGBoost, RF)Linear models may interpret as ordinal
💡
Not all ML algorithms require encoding. Decision trees and random forests can often handle categorical variables directly. Neural networks and linear models always need numeric encoding. One-hot encoding with high cardinality (1000+ cities) can reduce training efficiency — consider target encoding or embeddings instead.

📐 Numeric Feature Transformations

Numeric features often present challenges that confuse ML algorithms: features with vastly different scales can dominate training, extreme values disproportionately influence model behavior, and time-based features require special handling. Transformations bring features to comparable scales and reduce the impact of problematic data characteristics.

Scaling Technique Explorer

Click a technique to see how it transforms salary data:

📋 Normalization (Min-Max): Scales values to a fixed range (0 to 1). Formula: (x - min) / (max - min). Sensitive to outliers since min/max define the range. Best for neural networks and distance-based algorithms.
Raw Salary $15K ... $65K ... $2M Normalized 0.0 ... 0.025 ... 1.0 📏 Normalization Min-Max (0 to 1) 📊 Standardization Z-Score (mean=0) 📈 Log Transform Compress Skew

Comparison Table

TechniqueRangeOutlier ImpactBest For
Normalization0 to 1High — outliers compress other valuesNeural networks, distance-based algorithms (KNN, SVM)
StandardizationCentered at 0Reduced — uses mean/std not min/maxLinear regression, logistic regression, PCA
Log TransformCompressedGreatly reduced — compresses extremesRight-skewed data (income, transaction amounts)
BinningDiscrete bucketsEliminated — values grouped into rangesNon-linear relationships, interpretable categories
🎯
Rule of thumb: Use standardization as your default for most algorithms. Use normalization for neural networks. Use log transform for heavily skewed data (income, transaction amounts). Use binning when you need interpretable categories.

📦 Binning

Convert continuous values into discrete buckets. Reduces noise and handles non-linear relationships.

Salary Binning Example

Low: < $50K (entry-level, interns)

Medium: $50K — $100K (mid-career professionals)

High: $100K — $200K (senior engineers, managers)

Executive: > $200K (directors, VPs, C-suite)

Feature Selection Techniques

More features is not always better. Irrelevant, redundant, or noisy features hurt model performance, increase training time, and reduce interpretability. Feature selection finds the signal in the noise — choosing the most relevant features that help your ML algorithm efficiently recognize patterns in the dataset.

When to Remove Features

ProblemExampleAction
All unique valuesPhone numbers, employee IDsRemove — no predictive pattern possible
All identical valuesCountry = "USA" for all rowsRemove — zero variance, no information
Sparse dataOptional fields with 95% nullsRemove or engineer a "has_value" binary flag
Sensitive dataSSN, credit card numbersRemove — privacy risk, no ML value
Irrelevant featuresBadge color for attrition predictionRemove — no causal or correlational relationship
💡
Benefits of feature selection: Reduce dimensionality (faster training), improve interpretability (explain predictions), prevent overfitting (model generalizes better), and reduce storage/compute costs.

🔀 Split and Combine Features

Feature Splitting

Break one feature into multiple. "Full Address" splits into Street, City, State, Zip. "Full Name" splits into First, Last. Unlocks more granular patterns.

🔗

Feature Combining

Merge related features into one. Room1_sqft + Room2_sqft + Room3_sqft = Total_sqft. Reduces dimensions while preserving information.

📅

Feature Derivation

Create new features from existing ones. Current_date - Hire_date = Tenure_days. Salary / Hours_worked = Effective_hourly_rate.

AnyCompany Feature Engineering

Split: Location "New York, NY" → City="New York", State="NY"

Combine: Base_salary + Bonus + Stock_value → Total_compensation

Derive: Current_date - Hire_date → Tenure_days (new feature from existing data)

🔬 Principal Component Analysis (PCA)

PCA reduces many correlated features into fewer uncorrelated "principal components" that capture most of the variance in the data.

📊

How PCA Works

Transforms features into new axes (components) ranked by how much variance they explain. Keep top N components that capture 95%+ of variance.

🎯

When to Use PCA

Many correlated features (50+ columns), need to reduce dimensions for visualization, or algorithm struggles with high-dimensional data.

PCA Example — Compensation Analysis

Original features: Base salary, Bonus, Stock grants, Benefits value, Tax rate, Location cost index

PC1 (Total Comp): Base + Bonus + Stock + Benefits (correlated compensation metrics)

PC2 (Cost of Living): Tax rate + Location cost index (correlated location metrics)

Result: 6 features → 2 components capturing 92% of variance

🎮 Data Transformation Lab

Walk through a complete data transformation pipeline. Select a scenario, then click any pipeline node to explore that stage — or auto-play to watch the full flow with animated data particles.

🎯 Select a Transformation Scenario

💰

Income Prediction Dataset

Clean and transform demographics + income data for binary classification.

👤

Employee Attrition Features

Engineer features from HR data to predict which employees will leave.

🌐

Salary Benchmarking

Normalize compensation across 40+ countries into comparable metrics.

📋 Step 1 — Standardize Income Format: Convert all formats (50K, $42,000, 15/hr) to annual USD integer. Drop unparseable rows. This single step fixes the most common data quality issue in multi-source datasets.
📏 Standardize Fix Formats 🔧 Clean Missing & Outliers 🏷 Encode Categoricals 📐 Scale Numerics Export ML-Ready
📋 Transformation Summary
Scenario:Income Prediction
Input:Raw CSV with mixed formats
Output:Parquet, standardized numerics
Encoding:Education (ordinal), State (one-hot)
Target:Binary: income above/below $50K

AWS Data Transformation Services

AWS provides managed services for every stage of data transformation — from visual no-code tools to distributed processing at petabyte scale. The key is matching the right tool to your data volume and team expertise. These services help you transform data efficiently without managing infrastructure.

💡
Note: Feature engineering requires understanding both the data and the business problem. There's often no single correct way to engineer features — it requires experimentation and domain intuition. AWS tools accelerate this experimentation cycle.

Service Explorer

Click any service to see its capabilities and when to use it:

📋 SageMaker Data Wrangler: Visual, no-code interface with 300+ built-in transforms and 50+ data source connectors. Drag-and-drop data preparation. Best for exploration and prototyping transforms before productionizing.
ML Pipeline Transform → Train → Deploy 🎨 Data Wrangler Visual, No-Code Processing Custom Scripts 🔥 Amazon EMR Big Data Scale 🗄 Feature Store Governance
☁ Service Details
Service:SageMaker Data Wrangler
Code required:None (visual drag-and-drop)
Scale:Small to medium datasets
Best for:Exploration, prototyping, analyst-friendly transforms
AnyCompany use:Data scientists exploring new attrition features

📤 Transformation Output for Training

After transformation, prepare the final dataset for SageMaker model training:

🥧

Split Dataset

Training (70-80%), Validation (10-15%), Testing (10-15%). Never let test data leak into training.

🔀

Increase Diversity

Shuffle rows to remove ordering bias. Augment minority classes if imbalanced (SMOTE, oversampling).

💾

Format for Training

Export as CSV (simple), Parquet (efficient), or RecordIO-protobuf (SageMaker optimized). Match format to algorithm.

🎯
Typical AnyCompany workflow: Prototype transforms in Data Wrangler (visual, fast iteration). Productionize with SageMaker Processing (scripted, repeatable). Store results in Feature Store (shared, versioned). Use EMR only when data exceeds single-node capacity.

📝 Module Summary

Data Cleaning

Handle incorrect, duplicate, missing, and outlier data with appropriate techniques for each issue type.

Feature Engineering

Encode categoricals (binary, ordinal, one-hot), scale numerics (normalize, standardize, log, bin).

Feature Selection

Remove irrelevant features, split/combine for better signal, use PCA for dimensionality reduction.

AWS Services

Data Wrangler (visual), SageMaker Processing (scripts), EMR (scale), Feature Store (governance).