Data Transformation & Feature Engineering - Module 4

🧹 Data Cleaning Techniques

Raw data is messy. One of the first steps after collection and storage is cleaning — a time-consuming but essential process for training effective ML models. At AnyCompany, payroll data spans multiple countries with different formats, currencies, and rules. Poor quality data leads to inaccurate models, biased predictions, and wasted resources.

💡

Note: This module builds on Module 3 (data ingestion and exploration). You've already collected, ingested, and stored your data. Now you transform it into a clean, encoded, and engineered dataset ready for model training.

The Cleaning Pipeline

Click any node to explore that cleaning stage, or auto-play to walk through the full flow:

📋 Profile Data: Scan every column for null rates, unique counts, data types, and value distributions. At AnyCompany: check 50+ columns across payroll, HR, and time-tracking tables. This reveals the scope of cleaning needed.

Three Categories of Data Issues

❌

Incorrect & Duplicate Data

Typos, format inconsistencies (50K vs $50,000 vs 50000), language differences, copies from multiple source systems. Fix formatting or deduplicate.

❓

Missing Data

Null values, blank fields, incomplete records. Determine if missing at random (MAR), completely at random (MCAR), or not at random (MNAR). Then impute or drop.

📈

Outliers & Inconsistencies

Values that deviate from normal behavior. A salary of $25M could be a data entry error or a legitimate CEO record. Investigate before removing.

🔧 Techniques by Issue Type

Issue	Examples	Techniques	HCM Context
Incorrect	50K vs $50,000; NYC vs New York	Standardize formats, fix typos, validate against rules	Income formats vary across countries (USD, INR, EUR)
Duplicate	Same employee from two HR systems	Deduplication by key fields (employee ID + date)	Mergers create duplicate records across legacy systems
Missing (MAR)	Performance score blank for new hires	Impute with group median, flag as missing	New employees have no performance history — missingness depends on observed data (hire date)
Missing (MCAR)	Random blank fields from system glitch	Safe to drop rows or impute with mean/median	Software bug randomly blanks fields — unrelated to any data value
Missing (MNAR)	Salary blank for high earners who skip surveys	Domain expert input, separate model for group	People with high salaries specifically avoid answering — missingness depends on the missing value itself
Outliers	Age = 154, Salary = $25M	Remove if error, keep if legitimate (with flag)	Executive compensation is legitimately extreme — do not auto-remove

⚠

Outliers shift your mean. With outliers, the mean gets pulled toward extreme values while the median stays stable. Always compare mean vs median — a large gap signals outlier influence. Consider whether to remove, cap, or keep outliers based on domain knowledge.

🏷 Categorical Feature Encoding

ML algorithms work with numbers, not text. Categorical features (department names, education levels, states) must be converted to numeric representations. The encoding method depends on the type of category — binary, ordinal, or nominal. Getting this wrong can mislead your model into seeing relationships that don't exist.

⚠

Pitfall example: Encoding days of the week as [Monday=1, Tuesday=2, ... Sunday=7] implies Sunday is "7 times Monday" — which makes no sense. One-hot encoding avoids this false ordinal relationship. Always consider whether your categories have a meaningful order before choosing ordinal encoding.

Encoding Methods Explorer

Click any encoding type to see how it transforms AnyCompany data:

📋 Binary Encoding: Two options only — map to 0 and 1. Examples: Active subscription (Yes=1, No=0), Employee status (Active=1, Terminated=0), Fraud flag (Yes=1, No=0). Simplest encoding, no information loss.

🔄 Encoding Output

Input:Active_subscription: Yes / No

Method:Binary (0/1)

Output:Yes → 1, No → 0

Best for:Two-class features, simple toggle flags

Watch out:Only works for exactly 2 categories

⚙ When to Use Each Method

Method	How It Works	Best For	Watch Out
Binary (0/1)	Map Yes=1, No=0	Two-class features	Simple, no issues
Ordinal (integers)	Map ordered categories to 1, 2, 3...	Ranked categories (education, rating)	Implies equal spacing between levels
One-Hot Encoding	Create binary column per category	Nominal with few categories (<20)	High cardinality explodes dimensions
Label Encoding	Assign arbitrary integer per category	Tree-based models (XGBoost, RF)	Linear models may interpret as ordinal

💡

Not all ML algorithms require encoding. Decision trees and random forests can often handle categorical variables directly. Neural networks and linear models always need numeric encoding. One-hot encoding with high cardinality (1000+ cities) can reduce training efficiency — consider target encoding or embeddings instead.

📐 Numeric Feature Transformations

Numeric features often present challenges that confuse ML algorithms: features with vastly different scales can dominate training, extreme values disproportionately influence model behavior, and time-based features require special handling. Transformations bring features to comparable scales and reduce the impact of problematic data characteristics.

Scaling Technique Explorer

Click a technique to see how it transforms salary data:

📋 Normalization (Min-Max): Scales values to a fixed range (0 to 1). Formula: (x - min) / (max - min). Sensitive to outliers since min/max define the range. Best for neural networks and distance-based algorithms.

⚖ Comparison Table

Technique	Range	Outlier Impact	Best For
Normalization	0 to 1	High — outliers compress other values	Neural networks, distance-based algorithms (KNN, SVM)
Standardization	Centered at 0	Reduced — uses mean/std not min/max	Linear regression, logistic regression, PCA
Log Transform	Compressed	Greatly reduced — compresses extremes	Right-skewed data (income, transaction amounts)
Binning	Discrete buckets	Eliminated — values grouped into ranges	Non-linear relationships, interpretable categories

🎯

Rule of thumb: Use standardization as your default for most algorithms. Use normalization for neural networks. Use log transform for heavily skewed data (income, transaction amounts). Use binning when you need interpretable categories.

📦 Binning

Convert continuous values into discrete buckets. Reduces noise and handles non-linear relationships.

Salary Binning Example

Low: < $50K (entry-level, interns)

Medium: $50K — $100K (mid-career professionals)

High: $100K — $200K (senior engineers, managers)

Executive: > $200K (directors, VPs, C-suite)

✂ Feature Selection Techniques

More features is not always better. Irrelevant, redundant, or noisy features hurt model performance, increase training time, and reduce interpretability. Feature selection finds the signal in the noise — choosing the most relevant features that help your ML algorithm efficiently recognize patterns in the dataset.

When to Remove Features

Problem	Example	Action
All unique values	Phone numbers, employee IDs	Remove — no predictive pattern possible
All identical values	Country = "USA" for all rows	Remove — zero variance, no information
Sparse data	Optional fields with 95% nulls	Remove or engineer a "has_value" binary flag
Sensitive data	SSN, credit card numbers	Remove — privacy risk, no ML value
Irrelevant features	Badge color for attrition prediction	Remove — no causal or correlational relationship

💡

Benefits of feature selection: Reduce dimensionality (faster training), improve interpretability (explain predictions), prevent overfitting (model generalizes better), and reduce storage/compute costs.

🔀 Split and Combine Features

✂

Feature Splitting

Break one feature into multiple. "Full Address" splits into Street, City, State, Zip. "Full Name" splits into First, Last. Unlocks more granular patterns.

🔗

Feature Combining

Merge related features into one. Room1_sqft + Room2_sqft + Room3_sqft = Total_sqft. Reduces dimensions while preserving information.

📅

Feature Derivation

Create new features from existing ones. Current_date - Hire_date = Tenure_days. Salary / Hours_worked = Effective_hourly_rate.

AnyCompany Feature Engineering

Split: Location "New York, NY" → City="New York", State="NY"

Combine: Base_salary + Bonus + Stock_value → Total_compensation

Derive: Current_date - Hire_date → Tenure_days (new feature from existing data)

🔬 Principal Component Analysis (PCA)

PCA reduces many correlated features into fewer uncorrelated "principal components" that capture most of the variance in the data.

📊

How PCA Works

Transforms features into new axes (components) ranked by how much variance they explain. Keep top N components that capture 95%+ of variance.

🎯

When to Use PCA

Many correlated features (50+ columns), need to reduce dimensions for visualization, or algorithm struggles with high-dimensional data.

PCA Example — Compensation Analysis

Original features: Base salary, Bonus, Stock grants, Benefits value, Tax rate, Location cost index

PC1 (Total Comp): Base + Bonus + Stock + Benefits (correlated compensation metrics)

PC2 (Cost of Living): Tax rate + Location cost index (correlated location metrics)

Result: 6 features → 2 components capturing 92% of variance

🎮 Data Transformation Lab

Walk through a complete data transformation pipeline. Select a scenario, then click any pipeline node to explore that stage — or auto-play to watch the full flow with animated data particles.

🎯 Select a Transformation Scenario

💰

Income Prediction Dataset

Clean and transform demographics + income data for binary classification.

👤

Employee Attrition Features

Engineer features from HR data to predict which employees will leave.

🌐

Salary Benchmarking

Normalize compensation across 40+ countries into comparable metrics.

📋 Step 1 — Standardize Income Format: Convert all formats (50K, $42,000, 15/hr) to annual USD integer. Drop unparseable rows. This single step fixes the most common data quality issue in multi-source datasets.

📋 Transformation Summary

Scenario:Income Prediction

Input:Raw CSV with mixed formats

Output:Parquet, standardized numerics

Encoding:Education (ordinal), State (one-hot)

Target:Binary: income above/below $50K

☁ AWS Data Transformation Services

AWS provides managed services for every stage of data transformation — from visual no-code tools to distributed processing at petabyte scale. The key is matching the right tool to your data volume and team expertise. These services help you transform data efficiently without managing infrastructure.

💡

Note: Feature engineering requires understanding both the data and the business problem. There's often no single correct way to engineer features — it requires experimentation and domain intuition. AWS tools accelerate this experimentation cycle.

Service Explorer

Click any service to see its capabilities and when to use it:

📋 SageMaker Data Wrangler: Visual, no-code interface with 300+ built-in transforms and 50+ data source connectors. Drag-and-drop data preparation. Best for exploration and prototyping transforms before productionizing.

☁ Service Details

Service:SageMaker Data Wrangler

Code required:None (visual drag-and-drop)

Scale:Small to medium datasets

Best for:Exploration, prototyping, analyst-friendly transforms

AnyCompany use:Data scientists exploring new attrition features

📤 Transformation Output for Training

After transformation, prepare the final dataset for SageMaker model training:

🥧

Split Dataset

Training (70-80%), Validation (10-15%), Testing (10-15%). Never let test data leak into training.

🔀

Increase Diversity

Shuffle rows to remove ordering bias. Augment minority classes if imbalanced (SMOTE, oversampling).

💾

Format for Training

Export as CSV (simple), Parquet (efficient), or RecordIO-protobuf (SageMaker optimized). Match format to algorithm.

🎯

Typical AnyCompany workflow: Prototype transforms in Data Wrangler (visual, fast iteration). Productionize with SageMaker Processing (scripted, repeatable). Store results in Feature Store (shared, versioned). Use EMR only when data exceeds single-node capacity.

📝 Module Summary

✅

Data Cleaning

Handle incorrect, duplicate, missing, and outlier data with appropriate techniques for each issue type.

✅

Feature Engineering

Encode categoricals (binary, ordinal, one-hot), scale numerics (normalize, standardize, log, bin).

✅

Feature Selection

Remove irrelevant features, split/combine for better signal, use PCA for dimensionality reduction.

✅

AWS Services

Data Wrangler (visual), SageMaker Processing (scripts), EMR (scale), Feature Store (governance).