Module 4 - Interactive Explainer
Clean messy data, encode categories, scale numbers, select features, and leverage AWS services to build ML-ready datasets at enterprise scale.
Raw data is messy. One of the first steps after collection and storage is cleaning — a time-consuming but essential process for training effective ML models. At AnyCompany, payroll data spans multiple countries with different formats, currencies, and rules. Poor quality data leads to inaccurate models, biased predictions, and wasted resources.
Click any node to explore that cleaning stage, or auto-play to walk through the full flow:
Typos, format inconsistencies (50K vs $50,000 vs 50000), language differences, copies from multiple source systems. Fix formatting or deduplicate.
Null values, blank fields, incomplete records. Determine if missing at random (MAR), completely at random (MCAR), or not at random (MNAR). Then impute or drop.
Values that deviate from normal behavior. A salary of $25M could be a data entry error or a legitimate CEO record. Investigate before removing.
| Issue | Examples | Techniques | HCM Context |
|---|---|---|---|
| Incorrect | 50K vs $50,000; NYC vs New York | Standardize formats, fix typos, validate against rules | Income formats vary across countries (USD, INR, EUR) |
| Duplicate | Same employee from two HR systems | Deduplication by key fields (employee ID + date) | Mergers create duplicate records across legacy systems |
| Missing (MAR) | Performance score blank for new hires | Impute with group median, flag as missing | New employees have no performance history — missingness depends on observed data (hire date) |
| Missing (MCAR) | Random blank fields from system glitch | Safe to drop rows or impute with mean/median | Software bug randomly blanks fields — unrelated to any data value |
| Missing (MNAR) | Salary blank for high earners who skip surveys | Domain expert input, separate model for group | People with high salaries specifically avoid answering — missingness depends on the missing value itself |
| Outliers | Age = 154, Salary = $25M | Remove if error, keep if legitimate (with flag) | Executive compensation is legitimately extreme — do not auto-remove |
ML algorithms work with numbers, not text. Categorical features (department names, education levels, states) must be converted to numeric representations. The encoding method depends on the type of category — binary, ordinal, or nominal. Getting this wrong can mislead your model into seeing relationships that don't exist.
Click any encoding type to see how it transforms AnyCompany data:
| Method | How It Works | Best For | Watch Out |
|---|---|---|---|
| Binary (0/1) | Map Yes=1, No=0 | Two-class features | Simple, no issues |
| Ordinal (integers) | Map ordered categories to 1, 2, 3... | Ranked categories (education, rating) | Implies equal spacing between levels |
| One-Hot Encoding | Create binary column per category | Nominal with few categories (<20) | High cardinality explodes dimensions |
| Label Encoding | Assign arbitrary integer per category | Tree-based models (XGBoost, RF) | Linear models may interpret as ordinal |
Numeric features often present challenges that confuse ML algorithms: features with vastly different scales can dominate training, extreme values disproportionately influence model behavior, and time-based features require special handling. Transformations bring features to comparable scales and reduce the impact of problematic data characteristics.
Click a technique to see how it transforms salary data:
| Technique | Range | Outlier Impact | Best For |
|---|---|---|---|
| Normalization | 0 to 1 | High — outliers compress other values | Neural networks, distance-based algorithms (KNN, SVM) |
| Standardization | Centered at 0 | Reduced — uses mean/std not min/max | Linear regression, logistic regression, PCA |
| Log Transform | Compressed | Greatly reduced — compresses extremes | Right-skewed data (income, transaction amounts) |
| Binning | Discrete buckets | Eliminated — values grouped into ranges | Non-linear relationships, interpretable categories |
Convert continuous values into discrete buckets. Reduces noise and handles non-linear relationships.
Low: < $50K (entry-level, interns)
Medium: $50K — $100K (mid-career professionals)
High: $100K — $200K (senior engineers, managers)
Executive: > $200K (directors, VPs, C-suite)
More features is not always better. Irrelevant, redundant, or noisy features hurt model performance, increase training time, and reduce interpretability. Feature selection finds the signal in the noise — choosing the most relevant features that help your ML algorithm efficiently recognize patterns in the dataset.
| Problem | Example | Action |
|---|---|---|
| All unique values | Phone numbers, employee IDs | Remove — no predictive pattern possible |
| All identical values | Country = "USA" for all rows | Remove — zero variance, no information |
| Sparse data | Optional fields with 95% nulls | Remove or engineer a "has_value" binary flag |
| Sensitive data | SSN, credit card numbers | Remove — privacy risk, no ML value |
| Irrelevant features | Badge color for attrition prediction | Remove — no causal or correlational relationship |
Break one feature into multiple. "Full Address" splits into Street, City, State, Zip. "Full Name" splits into First, Last. Unlocks more granular patterns.
Merge related features into one. Room1_sqft + Room2_sqft + Room3_sqft = Total_sqft. Reduces dimensions while preserving information.
Create new features from existing ones. Current_date - Hire_date = Tenure_days. Salary / Hours_worked = Effective_hourly_rate.
Split: Location "New York, NY" → City="New York", State="NY"
Combine: Base_salary + Bonus + Stock_value → Total_compensation
Derive: Current_date - Hire_date → Tenure_days (new feature from existing data)
PCA reduces many correlated features into fewer uncorrelated "principal components" that capture most of the variance in the data.
Transforms features into new axes (components) ranked by how much variance they explain. Keep top N components that capture 95%+ of variance.
Many correlated features (50+ columns), need to reduce dimensions for visualization, or algorithm struggles with high-dimensional data.
Original features: Base salary, Bonus, Stock grants, Benefits value, Tax rate, Location cost index
PC1 (Total Comp): Base + Bonus + Stock + Benefits (correlated compensation metrics)
PC2 (Cost of Living): Tax rate + Location cost index (correlated location metrics)
Result: 6 features → 2 components capturing 92% of variance
Walk through a complete data transformation pipeline. Select a scenario, then click any pipeline node to explore that stage — or auto-play to watch the full flow with animated data particles.
AWS provides managed services for every stage of data transformation — from visual no-code tools to distributed processing at petabyte scale. The key is matching the right tool to your data volume and team expertise. These services help you transform data efficiently without managing infrastructure.
Click any service to see its capabilities and when to use it:
After transformation, prepare the final dataset for SageMaker model training:
Training (70-80%), Validation (10-15%), Testing (10-15%). Never let test data leak into training.
Shuffle rows to remove ordering bias. Augment minority classes if imbalanced (SMOTE, oversampling).
Export as CSV (simple), Parquet (efficient), or RecordIO-protobuf (SageMaker optimized). Match format to algorithm.
Handle incorrect, duplicate, missing, and outlier data with appropriate techniques for each issue type.
Encode categoricals (binary, ordinal, one-hot), scale numerics (normalize, standardize, log, bin).
Remove irrelevant features, split/combine for better signal, use PCA for dimensionality reduction.
Data Wrangler (visual), SageMaker Processing (scripts), EMR (scale), Feature Store (governance).