Module 3 - Interactive Explainer

Data Processing for Machine Learning

From raw data sources to ML-ready datasets — understand data types, formats, exploratory analysis, and AWS storage options that power production ML pipelines.

📦 Data Engineering ⚡ Interactive 🏢 HCM Context

📦 Data Sources for Machine Learning

Every ML project starts with data. Effective data processing lays the foundation for successful model development — the quality and relevance of your data directly impacts model performance. At AnyCompany, data flows from multiple systems — payroll engines, HR platforms, time-tracking, benefits administration, and compliance databases. Understanding where your data lives and what it looks like is the first step.

💡
Note: This module covers data ingestion and exploratory analysis. Data transformation (cleaning, feature engineering, selection) is covered in Module 4. Think of this as "understanding what you have" before "shaping it for ML."

The Data Processing Pipeline

Click any node to explore that stage. Or hit auto-play to walk through the full flow.

📋 Ingest from Sources: Collect raw data from databases, APIs, streaming platforms, and file storage. At AnyCompany, this means pulling from HRIS, payroll engines, time-tracking, and compliance systems.
📥 Ingest from Sources 🔄 Transform Clean & Shape 🔍 Validate Quality Checks 🧪 ML-Ready Dataset
💡
This module focuses on ingestion and exploration. Data transformation and feature engineering are covered in Module 4. Think of this as "understanding what you have" before "shaping it for ML."

Where ML Data Lives

It's critical to know the characteristics of your data sources — attribute names, record counts, data types, and value ranges. A data dictionary or Data Catalog is essential. If one doesn't exist, you'll need to build it.

🏊

Data Lakes

Centralized repositories for vast amounts of raw data in various formats — structured (CSV), unstructured (images, text), and semi-structured (JSON, XML). Flexible but requires significant preprocessing for ML use. AnyCompany DataCloud aggregates payroll and HR data from thousands of clients.

🗄

Data Warehouses

Structured data optimized for analytics — organized in well-defined schemas, efficient for complex queries and aggregations. Often uses columnar storage for improved performance. AnyCompany stores aggregated workforce analytics here.

💾

Databases

Structured storage optimized for transactional operations — organized in tables with defined relationships, efficient for frequent updates and retrievals. Relational (RDS, Aurora) and NoSQL (DynamoDB). AnyCompany stores transactional payroll records and employee profiles here.

🌊

Streaming Data

Real-time event streams via Kinesis or Kafka. Continuous generation and processing, ideal for near real-time predictions. AnyCompany captures login events, API calls, and transaction signals for real-time fraud detection.

🔌

APIs & External Sources

Third-party data feeds: labor market data, economic indicators, regulatory updates. Enriches internal data for better predictions. Choose a central storage location (like S3) and select appropriate ingestion methods.

Identifying High-Performing Data

Not all data is useful for ML. Evaluating data quality up front helps build the best dataset for the problem. High-performing data must meet four criteria:

🎯

Representative

Training data should reflect real-world scenarios. If 20% of employees typically leave after a year, your data should represent this rate accurately. Training only on US employees will fail for India or Europe.

🔗

Relevant

Include attributes that expose patterns related to your prediction goal. For attrition: membership duration, usage frequency, support interactions. Irrelevant information can negatively impact model performance.

📊

Feature Rich

A comprehensive set of diverse, relevant attributes increases the chances of discovering meaningful correlations. Demographic information might reveal important trends in employee behavior that a single feature cannot.

🔄

Consistent

Use consistent formatting, units, and metadata across datasets. Inconsistencies confuse algorithms and reduce accuracy. Data from various sources must adhere to a standardized format before training.

AnyCompany Data Quality Check

Good: 5 years of payroll transactions across 40+ countries with labeled fraud cases = representative, relevant, feature-rich, high volume.

Bad: 6 months of data from one office with no fraud labels = not representative, insufficient volume, no labels for supervised learning.

🗂 Data Types and Formats

ML algorithms are picky about their input. Data types significantly influence algorithm selection and model performance — numerical data suits linear regression, image data needs CNNs, and time series requires specialized forecasting algorithms. Understanding types, categories, and file formats helps you choose the right preprocessing pipeline.

Four Data Types in ML

📝

Text

Documents, chat logs, support tickets. AnyCompany: employee feedback surveys, support ticket descriptions, policy documents. Requires NLP preprocessing (tokenization, embeddings).

📊

Tabular

Rows and columns — the most common ML format. AnyCompany: payroll records, employee demographics, time entries. Each row is a sample, each column is a feature.

📈

Time Series

Sequential data points indexed by time. AnyCompany: monthly payroll totals, daily login patterns, quarterly attrition rates. Order matters — cannot shuffle rows.

🖼

Image

Visual data as pixel arrays. AnyCompany: scanned tax forms (W-2, I-9), ID documents for verification, signature images. Requires computer vision models.

📂 Data Categories

CategoryStructureExamplesML Readiness
StructuredFixed schema, rows/columnsPayroll tables, employee records, time entriesHigh — directly usable
Semi-structuredFlexible schema, nestedJSON API responses, XML config files, log entriesMedium — needs flattening
UnstructuredNo predefined schemaPDF documents, images, audio recordings, emailsLow — needs heavy preprocessing

💾 File Formats for ML

How you store data affects read speed, storage cost, and compatibility with ML frameworks. Row-based formats are fastest for writing (each record goes to disk sequentially), while columnar formats are fastest for reading specific features (skip irrelevant columns entirely).

FormatTypeBest ForHCM Use Case
CSVRow-basedSimple tabular data, quick exports, human-readableEmployee data exports, payroll summaries
AvroRow-basedStreaming data, schema evolution, compact binaryReal-time event streams (Kafka/Kinesis)
ParquetColumnarAnalytics queries, large datasets, column selectionDataCloud analytics, historical payroll analysis
ORCColumnarHive/Spark workloads, high compressionData lake batch processing jobs
JSON/JSONLObject notationNested data, API responses, flexible schemasAPI logs, configuration data, audit trails
RecordIOBinarySageMaker optimized, fast training data loadingTraining datasets for SageMaker built-in algorithms
🎯
Rule of thumb: Use Parquet for analytics and ML training (columnar = fast feature selection). Use CSV for quick prototyping. Use RecordIO when training with SageMaker built-in algorithms for maximum throughput.

🔄 Data Ingestion: Batch vs Streaming

📦

Batch Ingestion

Collected and processed in groups on a schedule (hourly, daily, weekly). AnyCompany: nightly payroll data sync, weekly HR data refresh, monthly compliance reports.

Streaming Ingestion

Processed in real-time as events occur. AnyCompany: real-time fraud detection on transactions, live login anomaly detection, instant compliance alerts.

When to Use Each

Batch: Attrition model retraining (monthly), salary benchmarking updates (quarterly), workforce analytics reports

Streaming: Payroll fraud detection (real-time), AnyCompany Assist responses (real-time), security anomaly alerts

📊 Exploratory Data Analysis (EDA)

Before training any model, you must understand your data. EDA serves as a bridge between raw data and actionable insights — helping you identify issues like deviations, skew, and outliers that would degrade model accuracy. By detecting these problems early, you reduce the chances of building a model on flawed foundations.

Three Goals of Data Visualization

Each goal contributes to more accurate and reliable models. They work together to ensure your data is well-understood before feeding it into algorithms.

🔍

Understand the Data

Grasp the overall structure and characteristics. Identify patterns, trends, and relationships between variables. A correlation matrix heatmap quickly shows which features relate most strongly to your target — guiding feature selection.

🧹

Identify Quality Issues

Visual inspection reveals problems missed in tabular formats — outliers, missing values, inconsistencies. A box plot highlights outliers that need addressing before training. Are there employees with negative tenure? Salaries of $0?

Shape the Data

Make informed preprocessing decisions based on visual insights. If a histogram shows highly skewed salary distribution, apply a log transformation to normalize it. Decide what to keep, transform, or remove.

🔬 EDA Methods Explorer

Click any method to see how it applies to workforce data. Each method reveals different insights from the same AnyCompany employee dataset.

🔗 Relationship Analysis: Use scatter plots and correlation matrices to find connections between features. At AnyCompany: plot salary vs tenure, performance vs attrition risk, or overtime hours vs satisfaction score. Strong correlations become powerful ML features.
EDA Methods 📉 Relationships Scatter • Correlation • Pair Plot 📊 Distributions Histogram • Box Plot • KDE ⚖️ Comparisons Grouped Bar • Violin • Swarm 🗺️ Composition Heatmap • Stacked • Treemap
🔗 Relationship Analysis — HCM Application
Chart TypesScatter plot, Correlation matrix, Pair plot
QuestionWhich features predict attrition?
HCM Examplesalary_percentile vs months_since_promotion (r = -0.42)
InsightUnderpaid + stalled employees leave 3x more often
ActionCreate interaction feature: compensation_gap × promotion_delay

📉 Visualization by Data Type

Data TypeChart TypePurposeHCM Example
CategoricalBar ChartCompare counts across categoriesAttrition count by department
Pie ChartShow proportions of a wholeEmployee distribution by role type
HeatmapShow patterns across two categoriesAttrition rate by department x tenure band
NumericalScatter PlotShow relationships between two variablesSalary vs years of experience
HistogramShow distribution of a single variableSalary distribution across all employees
Box PlotShow spread, median, and outliersCompensation by department (spot outliers)
Line ChartShow trends over timeMonthly attrition rate over 3 years
Heatmaps for hidden patterns: When you need to identify hidden patterns like seasonal purchase spikes or periodic transaction anomalies, heatmaps are the best choice. They reveal patterns across two dimensions simultaneously that other charts miss.

🧪 EDA Checklist for AnyCompany Data

Missing Values

What percentage of each column is null? Is missingness random or systematic? Employees without performance scores may have just joined.

📏

Outliers

Are there extreme values? A salary of $1M in a team averaging $80K could be a data error or a legitimate executive record.

Class Imbalance

Is the target variable balanced? Fraud is 0.01% of transactions. Attrition might be 15%. Imbalance requires special handling.

🔗

Correlations

Which features correlate with the target? Which correlate with each other (multicollinearity)? Drop redundant features.

🎮 Data Pipeline Lab

Walk through a complete data processing pipeline for an AnyCompany ML use case. Select a scenario, then click any pipeline node to explore that stage — or auto-play to watch the full flow with animated data particles.

🎯 Select a Pipeline Scenario

👤

Employee Attrition Dataset

Build a training dataset from HR systems to predict which employees will leave.

💰

Payroll Fraud Detection

Process transaction streams to identify anomalous payroll patterns in real-time.

📄

Document Processing (OCR)

Extract structured data from scanned tax forms and compliance documents.

📋 Step 1 — Identify Data Sources: HRIS (demographics, tenure), Compensation (salary, raises), Performance (scores, reviews), Time (attendance, PTO). Four systems, one employee ID to join them.
📥 Sources Identify Data 🧹 Quality Assess Issues 🔗 Join Align Data 🔄 Handle Missing Data Export ML-Ready
📋 Pipeline Configuration
Scenario:Employee Attrition
Input format:Multiple SQL tables (HRIS, Comp, Perf, Time)
Output format:Parquet on S3
Final shape:50,000 rows x 25 features
Target label:Binary (stayed / left)

AWS Storage Services for ML

Choosing the right storage service is a critical architectural decision that impacts training speed, cost, and operational complexity. Several factors come into play: performance requirements, scalability needs, cost constraints, data access patterns (sequential vs random), and integration with other AWS services like SageMaker.

Storage Services Overview

Click any storage node to see its details and best-fit ML use cases:

📋 Amazon S3 — Object Storage: The default data lake for ML. Flexible, scalable, cost-effective. Store training data, model artifacts, and results. Access via API, stream or copy between services. Supports Pipe mode for direct SageMaker streaming.
ML Training Job SageMaker / EC2 🪣 Amazon S3 Object Storage 💿 Amazon EBS Block Storage 📁 Amazon EFS File Storage Amazon FSx High-Perf File

📊 Detailed Comparison

ServiceTypePerformanceShared AccessBest ML Use Case
S3ObjectHigh throughput, higher latencyYes (API-based)Data lake, training data storage, model artifacts
EBSBlockLow latency, high IOPSLimited (multi-attach)Fast local storage during training, checkpoints
EFSFile (NFS)Scalable throughputYes (multi-AZ)Shared datasets across distributed training jobs
FSx LustreFile (Lustre)Sub-ms latency, massive throughputYesLarge-scale distributed training, HPC workloads

🪣 Amazon S3 Deep Dive

S3 is the foundation of most ML data architectures. Key capabilities for ML engineers:

🔌

API Access & Pipe Mode

Stream data directly to SageMaker training jobs via Pipe mode. No need to copy data to local storage first — reduces startup time.

💰

Storage Classes

Standard for active training data. Intelligent-Tiering for variable access. Glacier for archived datasets you rarely retrain on.

🔒

Security

Encryption at rest (SSE-S3, SSE-KMS), bucket policies, VPC endpoints. Critical for AnyCompany PII (SSNs, salaries).

📋

Versioning

Track dataset versions for reproducibility. Know exactly which data version trained which model. Essential for ML governance.

🧩 Making Storage Decisions

Choosing the right storage depends on your access pattern, performance requirements, and cost constraints. Here is a decision framework for AnyCompany ML workloads.

Data Access Patterns

📋

Copy and Load

Download entire dataset to local storage before training. Simple but slow for large datasets. Use S3 + EBS.

🌊

Sequential Streaming

Stream data record-by-record during training. Memory efficient, good for large datasets. Use S3 Pipe mode.

🎲

Random Access

Read arbitrary records on demand. Required for some training algorithms. Use EBS or FSx for low-latency random reads.

🎯 Decision Matrix

If you need...Use...Why
Store large datasets cheaplyAmazon S3Lowest cost per GB, unlimited scale, lifecycle policies
Real-time data sharing across jobsAmazon EFSShared file system, concurrent access, auto-scaling
Fast local storage during trainingAmazon EBSLow latency, high IOPS, attached to compute instance
Distributed training at massive scaleAmazon FSx LustreSub-ms latency, integrates with S3, parallel file system

Data Structure Best Practices

🔄

Transform Data

Convert raw formats to ML-optimized formats before training. Do not train on raw CSVs at scale — preprocess first.

📊

Use Columnar Storage

Parquet or ORC for training data. Read only the columns (features) you need. 10x faster than scanning full CSV rows.

🗜

Compress Data

Snappy or GZIP compression reduces storage costs and network transfer time. Parquet with Snappy is the gold standard.

📁

Partition Data

Partition by date, region, or category. Enables efficient queries: "give me all India payroll data for Q4" without scanning everything.

AnyCompany Storage Architecture

Raw Zone (S3): Incoming data in original format (CSV, JSON, API dumps). Immutable — never modify raw data.

Processed Zone (S3, Parquet): Cleaned, transformed, partitioned by date and region. Ready for analytics.

Training Zone (S3 or FSx): ML-ready datasets with features engineered, labels attached, split into train/validation/test.

📝 Module Summary

Data Formats & Ingestion

Understand row-based vs columnar formats, batch vs streaming ingestion, and when to use each.

Visualization Methods

Choose the right chart for your data type. Use EDA to find patterns, quality issues, and inform feature selection.

AWS Storage Selection

Match storage service to access pattern and performance needs. S3 for lakes, EBS for speed, EFS for sharing, FSx for scale.