Data Processing for ML - Module 3 | AnyCompany ML Engineering

📦 Data Sources for Machine Learning

Every ML project starts with data. Effective data processing lays the foundation for successful model development — the quality and relevance of your data directly impacts model performance. At AnyCompany, data flows from multiple systems — payroll engines, HR platforms, time-tracking, benefits administration, and compliance databases. Understanding where your data lives and what it looks like is the first step.

💡

Note: This module covers data ingestion and exploratory analysis. Data transformation (cleaning, feature engineering, selection) is covered in Module 4. Think of this as "understanding what you have" before "shaping it for ML."

The Data Processing Pipeline

Click any node to explore that stage. Or hit auto-play to walk through the full flow.

📋 Ingest from Sources: Collect raw data from databases, APIs, streaming platforms, and file storage. At AnyCompany, this means pulling from HRIS, payroll engines, time-tracking, and compliance systems.

💡

This module focuses on ingestion and exploration. Data transformation and feature engineering are covered in Module 4. Think of this as "understanding what you have" before "shaping it for ML."

Where ML Data Lives

It's critical to know the characteristics of your data sources — attribute names, record counts, data types, and value ranges. A data dictionary or Data Catalog is essential. If one doesn't exist, you'll need to build it.

🏊

Data Lakes

Centralized repositories for vast amounts of raw data in various formats — structured (CSV), unstructured (images, text), and semi-structured (JSON, XML). Flexible but requires significant preprocessing for ML use. AnyCompany DataCloud aggregates payroll and HR data from thousands of clients.

🗄

Data Warehouses

Structured data optimized for analytics — organized in well-defined schemas, efficient for complex queries and aggregations. Often uses columnar storage for improved performance. AnyCompany stores aggregated workforce analytics here.

💾

Databases

Structured storage optimized for transactional operations — organized in tables with defined relationships, efficient for frequent updates and retrievals. Relational (RDS, Aurora) and NoSQL (DynamoDB). AnyCompany stores transactional payroll records and employee profiles here.

🌊

Streaming Data

Real-time event streams via Kinesis or Kafka. Continuous generation and processing, ideal for near real-time predictions. AnyCompany captures login events, API calls, and transaction signals for real-time fraud detection.

🔌

APIs & External Sources

Third-party data feeds: labor market data, economic indicators, regulatory updates. Enriches internal data for better predictions. Choose a central storage location (like S3) and select appropriate ingestion methods.

✨ Identifying High-Performing Data

Not all data is useful for ML. Evaluating data quality up front helps build the best dataset for the problem. High-performing data must meet four criteria:

🎯

Representative

Training data should reflect real-world scenarios. If 20% of employees typically leave after a year, your data should represent this rate accurately. Training only on US employees will fail for India or Europe.

🔗

Relevant

Include attributes that expose patterns related to your prediction goal. For attrition: membership duration, usage frequency, support interactions. Irrelevant information can negatively impact model performance.

📊

Feature Rich

A comprehensive set of diverse, relevant attributes increases the chances of discovering meaningful correlations. Demographic information might reveal important trends in employee behavior that a single feature cannot.

🔄

Consistent

Use consistent formatting, units, and metadata across datasets. Inconsistencies confuse algorithms and reduce accuracy. Data from various sources must adhere to a standardized format before training.

AnyCompany Data Quality Check

Good: 5 years of payroll transactions across 40+ countries with labeled fraud cases = representative, relevant, feature-rich, high volume.

Bad: 6 months of data from one office with no fraud labels = not representative, insufficient volume, no labels for supervised learning.

🗂 Data Types and Formats

ML algorithms are picky about their input. Data types significantly influence algorithm selection and model performance — numerical data suits linear regression, image data needs CNNs, and time series requires specialized forecasting algorithms. Understanding types, categories, and file formats helps you choose the right preprocessing pipeline.

Four Data Types in ML

📝

Text

Documents, chat logs, support tickets. AnyCompany: employee feedback surveys, support ticket descriptions, policy documents. Requires NLP preprocessing (tokenization, embeddings).

📊

Tabular

Rows and columns — the most common ML format. AnyCompany: payroll records, employee demographics, time entries. Each row is a sample, each column is a feature.

📈

Time Series

Sequential data points indexed by time. AnyCompany: monthly payroll totals, daily login patterns, quarterly attrition rates. Order matters — cannot shuffle rows.

🖼

Image

Visual data as pixel arrays. AnyCompany: scanned tax forms (W-2, I-9), ID documents for verification, signature images. Requires computer vision models.

📂 Data Categories

Category	Structure	Examples	ML Readiness
Structured	Fixed schema, rows/columns	Payroll tables, employee records, time entries	High — directly usable
Semi-structured	Flexible schema, nested	JSON API responses, XML config files, log entries	Medium — needs flattening
Unstructured	No predefined schema	PDF documents, images, audio recordings, emails	Low — needs heavy preprocessing

💾 File Formats for ML

How you store data affects read speed, storage cost, and compatibility with ML frameworks. Row-based formats are fastest for writing (each record goes to disk sequentially), while columnar formats are fastest for reading specific features (skip irrelevant columns entirely).

Format	Type	Best For	HCM Use Case
CSV	Row-based	Simple tabular data, quick exports, human-readable	Employee data exports, payroll summaries
Avro	Row-based	Streaming data, schema evolution, compact binary	Real-time event streams (Kafka/Kinesis)
Parquet	Columnar	Analytics queries, large datasets, column selection	DataCloud analytics, historical payroll analysis
ORC	Columnar	Hive/Spark workloads, high compression	Data lake batch processing jobs
JSON/JSONL	Object notation	Nested data, API responses, flexible schemas	API logs, configuration data, audit trails
RecordIO	Binary	SageMaker optimized, fast training data loading	Training datasets for SageMaker built-in algorithms

🎯

Rule of thumb: Use Parquet for analytics and ML training (columnar = fast feature selection). Use CSV for quick prototyping. Use RecordIO when training with SageMaker built-in algorithms for maximum throughput.

🔄 Data Ingestion: Batch vs Streaming

📦

Batch Ingestion

Collected and processed in groups on a schedule (hourly, daily, weekly). AnyCompany: nightly payroll data sync, weekly HR data refresh, monthly compliance reports.

⚡

Streaming Ingestion

Processed in real-time as events occur. AnyCompany: real-time fraud detection on transactions, live login anomaly detection, instant compliance alerts.

When to Use Each

Batch: Attrition model retraining (monthly), salary benchmarking updates (quarterly), workforce analytics reports

Streaming: Payroll fraud detection (real-time), AnyCompany Assist responses (real-time), security anomaly alerts

📊 Exploratory Data Analysis (EDA)

Before training any model, you must understand your data. EDA serves as a bridge between raw data and actionable insights — helping you identify issues like deviations, skew, and outliers that would degrade model accuracy. By detecting these problems early, you reduce the chances of building a model on flawed foundations.

Three Goals of Data Visualization

Each goal contributes to more accurate and reliable models. They work together to ensure your data is well-understood before feeding it into algorithms.

🔍

Understand the Data

Grasp the overall structure and characteristics. Identify patterns, trends, and relationships between variables. A correlation matrix heatmap quickly shows which features relate most strongly to your target — guiding feature selection.

🧹

Identify Quality Issues

Visual inspection reveals problems missed in tabular formats — outliers, missing values, inconsistencies. A box plot highlights outliers that need addressing before training. Are there employees with negative tenure? Salaries of $0?

✂

Shape the Data

Make informed preprocessing decisions based on visual insights. If a histogram shows highly skewed salary distribution, apply a log transformation to normalize it. Decide what to keep, transform, or remove.

🔬 EDA Methods Explorer

Click any method to see how it applies to workforce data. Each method reveals different insights from the same AnyCompany employee dataset.

🔗 Relationship Analysis: Use scatter plots and correlation matrices to find connections between features. At AnyCompany: plot salary vs tenure, performance vs attrition risk, or overtime hours vs satisfaction score. Strong correlations become powerful ML features.

🔗 Relationship Analysis — HCM Application

Chart TypesScatter plot, Correlation matrix, Pair plot

QuestionWhich features predict attrition?

HCM Examplesalary_percentile vs months_since_promotion (r = -0.42)

InsightUnderpaid + stalled employees leave 3x more often

ActionCreate interaction feature: compensation_gap × promotion_delay

📉 Visualization by Data Type

Data Type	Chart Type	Purpose	HCM Example
Categorical	Bar Chart	Compare counts across categories	Attrition count by department
	Pie Chart	Show proportions of a whole	Employee distribution by role type
	Heatmap	Show patterns across two categories	Attrition rate by department x tenure band
Numerical	Scatter Plot	Show relationships between two variables	Salary vs years of experience
	Histogram	Show distribution of a single variable	Salary distribution across all employees
	Box Plot	Show spread, median, and outliers	Compensation by department (spot outliers)
	Line Chart	Show trends over time	Monthly attrition rate over 3 years

⚠

Heatmaps for hidden patterns: When you need to identify hidden patterns like seasonal purchase spikes or periodic transaction anomalies, heatmaps are the best choice. They reveal patterns across two dimensions simultaneously that other charts miss.

🧪 EDA Checklist for AnyCompany Data

❓

Missing Values

What percentage of each column is null? Is missingness random or systematic? Employees without performance scores may have just joined.

📏

Outliers

Are there extreme values? A salary of $1M in a team averaging $80K could be a data error or a legitimate executive record.

⚖

Class Imbalance

Is the target variable balanced? Fraud is 0.01% of transactions. Attrition might be 15%. Imbalance requires special handling.

🔗

Correlations

Which features correlate with the target? Which correlate with each other (multicollinearity)? Drop redundant features.

🎮 Data Pipeline Lab

Walk through a complete data processing pipeline for an AnyCompany ML use case. Select a scenario, then click any pipeline node to explore that stage — or auto-play to watch the full flow with animated data particles.

🎯 Select a Pipeline Scenario

👤

Employee Attrition Dataset

Build a training dataset from HR systems to predict which employees will leave.

💰

Payroll Fraud Detection

Process transaction streams to identify anomalous payroll patterns in real-time.

📄

Document Processing (OCR)

Extract structured data from scanned tax forms and compliance documents.

📋 Step 1 — Identify Data Sources: HRIS (demographics, tenure), Compensation (salary, raises), Performance (scores, reviews), Time (attendance, PTO). Four systems, one employee ID to join them.

📋 Pipeline Configuration

Scenario:Employee Attrition

Input format:Multiple SQL tables (HRIS, Comp, Perf, Time)

Output format:Parquet on S3

Final shape:50,000 rows x 25 features

Target label:Binary (stayed / left)

☁ AWS Storage Services for ML

Choosing the right storage service is a critical architectural decision that impacts training speed, cost, and operational complexity. Several factors come into play: performance requirements, scalability needs, cost constraints, data access patterns (sequential vs random), and integration with other AWS services like SageMaker.

Storage Services Overview

Click any storage node to see its details and best-fit ML use cases:

📋 Amazon S3 — Object Storage: The default data lake for ML. Flexible, scalable, cost-effective. Store training data, model artifacts, and results. Access via API, stream or copy between services. Supports Pipe mode for direct SageMaker streaming.

📊 Detailed Comparison

Service	Type	Performance	Shared Access	Best ML Use Case
S3	Object	High throughput, higher latency	Yes (API-based)	Data lake, training data storage, model artifacts
EBS	Block	Low latency, high IOPS	Limited (multi-attach)	Fast local storage during training, checkpoints
EFS	File (NFS)	Scalable throughput	Yes (multi-AZ)	Shared datasets across distributed training jobs
FSx Lustre	File (Lustre)	Sub-ms latency, massive throughput	Yes	Large-scale distributed training, HPC workloads

🪣 Amazon S3 Deep Dive

S3 is the foundation of most ML data architectures. Key capabilities for ML engineers:

🔌

API Access & Pipe Mode

Stream data directly to SageMaker training jobs via Pipe mode. No need to copy data to local storage first — reduces startup time.

💰

Storage Classes

Standard for active training data. Intelligent-Tiering for variable access. Glacier for archived datasets you rarely retrain on.

🔒

Security

Encryption at rest (SSE-S3, SSE-KMS), bucket policies, VPC endpoints. Critical for AnyCompany PII (SSNs, salaries).

📋

Versioning

Track dataset versions for reproducibility. Know exactly which data version trained which model. Essential for ML governance.

🧩 Making Storage Decisions

Choosing the right storage depends on your access pattern, performance requirements, and cost constraints. Here is a decision framework for AnyCompany ML workloads.

Data Access Patterns

📋

Copy and Load

Download entire dataset to local storage before training. Simple but slow for large datasets. Use S3 + EBS.

🌊

Sequential Streaming

Stream data record-by-record during training. Memory efficient, good for large datasets. Use S3 Pipe mode.

🎲

Random Access

Read arbitrary records on demand. Required for some training algorithms. Use EBS or FSx for low-latency random reads.

🎯 Decision Matrix

If you need...	Use...	Why
Store large datasets cheaply	Amazon S3	Lowest cost per GB, unlimited scale, lifecycle policies
Real-time data sharing across jobs	Amazon EFS	Shared file system, concurrent access, auto-scaling
Fast local storage during training	Amazon EBS	Low latency, high IOPS, attached to compute instance
Distributed training at massive scale	Amazon FSx Lustre	Sub-ms latency, integrates with S3, parallel file system

⚡ Data Structure Best Practices

🔄

Transform Data

Convert raw formats to ML-optimized formats before training. Do not train on raw CSVs at scale — preprocess first.

📊

Use Columnar Storage

Parquet or ORC for training data. Read only the columns (features) you need. 10x faster than scanning full CSV rows.

🗜

Compress Data

Snappy or GZIP compression reduces storage costs and network transfer time. Parquet with Snappy is the gold standard.

📁

Partition Data

Partition by date, region, or category. Enables efficient queries: "give me all India payroll data for Q4" without scanning everything.

AnyCompany Storage Architecture

Raw Zone (S3): Incoming data in original format (CSV, JSON, API dumps). Immutable — never modify raw data.

Processed Zone (S3, Parquet): Cleaned, transformed, partitioned by date and region. Ready for analytics.

Training Zone (S3 or FSx): ML-ready datasets with features engineered, labels attached, split into train/validation/test.

📝 Module Summary

✅

Data Formats & Ingestion

Understand row-based vs columnar formats, batch vs streaming ingestion, and when to use each.

✅

Visualization Methods

Choose the right chart for your data type. Use EDA to find patterns, quality issues, and inform feature selection.

✅

AWS Storage Selection

Match storage service to access pattern and performance needs. S3 for lakes, EBS for speed, EFS for sharing, FSx for scale.

Data Processing for Machine Learning

📦 Data Sources for Machine Learning

The Data Processing Pipeline

Where ML Data Lives

Data Lakes

Data Warehouses

Databases

Streaming Data

APIs & External Sources

✨ Identifying High-Performing Data

Representative

Relevant

Feature Rich

Consistent

🗂 Data Types and Formats

Four Data Types in ML

Text

Tabular

Time Series

Image

📂 Data Categories

💾 File Formats for ML

🔄 Data Ingestion: Batch vs Streaming

Batch Ingestion

Streaming Ingestion

📊 Exploratory Data Analysis (EDA)

Three Goals of Data Visualization

Understand the Data

Identify Quality Issues

Shape the Data

🔬 EDA Methods Explorer

📉 Visualization by Data Type

🧪 EDA Checklist for AnyCompany Data

Missing Values

Outliers

Class Imbalance

Correlations

🎮 Data Pipeline Lab

🎯 Select a Pipeline Scenario

Employee Attrition Dataset

Payroll Fraud Detection

Document Processing (OCR)

☁ AWS Storage Services for ML

Storage Services Overview

📊 Detailed Comparison

🪣 Amazon S3 Deep Dive

API Access & Pipe Mode

Storage Classes

Security

Versioning

🧩 Making Storage Decisions

Data Access Patterns

Copy and Load

Sequential Streaming

Random Access

🎯 Decision Matrix

⚡ Data Structure Best Practices

Transform Data

Use Columnar Storage

Compress Data

Partition Data

📝 Module Summary

Data Formats & Ingestion

Visualization Methods

AWS Storage Selection