Module 3 - Interactive Explainer
From raw data sources to ML-ready datasets — understand data types, formats, exploratory analysis, and AWS storage options that power production ML pipelines.
Every ML project starts with data. Effective data processing lays the foundation for successful model development — the quality and relevance of your data directly impacts model performance. At AnyCompany, data flows from multiple systems — payroll engines, HR platforms, time-tracking, benefits administration, and compliance databases. Understanding where your data lives and what it looks like is the first step.
Click any node to explore that stage. Or hit auto-play to walk through the full flow.
It's critical to know the characteristics of your data sources — attribute names, record counts, data types, and value ranges. A data dictionary or Data Catalog is essential. If one doesn't exist, you'll need to build it.
Centralized repositories for vast amounts of raw data in various formats — structured (CSV), unstructured (images, text), and semi-structured (JSON, XML). Flexible but requires significant preprocessing for ML use. AnyCompany DataCloud aggregates payroll and HR data from thousands of clients.
Structured data optimized for analytics — organized in well-defined schemas, efficient for complex queries and aggregations. Often uses columnar storage for improved performance. AnyCompany stores aggregated workforce analytics here.
Structured storage optimized for transactional operations — organized in tables with defined relationships, efficient for frequent updates and retrievals. Relational (RDS, Aurora) and NoSQL (DynamoDB). AnyCompany stores transactional payroll records and employee profiles here.
Real-time event streams via Kinesis or Kafka. Continuous generation and processing, ideal for near real-time predictions. AnyCompany captures login events, API calls, and transaction signals for real-time fraud detection.
Third-party data feeds: labor market data, economic indicators, regulatory updates. Enriches internal data for better predictions. Choose a central storage location (like S3) and select appropriate ingestion methods.
Not all data is useful for ML. Evaluating data quality up front helps build the best dataset for the problem. High-performing data must meet four criteria:
Training data should reflect real-world scenarios. If 20% of employees typically leave after a year, your data should represent this rate accurately. Training only on US employees will fail for India or Europe.
Include attributes that expose patterns related to your prediction goal. For attrition: membership duration, usage frequency, support interactions. Irrelevant information can negatively impact model performance.
A comprehensive set of diverse, relevant attributes increases the chances of discovering meaningful correlations. Demographic information might reveal important trends in employee behavior that a single feature cannot.
Use consistent formatting, units, and metadata across datasets. Inconsistencies confuse algorithms and reduce accuracy. Data from various sources must adhere to a standardized format before training.
Good: 5 years of payroll transactions across 40+ countries with labeled fraud cases = representative, relevant, feature-rich, high volume.
Bad: 6 months of data from one office with no fraud labels = not representative, insufficient volume, no labels for supervised learning.
ML algorithms are picky about their input. Data types significantly influence algorithm selection and model performance — numerical data suits linear regression, image data needs CNNs, and time series requires specialized forecasting algorithms. Understanding types, categories, and file formats helps you choose the right preprocessing pipeline.
Documents, chat logs, support tickets. AnyCompany: employee feedback surveys, support ticket descriptions, policy documents. Requires NLP preprocessing (tokenization, embeddings).
Rows and columns — the most common ML format. AnyCompany: payroll records, employee demographics, time entries. Each row is a sample, each column is a feature.
Sequential data points indexed by time. AnyCompany: monthly payroll totals, daily login patterns, quarterly attrition rates. Order matters — cannot shuffle rows.
Visual data as pixel arrays. AnyCompany: scanned tax forms (W-2, I-9), ID documents for verification, signature images. Requires computer vision models.
| Category | Structure | Examples | ML Readiness |
|---|---|---|---|
| Structured | Fixed schema, rows/columns | Payroll tables, employee records, time entries | High — directly usable |
| Semi-structured | Flexible schema, nested | JSON API responses, XML config files, log entries | Medium — needs flattening |
| Unstructured | No predefined schema | PDF documents, images, audio recordings, emails | Low — needs heavy preprocessing |
How you store data affects read speed, storage cost, and compatibility with ML frameworks. Row-based formats are fastest for writing (each record goes to disk sequentially), while columnar formats are fastest for reading specific features (skip irrelevant columns entirely).
| Format | Type | Best For | HCM Use Case |
|---|---|---|---|
| CSV | Row-based | Simple tabular data, quick exports, human-readable | Employee data exports, payroll summaries |
| Avro | Row-based | Streaming data, schema evolution, compact binary | Real-time event streams (Kafka/Kinesis) |
| Parquet | Columnar | Analytics queries, large datasets, column selection | DataCloud analytics, historical payroll analysis |
| ORC | Columnar | Hive/Spark workloads, high compression | Data lake batch processing jobs |
| JSON/JSONL | Object notation | Nested data, API responses, flexible schemas | API logs, configuration data, audit trails |
| RecordIO | Binary | SageMaker optimized, fast training data loading | Training datasets for SageMaker built-in algorithms |
Collected and processed in groups on a schedule (hourly, daily, weekly). AnyCompany: nightly payroll data sync, weekly HR data refresh, monthly compliance reports.
Processed in real-time as events occur. AnyCompany: real-time fraud detection on transactions, live login anomaly detection, instant compliance alerts.
Batch: Attrition model retraining (monthly), salary benchmarking updates (quarterly), workforce analytics reports
Streaming: Payroll fraud detection (real-time), AnyCompany Assist responses (real-time), security anomaly alerts
Before training any model, you must understand your data. EDA serves as a bridge between raw data and actionable insights — helping you identify issues like deviations, skew, and outliers that would degrade model accuracy. By detecting these problems early, you reduce the chances of building a model on flawed foundations.
Each goal contributes to more accurate and reliable models. They work together to ensure your data is well-understood before feeding it into algorithms.
Grasp the overall structure and characteristics. Identify patterns, trends, and relationships between variables. A correlation matrix heatmap quickly shows which features relate most strongly to your target — guiding feature selection.
Visual inspection reveals problems missed in tabular formats — outliers, missing values, inconsistencies. A box plot highlights outliers that need addressing before training. Are there employees with negative tenure? Salaries of $0?
Make informed preprocessing decisions based on visual insights. If a histogram shows highly skewed salary distribution, apply a log transformation to normalize it. Decide what to keep, transform, or remove.
Click any method to see how it applies to workforce data. Each method reveals different insights from the same AnyCompany employee dataset.
| Data Type | Chart Type | Purpose | HCM Example |
|---|---|---|---|
| Categorical | Bar Chart | Compare counts across categories | Attrition count by department |
| Pie Chart | Show proportions of a whole | Employee distribution by role type | |
| Heatmap | Show patterns across two categories | Attrition rate by department x tenure band | |
| Numerical | Scatter Plot | Show relationships between two variables | Salary vs years of experience |
| Histogram | Show distribution of a single variable | Salary distribution across all employees | |
| Box Plot | Show spread, median, and outliers | Compensation by department (spot outliers) | |
| Line Chart | Show trends over time | Monthly attrition rate over 3 years |
What percentage of each column is null? Is missingness random or systematic? Employees without performance scores may have just joined.
Are there extreme values? A salary of $1M in a team averaging $80K could be a data error or a legitimate executive record.
Is the target variable balanced? Fraud is 0.01% of transactions. Attrition might be 15%. Imbalance requires special handling.
Which features correlate with the target? Which correlate with each other (multicollinearity)? Drop redundant features.
Walk through a complete data processing pipeline for an AnyCompany ML use case. Select a scenario, then click any pipeline node to explore that stage — or auto-play to watch the full flow with animated data particles.
Choosing the right storage service is a critical architectural decision that impacts training speed, cost, and operational complexity. Several factors come into play: performance requirements, scalability needs, cost constraints, data access patterns (sequential vs random), and integration with other AWS services like SageMaker.
Click any storage node to see its details and best-fit ML use cases:
| Service | Type | Performance | Shared Access | Best ML Use Case |
|---|---|---|---|---|
| S3 | Object | High throughput, higher latency | Yes (API-based) | Data lake, training data storage, model artifacts |
| EBS | Block | Low latency, high IOPS | Limited (multi-attach) | Fast local storage during training, checkpoints |
| EFS | File (NFS) | Scalable throughput | Yes (multi-AZ) | Shared datasets across distributed training jobs |
| FSx Lustre | File (Lustre) | Sub-ms latency, massive throughput | Yes | Large-scale distributed training, HPC workloads |
S3 is the foundation of most ML data architectures. Key capabilities for ML engineers:
Stream data directly to SageMaker training jobs via Pipe mode. No need to copy data to local storage first — reduces startup time.
Standard for active training data. Intelligent-Tiering for variable access. Glacier for archived datasets you rarely retrain on.
Encryption at rest (SSE-S3, SSE-KMS), bucket policies, VPC endpoints. Critical for AnyCompany PII (SSNs, salaries).
Track dataset versions for reproducibility. Know exactly which data version trained which model. Essential for ML governance.
Choosing the right storage depends on your access pattern, performance requirements, and cost constraints. Here is a decision framework for AnyCompany ML workloads.
Download entire dataset to local storage before training. Simple but slow for large datasets. Use S3 + EBS.
Stream data record-by-record during training. Memory efficient, good for large datasets. Use S3 Pipe mode.
Read arbitrary records on demand. Required for some training algorithms. Use EBS or FSx for low-latency random reads.
| If you need... | Use... | Why |
|---|---|---|
| Store large datasets cheaply | Amazon S3 | Lowest cost per GB, unlimited scale, lifecycle policies |
| Real-time data sharing across jobs | Amazon EFS | Shared file system, concurrent access, auto-scaling |
| Fast local storage during training | Amazon EBS | Low latency, high IOPS, attached to compute instance |
| Distributed training at massive scale | Amazon FSx Lustre | Sub-ms latency, integrates with S3, parallel file system |
Convert raw formats to ML-optimized formats before training. Do not train on raw CSVs at scale — preprocess first.
Parquet or ORC for training data. Read only the columns (features) you need. 10x faster than scanning full CSV rows.
Snappy or GZIP compression reduces storage costs and network transfer time. Parquet with Snappy is the gold standard.
Partition by date, region, or category. Enables efficient queries: "give me all India payroll data for Q4" without scanning everything.
Raw Zone (S3): Incoming data in original format (CSV, JSON, API dumps). Immutable — never modify raw data.
Processed Zone (S3, Parquet): Cleaned, transformed, partitioned by date and region. Ready for analytics.
Training Zone (S3 or FSx): ML-ready datasets with features engineered, labels attached, split into train/validation/test.
Understand row-based vs columnar formats, batch vs streaming ingestion, and when to use each.
Choose the right chart for your data type. Use EDA to find patterns, quality issues, and inform feature selection.
Match storage service to access pattern and performance needs. S3 for lakes, EBS for speed, EFS for sharing, FSx for scale.