Data Engineering for ML: Pipelines, Annotation, and Data Quality
"We have lots of data" — phrase often meaning "we have lots of raw logs in S3 that nobody touched two years ago." Before training models, need to understand what exists: structure, duplicates, schema changes, representativeness.
Data Engineering for ML — not just ETL. Building reproducible data infrastructure making model training reliable and retraining predictable.
ETL Pipelines for ML: Differences from BI
ETL for analytics and ETL for ML — different tasks. Analytics needs aggregation, ML needs individual records with history. Analytics doesn't need train/val/test split, ML — critical. Analytics skew hinders interpretation, ML — directly impacts quality.
Tools. Apache Spark for large volumes (10GB+): PySpark DataFrames, optimizations via partitioning and caching. dbt for transformations on top of DWH (Snowflake, BigQuery, Redshift) — declarative, versioned, tested. Pandas + Polars for <several GB — Polars 5-10x faster on typical transformations.
Temporal splits. For ML important to split by time, not randomly. With temporal data (transactions, user events), random split causes leakage: model sees "future" data in training. Rule: train on T1-T2, validation on T2-T3 (with gap to prevent leakage), test on T3-T4.
Incremental pipelines. Model retrains weekly on new data. Need pipeline incrementally adding new records, not reloading everything. Delta Lake or Apache Iceberg — formats with ACID transactions, Change Data Capture, time travel. Stored in S3/GCS, read via Spark or DuckDB.
Feature Engineering and Feature Store
Feature Store solves training-serving skew. Most insidious ML error — feature computed differently in training vs production. Model learns on "correct" data, inference gets different.
Feast (open source) — offline store on Parquet/Delta in S3 for training, online store on Redis for <10ms inference. Feature definitions as Python code:
from feast import FeatureView, Field
from feast.types import Float32, Int64
user_features = FeatureView(
name="user_features",
entities=["user_id"],
schema=[
Field(name="purchase_count_7d", dtype=Int64),
Field(name="avg_session_duration", dtype=Float32),
],
ttl=timedelta(days=7),
source=user_features_source,
)
One definition, used everywhere. No mismatches.
Streaming features. When feature must update real-time (transaction count last 10 minutes), need streaming processing. Apache Kafka + Apache Flink or Kafka Streams for real-time feature computation → online store. More complex and expensive, only when feature staleness critical to quality.
Data Annotation
Annotation — most labor-intensive and underestimated ML project part. Poorly annotated data no architecture fixes.
Label Studio — open source, supports images (bounding box, polygon, segmentation), text (NER, classification), audio, video. Deploys in 10 minutes via Docker. For small teams — first choice.
Annotation quality assessment. Inter-annotator agreement — annotator consensus. Cohen's Kappa > 0.8 — good, 0.6-0.8 — acceptable, < 0.6 — task ambiguous or instructions poor. 10-20% examples double-annotated by independent annotators — mandatory.
Active learning. Don't annotate random examples, select those model most uncertain about (low confidence, high uncertainty). Cycle: train baseline → find uncertain examples → annotate → retrain. Achieves same quality at 50-70% annotation volume. Modals, Prodigy, Label Studio support active learning.
Synthetic data. When real data scarce or expensive. For CV: render in Blender/Unity with realistic textures (domain randomization). For NLP: paraphrase via LLM, backtranslation. Risk: model learns synthetic distribution, not real — need caution and validation on real holdout.
Data Quality: Validation and Monitoring
Great Expectations — de facto standard for data validation in ML pipelines. Expectations — declarative data assertions: "age column contains 0–120", "user_id has no nulls", "amount distribution not >20% from baseline." Runs in pipeline, blocks on failure.
Pandera — more Pythonic for pandas/polars DataFrames. Schema-based with type hints:
import pandera as pa
schema = pa.DataFrameSchema({
"user_id": pa.Column(int, nullable=False),
"score": pa.Column(float, pa.Check.between(0, 1)),
"label": pa.Column(str, pa.Check.isin(["positive", "negative", "neutral"])),
})
Data freshness. Model expects last N days data. ETL fails, data doesn't update — model uses stale features. Monitor last record timestamp per table, alert if delay > threshold.
Deduplication. Duplicates in training inflate metrics (same examples in train and val) and skew weights. MinHash LSH for approximate dedup on large datasets. For exact — hash on normalized content.
Storage and Formats
| Format | Better for | Features |
|---|---|---|
| Parquet | Batch training, analytics | Columnar, efficient compression |
| Delta Lake | Incremental updates, ACID | Time travel, schema evolution |
| Apache Iceberg | Enterprise, multi-engine | Best catalog, hidden partitioning |
| HDF5 | Numeric arrays (CV datasets) | Hierarchical structure |
| TFDS / datasets | Standardized ML datasets | Hugging Face datasets — convenient for NLP |
For most ML projects starting: Parquet in S3 + DVC for versioning. Delta Lake or Iceberg when need incremental updates or time travel.
Workflow
Audit existing data. Profiling: ydata-profiling generates HTML report with statistics, distributions, correlations, missing values in minutes. First step in any project.
Pipeline design. Determine data sources, update frequency, feature latency requirements, volumes. Choose tools for task.
Implementation and testing. Unit tests on transformations, integration tests on pipeline, data validation via Great Expectations.
Production monitoring. Alerts on freshness, quality checks, data volume anomalies.
Simple ETL pipeline with validation: 2-3 weeks. Complete data platform with Feature Store and monitoring: 2-3 months. Audit existing pipelines and roadmap development: 1 week.







