Data Management

Data versioning, synthetic data, pipelines, feature stores, and practices for development

Why Version Data in Development

Reproducibility and auditability depend on knowing exactly which data was used for each run.

Principles

Track dataset identity (name + version or commit) for every training run
Store metadata: schema, row count, checksums, creation date
Use immutable versions: new data → new version, never overwrite
Link data versions to code (e.g. Git tag or commit) in experiment logs

Tools & Patterns

In development, you can start simple and scale up:

File-based: DVC, Git LFS, or object storage with versioned paths (e.g. s3://bucket/datasets/raw/v1/)
Catalog: Data catalogs (OpenLineage, DataHub) for lineage and discovery
ML-specific: MLflow Datasets, Kubeflow artifact tracking, or custom version tables

Example: Logging data version in config

# config/train_v1.yaml
data:
  train_path: "s3://my-bucket/datasets/train/v2.3"
  val_path: "s3://my-bucket/datasets/val/v2.3"
  schema_version: "1.0"
model:
  ...