Skip to main content
MLOps Academy
Back to Development Environment

Data Management

Data versioning, synthetic data, pipelines, feature stores, and practices for development

Why Version Data in Development
Reproducibility and auditability depend on knowing exactly which data was used for each run.

Principles

  • Track dataset identity (name + version or commit) for every training run
  • Store metadata: schema, row count, checksums, creation date
  • Use immutable versions: new data → new version, never overwrite
  • Link data versions to code (e.g. Git tag or commit) in experiment logs

Tools & Patterns

In development, you can start simple and scale up:

  • File-based: DVC, Git LFS, or object storage with versioned paths (e.g. s3://bucket/datasets/raw/v1/)
  • Catalog: Data catalogs (OpenLineage, DataHub) for lineage and discovery
  • ML-specific: MLflow Datasets, Kubeflow artifact tracking, or custom version tables

Example: Logging data version in config

# config/train_v1.yaml
data:
  train_path: "s3://my-bucket/datasets/train/v2.3"
  val_path: "s3://my-bucket/datasets/val/v2.3"
  schema_version: "1.0"
model:
  ...