Back to Development Environment
Data Management
Data versioning, synthetic data, pipelines, feature stores, and practices for development
Why Version Data in Development
Reproducibility and auditability depend on knowing exactly which data was used for each run.
Principles
- Track dataset identity (name + version or commit) for every training run
- Store metadata: schema, row count, checksums, creation date
- Use immutable versions: new data → new version, never overwrite
- Link data versions to code (e.g. Git tag or commit) in experiment logs
Tools & Patterns
In development, you can start simple and scale up:
- File-based: DVC, Git LFS, or object storage with versioned paths (e.g. s3://bucket/datasets/raw/v1/)
- Catalog: Data catalogs (OpenLineage, DataHub) for lineage and discovery
- ML-specific: MLflow Datasets, Kubeflow artifact tracking, or custom version tables
Example: Logging data version in config
# config/train_v1.yaml
data:
train_path: "s3://my-bucket/datasets/train/v2.3"
val_path: "s3://my-bucket/datasets/val/v2.3"
schema_version: "1.0"
model:
...