Dataset Versioning Without the Tools: A Practical Approach for Reproducible Machine Learning

Introduction

Reproducibility is a cornerstone of rigorous machine learning practice. Yet in production ML systems, reproducibility often breaks at the data layer. A model trained on dataset-v1.2.3 performs differently from one trained on dataset-v1.2.4, but engineers struggle to articulate why.

The conventional wisdom is to adopt specialised data versioning tools: DVC (Data Version Control), MLflow, or Weights & Biases. These tools are powerful. They’re also often unnecessary when you’re starting out.

This article presents a lightweight, pragmatic approach to data versioning that achieves reproducibility without the operational overhead of specialised tools. The approach is based on production experience versioning TB-scale datasets across ML pipelines in computer vision and autonomous systems.

The core insight: data versioning is not about the tool. It’s about maintaining a consistent contract between your data pipeline and your model training process. You can achieve that with discipline and automation, before graduating to specialized tooling.

The Reproducible crisis in ML

The machine learning community has long recognised that reproducibility is critical. Yet reproducibility often means model reproducibility: “Can I retrain this model and get the same results?”

Data reproducibility is equally important but less frequently addressed: “Can I access the exact data this model trained on?”

Consider a common scenario:

  • A production model trained on dataset-v1.2.0 performs at 94% accuracy
  • Months later, the same model retrains on dataset-v1.2.5 and drops to 89% accuracy
  • The team investigates the model code (no changes), the training pipeline (no changes), the hardware (identical)
  • The only variable is the data

Without data versioning, the investigation stalls. Was there a quality degradation in the incoming data? Were certain classes undersampled? Did a preprocessing step change? Without a record of what’s in each dataset version, debugging becomes impossible.

This is why data versioning matters. Not as an academic exercise, but as a practical requirement for maintaining ML systems in production.

Why Speacialised Tools are great but can be an overkill

DVC is powerful. MLflow is comprehensive. These tools exist for good reasons, but they introduce operational complexity that many teams can’t justify early on:

  • Setup overhead: Configuration, API keys, service dependencies
  • Operational burden: Another service to monitor, update, and maintain
  • Learning curve: Teams must understand specialized concepts (remotes, stages, registries)
  • Cost: These tools scale in price with data volume
  • Vendor lock-in: Switching tools requires migration of metadata

For a startup with a single model in production, or a research team exploring a new problem, this overhead is premature. You need something simpler first.

What you actually need:

  • A way to identify which data trained which model
  • The ability to reproduce that exact dataset state
  • Documentation of what’s in each version (size, class distribution, etc.)
  • Traceability of where the data came from

You can achieve all of this with a JSON file, S3, and discipline.

A Manifest-based approach to Data Versioning

The core principle is simple: treat a dataset as a versioned artifact with associated metadata. Rather than version the data itself (which is often impractical at scale), version a manifest file that describes the data.

A data manifest should capture:

  1. Data lineage: Which raw data sources were used to create this version
  2. Transformations: What processing steps were applied (filtering, normalization, augmentation)
  3. Dataset metadata: Size, count of samples, class distribution, temporal range
  4. Integrity verification: Checksums of data splits for tamper detection and corruption detection
  5. Temporal information: When the dataset was created and by whom

These five elements are sufficient for reproducibility in most production ML systems. They answer the critical question: “Given this model and this dataset version, can someone else reproduce the same training run?”

Below is an example manifest structure:

Example manifest:

{
"version": "dataset-v1.2.3",
"created_at": "2026-05-11T09:30:00Z",
"source_batches": [
"raw-data/batch-2026-05-08/",
"raw-data/batch-2026-05-09/",
"raw-data/batch-2026-05-10/"
],
"transformations": [
"removed frames with timestamp gaps > 5s",
"filtered out frames with <3 modalities",
"normalized timestamps to UTC"
],
"output_location": "s3://datasets/training/dataset-v1.2.3/",
"metadata": {
"total_frames": 45000,
"total_size_gb": 247,
"class_distribution": {
"class_a": 0.42,
"class_b": 0.31,
"class_c": 0.27
},
"split": {
"train": 0.75,
"val": 0.15,
"test": 0.1
}
},
"checksums": {
"train_split": "sha256:abc123...",
"val_split": "sha256:def456...",
"test_split": "sha256:ghi789..."
},
"lineage": {
"previous_version": "dataset-v1.2.2",
"reason_for_update": "added 3 new batches, filtered corrupt frames"
}
}

That’s it. One JSON file. Transparent & Git can version it. Your pipelines can parse it.

The Implementation

Step 1: Choose a naming convention

Store your datasets like this:

└── s3://datasets/training/
├── dataset-v1.0.0/
│ ├── train/
│ ├── val/
│ ├── test/
│ └── manifest.json
├── dataset-v1.1.0/
│ ├── train/
│ ├── val/
│ ├── test/
│ └── manifest.json
└── dataset-v1.2.0/
├── train/
├── val/
├── test/
└── manifest.json

Every version gets its own folder. The manifest.json lives in the root of that folder.

Step 2: Create the manifest after data processing

After your data pipeline completes (ingestion, validation, transformation, splitting), generate the manifest:

import json
import hashlib
from datetime import datetime, timezone
def create_manifest(version, source_batches, transformations,
train_split, val_split, test_split, output_location):
"""Create a dataset manifest for versioning."""
manifest = {
"version": version,
"created_at": datetime.now(timezone.utc).isoformat(),
"source_batches": source_batches,
"transformations": transformations,
"output_location": output_location,
"metadata": {
"total_frames": count_frames(output_location),
"total_size_gb": get_size_gb(output_location),
"class_distribution": compute_class_distribution(output_location),
"split": {
"train": train_split,
"val": val_split,
"test": test_split
}
},
"checksums": {
"train_split": compute_checksum(f"{output_location}/train/"),
"val_split": compute_checksum(f"{output_location}/val/"),
"test_split": compute_checksum(f"{output_location}/test/")
}
}
return manifest

Write manifest to S3

s3_client.put_object(
Bucket="datasets",
Key=f"training/{version}/manifest.json",
Body=json.dumps(manifest, indent=2)
)

Step 3: Reference the manifest in your training job

When you train a model, record which dataset version you used:

import json
from datetime import datetime, timezone
import boto3
s3 = boto3.client("s3")
# Load the manifest
response = s3.get_object(Bucket="datasets", Key="training/dataset-v1.2.0/manifest.json")
manifest = json.load(response["Body"])
# Train your model
model = train_model(
train_data=f"{manifest['output_location']}/train/",
val_data=f"{manifest['output_location']}/val/"
)
# Save the dataset version alongside the model
model_metadata = {
"model_version": "model-v1.0.0",
"training_date": datetime.now(timezone.utc).isoformat(),
"dataset_version": manifest["version"],
"dataset_manifest": manifest
}
# Store this with your model
save_model_metadata(model_metadata)

Now when the model regresses, you can trace it back to the exact dataset version and know what was in it.

Automation

You don’t manually create manifests. This should be the last step of your data pipeline, automated:

In your CI/CD pipeline (GitLab, GitHub Actions, etc.)

create-dataset-manifest:
stage: data-processing
script:
- python scripts/create_manifest.py \\
--version dataset-v1.2.3 \\
--source-batches raw-data/batch-* \\
--output-location s3://datasets/training/dataset-v1.2.3/
only:
- main

Every time your data pipeline completes, the manifest is created automatically. No manual steps.

When to Add Data Lineage

As your data pipeline becomes more complex, you’ll want to track which transformations created which dataset.

Add a lineage field to your manifest:

{
"lineage": {
"previous_version": "dataset-v1.2.2",
"reason_for_update": "added 3 new batches, removed 15 corrupt frames",
"transformations_applied": [
{
"name": "filter_corrupt_frames",
"removed_frames": 15,
"reason": "timestamp gaps > 5 seconds"
},
{
"name": "normalize_timestamps",
"modified_frames": 44985
}
]
}
}

This lets you answer: “Why is v1.2.3 different from v1.2.2?” Just read the lineage.

Versioning Strategy

Use semantic versioning:

dataset-v1.0.0: Initial release
dataset-v1.1.0: Minor change (new batch added)
dataset-v1.2.0: Patch (filtered out bad frames)
dataset-v2.0.0: Major change (new modality, new collection process)

Only increment the major version when the data collection process fundamentally changes. New batches? Minor version. Filtering? Patch version.

When to Graduate to New Tools

This approach works until:

  • You have 50+ dataset versions
  • Your pipeline is complex enough that lineage becomes hard to track
  • You need collaborative features (teams labeling, reviewing)
  • You want to deduplicate identical data across versions

At that point, DVC or MLflow makes sense. But you probably don’t need it for 6–12 months.

Pros:

Simple: A JSON file and S3. Everyone understands it.
Reproducible: Any engineer can download a dataset version and train on it.
Auditable: You know exactly which data trained which model.
Version controlled: The manifest lives in your Git repo (or alongside your data).
Automatic: Once you set it up, it runs without maintenance.
Vendor independent: No proprietary format, no lock-in.

Cons:

You won’t have:

A beautiful dashboard showing all your datasets
Automatic deduplication across versions
Collaborative labelling workflows
Integration with every ML framework

That’s okay. Those are luxuries, not requirements.

When to Graduate to New tooling

This manifest-based approach is sustainable until the complexity of your data ecosystem makes manual manifest management untenable. Indicators that you should consider adopting specialised tooling:

  • You maintain 50+ concurrent dataset versions with complex interdependencies
  • Multiple teams are simultaneously creating and consuming datasets
  • Your data lineage forms a directed acyclic graph (DAG) that’s difficult to reason about manually
  • You need collaborative features (simultaneous labelling, peer review of dataset changes)
  • You require automatic deduplication across versions to optimise storage
  • You need integration with specialised ML frameworks (PyTorch Lightning, TensorFlow Data Services)

For most organisations, this inflection point occurs 6–12 months into production operation, not immediately.

Principle and TradeOffs

This approach prioritises:

  • Simplicity: Manifest files are human-readable and tool-agnostic
  • Operational independence: No reliance on external services
  • Audit-ability: All versioning decisions are recorded in Git
  • Reproducibility: Complete reconstruction of training data is always possible

The tradeoffs:

  • Scalability: Manual manifest management doesn’t scale beyond ~50 versions
  • Discoverability: Finding relevant datasets requires searching through files, not querying metadata
  • Integration: No built-in integrations with visualisation or experiment tracking tools
  • Automation: Lineage tracking requires discipline and scripting

These tradeoffs are acceptable for the majority of ML projects, where a few carefully-versioned datasets support model development. They become unacceptable when your data ecosystem grows beyond that scope.

References: Sculley et al., 2015, “Hidden Technical Debt in ML Systems”

Conclusion

Reproducible machine learning requires reproducible data. The common assumption that this requires specialised tooling is incorrect. For teams early in their ML journey, a simple manifest-based approach provides reproducibility, audit-ability, and operational simplicity without the overhead of specialised tools.

Start with manifests and discipline. When your data ecosystem becomes complex enough to justify it, graduate to specialised tooling. But recognise that the complexity is often optional, not inevitable.

I’m Raj Alamuri, a Senior MLOps Engineer in London specialising in production ML infrastructure for computer vision and autonomous systems. I’ve implemented data versioning across research, startup, and enterprise contexts using both specialised tools and lightweight approaches.

Connect on LinkedIn


Dataset Versioning Without the Tools: A Practical Approach for Reproducible Machine Learning was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top