Best examples of ML projects with good dataset/task code abstractions? [D]

I am working on a benchmark and need to manage several interlocking components: datasets and metadata, diverse ML tasks (varying inputs and outputs), and baseline experiments covering models, training, and evaluations. Any pointers to projects that handle these through clean/minimal data structures like Dataclasses or Pydantic. Specifically, I want to see how others manage:

  1. Dataset Information: Representing dataset cards, metadata, and split definitions as first-class objects.
  2. Task Schemas: Defining ML tasks with specific input and output types to ensure consistency across different models.
  3. Experiment Composition: Structures that link a model and training configuration to a specific evaluation and prediction set.

If you have seen repositories that maintain these abstractions with minimal boilerplate and high type safety, please share them. I am interested in internal code organization rather than external tools like W&B or MLflow. Definitely aware of cookie-cutter data-science, looking for for datastructures.

submitted by /u/LetsTacoooo
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top