Skip to content

Overriding Data Files

Applying a task or pre-trained model to train, evaluate, or predict with your data (locally or from Hugging Face) is straightforward. There are two requirements:

  1. You must know which Task and DataModule you are using. If you are using a finetuned model checkpoint, find this in the associated config.yaml.
  2. Your data must be in a format that the Hugging Face load_dataset can read (e.g. tsv, csv, json, etc).

If these are satisified, congrats! You can immediately use your data with AIDO.ModelGenerator by overriding the data paths and files in the config.

Overriding Data Paths

Suppose a model was trained with data from Hugging Face but now we want to predict or retrain with a local dataset.

my_dataset/
  # Columns: `id`, `dna_sequence`, `expression`
  my_train.tsv
  my_test.tsv

The config.yaml data section might look like this:

data:
  class_path: PromoterExpressionRegression
  init_args:
    path: genbio-ai/100m-random-promoters
    train_split_files: train.tsv
    test_split_files: test.tsv
    x_col: sequence
    y_col: label
    val_split_size: 0.1

To re-run with your data, just override the necessary args

data:
  class_path: PromoterExpressionRegression
  init_args:
    path: my_dataset
    train_split_files: my_train.tsv
    test_split_files: my_test.tsv
    x_col: dna_sequence
    y_col: expression
    val_split_size: 0.1

If some of the overrides aren't immediately obvious from the config, you can go to the Data API Reference to find documentation for usage. Many datasets are convenience datasets, and aren't explicitly documented, so you'll need to look for their parent class in the Data API Reference.

For example, a quick look at the codebase shows that PromoterExpressionRegression is a convenience subclass of SequenceRegressionDataModule. The above overrides are equivalent to

data:
  class_path: SequenceRegressionDataModule
  init_args:
    path: my_dataset
    train_split_files: my_train.tsv
    test_split_files: my_test.tsv
    x_col: dna_sequence
    y_col: expression
    val_split_size: 0.1