Overriding Data Files
Applying a task or pre-trained model to train, evaluate, or predict with your data (locally or from Hugging Face) is straightforward. There are two requirements:
- You must know which Task and DataModule you are using. If you are using a finetuned model checkpoint, find this in the associated config.yaml.
- Your data must be in a format that the Hugging Face
load_dataset
can read (e.g. tsv, csv, json, etc).
If these are satisified, congrats! You can immediately use your data with AIDO.ModelGenerator by overriding the data paths and files in the config.
Overriding Data Paths
Suppose a model was trained with data from Hugging Face but now we want to predict or retrain with a local dataset.
my_dataset/
# Columns: `id`, `dna_sequence`, `expression`
my_train.tsv
my_test.tsv
The config.yaml data
section might look like this:
data:
class_path: PromoterExpressionRegression
init_args:
path: genbio-ai/100m-random-promoters
train_split_files: train.tsv
test_split_files: test.tsv
x_col: sequence
y_col: label
val_split_size: 0.1
To re-run with your data, just override the necessary args
data:
class_path: PromoterExpressionRegression
init_args:
path: my_dataset
train_split_files: my_train.tsv
test_split_files: my_test.tsv
x_col: dna_sequence
y_col: expression
val_split_size: 0.1
If some of the overrides aren't immediately obvious from the config, you can go to the Data API Reference to find documentation for usage. Many datasets are convenience datasets, and aren't explicitly documented, so you'll need to look for their parent class in the Data API Reference.
For example, a quick look at the codebase shows that PromoterExpressionRegression
is a convenience subclass of SequenceRegressionDataModule
.
The above overrides are equivalent to
data:
class_path: SequenceRegressionDataModule
init_args:
path: my_dataset
train_split_files: my_train.tsv
test_split_files: my_test.tsv
x_col: dna_sequence
y_col: expression
val_split_size: 0.1